GSoC pre-proposal: Matrix Archiver

JadedBlueEyes · March 14, 2025, 12:38am

I like how that connects! It sounds quite grand, haha. One idea I had was serac - a block of glacial ice.

This is going to be a bit stream of thought, but here goes:

The UI would be the primary concern here, it would have to be prerendered into digestible chunks for search engines rather than infinite scroll. Of course it would have to be split by channel, but it also needs to be divided more than that, so it doesn’t grow forever and result in gigabyte pages.

I pretty much agree with all of this. Infinite scrolling is hard to do well, especially in a cross browser way - what works well in Chrome may not work well in Firefox, and vice versa.
My main problems with splitting by date are threefold: first, conversations late at night can be split between pages. Second is you often end up with very active (and thus long) days, and then less active ones, making very uneven page sizes. Finally, some days have nothing at all! None are really technical problems, just mildly annoying UX.

The ideal state in my opinion would be some algorithm that splits to a new page after a set number of messages and a set period of inactivity. Hopefully that would result in relatively regular pages with minimal conversation splitting, but I’m sure there are problems with this I haven’t thought of.

Looks like the message gets hidden under the header, and you need to scroll up a bit. I think this could be fixed with CSS. Not sure if the infinite scroll is causing additional issues, though.

So we would have to track the event ID, we should be embedding that in the HTML regardless. The thing is, going in and editing within a page would be finicky and fragile compared to regenerating each page from scratch. That would necesitate knowing what’s in the page. If we split by date, for example, that would be relatively easy. But then edits, etc complicate that and require storing exactly what IDs are in each page anyway. We’re essentially forced to maintain a event ID ↔ page mapping - the question is what else we get from that, and what constraints it adds.

If we add something like PostgreSQL, we have to deploy a chunky database to run with the service. We also have a bunch of data that’s separate, easy to forget to backup and not so easily accessible to the public anymore (although it could be regenerated with some effort from each HTML file). Chances are that a whole bunch of state ends up stored there.
We don’t really need to have a dedicated RDBMS for this, so we could use an embedded database like rocksdb - that would be easier to manage, and with a little bit of setup could be backed up just by copying the public files (as it can do live backups). It’s got its own complexities to think about though - being K/V would complicate less straightforward additions to the schema, for one thing.
Another option is to avoid a database entirely and just write a file with the page for each ID - essentially using the filesystem as a database. We could also use this to get redirects to the page as a bonus. Webservers are all different, so it would be a HTML file with a meta redirect. I’m not actually sure how much I like this idea, though, because putting it in HTML for the browser complicates reading it back for us.
Finally, when I saw the search index, I noticed that that could be used as a database too - it would give pretty much all the benefits of the embedded database, plus search. (And it’s all pure Rust, too! Less chunky C++) We could use something like Tantivy to get very fast search on the server, or something like summa (which seems to be a soft fork of Tantivy that compiles to wasm) to search on the client using the index from the server.

As you can see - there are a lot of options

There is one last point to make here, which is storing the contents of the message itself in the database - which, aside from the search index, seems relatively pointless to me because there’s no way to feed the database I to any of the Matrix SDKs from what I can see. I would also assume that if we’re receiving new messages, we’re able to retrieve recent past messages too.

This seems unneeded to me?
Effectively the thumbnail API is just the media API with extra parameters for width and height. The complication would be deciding which thumbnail sizes to predownload, but that would be simpler than generating our own thumbnails! Even then, we’d still have to pick a size to generate. Of course, only using the original media would be the simplest option. It’s probably a ‘start simple, improve later’ thing.

This seems like a nuclear option to me, as it breaks some functionality, although it could be a final resort. Ideally it would be possible to configure headers correctly to avoid issues, although I’m not sure how easy that would be.

Ah, I was referring to

It seems like there was a sentry metrics beta feature that was similar to Prometheus, but it’s been removed and the page here has linkrotted and redirects to a new, but similar, feature. Honestly kind of confusing.

That all sounds good. I think Maubot would be a good fit, it seems pretty much designed for that kind of stuff! Writing a couple simple python plugins would be a nice break from the rust brain melt.