GSoC pre-proposal: Matrix Archiver

JadedBlueEyes · March 9, 2025, 8:41pm

Hi everyone! This is a sort of pre-proposal for a project idea I’ve been thinking about to help flesh it out and gather community feedback before I decide which project I’ll submit for my final proposal. (For that reason, I’ve not put it in the GSoC Applications category)

I’ve had a bit of a chat with @julian45 here, so this is building on that a bit.

The project idea from the wiki is as follows:

Matrix room archiver

Proposed mentor: julian45 (possibly zas, lucifer?)
Languages/skills: TBD languages/Matrix/Docker/Prometheus
Estimated project length: 350 hours
Difficulty: Medium

Last summer, we migrated from IRC to Matrix for communication. We have BrainzBot (a fork of BotBotMe), an IRC-based bot that reads all messages and logs them to a Postgres database. These logs are displayed on chatlogs metabrainz org. Although we have a functioning IRC-Matrix bridge so that BrainzBot’s chatlogging still works, BrainzBot is unmaintained and uses dependencies with known vulnerabilities. Due to this and our focus on Matrix as our communications platform of choice, we would quite prefer to move BrainzBot’s functions to a more modern bot.

The target solution should be Matrix homeserver-agnostic (i.e., not necessarily require reading out of a specific homeserver implementation’s database), be able to run in a Docker container, and expose a metrics endpoint for monitoring/alerting with Prometheus and Grafana (though Sentry would be a nice bonus). In addition, there are a few commands (e.g., macros for recalling specific reaction images, giving kudos to another user) that should be accommodated by the new bot.

To bullet point, the primary aim of the project is to:

Create a public archive / log of Metabrainz matrix channels, to replace the IRC logs
These logs are to be public and indexable by search engines
The archive should persist regardless of the state of Matrix (protocol changes, chatbrainz going down)
The logs should support matrix-native features such as reactions, replies, and mentions

The high-level plan would be to have a bot account join the Matrix room, controlled by a matrix SDK. The bot would create and incrementally update HTML pages directly from the Matrix events. These pages would be written to the filesystem and then served by a regular webserver. Being prerendered HTML, these pages wouldn’t become unreadable in case of matrix spec changes, and they would be easy to search and index by search engines.

One alternative to using a bot account considered was directly reading from the Synapse database, but this was discarded due to the fragility of relying on private implementation details and incompatibility with other homeserver implementations.

In terms of the output, there are a number of considerations:

I’m not sure what the best method of splitting pages is. Neither splitting by date nor by number of messages seem ideal from a user perspective.
We need the ability to find the correct page to incrementally update, for example for reactions and redactions.
- Replies would be the the simplest additional feature, requiring the ability to create deep links to other messages - potentially on other pages.
- Edits and redactions require the ability to update prior messages
- Reactions would similarly require updating prior messages
- This could require creating some kind of index of event ID to page, as event IDs are arbitrary strings. This could be a database, or a bunch of files with redirects - but both methods have tradeoffs.
The next challenge is media. Since the Authenticated Media change, matrix media cannot be hotlinked, so the bot would need to download the media and store it locally in some kind of media repository. Media would need to be tracked so it can be deleted if all messages referencing it are redacted. Additionally, matrix has normal media and ‘thumbnails’ which are transformed versions of the original media. These would also need to be tracked.
- Not all media is an image - the fact that media could be a PDF, a video, or pretty much any arbitrary file needs to be kept in mind both for security reasons and for compatibility with the webserver.
Threads have a distinctly different UX from normal messages. They could initially be treated as slightly special replies, but in the best case they would have their own pages, linked to from the message that started the thread.
Which channels can be logged needs to be controlled - whether this is a configurable list of room IDs, or a list of ‘admins’ able to invite the bot into new channels.
A home page that allows easy discovery of channels is important. Could also be related to space handling.
Should output be generated on a schedule, continuously or perhaps on demand? Ideally the service would be architected so that the code changes between each would be minimal.
It would be convenient and useful to embed the original JSON of each event in the HTML, so the logs are easily machine readable.

For the language, I’d prefer Rust, with the following reasoning:

There are well maintained, official SDKs for Rust and Javascript
Additionally, there are good community SDKs for Java, Kotlin, Python, Go, C# and Dart
Of these languages, MetaBrainz seems to have codebases in Rust, JavaScript and Python. I personally know Rust and JavaScript better than Python.
Considering the stability aims of the project, I would choose Rust. Rust’s string compile-time grantees and general commitment to stability should make long-term maintenance easier, and prevent accidental breakage.

Deployment & operation notes:

Deploying using docker would be relatively easy, reusing a similar Dockerfile to other MeB Rust projects
Output could be written to a configurable path that could be a shared volume or a host mount, where the webserver lives separately and backups are possible
Prometheus and Sentry support can also be similarly reused from mb-mail-service
(as a side note, it looks like Sentry stats have been removed, so Prom is needed for that)

Bonus points

Message search could be added using something like Lunr, tinysearch or summa without any server API, which would provide a useful progressive enhancement. This could also be used as the message_id to page index.

The project idea on the Wiki states:

In addition, there are a few commands (e.g., macros for recalling specific reaction images, giving kudos to another user) that should be accommodated by the new bot.

Although this wouldn’t entirely fit with the main project, setting up a MauBot instance would be a great bonus. Various plugins could be used to replace the functionality of the IRC bots. For example the github plugin could be used to replace BrainzGit, the rss plugin could be used to follow the MetaBrainz feeds and the karma plugin could be used to give kudos to other users. Plugins are programmed in Python, so a custom plugin could be set up relatively simply as well.

Hookshot, another matrix bot, could be used for Git and RSS feeds and has Element’s hosted public instances, but it doesn’t have the flexibility of MauBot - it has no plugin system AFAICT and features like the kudos system would not be possible.

Finally, does anyone have any cool ideas for the name of this project? Just ‘Matrix Archiver’ seems a bit generic. Perhaps something Artic-themed or music-themed?

julian45 · March 12, 2025, 12:04am

Thanks a bunch for writing this pre-proposal out! I’d like to respond a bit to some of the things you’ve raised here. Since this ended up a bit long, I’ll put the arguably least-technical thing first:

Hmm… for music-themed, how about “libretto”? May be a bit too formal/classical, but just a thought.

Now, on to the rest:

This might be my current lack of caffeine talking, but I’m not quite sure what you mean by “splitting” here. Do you mean for storage, UI, or perhaps both? As far as the UI is concerned, the way that the current IRC logger handles this (and which I’m not necessarily opposed to, but could be convinced otherwise) is that at the top level, it splits logs out by channel, and then each channel is basically a continuous stream of chat logs; within each channel, only the last day or so is shown on first load, but by scrolling up, you can extend the stream theoretically as far back as the beginning of the log.

Yep, this is definitely something to think about. For what it’s worth, I tried writing and then editing a message in a test room, and when looking at the message’s source, although the top-level event ID changed, a new m.relates_to section was added to the content object with contents "event_id": "[original event ID]" and "rel_type": "m.replace", so this indicates that we have a way to track event IDs to update at all. For the index you mention, what kind of tradeoffs do you have in mind?

Agreed.

Re: thumbnails: to reduce complications in parsing and handling, it may be preferable to simply not consume them (assuming that the homeserver generates them server-side) and instead have the project generate its own thumbnails, essentially as dynamic by-products; if all messages referencing the media are deleted/redacted, then the project would no longer have to worry about them. (Or, as an alternative to generating them, simply not handle them at all and instead link to copies of the original media only?)
Re: “not all media is an image”: one approach to this could be to restrict the types of media the bot consumes and makes available. This could be either an explicit allowlist with implicit denials of everything else, or an explicit denylist with implicit allows for everything else. Tradeoffs either way, of course.

I almost hesitate to suggest this, but combining this point with others above, I wonder if there may be utility in pairing the core project service with a database service (e.g., Postgres) so that things like messages, edits, etc. can be intrinsically interlinked with each other (Postgres and its ilk being a relational database, after all). That being said, complexity increases and scales when databases get involved, so any benefits brought by introducing one may not be worth it.

On other notes:

That’s odd, I can see that “email” with platform “rust” is still configured in our Sentry instance. Seems worth a look.

Thanks for the introduction to MauBot! BrainzBot currently does some things that don’t require explicit prompting (e.g., if someone mentions a Jira issue number in a message, it replies with the issue number, its title, and a link to the issue in the Jira UI), so if Hookshot or a MauBot plugin doesn’t currently handle that well, that would seem to call for a custom MauBot plugin with passive command handlers.
Also, to clarify: by “kudos”, I meant less of a karma-style approach and more stuff like the interaction shown in these two messages: (first, second).

Bitmap · March 13, 2025, 1:56am

The old Perl-based IRC logger we had before BrainzBot split pages by date (UTC) and I thought that worked great. The index page for the room could be an alias for, or redirect to, the current date. BrainzBot’s infinite scrolling is extremely buggy in my experience, and linking to a specific message has been outright broken for years. But maybe it can be done correctly.

jesus2099 · March 13, 2025, 7:12am

How broken?
You mean you end up somewhere in the logs but you can’t see the linked message?

I thought it was only a problem with my browser, maybe (Vivaldi).

In fact you end up near the linked message but the scroll is shifted.

I have a chatlogs userscript that reveals the linked message for you, scrolls it in front of your eyes, it works for me but it is not 100% guaranteed.

It works on desktop and mobile.

Bitmap · March 13, 2025, 3:11pm

Yeah, it’s never in the viewport. And I’m never sure which direction to scroll in, and the simple act of scrolling frequently causes the page to jump in the opposite direction. Thank you for the script, I wish I knew about it sooner!

JadedBlueEyes · March 14, 2025, 12:38am

I like how that connects! It sounds quite grand, haha. One idea I had was serac - a block of glacial ice.

This is going to be a bit stream of thought, but here goes:

The UI would be the primary concern here, it would have to be prerendered into digestible chunks for search engines rather than infinite scroll. Of course it would have to be split by channel, but it also needs to be divided more than that, so it doesn’t grow forever and result in gigabyte pages.

I pretty much agree with all of this. Infinite scrolling is hard to do well, especially in a cross browser way - what works well in Chrome may not work well in Firefox, and vice versa.
My main problems with splitting by date are threefold: first, conversations late at night can be split between pages. Second is you often end up with very active (and thus long) days, and then less active ones, making very uneven page sizes. Finally, some days have nothing at all! None are really technical problems, just mildly annoying UX.

The ideal state in my opinion would be some algorithm that splits to a new page after a set number of messages and a set period of inactivity. Hopefully that would result in relatively regular pages with minimal conversation splitting, but I’m sure there are problems with this I haven’t thought of.

Looks like the message gets hidden under the header, and you need to scroll up a bit. I think this could be fixed with CSS. Not sure if the infinite scroll is causing additional issues, though.

So we would have to track the event ID, we should be embedding that in the HTML regardless. The thing is, going in and editing within a page would be finicky and fragile compared to regenerating each page from scratch. That would necesitate knowing what’s in the page. If we split by date, for example, that would be relatively easy. But then edits, etc complicate that and require storing exactly what IDs are in each page anyway. We’re essentially forced to maintain a event ID ↔ page mapping - the question is what else we get from that, and what constraints it adds.

If we add something like PostgreSQL, we have to deploy a chunky database to run with the service. We also have a bunch of data that’s separate, easy to forget to backup and not so easily accessible to the public anymore (although it could be regenerated with some effort from each HTML file). Chances are that a whole bunch of state ends up stored there.
We don’t really need to have a dedicated RDBMS for this, so we could use an embedded database like rocksdb - that would be easier to manage, and with a little bit of setup could be backed up just by copying the public files (as it can do live backups). It’s got its own complexities to think about though - being K/V would complicate less straightforward additions to the schema, for one thing.
Another option is to avoid a database entirely and just write a file with the page for each ID - essentially using the filesystem as a database. We could also use this to get redirects to the page as a bonus. Webservers are all different, so it would be a HTML file with a meta redirect. I’m not actually sure how much I like this idea, though, because putting it in HTML for the browser complicates reading it back for us.
Finally, when I saw the search index, I noticed that that could be used as a database too - it would give pretty much all the benefits of the embedded database, plus search. (And it’s all pure Rust, too! Less chunky C++) We could use something like Tantivy to get very fast search on the server, or something like summa (which seems to be a soft fork of Tantivy that compiles to wasm) to search on the client using the index from the server.

As you can see - there are a lot of options

There is one last point to make here, which is storing the contents of the message itself in the database - which, aside from the search index, seems relatively pointless to me because there’s no way to feed the database I to any of the Matrix SDKs from what I can see. I would also assume that if we’re receiving new messages, we’re able to retrieve recent past messages too.

This seems unneeded to me?
Effectively the thumbnail API is just the media API with extra parameters for width and height. The complication would be deciding which thumbnail sizes to predownload, but that would be simpler than generating our own thumbnails! Even then, we’d still have to pick a size to generate. Of course, only using the original media would be the simplest option. It’s probably a ‘start simple, improve later’ thing.

This seems like a nuclear option to me, as it breaks some functionality, although it could be a final resort. Ideally it would be possible to configure headers correctly to avoid issues, although I’m not sure how easy that would be.

Ah, I was referring to

It seems like there was a sentry metrics beta feature that was similar to Prometheus, but it’s been removed and the page here has linkrotted and redirects to a new, but similar, feature. Honestly kind of confusing.

That all sounds good. I think Maubot would be a good fit, it seems pretty much designed for that kind of stuff! Writing a couple simple python plugins would be a nice break from the rust brain melt.

Bitmap · March 14, 2025, 1:33am

The main reason I like paginating by date is that that’s how I recall what I’m looking for (e.g., “let me review the conversation about X that happened after the meeting last Monday”). But if the system is designed so that you can also jump to a specific date, that could work too.