Hi everyone! This is a sort of pre-proposal for a project idea I’ve been thinking about to help flesh it out and gather community feedback before I decide which project I’ll submit for my final proposal. (For that reason, I’ve not put it in the GSoC Applications category)
I’ve had a bit of a chat with @julian45 here, so this is building on that a bit.
The project idea from the wiki is as follows:
Matrix room archiver
Proposed mentor: julian45 (possibly zas, lucifer?)
Languages/skills: TBD languages/Matrix/Docker/Prometheus
Estimated project length: 350 hours
Difficulty: MediumLast summer, we migrated from IRC to Matrix for communication. We have BrainzBot (a fork of BotBotMe), an IRC-based bot that reads all messages and logs them to a Postgres database. These logs are displayed on chatlogs metabrainz org. Although we have a functioning IRC-Matrix bridge so that BrainzBot’s chatlogging still works, BrainzBot is unmaintained and uses dependencies with known vulnerabilities. Due to this and our focus on Matrix as our communications platform of choice, we would quite prefer to move BrainzBot’s functions to a more modern bot.
The target solution should be Matrix homeserver-agnostic (i.e., not necessarily require reading out of a specific homeserver implementation’s database), be able to run in a Docker container, and expose a metrics endpoint for monitoring/alerting with Prometheus and Grafana (though Sentry would be a nice bonus). In addition, there are a few commands (e.g., macros for recalling specific reaction images, giving kudos to another user) that should be accommodated by the new bot.
To bullet point, the primary aim of the project is to:
- Create a public archive / log of Metabrainz matrix channels, to replace the IRC logs
- These logs are to be public and indexable by search engines
- The archive should persist regardless of the state of Matrix (protocol changes, chatbrainz going down)
- The logs should support matrix-native features such as reactions, replies, and mentions
The high-level plan would be to have a bot account join the Matrix room, controlled by a matrix SDK. The bot would create and incrementally update HTML pages directly from the Matrix events. These pages would be written to the filesystem and then served by a regular webserver. Being prerendered HTML, these pages wouldn’t become unreadable in case of matrix spec changes, and they would be easy to search and index by search engines.
One alternative to using a bot account considered was directly reading from the Synapse database, but this was discarded due to the fragility of relying on private implementation details and incompatibility with other homeserver implementations.
In terms of the output, there are a number of considerations:
- I’m not sure what the best method of splitting pages is. Neither splitting by date nor by number of messages seem ideal from a user perspective.
- We need the ability to find the correct page to incrementally update, for example for reactions and redactions.
- Replies would be the the simplest additional feature, requiring the ability to create deep links to other messages - potentially on other pages.
- Edits and redactions require the ability to update prior messages
- Reactions would similarly require updating prior messages
- This could require creating some kind of index of event ID to page, as event IDs are arbitrary strings. This could be a database, or a bunch of files with redirects - but both methods have tradeoffs.
- The next challenge is media. Since the Authenticated Media change, matrix media cannot be hotlinked, so the bot would need to download the media and store it locally in some kind of media repository. Media would need to be tracked so it can be deleted if all messages referencing it are redacted. Additionally, matrix has normal media and ‘thumbnails’ which are transformed versions of the original media. These would also need to be tracked.
- Not all media is an image - the fact that media could be a PDF, a video, or pretty much any arbitrary file needs to be kept in mind both for security reasons and for compatibility with the webserver.
- Threads have a distinctly different UX from normal messages. They could initially be treated as slightly special replies, but in the best case they would have their own pages, linked to from the message that started the thread.
- Which channels can be logged needs to be controlled - whether this is a configurable list of room IDs, or a list of ‘admins’ able to invite the bot into new channels.
- A home page that allows easy discovery of channels is important. Could also be related to space handling.
- Should output be generated on a schedule, continuously or perhaps on demand? Ideally the service would be architected so that the code changes between each would be minimal.
- It would be convenient and useful to embed the original JSON of each event in the HTML, so the logs are easily machine readable.
For the language, I’d prefer Rust, with the following reasoning:
- There are well maintained, official SDKs for Rust and Javascript
- Additionally, there are good community SDKs for Java, Kotlin, Python, Go, C# and Dart
- Of these languages, MetaBrainz seems to have codebases in Rust, JavaScript and Python. I personally know Rust and JavaScript better than Python.
- Considering the stability aims of the project, I would choose Rust. Rust’s string compile-time grantees and general commitment to stability should make long-term maintenance easier, and prevent accidental breakage.
Deployment & operation notes:
- Deploying using docker would be relatively easy, reusing a similar Dockerfile to other MeB Rust projects
- Output could be written to a configurable path that could be a shared volume or a host mount, where the webserver lives separately and backups are possible
- Prometheus and Sentry support can also be similarly reused from mb-mail-service
- (as a side note, it looks like Sentry stats have been removed, so Prom is needed for that)
Bonus points
- Message search could be added using something like Lunr, tinysearch or summa without any server API, which would provide a useful progressive enhancement. This could also be used as the message_id to page index.
The project idea on the Wiki states:
In addition, there are a few commands (e.g., macros for recalling specific reaction images, giving kudos to another user) that should be accommodated by the new bot.
Although this wouldn’t entirely fit with the main project, setting up a MauBot instance would be a great bonus. Various plugins could be used to replace the functionality of the IRC bots. For example the github plugin could be used to replace BrainzGit, the rss plugin could be used to follow the MetaBrainz feeds and the karma plugin could be used to give kudos to other users. Plugins are programmed in Python, so a custom plugin could be set up relatively simply as well.
Hookshot, another matrix bot, could be used for Git and RSS feeds and has Element’s hosted public instances, but it doesn’t have the flexibility of MauBot - it has no plugin system AFAICT and features like the kudos system would not be possible.
Finally, does anyone have any cool ideas for the name of this project? Just ‘Matrix Archiver’ seems a bit generic. Perhaps something Artic-themed or music-themed?