Where does MessyBrainz data come from?

Alch_Emi · April 9, 2022, 2:46pm

Heya!

I recently got into MusicBrainz editing through ListenBrainz, when I saw that some of my favorite tracks (which have pretty small audiences) weren’t submitting correctly because they lacked MusicBrainz entries. I had a lot of fun learning how the system works and getting into the flow of it, and I’m proud of my work, even if there’s not as much as some of the bigger contributors.

One thing I noticed, though, is that even after I submitted an update to the database, the MSIDs still failed to link to the MBIDs, even several weeks after their submission - long after services like Last.fm were able to associate messy scrobble information with the release MBID. In fact, I don’t think I’ve seen a single one of my edits end up in the MessyBrainz database, although that could just be for lack of diligence in checking.

Anyway, I got to wondering where MessyBrainz data comes from. I found the MessyBrainz page on ListenBrainz, and the MessyBrainz directory of the Github, but neither really provided a satisfactory answer. In both cases, they suggested that data could be submitted by a restricted set of users, but nothing about how that data was submitted, who the restricted set of users was, if there were any sort of automated processes for this.

From poking around the forum a little bit, it seems like MessyBrainz in its current iteration is sorta new and unexplored. I get the feeling that a lot of the MSID associations are the result of a recent GSoC sprint. But I was hoping someone here with a bit more knowledge of the MetaBrainz community might be able to fill in some of the details about how MSIDs are mapped, where that data comes from, and what automated or semi-automated systems are in place to make it happen?

Thank you so much for your time!

lucifer · April 10, 2022, 8:45am

HI! I hope this answers your questions. Feel free to ask any further questions you might have.

TL,DR; The listen data in LB can be considered as unclean/raw. To make the best use of the listen data and make interesting uses out of it, we need to link it to MB somehow. MsB acts as this bridge. Whenever a listen is inserted into LB, its assigned a MSID and automatically sent to a mapper to find a matching MBID for it. Any match found is recorded for later use. If your listens are not getting matched or getting matched incorrectly, ping @lucifer or @rob here (@mayhem in IRC) or in #metabrainz IRC channel.

Background

Each listen submitted to LB has artist name, track name. Listens can optionally have MBIDs included to indicate the recording, artist, album they relate to.

However practically looking at the data, most listens submitted to LB do not have MBIDs. Many times its a streaming service like Spotify which is the source of listens. In this case, the user has no control over what metadata is sent with the listen. Spotify does not use MBIDs so listens from it do not have MBIDs. Last.fm uses MBIDs but there are issues with MBID matching algorithm so the MBIDs from the listens received from Last.fm importer are not trustworthy.

Except artist name and track name, all other metadata is optional so its many times missing. Also, the amount of data present in MB is vast and rich, it would be unreasonable to expect clients to submit much of it with their listens. Further sometimes even the artist name, track name have some typos. Therefore, the listen data in LB can be considered as unclean/raw. To make the best use of the listen data and make interesting uses out of it, we need to link it to MB somehow.

MessyBrainz

MessyBrainz acts as the bridge between the raw LB data and clean MB data. MsB takes the artist name and track name of a listen and calculates a unique hash based on it. Each hash is assigned a UUID called MSID. This MSID is also attached to the listen. Each listen submitted to LB goes through MsB and thus has a MSID associated with it (except now playing listens which are temporary and not stored in the database). Broadly speaking if two listens have the same track and artist name, both get the same MSID.

So to answer who submits data to MessyBrainz, all ListenBrainz users do!

The wording about restricting user submissions to MsB is a relic. I am not sure why that was intended to be case because by the time I joined the MeB community MsB had been retired as a separate project and become a part of the LB webserver. My vague guess would be it had to do something with how the original MsB was originally intended to be used. Maybe @rob or @alastairp can shed light on that trivia some time.

Besides acting as a bridge between LB and MB, MSIDs help us implement various features in LB because every listen is guaranteed to have one. Listen deletion, love/hate feedback, pinned recordings so on are implemented with the help of MSIDs.

Is MsB a result of GSoC sprint? not exactly. MsB has been around for over 7 years now almost as old as LB itself. But there has been work on MsB over the years as part of GSoC. One such project was to form clusters of MSIDs based on similar data, say typo in artist name and track name and then map somehow to a MBID. A bit hazy on details because this was again before me. Anyways, this project never really progressed much due to various reasons.

Mapping MSIDs to MBIDs

Almost a year or so ago, we switched gears and @rob worked on a new way to map MSIDs to MBIDs. It works by first searching the track name and artist name for an exact match in MB. If a match is found, we store a record that the MSID links to the specific MBID.

If an exact match is not found, we look up the artist name and track name in a typo resistant search index based on MB data. Still no match found, then “detune” the artist and track name and search again. These “detunings” involve removing joinphrases like ft., with, de-accenting and transliterating the names. Then again search for these names in the index.

At any point in the search when we find a match, we count the typos in data (difference between name and the match we found in MB) and see if we had to detune the names. Based on these metrics we assign a quality to the match, whether the match is a high/medium/low quality.

If no match is found even after all steps, we record that no match was found for this MSID.

Whenever a listen is inserted in the database, it is automatically sent to the mapping process for finding a MBID match. This process was put into production at some point last year and continually processes incoming listens. At the same time, we had put another process in place to map listens which were inserted earlier. That process soon completed its job was repurposed.

Many times we fix bugs in the mapping process, new data is added to MB or something else happens due to which a match may now be found. To make this happen, we mark listens we want to be re-scanned in the database, the mapper picks them up and repeats the match finding process on them. Currently, we mark the listens we want to be rescanned manually but we intend to move this step to be somewhat automatic in future.

Last time I checked we are to map 70-80% of listens with these processes.

The Present

We have setup an endpoint at Dataset hoster: explain-mbid-mapping to help in debugging why a particular listen doesn’t get mapped. You can look it up there and add the case to [LB-1036] MBID Mapping improvement rollup ticket - MetaBrainz JIRA . Probably also ping me (@lucifer) or @rob/@mayhem to look into it.

Recordings in MB can appear on multiple releases. Some recordings/releases are released multiple times.The difference between such instances usually includes but is not limited to release country, release format, release group type etc. The mapping process makes various assumptions to ensure that a best and consistent match is found. The consistency part is also really important to ensure good statistics in LB. We are already using the MBIDs found through mapping in generating statistics.

We have also started to serve these mapped MBIDs in the LB APIs and have made MSIDs optional to pin a listen, provide love/hate feedback so on. This is important because while the LB website and server have access to MSIDs, external devs using the LB API don’t. Also, it helps in supporting now playing listens better.

The Future

As said earlier, now playing listens are not assigned a MSID. This readily prevents them from being pinned, or being loved or hated so on. Making MSID optional for this stuff and improving the mapping process gives us an opportunity to treat now playing listens on par with other listens. This is because the mapping process only needs artist name and track name to find a match. The MSID is only needed later if we want to store matches in the database as well. Based on this we are trying to workout a way to support now playing listens.

On the mbid mapper front, we intend to make it more intelligent by enabling it to detect albums in listens etc. We also intend to publish datasets based of the MsB and mapping data in the near future.

Alch_Emi · April 10, 2022, 2:52pm

Thank you so so much for the detailed response! This was super informative, and I really appreciate all the answers! I’ll definitely go post a couple of the unmapped MSIDs to the ticket you linked!

I’m wondering if you could say a little bit more about the process by which listens are flagged after new data is added to MB, just to sate my curiosity? Is the detuning algorithm just run in reverse on the clean metadata to get potential messy metadata that could produce it, or is it something different?

lucifer · April 10, 2022, 3:59pm

Currently, there is no process to flag the listens automatically after new data is added. We do it manually. In the past on doing some bug fixes or receiving user reports, we have flagged all unmatched listens of say 2021, a specific week, of a user in a specified range etc. We intend to make this process somewhat automatic in the future but haven’t figured out how so.

Alch_Emi · April 10, 2022, 4:19pm

Alright, thank you! Just to make sure I understand, if a listen is submitted before an entry exists in the MB database, it will remain unmapped, even after the entry is added to the database, UNTIL a new listen with the same MSID is submitted, at which point the MSID, and both listens, will be mapped to the MBID?

lucifer · April 10, 2022, 6:43pm

Uh no! Sorry for being unclear earlier. Even if a new listen is submitted for an unmatched MSID, the record that a match for this MSID wasn’t found the last time exists in the database. So it won’t be rechecked. The only way to get a MSID to be rechecked (in existing or new listens) is to mark it in database for rechecking.

lucifer · April 10, 2022, 7:32pm

BTW, I took a look at the link you posted in the ticket. There was an issue with the debugging backend due to which incomplete logs were appearing. I see that in the complete logs match is found. So if you point me to some relevant sample listens on your profile, I can invalidate the associated MSIDs for rechecking.