HI! I hope this answers your questions. Feel free to ask any further questions you might have.
TL,DR; The listen data in LB can be considered as unclean/raw. To make the best use of the listen data and make interesting uses out of it, we need to link it to MB somehow. MsB acts as this bridge. Whenever a listen is inserted into LB, its assigned a MSID and automatically sent to a mapper to find a matching MBID for it. Any match found is recorded for later use. If your listens are not getting matched or getting matched incorrectly, ping @lucifer or @rob here (@mayhem in IRC) or in #metabrainz IRC channel.
Background
Each listen submitted to LB has artist name, track name. Listens can optionally have MBIDs included to indicate the recording, artist, album they relate to.
However practically looking at the data, most listens submitted to LB do not have MBIDs. Many times its a streaming service like Spotify which is the source of listens. In this case, the user has no control over what metadata is sent with the listen. Spotify does not use MBIDs so listens from it do not have MBIDs. Last.fm uses MBIDs but there are issues with MBID matching algorithm so the MBIDs from the listens received from Last.fm importer are not trustworthy.
Except artist name and track name, all other metadata is optional so its many times missing. Also, the amount of data present in MB is vast and rich, it would be unreasonable to expect clients to submit much of it with their listens. Further sometimes even the artist name, track name have some typos. Therefore, the listen data in LB can be considered as unclean/raw. To make the best use of the listen data and make interesting uses out of it, we need to link it to MB somehow.
MessyBrainz
MessyBrainz acts as the bridge between the raw LB data and clean MB data. MsB takes the artist name and track name of a listen and calculates a unique hash based on it. Each hash is assigned a UUID called MSID. This MSID is also attached to the listen. Each listen submitted to LB goes through MsB and thus has a MSID associated with it (except now playing listens which are temporary and not stored in the database). Broadly speaking if two listens have the same track and artist name, both get the same MSID.
So to answer who submits data to MessyBrainz, all ListenBrainz users do!
The wording about restricting user submissions to MsB is a relic. I am not sure why that was intended to be case because by the time I joined the MeB community MsB had been retired as a separate project and become a part of the LB webserver. My vague guess would be it had to do something with how the original MsB was originally intended to be used. Maybe @rob or @alastairp can shed light on that trivia some time.
Besides acting as a bridge between LB and MB, MSIDs help us implement various features in LB because every listen is guaranteed to have one. Listen deletion, love/hate feedback, pinned recordings so on are implemented with the help of MSIDs.
Is MsB a result of GSoC sprint? not exactly. MsB has been around for over 7 years now almost as old as LB itself. But there has been work on MsB over the years as part of GSoC. One such project was to form clusters of MSIDs based on similar data, say typo in artist name and track name and then map somehow to a MBID. A bit hazy on details because this was again before me. Anyways, this project never really progressed much due to various reasons.
Mapping MSIDs to MBIDs
Almost a year or so ago, we switched gears and @rob worked on a new way to map MSIDs to MBIDs. It works by first searching the track name and artist name for an exact match in MB. If a match is found, we store a record that the MSID links to the specific MBID.
If an exact match is not found, we look up the artist name and track name in a typo resistant search index based on MB data. Still no match found, then “detune” the artist and track name and search again. These “detunings” involve removing joinphrases like ft.
, with
, de-accenting and transliterating the names. Then again search for these names in the index.
At any point in the search when we find a match, we count the typos in data (difference between name and the match we found in MB) and see if we had to detune the names. Based on these metrics we assign a quality to the match, whether the match is a high/medium/low quality.
If no match is found even after all steps, we record that no match was found for this MSID.
Whenever a listen is inserted in the database, it is automatically sent to the mapping process for finding a MBID match. This process was put into production at some point last year and continually processes incoming listens. At the same time, we had put another process in place to map listens which were inserted earlier. That process soon completed its job was repurposed.
Many times we fix bugs in the mapping process, new data is added to MB or something else happens due to which a match may now be found. To make this happen, we mark listens we want to be re-scanned in the database, the mapper picks them up and repeats the match finding process on them. Currently, we mark the listens we want to be rescanned manually but we intend to move this step to be somewhat automatic in future.
Last time I checked we are to map 70-80% of listens with these processes.
The Present
We have setup an endpoint at Dataset hoster: explain-mbid-mapping to help in debugging why a particular listen doesn’t get mapped. You can look it up there and add the case to [LB-1036] MBID Mapping improvement rollup ticket - MetaBrainz JIRA . Probably also ping me (@lucifer) or @rob/@mayhem to look into it.
Recordings in MB can appear on multiple releases. Some recordings/releases are released multiple times.The difference between such instances usually includes but is not limited to release country, release format, release group type etc. The mapping process makes various assumptions to ensure that a best and consistent match is found. The consistency part is also really important to ensure good statistics in LB. We are already using the MBIDs found through mapping in generating statistics.
We have also started to serve these mapped MBIDs in the LB APIs and have made MSIDs optional to pin a listen, provide love/hate feedback so on. This is important because while the LB website and server have access to MSIDs, external devs using the LB API don’t. Also, it helps in supporting now playing listens better.
The Future
As said earlier, now playing listens are not assigned a MSID. This readily prevents them from being pinned, or being loved or hated so on. Making MSID optional for this stuff and improving the mapping process gives us an opportunity to treat now playing listens on par with other listens. This is because the mapping process only needs artist name and track name to find a match. The MSID is only needed later if we want to store matches in the database as well. Based on this we are trying to workout a way to support now playing listens.
On the mbid mapper front, we intend to make it more intelligent by enabling it to detect albums in listens etc. We also intend to publish datasets based of the MsB and mapping data in the near future.