ListenBrainz internally uses MSID, which are MessyBrainz ID, because they index… messy data. The challenge I’ve been working on for months now (a big task, really) is to create a mapping between MSIDs and MBIDs – this effectively maps data in the wild to the best matches in MusicBrainz.
I’ve completed the first pass of this mapping which maps individual listens from MSIDs to MBIDs and about 80+% of the recent listens are getting mapped. This part is looking pretty promising, but we’re seeing problems like the ones that you mention – data is being spread across multiple artists which dilutes the clarity/effectiveness of the reports and the recommendations we’re trying to get going.
This week I’ve started work on an “release pass” over the MBID mapping, meaning that it will inspect a users listen stream and identify whole albums. If some of the tracks are spread across more than one album, we’ll hopefully fix this so that a collection of tracks that originated from one album all point to that same album.
However, none of this new work is in production yet – the production systems are still using an older mapping that pretty much relied on exact matches (the new one allows for various levels of fuzzy matching). This mapping was always “better than nothing”, but admittedly not very good. And it hasn’t been updated in quite some time.
The whole team is one way or another drifting off to the various levels of vacation for the rest of the month, but we hope to push all of this into production sometime in september. At this point I hope that many of the problem you’re seeing are going to get much better.
In the meantime, if you’re curious the mapping is continually updating and here is our stats page for it:
(You can create yourself an account if you don’t already have one)
60% of listens (including a ton of old stuff with bad metadata) have been matched with MBIDs. That translates to 80+% of the unique metadata that MessyBrainz has seen. The mapping contains 60M entries, 35+M of which are matches of varying quality. When compared to overall listens, these 60M matches represent the “unique listens”, meaning that if a user has listened to that track more than one, it will only be counted once here.
If you have specific questions about how the mapping comes together, let me know!