What exactly is MessyBrainz?

messybrainz
Tags: #<Tag:0x00007fcd7e7e20e0>

#1

The website says

MessyBrainz is a MetaBrainz project to support unclean metadata. While MusicBrainz is designed to link clean metadata to stable identifiers, there is a need to identify unclean or misspelled data as well. MessyBrainz provides identifiers to unclean metadata, and where possible, links it to stable MusicBrainz identifiers.

MessyBrainz is currently used in support of two projects, ListenBrainz and AcousticBrainz. Submission to MessyBrainz is restricted, however the resulting data will be made freely available.

But some more detail would be nice, does this project anyway attempt to supercede the importer project that ianmcvordeo worked on but got canned (I cant for the life of me remember what it was called) or not ?


#2

The easy answer here is “we don’t know”, but that’s a cop-out, so I’ll try.
It’s easier to start with a problem statement:

We get submissions of metadata from ListenBrainz which could include as little as a track name and artist credit as strings. Is it possible to match this data to MusicBrainz?

Listenbrainz has taken a few years to ramp up, so we’ve only been using MessyBrainz as a data store for now. We map sha256({metadata provided in a listen}) to a unique messybrainz id. The grand plan is to eventually map this metadata to a recording id, keeping in mind all of the issues that exist with confusing, complex, and ambiguous metadata that MusicBrainz has tried to take care of over the years. How do we do this? The answer again is that we don’t know. We’re going to start looking at the data this year to see what is possible.

The biggest reason that there is not more detail on the website is that no one has started on this mapping process, so we still don’t know what’s possible and what’s not.

In the mid-term, I don’t see MessyBrainz used as a way to add data to MusicBrainz, but it’s a future possibility.


#3

Thanks, seems the major issue would be that submissions are just for tracks rather than albums and to do a decent match you really need the whole album, With just the track (but no acoustid) even if you can match a MB recording I dont see how you can accurately match to the correct recording since in so many cases there is more than one version of a recording in MB, or does that level of accuracy not matter.

I assume when data is submitted to AcousticBrainz (sorry it is still on my plan to add that to SongKong) they are more likely to submit album at a time and hence even if has no MusicBrainz Ids may be easier to match. However one has to consider why no MusicBrainz Id sent, is it simply because the release is not in MusicBrainz.


#4

It’s true that “Matching metadata when you have a whole album of metadata” is a much easier task, but our current goal isn’t to collect album metadata. Instead we need a solution to the fact that many people are going to be sending data to us from ListenBrainz that only contains an artist credit and track name. Of course, some submissions may also contain album name, track number, Spotify ids, MusicBrainz ids. The more data that we get, the easier these submissions will be to map directly to a MusicBrainz id. In fact, accepting full acoustid fingerprints is an interesting idea! I wonder how many people have fingerprints but no MBIDs in their metdata?

Neither do we. I suspect we’re going to have a huge number of easy matches (where the artist credit-track name are unique in musicbrainz) and a huge number of hard ones (where recordings exist with many MBIDs or artist names are duplicated or recording names are duplicated or…). As a starting point, I suspect that we will map items in MessyBrainz to 0 or more MBIDs, not exactly 1.

We’ve just had a proposal for an SoC project, which might be a first start to understanding this data. There’s also a data dump available if you want to take a look at the data right now. As I said previously we’ve only been collecting data up to now in a way that we think is useful. Now we’re finally starting to look at it and see how achievable this task actually is.

Although the Messybrainz website mentions AB, we currently don’t use it for submissions sent to AcousticBrainz, and neither Messybrainz nor Acousticbrainz have the understanding of an “album” of music at submission time. This is definitely a future possibility, but for now adds much more complexity to the problem, perhaps not for much value.

Because we don’t want to prevent people from using ListenBrainz just because they don’t have audio with MBID tags, or they use software that doesn’t send them with the request.


#5

Thanks for the answer

I realize that but my point was that actually there will be alot of tracks that cannot be found in MusicBrainz. For example so far Acoustid has mapped 10M acoustids to MusicBrainz recordings out of 18M MusicBrainz recordings. But Acoustid has 40M acoustids so even if every MusicBrainz recording had been linked that would still leave over 20M Acoustid that cannot be found in MusicBrainz, thats over 50%.


#6

That’s definitely a great future goal, but we don’t have the developer bandwidth for it yet.


#7

I was just making the point that much of your messybrainz data (using acoustid as an example) will not have a Musicbrainz track to match to