ListenBrainz as a correction engine?


Listenbrainz has scrobbled around 100million listens since its launch. Most of the data received, is from genuine plays but there is a probability of it being fake.

  • Though, the amount of it being invalid is < 1% (surely, much lower than this, I would bet < 0.1% ) but still to make the data pretty accurate we can create more robust database.(Especially if the database is used for research purposes.)

  • Another idea is to remap the listens scrobbled by users and correct them on the fly.
    A User just uploads the song name and artist name, LB fills in the required details for him. This can be achieved from the genuine database LB has (may be using Google’s Big Query). This would also help LB to be more accurate as a recommendation engine.

Any suggestions appreciated. :slightly_smiling:

I’m not sure I understand your suggestion, the premise of ListenBrainz was to use MusicBrainz as the database behind it.
Lookups from MB are already quite accurate, as evident in Picard. The main problem is differentiating between different releases of the same release group.

Are you suggesting LB could be used to correct data in MB?

No, not exactly, but reverse is somewhat my idea.
The about page says, “Our data is mostly gathered by volunteers, and verified by a voting system to make sure it is consistent and correct.”

I am not much familiar with picard myself. So, correct me if i go wrong.
I wonder then how is the data collected by LB used ? Just for research and statistical purposes?
LB does not verify mbid’s while importing data. And that needs to be done. But another reason is that since the data is curated mostly from lastfm, which itself has same tracks available multiple times because of slight variation in spellings. So associating listns to mdibs can be done using the statistical data or MB.

At the moment, it isn’t used at all; just stored. The idea is to let people write recommendation engines that will look at which songs are popular with the same set of people, so that they can tell “Since you liked A, you may also like B”. is the thing you want not listenbrainz.
I believe when you submit data to listenbrainz it passes through messybrainz and this does the filtering and matching.
I am not a developer on this so hopefully someone with more knowledge can speak up.

1 Like

This is something that we have had some ideas about. The ideal goal for ListenBrainz would be for everyone to send listen data to us using a MBID. This way we know exactly what recording is being talked about, and we can get good quality data from MusicBrainz. However, we know that this isn’t always possible. The person listening to the music may not have added MusicBrainz tags to their music, the music may not be in MusicBrainz, or the software they use to play the music might not understand MBID tags.

To work around this, we made MessyBrainz (, a component of ListenBrainz. In this system, we take all of the metadata given to us by a client (which could include Artist, Title, Album, Year, Track position, MBIDs, for example), and give each unique set of information its own ID (which we call a MessyBrainz ID).

In our MessyBrainz database we currently have many IDs which probably represent the same data. Perhaps one submission includes a track number when another one doesn’t, or one contains a spelling mistake in the artist name or track title. Perhaps also, one of the submissions includes a MBID because someone tagged their file with Picard and submitted a listen with a MBID-supported music player.

There is an interesting task here to work out if two data submissions are actually the same. In fact, we think it’s so interesting that we proposed it as a potential SoC project:
The final goal of a project like this would be to query ListenBrainz with a MBID and find all submissions for this recording even if the recording title or artist name were spelt incorrectly.

There are some subtle and not-so-subtle problems which we have thought about here, which make this a difficult problem:

Artists sometimes have the same name
On, they don’t handle this very well

    "There are multiple artists called James Morrison: 1) an English singer-songwriter from Rugby 2) an Australian jazz musician who plays numerous instruments; best known for his trumpet playing 3) a notable south Sligo-style Irish fiddler. 4) “Jim” Morrison, lead singer of 1960s American rock group The Doors."

Some artists name their albums the same

And if they do, sometimes people refer to them with different names

Or people simply don’t know how to write the track title when they tag their audio

Sometimes artists perform different songs with the same name

And sometimes they’re on the same album

And that’s not even starting to mention how to deal with classical/orchestral recordings.

If you’re interested in this kind of thing, definitely talk to us!


For Artists sometimes have the same name we can track one specific user that listens to artist to differentiate the artists because one who listens to James Morrison can’t be actually listens to all 4 variations. He should be listening to one specific (best scenario) based on his country, age, musical taste.
so if we can differentiate an artist by the users that listen to.

I’m a big fan of , I use it to explore new music (From API - Track.getSimilar, Track.getTags, Artist.getSimilar ) with help of scripts in foobar2000. With recent changes to some API’s are not working and they seem not to care about important things like Neighbors system which emphasize social aspect of listening music. So I think it’s really great metabrainz started something like this.

For all of this (recommendation stuff) to work the really important thing is to match a specific track with all of future listens so recommendations systems can make use of it. So MBID aside we can use thing like track length and users past listening patters to map the listen with exact song in database.

As per MBID’s picard is great for full albums but not for singles tracks that don’t have album tag in it even if you set preferred release type to the max in setting it gives you incorrect album tag. So it will be great if there are options in picard like “use current artist/title tags”, “only match album release types” etc.
So if you emphasize users to use picard before scrobbing lot of thig will be easier.

So are you guys planning to implement a desktop app for scrobbing like in ??(I’m currently using this script to scrobble)

Bump because this seems like a really interesting problem to solve. :smiley:

From what I’ve seen in Picard, getting the release group should be relatively easy in most cases, but edge cases like those mentioned by @alastairp make the need of some corrections important. In my opinion, the first thing that we should build for this is some kind of interface for the user herself to mark a group of listens which we might think are different as being the same release. Maybe some interface like this would merge all these listens in Messybrainz with the correct one from Musicbrainz and we get at least some confirmation of our listen data being correct.

Also, what’s the vision of the lead devs as to automatic correction? I’m not familiar with picard’s lookup but wouldn’t something inspired from that work for 99% of the cases if we cluster listens into release_groups and assign most relevant MBIDs ourselves if not present in the listen data? From what I see, does automatic correction by creating some mappings using audio fingerprints sourced from the community ( Could we do something like this also?