Artists listed twice in charts and reports, one hyperlinked to MB and the other not

agatzk · August 9, 2021, 2:47pm

In my ListenBrainz reports and charts, many artists including my most popular ones are listed twice (see Taylor Swift and Oliver Tree, User "agatzk" - ListenBrainz). One is hyperlinked to the MB artist and the other is not. My concern is that statistics are significantly less useful when artist listens are split across two containers like this.

I know the ‘Listens’ feed sometimes links to recordings/artists and sometimes does not. Does the presence of a hyperlink implicate the artist container to which the listen is attributed? I listen to music in both Spotify and foobar2000.

MuLiO1 · August 10, 2021, 5:05am

This behaviour I can confirm for several artists in my own statistics.

As far as I understood the link indicates a match between the scrobbled artist and a MusicBrainz ID. This could be the reason for ListenBrainz to accept those as a different artist. My question for some time now is how to manipulate the artist tag to ensure this match without repeating all the tagging with Picard.
In parallel I’m scrobbling to Libre.Fm and it shows some artists with a MusicBrainz ID and many without one. The cause for this difference is unknown to me.

Leona Lewis and Selena have the mbid tag but Madonna and Lady Gaga don’t.

agatzk · August 10, 2021, 4:58pm

I’m thinking ListenBrainz uses mbid for primary artist identification and then exact character string if no mbid is present? I figured LB would know this is the same artist given the Spotify artist URL linked to the MB artist.

To be honest I’m a tad surprised this hasn’t (as far as I’ve seen) been acknowledged before. It appears it should at least be attempting to map recordings/artists based on a char string, listenbrainz-server/listenbrainz/mbid_mapping at master · metabrainz/listenbrainz-server · GitHub. Maybe I’ll dive in myself and check it out.

aerozol · August 10, 2021, 8:13pm

@agatzk have a look at @rob’s notes from the last meeting, sounds like he’s looking at some improved matching as well?

rob · August 10, 2021, 8:56pm

Hiya!

ListenBrainz internally uses MSID, which are MessyBrainz ID, because they index… messy data. The challenge I’ve been working on for months now (a big task, really) is to create a mapping between MSIDs and MBIDs – this effectively maps data in the wild to the best matches in MusicBrainz.

I’ve completed the first pass of this mapping which maps individual listens from MSIDs to MBIDs and about 80+% of the recent listens are getting mapped. This part is looking pretty promising, but we’re seeing problems like the ones that you mention – data is being spread across multiple artists which dilutes the clarity/effectiveness of the reports and the recommendations we’re trying to get going.

This week I’ve started work on an “release pass” over the MBID mapping, meaning that it will inspect a users listen stream and identify whole albums. If some of the tracks are spread across more than one album, we’ll hopefully fix this so that a collection of tracks that originated from one album all point to that same album.

However, none of this new work is in production yet – the production systems are still using an older mapping that pretty much relied on exact matches (the new one allows for various levels of fuzzy matching). This mapping was always “better than nothing”, but admittedly not very good. And it hasn’t been updated in quite some time.

The whole team is one way or another drifting off to the various levels of vacation for the rest of the month, but we hope to push all of this into production sometime in september. At this point I hope that many of the problem you’re seeing are going to get much better.

In the meantime, if you’re curious the mapping is continually updating and here is our stats page for it:

https://stats.metabrainz.org/d/OGg5QUCGz/listenbrainz-services?orgId=1&refresh=1m

(You can create yourself an account if you don’t already have one)

60% of listens (including a ton of old stuff with bad metadata) have been matched with MBIDs. That translates to 80+% of the unique metadata that MessyBrainz has seen. The mapping contains 60M entries, 35+M of which are matches of varying quality. When compared to overall listens, these 60M matches represent the “unique listens”, meaning that if a user has listened to that track more than one, it will only be counted once here.

If you have specific questions about how the mapping comes together, let me know!