Cluster tracks or artist based on user listens : algorithm proposition

@lucifer coming back here about what we spoke at fosdem. The idea was to create tracks or artists clusters based on user listenings. This might allow to generate genres. The tracks or artists in the same cluster are a similar genre. To find the name of this genre we could find the most common words appearing on the track titles or artist names.

I’m not a mathematician tho’

  1. Create a table like this :

    track       | user 1 | users 2 | user 3 | user 4 | etc
    track_1       0          5        1000      42
    track_2       100        13        12        3
    

    this will allow to have a space where each user is a dimension, tracks are
    place in the dimension according to the listen value : track(0,5,1000,42)

  2. Use a clustering algorithm like kmeans or DBSCAN.
    These algorithms find group of tracks in the generated dimension

Limitations :

  • tracks with a lot of listens might be related to each others. Listening could be normalized/standardized to avoid this but how ? Using only binary data instead of listening count ? maybe also see StandardScaler
  • There is a lot of user, so a lot of dimensions. To make anaylyse less expensive maybe PCA can be used. It will reduce the number or dimension to focus on meaningful data. Then kmeans can be used on the generated output (maybe this is more expensive i’ not sure but I recall we did something like that in university)
1 Like

What exactly would be the objective? Is it to impute missing genre tags? Or do you want to classify recordings and artists into genres irrespective of what genre tags were applied by the community?

1 Like

it has nothing to do with genre tags, it’s to group artist/tracks based on listening data, we could consider the result is a genre. But really it’s only statistical relationship.
This approach is building a group of track, it doesn’t give a relationship between two tracks has it’s the case in session. Sessions can give a relation with a value that represent how strong is the relation. On the contrary in this case there is not such value. e only know the tracks/artists are in the same group.

Very well, but what will you be using this grouping for? If it is to create groupings of musically similar tracks, then why not make use of the genre tags?

1 Like

Bit of a tangent, but something like this could be useful to prompt people to add tags, also.

Just an example, imagine a “game” in the LB explore section where it plays you tracks (either at random or in a genre you select/are an expert in), and asks you “is this X genre”, and you click yes or no or “input other”, and then the next track plays.

4 Likes

people don’t use tags enough from my pov. So we don’t have enough data on genres. This is a way to guess genres from listens. Hard to say if it would work tho’ ^^

We could use user’s tags to see if the generated relationships are meaningful. If a significant amount of tracks or artist that are in the same group share the same tag, then it’s probable we can extrapolate this tag to the artist/tracks that were not tagged.

1 Like

Gotta be careful with that too. You don’t want to encourage people to speedrun it, and not listening to the song fully. Songs may change genres, and you need to listen to the whole song first to really tell

That would indeed be one way to use the genre tags. Another would be to use the existing genres to decide upon the proper parameters, e.g. how many clusters to choose, at what distance is the cut-off or what is the minimum size of a cluster, depending on your choice of algorithm.

It is my understanding that genres are “folksonomy” tags, so the classification is inherently noisy and subjective. For most songs it will not matter a great deal whether the tagger has intently studied the subject or just listened to a snippet.

1 Like

Trying to use the listenbrainz data dump to do this but I can’t find a way to make this work considering the size of the dataset ^^ the good thing with the session approach (used by artist similarity) is that you can do things slowly and don’t need to load all the data. With kmeans I suppose all the data needs to be processed at the same time so it will not work.

1 Like

How to manage large table : listenbrainz-server/listenbrainz/spark/spark_dataset.py at master · metabrainz/listenbrainz-server · GitHub

messages : listenbrainz-server/listenbrainz_spark/stats/incremental/message_creator.py at 8d82cb0f01d3a36251964e57fa772832e33d4ee5 · metabrainz/listenbrainz-server · GitHub

also see : listenbrainz-server/listenbrainz/labs_api/labs/api/similar_artists.py at master · metabrainz/listenbrainz-server · GitHub