Cluster tracks or artist based on user listens : algorithm proposition

yuioen · February 7, 2025, 10:19pm

@lucifer coming back here about what we spoke at fosdem. The idea was to create tracks or artists clusters based on user listenings. This might allow to generate genres. The tracks or artists in the same cluster are a similar genre. To find the name of this genre we could find the most common words appearing on the track titles or artist names.

I’m not a mathematician tho’

Create a table like this :
```
track       | user 1 | users 2 | user 3 | user 4 | etc
track_1       0          5        1000      42
track_2       100        13        12        3
```
this will allow to have a space where each user is a dimension, tracks are
place in the dimension according to the listen value : track(0,5,1000,42)
Use a clustering algorithm like kmeans or DBSCAN.
These algorithms find group of tracks in the generated dimension

Limitations :

tracks with a lot of listens might be related to each others. Listening could be normalized/standardized to avoid this but how ? Using only binary data instead of listening count ? maybe also see StandardScaler
There is a lot of user, so a lot of dimensions. To make anaylyse less expensive maybe PCA can be used. It will reduce the number or dimension to focus on meaningful data. Then kmeans can be used on the generated output (maybe this is more expensive i’ not sure but I recall we did something like that in university)

biocv · February 8, 2025, 10:33am

What exactly would be the objective? Is it to impute missing genre tags? Or do you want to classify recordings and artists into genres irrespective of what genre tags were applied by the community?

yuioen · February 17, 2025, 1:30pm

it has nothing to do with genre tags, it’s to group artist/tracks based on listening data, we could consider the result is a genre. But really it’s only statistical relationship.
This approach is building a group of track, it doesn’t give a relationship between two tracks has it’s the case in session. Sessions can give a relation with a value that represent how strong is the relation. On the contrary in this case there is not such value. e only know the tracks/artists are in the same group.

biocv · February 17, 2025, 8:50pm

Very well, but what will you be using this grouping for? If it is to create groupings of musically similar tracks, then why not make use of the genre tags?

aerozol · February 17, 2025, 9:14pm

Bit of a tangent, but something like this could be useful to prompt people to add tags, also.

Just an example, imagine a “game” in the LB explore section where it plays you tracks (either at random or in a genre you select/are an expert in), and asks you “is this X genre”, and you click yes or no or “input other”, and then the next track plays.

yuioen · February 20, 2025, 9:15pm

people don’t use tags enough from my pov. So we don’t have enough data on genres. This is a way to guess genres from listens. Hard to say if it would work tho’ ^^

We could use user’s tags to see if the generated relationships are meaningful. If a significant amount of tracks or artist that are in the same group share the same tag, then it’s probable we can extrapolate this tag to the artist/tracks that were not tagged.

RustyNova · February 21, 2025, 7:30am

Gotta be careful with that too. You don’t want to encourage people to speedrun it, and not listening to the song fully. Songs may change genres, and you need to listen to the whole song first to really tell

biocv · February 21, 2025, 6:10pm

That would indeed be one way to use the genre tags. Another would be to use the existing genres to decide upon the proper parameters, e.g. how many clusters to choose, at what distance is the cut-off or what is the minimum size of a cluster, depending on your choice of algorithm.

biocv · February 21, 2025, 6:25pm

It is my understanding that genres are “folksonomy” tags, so the classification is inherently noisy and subjective. For most songs it will not matter a great deal whether the tagger has intently studied the subject or just listened to a snippet.

yuioen · February 27, 2025, 11:22pm

Trying to use the listenbrainz data dump to do this but I can’t find a way to make this work considering the size of the dataset ^^ the good thing with the session approach (used by artist similarity) is that you can do things slowly and don’t need to load all the data. With kmeans I suppose all the data needs to be processed at the same time so it will not work.

yuioen · February 28, 2025, 11:49am

How to manage large table : listenbrainz-server/listenbrainz/spark/spark_dataset.py at master · metabrainz/listenbrainz-server · GitHub

messages : listenbrainz-server/listenbrainz_spark/stats/incremental/message_creator.py at 8d82cb0f01d3a36251964e57fa772832e33d4ee5 · metabrainz/listenbrainz-server · GitHub

also see : listenbrainz-server/listenbrainz/labs_api/labs/api/similar_artists.py at master · metabrainz/listenbrainz-server · GitHub

github.com/metabrainz/listenbrainz-server

listenbrainz_spark/similarity/artist.py

master

from datetime import datetime, date, time, timedelta

from more_itertools import chunked

import listenbrainz_spark
from listenbrainz_spark import config
from listenbrainz_spark.path import RECORDING_LENGTH_DATAFRAME, ARTIST_CREDIT_MBID_DATAFRAME
from listenbrainz_spark.stats import run_query
from listenbrainz_spark.utils import get_listens_from_dump


RECORDINGS_PER_MESSAGE = 10000
# the duration value in seconds to use for track whose duration data in not available in MB
DEFAULT_TRACK_LENGTH = 180
# main artist in a credit are consider to weigh 1, featured artists should weigh less.
FEATURED_ARTIST_WEIGHT = 0.25


def build_sessioned_index(listen_table, metadata_table, artist_credit_table, session, max_contribution, threshold, limit, skip_threshold):
    # TODO: Handle case of unmatched recordings breaking sessions!

This file has been truncated. show original