Artist similarity : save without score thresholds to allow more uses cases

yuioen · February 28, 2025, 8:06pm

Hello
I’ve been looking at the query used to generate the artist-similarity.
It’s nice to see there is a score threshold to make sure only meaningful data is saved o/ But since we don’t have enough users a lot of artists don’t have similarity data.

Considering we are calculating the similarity anyway, why not save the data, even if it’s under the threshold ? The api can still be filtered using a threshold. If we want to find similar artists for unknowns artists we can use a super low threshold and the algorithms based on artist-similarity can still use data with a higher one.

thanks again for your work

rob · March 3, 2025, 12:10pm

The data under the threshold is pretty crap – have you actually looked at the data under the threshold?

The correct way to solve this problem is to finally process the MLHD from last.fm to import listen data all the way back to 2005. That will give us a lot more similarity data than we have now. Processing this data is on the short-list for getting done as soon as possible.

@lucifer has been working very hard on a spark cluster setup and soon we should have the capacity to generate new data-sets such as improved similarity data.

yuioen · March 4, 2025, 12:57pm

The data under the threshold is pretty crap – have you actually looked at the data under the threshold?

I don’t have the infra to generate this king of data :s

The correct way to solve this problem is to finally process the MLHD from last.fm to import listen data all the way back to 2005. That will give us a lot more similarity data than we have now. Processing this data is on the short-list for getting done as soon as possible.

wow didn’t knew about this dataset, it’s huuuge. So nice to ear you’re gonna process this o/

@lucifer has been working very hard on a spark cluster setup and soon we should have the capacity to generate new data-sets such as improved similarity data.

Thank you @lucifer Can’t wait to see the results

Thanks a lot for the response @rob, it’s nice to have these news

mr_monkey · April 2, 2025, 3:22pm

A quick update: we have run the first preliminary processing of the MLHD, so some improvements are on the horizon

For what it’s worth, I also think an optional threshold could potentially be useful for more recent artists who will have more sparse data. For artists with no similarity data at all, even bad similarity data could be interesting in some contexts (wouldn’t show that on the website, though…).

RustyNova · April 2, 2025, 3:30pm

Wouldn’t mind a “fuck around” toggle for that… Just to find out…