Hello
I’ve been looking at the query used to generate the artist-similarity.
It’s nice to see there is a score threshold to make sure only meaningful data is saved o/ But since we don’t have enough users a lot of artists don’t have similarity data.
Considering we are calculating the similarity anyway, why not save the data, even if it’s under the threshold ? The api can still be filtered using a threshold. If we want to find similar artists for unknowns artists we can use a super low threshold and the algorithms based on artist-similarity can still use data with a higher one.
The data under the threshold is pretty crap – have you actually looked at the data under the threshold?
The correct way to solve this problem is to finally process the MLHD from last.fm to import listen data all the way back to 2005. That will give us a lot more similarity data than we have now. Processing this data is on the short-list for getting done as soon as possible.
@lucifer has been working very hard on a spark cluster setup and soon we should have the capacity to generate new data-sets such as improved similarity data.
The data under the threshold is pretty crap – have you actually looked at the data under the threshold?
I don’t have the infra to generate this king of data :s
The correct way to solve this problem is to finally process the MLHD from last.fm to import listen data all the way back to 2005. That will give us a lot more similarity data than we have now. Processing this data is on the short-list for getting done as soon as possible.
wow didn’t knew about this dataset, it’s huuuge. So nice to ear you’re gonna process this o/
@lucifer has been working very hard on a spark cluster setup and soon we should have the capacity to generate new data-sets such as improved similarity data.