How is user similarity score calculated?

biocv · October 29, 2023, 4:08pm

Sometimes I browse the ListenBrainz listening history of “similar users” in the hope of discovering some new artists. Users appear to be ranked based on some “similarity score”, but this is a bit of an arcane metric to me. For example, user throwawaytest140 has a wopping 10 out of 10 similarity. However, when I visit their page, I find that my similarity to them is a meagre 5.5. Hence, similarity scores do not appear to be symmetric.

So how is it calculated and what does it represent?

rob · October 30, 2023, 12:24pm

Hi!

FWIW, I totally agree with you – the way we present that isn’t very good, but I was lacking a good way to do this when we implemented it. And I said, at some point, someone is going to complain about this and then we have a chance to work out how to do this better. Thank you for complaining.

The similarity score is calculated using a pearson correlation coefficent algorithm: Pearson correlation coefficient - Wikipedia

The algorithm itself isn’t the problem – that works quite well. The problem comes after we get the result from the algorithm – the similarity scores that it provides are global across all users and is a floating point value from -1.0 to 1.0. Here is a good explanation of this: https://www.quora.com/What-is-an-intuitive-explanation-of-the-Pearson-product-moment-correlation-coefficient

Right now, we lookup all the values for a given user and scale the set to fit onto the unit interval from 0.0 to 1.0. This has the side-effect that at least one user will be considered as a 100% match, regardless of well they match. And this is why we’re having this conversation right now.

The last link had a great insight:

“You’re left with a value in [-1,1] that tells you how much they move together in the abstract, and not in an absolute sense”

I kept looking at this data in a global sense and had a hard time making it work. Clearly the way to look at the data is in relative terms, by literally comparing two users based on the difference between their coefficient values.

This helps put this into a better context – I can change the algorithm to calculate the percent based on the difference between users on a scale of -1.0 to 1.0. That will overall drop most people to be far less similar than they are now, but it would reflect reality better.

Thoughts?

DontMindMe · October 30, 2023, 3:57pm

How often is it updated? Like if a persons tastes change over time.

biocv · October 30, 2023, 7:14pm

Alerting all statistics nerds!

Mighty interesting. Thanks for the detailed explanation. What I don’t understand yet; between what two variables do you calculate the correlation coefficient? Is it some sort of presence-absence vector of listens to specific tracks?

The scaling explains why the similarity scores are not symmetric between two users: apparently throwawaytest140 has several buddies with more closely matching listening histories. I hadn’t noticed that everybody has their very own close 10 out of 10 listening friend .

This topic has gone a little rusty for me, but I believe the standard method to condense a multivariate data set into a single distance (or similarity) measure is multidimensional scaling (MDS). Is there any reason why you haven’t been adopting any of those approaches?

rob · October 30, 2023, 10:01pm

More or less – its not just presence, but actual listen counts (normally expressed as a user rating). The docs on this are actually good:

github.com

metabrainz/listenbrainz-server/blob/f768c79d80090c50d7d0c256ff926a6889603600/listenbrainz_spark/similarity/user.py#L122


      
                          row.append((x,
                                      y,
                                      (value - min_similarity) / similarity_range,
                                      (value - global_min_similarity) / global_similarity_range))
          
                      similar_users.extend(sorted(row, key = itemgetter(2), reverse = True)[:max_num_users])
          
              return similar_users
          
          
          def get_vectors_df(playcounts_df):
              """
              Each row of playcounts_df has the following columns: recording_id, spark_user_id and a play count denoting how many times
              a user has played that recording. However, the correlation matrix requires a dataframe having a column of user
              vectors. Spark has various representations built-in for storing sparse matrices. Of these, two are Coordinate
              Matrix and Indexed Row Matrix. A coordinate matrix stores the matrix as tuples of (i, j, x) where matrix[i, j] = x.
              An Indexed Row Matrix stores it as tuples of row index and vectors.
          
              Our playcounts_df is similar in structure to a coordinate matrix. We begin with mapping each row of the
              playcounts_df to a MatrixEntry and then create a matrix of these entries. The recording_ids are rows, user_ids are
              columns and the playcounts are the values in the matrix. We convert the coordinate matrix to indexed row matrix

Mainly because we use Apache Spark for all of our “big data” crunching and the pearson coefficent is part of the standard spark setup, so it is almost free for us to use without needing to do much of anything. And since its built on Spark, it will scale.

All of the idiosyncrasies you see where introduced by the muppet with whom you’re currently chatting. The fix, if I understand correctly, is mostly to remove some code and write 2 new lines. I’ll try and look at that tomorrow.

rob · October 31, 2023, 9:39am

It updates daily at 2:30UTC.

rob · October 31, 2023, 9:44am

Here is the PR for this fix – will require more testing:

Fix user similarity by mayhem · Pull Request #2613 · metabrainz/listenbrainz-server · GitHub

rob · November 2, 2023, 7:29pm

Ok, we’ve deployed this to beta now:

jambowned's Listens - ListenBrainz

And I have to say the percentages make a lot more sense!

DontMindMe · November 2, 2023, 9:34pm

Oh! That is interesting lol Went from thinking I had a musical soulmate to all alone in the music world 31% is my highest!

rob · November 2, 2023, 9:50pm

I think that means your tastes are diverse and that we have few people like you on LB. At least for now.

aerozol · November 2, 2023, 9:51pm

6% is my highest!

But my surprise is minimal - to be clear, it’s not that I listen to ‘underground’ music (though sometimes I do), it’s that I listen to quite a lot of bad music

biocv · November 3, 2023, 6:55pm

Yup, 9% and below here, probably because of all the sins of youth in my playlist .

biocv · November 3, 2023, 6:59pm

I like the percentages as well. It is good that if one has an …ehm… eccentric music taste, that this is reflected in the similarity score.

biocv · November 4, 2023, 12:04pm

I know you are currently busy solving the LB hiccup problems, so no need to respond immediately. Just putting some thoughts down here for later contemplation .

If I understand correctly, the “percentages” are not based on overlapping listens, as one might naively expect (e.g. as implemented in Jaccard similarity). Rather, they correspond to the Pearson correlation coefficient, with the -1 to 1 range squeezed into a 0 to 100 scale. Is that correct?

If so, then that would mean that every score below 50 is a negative correlation though. As these do appear, I am guessing that tracks that neither user in the comparison has listened to are dropped, which would cause negative correlations to be rather common. So is this close to what happens, or am I down the wrong path here?

UltimateRiff · November 4, 2023, 10:30pm

oh yeah, that feels a lot better~ several times I’ve looked at a 10/10 match and not seen much in common, that would explain why… lol

chaban · November 4, 2023, 10:37pm

From 10/10 down to 5%.

DontMindMe · November 4, 2023, 10:41pm

Someone beat aerozol! Now I’m curious who (with an actual listening history) has the most unique taste… so we can all go listen to their favourite tracks and ruin their low percentage!

rob · November 5, 2023, 10:00am

Correct.

I’m not sure if this interpretation is correct – this is what threw me for a loop when we first implemented it.

This coefficient represents the angle between two vectors and it is expressed via the cos() of that angle. If we think of the angle between two vectors, they can go from 0 to 180 (deg), where 0 means that the two users are not similar at all and 180 means that the users are perfectly similar.

Just because the cos() of the angle returns a negative number, this does not mean they are negatively correlated – it is only expressing an angle between two vectors. so there is not way to express a negative relation. And semantically speaking what would it mean for two users to be negatively related? Would that be a matter/anti-matter sort of situation?

biocv · November 5, 2023, 12:00pm

Right on! Now we are moving into some hardcore statistics

And here I learned a thing. I was not familiar with that interpretation of the correlation coefficient. For me, the correlation coefficient is the standardized covariance between two variables. I am old school, I suppose . I will try to present my case in vector lingo:

Correct me if I am wrong, but if the correlation coefficient is the cosinus of the angle between two vectors (in radians) then in my book 0 degrees would indicate perfect correlation (rho = 1) and 180 degrees would indicate a perfect negative correlation (rho = -1), whereas uncorrelated vectors would make a 90 degrees (0.5 * pi) angle.

Here is an extreme example. Suppose that the only two tracks in the whole world are Starship’s “We Built This City” and The Village People’s “YMCA”. Suppose that I listened to the first but not to the second. My vector would be (1,0). Now if you would have listened to “YMCA”, but not “We Built This City”, then your vector would be (0,1). These vectors point in opposite directions (ETA: see bottom page), hence the angle they are making is 180 degrees. Our listening histories are perfectly opposite: you listen to the stuff I never heard and vice versa.

To get an uncorrelated data set, we require a certain overlap in listening histories e.g. (1,1,0,0) and (1,0,1,0) where I have listened to one track you did as well, but also listened to a track you did not. These vectors are at a right angle, I believe.

Yes, sorta. It means that one user listens to the tracks that the other does not listen to, like in the extreme example above. Hence, a negative correlation would result from the situation where the majority of listens are not shared between two users. My gut feeling is that this would be extremely common if only tracks are included that at least one user has listened to. Conversely, if you include ALL tracks in the database in your analysis, then this should result in a huge number of shared points in the origin that neither user has listened to, forcing a positive correlation. ETA: Rubbish, scratch that.
Now, linear algebra is outside of my comfort zone and of course I do not know exactly what calculations you perform, so I could be wrong. But this is my understanding so far. Also, I am really wondering how many statistics nerds are lurking in this forum and whether they will weigh in .

ETA: or rather: the vectors of deviations from the mean run in opposite directions: (0.5, -0.5) and (-0.5, 0.5). Sorry for the mess.

rob · November 5, 2023, 4:17pm

From what I understand it isn’t about the direction of the vectors, but the angle between them and nothing else. But I am not an expert in these matters.

I’m certainly open for other interpretations of this – if you have a concrete method by which you suggest scaling/offsetting the resultant cosine value, please suggest it. If it isn’t complex, I’ll try to implement it and we’ll compare the results.