Right on! Now we are moving into some hardcore statistics
And here I learned a thing. I was not familiar with that interpretation of the correlation coefficient. For me, the correlation coefficient is the standardized covariance between two variables. I am old school, I suppose . I will try to present my case in vector lingo:
Correct me if I am wrong, but if the correlation coefficient is the cosinus of the angle between two vectors (in radians) then in my book 0 degrees would indicate perfect correlation (rho = 1) and 180 degrees would indicate a perfect negative correlation (rho = -1), whereas uncorrelated vectors would make a 90 degrees (0.5 * pi) angle.
Here is an extreme example. Suppose that the only two tracks in the whole world are Starship’s “We Built This City” and The Village People’s “YMCA”. Suppose that I listened to the first but not to the second. My vector would be (1,0). Now if you would have listened to “YMCA”, but not “We Built This City”, then your vector would be (0,1). These vectors point in opposite directions (ETA: see bottom page), hence the angle they are making is 180 degrees. Our listening histories are perfectly opposite: you listen to the stuff I never heard and vice versa.
To get an uncorrelated data set, we require a certain overlap in listening histories e.g. (1,1,0,0) and (1,0,1,0) where I have listened to one track you did as well, but also listened to a track you did not. These vectors are at a right angle, I believe.
Yes, sorta. It means that one user listens to the tracks that the other does not listen to, like in the extreme example above. Hence, a negative correlation would result from the situation where the majority of listens are not shared between two users. My gut feeling is that this would be extremely common if only tracks are included that at least one user has listened to. Conversely, if you include ALL tracks in the database in your analysis, then this should result in a huge number of shared points in the origin that neither user has listened to, forcing a positive correlation. ETA: Rubbish, scratch that.
Now, linear algebra is outside of my comfort zone and of course I do not know exactly what calculations you perform, so I could be wrong. But this is my understanding so far. Also, I am really wondering how many statistics nerds are lurking in this forum and whether they will weigh in .
ETA: or rather: the vectors of deviations from the mean run in opposite directions: (0.5, -0.5) and (-0.5, 0.5). Sorry for the mess.