Hello everyone. I am making this post following the conversation I have been having in MusicBrainz on IRC. I hope this post can serve as a more central collection of thoughts as well as a place to discuss this feedback in a way that can be more easily followed.
Background
Recently, I was looking into some other MetaBrainz projects after making some contributions to MusicBrainz. After reading about Listenbrainz, I was considering switching away from Last.fm. Before doing this, I looked at one of the public datasets available, specifically an incremental one. I was very surprised that the data is not anonymized in any capacity, with user IDs and usernames included along with the listening data.
I may have had a false impression of how things work, but I had a belief that the actual user information would not be in the dataset, at least in a public one. I do think that releasing a listens database is useful for statistical analysis, but having direct data that can be used to directly identify users attached to that is not necessary.
I am also operating under how other projects approach aggregate data: information that can be used to identify individuals is often stripped out. As I wrote in the IRC chat, I am not sure what the (current) value is for including user information in the dump aside from spooking users and onlookers.
I am also aware that there are other posts on the forum concerning privacy features for ListenBrainz, such as this one. This is a slightly different subject, so I thought it warrants a separate thread.
The Ask
I would like the public datasets exported by ListenBrainz to change to be pseudo-anonymous. That is, user IDs and usernames stripped from the dataset and replaced with something else so that the effectiveness of the data is mostly remained intact.
Reasoning
I view ListenBrainz as two different components for this discussion:
- A social component, where users are encouraged to track their histories with their friends and gain insights from that data.
- The public research component, where the aggregated data is available.
For (2), there is no need to have user IDs and usernames directly linked in aggregate data. A random identifier generated for the dataset would probably be sufficient as it allows grouping listens together.
Why pseudo-anonymous? At the current point in time, user profiles are all public. Even if user IDs and usernames were removed from the dataset and replaced with a randomly generated UUID, it would still be possible to identify a user’s history in the dataset. For context, I am following this presentation (slide 6).
Anonymous data:
‘…information which does not relate to an identified or identifiable natural person or to
personal data rendered anonymous in such a manner that the data subject is not or no
longer identifiable.’’ (Recital 26, GDPR)
Pseudo-anonymous data:
“…the processing of personal data in such a manner that the personal data can no longer
be attributed to a specific data subject without the use of additional information, provided
that such additional information is kept separately and is subject to technical and
organisational measures to ensure that the personal data are not attributed to an identified
or identifiable natural person.” (Article 4, GDPR)
Using this explanation, the user profile itself serves as the link between the user and the public data. If private profiles ever do become a thing, this can be revisited.
How I Feel
Generally, I would say that I like the idea of ListenBrainz. Public data for research is great. I do think there should be a way to contribute data without having it easily linked back to me, though. At least for the public dataset. I believe that having generated UUID substitutes in place of user IDs and usernames should be doable. With profiles currently public, the data could be linked back, but I do not see much of a point in that as listening history can be inspected directly without having to look at the dataset. In a way, you could say that the intents are different.
While I will continue with an example in the next section, I also want to state that a UUID(4) is just an idea. I am not claiming that it is the best solution, but a solution.
Example
For example, I would change
{“user_id”:123,“user_name”:“somebody”,“timestamp”:1234567890,“track_metadata”: {…}}
to
{“uuid”:“7dcdc012-bc76-4d88-a9af-b7bc1b1798b6”,“timestamp”:1234567890,“track_metadata”: {…}}
in the exports. For every export, this UUID should be regenerated. Effectively, this means that a UUID is only useful for grouping a particular user within the same dataset. This might make the incremental datasets themselves less useful, but I believe it is a worthy consideration.
Short Term
In the short term, @aerozol mentioned that I can probably file a ticket to at least have usernames removed from the dataset, as they are unnecessary. I have made a ticket and willl leave this post for the UUID discussion.