Ideas for Making the Public Dataset Pseudoanonymous

Hello everyone. I am making this post following the conversation I have been having in MusicBrainz on IRC. I hope this post can serve as a more central collection of thoughts as well as a place to discuss this feedback in a way that can be more easily followed.

Background

Recently, I was looking into some other MetaBrainz projects after making some contributions to MusicBrainz. After reading about Listenbrainz, I was considering switching away from Last.fm. Before doing this, I looked at one of the public datasets available, specifically an incremental one. I was very surprised that the data is not anonymized in any capacity, with user IDs and usernames included along with the listening data.

I may have had a false impression of how things work, but I had a belief that the actual user information would not be in the dataset, at least in a public one. I do think that releasing a listens database is useful for statistical analysis, but having direct data that can be used to directly identify users attached to that is not necessary.

I am also operating under how other projects approach aggregate data: information that can be used to identify individuals is often stripped out. As I wrote in the IRC chat, I am not sure what the (current) value is for including user information in the dump aside from spooking users and onlookers.

I am also aware that there are other posts on the forum concerning privacy features for ListenBrainz, such as this one. This is a slightly different subject, so I thought it warrants a separate thread.

The Ask

I would like the public datasets exported by ListenBrainz to change to be pseudo-anonymous. That is, user IDs and usernames stripped from the dataset and replaced with something else so that the effectiveness of the data is mostly remained intact.

Reasoning

I view ListenBrainz as two different components for this discussion:

  1. A social component, where users are encouraged to track their histories with their friends and gain insights from that data.
  2. The public research component, where the aggregated data is available.

For (2), there is no need to have user IDs and usernames directly linked in aggregate data. A random identifier generated for the dataset would probably be sufficient as it allows grouping listens together.

Why pseudo-anonymous? At the current point in time, user profiles are all public. Even if user IDs and usernames were removed from the dataset and replaced with a randomly generated UUID, it would still be possible to identify a user’s history in the dataset. For context, I am following this presentation (slide 6).

Anonymous data:

‘…information which does not relate to an identified or identifiable natural person or to
personal data rendered anonymous in such a manner that the data subject is not or no
longer identifiable.’’ (Recital 26, GDPR)

Pseudo-anonymous data:

“…the processing of personal data in such a manner that the personal data can no longer
be attributed to a specific data subject without the use of additional information, provided
that such additional information is kept separately and is subject to technical and
organisational measures to ensure that the personal data are not attributed to an identified
or identifiable natural person.” (Article 4, GDPR)

Using this explanation, the user profile itself serves as the link between the user and the public data. If private profiles ever do become a thing, this can be revisited.

How I Feel

Generally, I would say that I like the idea of ListenBrainz. Public data for research is great. I do think there should be a way to contribute data without having it easily linked back to me, though. At least for the public dataset. I believe that having generated UUID substitutes in place of user IDs and usernames should be doable. With profiles currently public, the data could be linked back, but I do not see much of a point in that as listening history can be inspected directly without having to look at the dataset. In a way, you could say that the intents are different.

While I will continue with an example in the next section, I also want to state that a UUID(4) is just an idea. I am not claiming that it is the best solution, but a solution.

Example

For example, I would change

{“user_id”:123,“user_name”:“somebody”,“timestamp”:1234567890,“track_metadata”: {…}}

to

{“uuid”:“7dcdc012-bc76-4d88-a9af-b7bc1b1798b6”,“timestamp”:1234567890,“track_metadata”: {…}}

in the exports. For every export, this UUID should be regenerated. Effectively, this means that a UUID is only useful for grouping a particular user within the same dataset. This might make the incremental datasets themselves less useful, but I believe it is a worthy consideration.

Short Term

In the short term, @aerozol mentioned that I can probably file a ticket to at least have usernames removed from the dataset, as they are unnecessary. I have made a ticket and willl leave this post for the UUID discussion.

8 Likes

an idea for incremental datasets: the randomization of the UUID could be opt-out with a clear enough warning during, say, the registration process. This would most likely result in most people not opting out and providing the would-be researcher with at least some continuous data.

Of course, there would have to be a way to discern new users with new UUIDs from old users with new UUIDs, which would defeat the purpose of the randomisation if there were very few people who would have opted out.

Ultimately, I believe incremental dumps would probably have less of a reason to exist with the UUID system implemented. Those who wish to draw larger insights on the data would probably be grabbing the full dataset anyway, which is already updated on at least a monthly basis.

UUIDs should be generated at the time of export and never saved, so incremental datasets would have different UUIDs even if the same user is present.

Hi!

Here at MetaBrainz we have a general rule that we apply to our data dumps: If a user can see the data on our webpages, we should include it in the data dumps.

If we were to pseudo-anonymize the dumps, then the following things would likely happen:

  1. User can no longer find their data in the dumps nor can they relate any of the data on our site with what is the in the data dumps. This makes the data less useful – and our goals are not just to make a user’s data available, but to make our datasets useful on a global scale. Having user names in the data allows 3rd parties to easier build other tools the complement LB or build their own tools, for… whatever.

  2. Someone would start scraping our web pages in order to collect or reverse engineer the missing data in order to de-anonymize it. I would posit that taking the listens for a user for a day from the web page and then looking for them in the data dumps would make it trivial to de-anonymize the data.

So, if the data is easy to de-anonymize we lose the intended anonymization benefit, but we now take the negatives in that our data becomes harder to use for its intended purposes.

When we started ListenBrainz, we had people from MB and the founders of last.fm hacking with us an entire weekend to stand the project up. One of the most ardent pieces of advice that was given to use was to not allow private listens. The private listens always caused a massive headache for the team and few people ever made use of it. We all quickly agreed that private listens were a bad idea.

We made a conscious decision to make the data public and that if people were not ok with that, that they should choose another service. Yes, we understand that we’re going to lose some users with this decision, but we are dedicated to useful public data and anyone who wishes to join us in that mission, is welcome to join us.

4 Likes

For me as a user the dumps are not only about research. I personally care very little about that. But the fact that dumps are made available is also part of the promise that the service remains open and no one can take it away.

For sure I trust the MetaBrainz foundation and the people behind it. After all cddb becoming closed data is what sparked MusicBrainz. But knowing that even if things should go downhill we, as a community, could use the data to get back* what we contributed to is reassuring.

And given the public nature of the LB data I would also not see much benefit of a anonymized dump as all the data is available via the website and API as well.


*) I fully understand that the public dumps cannot contain private information and to get everything up running for the users requires additional effort, but the primary data is there.

4 Likes

What would be the purpose of a user trying to locate their data in the dumps when they can view their history via their profile page or export it outright via the settings interface?

I also had this concern when talking in the IRC and among some peers before writing. I concede that someone could go through the effort of finding a user’s data in the dataset but doing so with public profiles is moot right now. Should the feature of private profiles ever be introduced, this becomes a lot more interesting.

The implementation specifics of a private profile do not need to be fully fleshed out but suppose only friends of a user could view their listens. Only they would be able to find the data in the dataset. If they have a listening history before private profiles are introduced, I agree that some extra work would need to be done to sufficiently anonymize the data.

I suppose you could either make private profile listens pseudo-anonymous in the dataset or exclude them entirely.

I believe the dataset still has immense practical value despite the lack of identifying usernames. I would like the intent to be considered in this case. For aggregate data, it is not necessary. In my view, targeting actual users should be a direct action. The only benefit (that I can see) of having usernames and user IDs in the public dataset is to allow mass profiling of every user on the platform directly without lifting a finger. I see a lot of malicious use cases, but no positives.

What the public dataset is intended to be used for is an important question, for sure. Bulk listening data including this much information feels like a purely unnecessary measure.

Maybe it might be in vain for me to try and convince you, but I will try. There are certainly people out there, like me, who really do want to contribute to the public good, but also do it in such a way where there is a little privacy in doing so. I do not care if someone looks at my profile directly at this point to see my listening history, but I would not want my direct identifying information to be in the dataset. Too much outright exposure for no real gain. The dataset is useful without the necessity of directly identifying who is submitting the listens.

Last.fm, from what I can tell (reading their privacy policies on the Internet Archive), makes their datasets pseudo-anonymous. Profiles are mostly public even if recent listening history is hidden. I believe ListenBrainz can at least match this. I would much rather use ListenBrainz over Last.fm as it is open source and not commercial.

I noticed that the server is written in Python. At this point, I am not sure what component of the code deals with exports but is something I am willing to at least try to investigate.

2 Likes

I understand your point. However, is there any situation where someone would not be able to export their own data? I highly doubt that it would ever be removed. A user with their own exported listens could then import it.

If one wants to stand up a copy of the website, does it matter if the listens of a user are pseudo-anonymous? It would not be possible to create placeholder accounts as there would be no usernames, but the data itself should have value.

Perhaps a broader question should be: where is the value in the data? The direct user, the listen itself, or something else? In aggregate, it is still possible to group listens together using a generated UUID instead of a username, so it would still be possible to calculate something such as “x number of users are Taylor Swift superfans.” The only thing lost would be who the actual users are, but this is a bulk dataset. In pretty much any other research scenario, one would not expect identifying data to be present in a bulk dataset.

Another deeply held belief here at MetaBrainz is that we should never restrict access to data and always make it as easy as possible to use our data with the goal of making it as broadly useful as possible. We can never predict all the use cases of this dataset and we shouldn’t – if we limited the use of that data to the things we can imagine, we’d be stifling the use of our datasets. Including as much data as possible to make it as easy as possible to use, is an important key to that.

What you are suggesting hobbles our data for no clear win – if the data can be deanonimized easily, why bother? It serves literally ZERO purpose other than to make more work for the team. And we’re a tiny team to being with!

Thanks for the offer, but its not about writing the code, but its about all the maintenance and extra hassles that would come after writing the code. And this proposition is lose-lose – more work for no clear payoff.

Seriously, if this is a deal breaker for you, then please don’t use ListenBrainz. We can’t change the fundamental assumptions about our project now. Especially when it makes no sense to do so.

4 Likes

Hi @Techman,

Technically, ListenBrainz profiles are already pseudoanonymous. We just ask for an email to sign up, you can always create a new or use a disposable/one-off use one.

4 Likes

Well, it is easy right now because all profiles are public. If private profiles were ever a thing (which is not a goal but it could technically become possible, per the About page), it would come down to what I said previously. That profile’s listening data could be excluded from the dataset entirely or it could be pseudo-anonymized.

Now that I think about it, it is probably much simpler to exclude that profile’s data from the dataset. It does not benefit consumers of the dataset, but ListenBrainz (the platform) could use it internally.

Another somewhat related question I just thought of: What if there is a class of people that trust ListenBrainz the platform, but not so much the (unknowable) downstream consumers of the data?

Ultimately, I like the idea of this project so I might use it and just accept the risk of what negative things could happen to me. The alternative is entrusting it to a for-profit entity…which is not saying much. Music listening history seems like such a small-scale thing to sweat about, but the risks are real.

1 Like