Ideas for Making the Public Dataset Pseudoanonymous

rob · March 4, 2024, 11:30am

Hi!

Here at MetaBrainz we have a general rule that we apply to our data dumps: If a user can see the data on our webpages, we should include it in the data dumps.

If we were to pseudo-anonymize the dumps, then the following things would likely happen:

User can no longer find their data in the dumps nor can they relate any of the data on our site with what is the in the data dumps. This makes the data less useful – and our goals are not just to make a user’s data available, but to make our datasets useful on a global scale. Having user names in the data allows 3rd parties to easier build other tools the complement LB or build their own tools, for… whatever.
Someone would start scraping our web pages in order to collect or reverse engineer the missing data in order to de-anonymize it. I would posit that taking the listens for a user for a day from the web page and then looking for them in the data dumps would make it trivial to de-anonymize the data.

So, if the data is easy to de-anonymize we lose the intended anonymization benefit, but we now take the negatives in that our data becomes harder to use for its intended purposes.

When we started ListenBrainz, we had people from MB and the founders of last.fm hacking with us an entire weekend to stand the project up. One of the most ardent pieces of advice that was given to use was to not allow private listens. The private listens always caused a massive headache for the team and few people ever made use of it. We all quickly agreed that private listens were a bad idea.

We made a conscious decision to make the data public and that if people were not ok with that, that they should choose another service. Yes, we understand that we’re going to lose some users with this decision, but we are dedicated to useful public data and anyone who wishes to join us in that mission, is welcome to join us.