Any plans to give ListenBrainz users control over their listen data privacy?

I’m asking this question after looking into the ListenBrainz API recently, wondering about what thinking went into deciding which endpoints require auth, and which don’t (currently it’s read only endpoints have no auth, any write endpoints obviously require it).

The reason is ask is because I’ve been pondering this point under the Anti-goals section on the ListenBrainz website:

  1. A store for people’s private listen history. The point of this project is to build a public, shareable store of listen data. As we build out our sharing features, building a private listen store will become possible, but that is not part of our goals.

I totally get this philosophy - I wonder though if the only-open availability of personal data might be acting as a barrier to people feeling like they want to contribute to ListenBrainz, especially given privacy is such a hot topic right now. Does anyone know what the thinking was behind this? It seems as if allowing some control of public vs private sharing could be good for general adoption / looking after people’s privacy, and unless I’m not understanding something right it feels like the real value is in the data dumps etc anyway?

For instance, a person might be quite comfortable knowing their anonymised data was being dumped and made open for the public to use to build awesome recommendation engines, but would still prefer to have their personal data visible and usable only by themselves (to start with), or allowed people (future feature?).

8 Likes

Ah. I just realised there’s already a Jira ticket about this exact topic already, with a very similar question wondering if it’s possible to have a system with the best of both worlds:

4 Likes

I’m not a developer involved in listenbrainz.

While having private listens is a feature that people clearly want it is not a goal of listenbrainz and would make things significantly harder for the developers now and severely limit what features can be developed in the future.

Musicbrainz is trying to be more than just a database for tagging your music and allow other usage.
The data is mostly licensed under cc0 with some under cc by-nc-sa.

Things such as gdpr make things a lot harder as there are significant penalties if the developers get things wrong. If everything is public there is no problem, if somethings are private but not protected well enough there are going to be problems.

2 Likes

Yep, I get what the goals of ListenBrainz are - I probably should reiterate that my question comes from a place where I’m very much in support of the goals, and really want to see it succeed!

I spent years contributing data to things like last.fm and Google play music, only to have it dawn on me I owned none of the recommendation power outside of each setting, and my data could never be used by anyone else to build anything else.

When I think about my own relationship to this data - I’m happy and comfortable with my listen history being publicly available and tied to an online identifier, but I realise that others aren’t always in such a position, and for various possible reasons may feel like they aren’t able to contribute to ListenBrainz for fear of exposing themselves to danger or emotional stress online. They may be keen contributors though!

In the past, we’ve generally said to people ‘Withdraw from the service’ or perhaps even “don’t sign up in the first place!”.

One of the great things about things like GDPR, and ‘Privacy by design’ is how they aim to challenge this notion that such goals can’t be achieved without sharing every bit of personal data under the sun. It’s essentially saying we should aim to build things where people feel safe to participate online, and be looked after, whatever the individuals vulnerability level / personal situation. Which I think is a cool thing to aim for :slight_smile:

I guess what I’m saying is I don’t think the goals of ListenBrainz, and one or two broad privacy controls are mutually exclusive things.

I agree though it would indeed add a little extra thinking and development effort - it’s tricky stuff! To be clear, I think there’s a good distinction between building a private listen store (which would obviously be awesome!), and having some control over public listen data visibility. If no one could see what I was listening to by visiting my page, or requesting my listen data from the API, but my anonymised data (without the userID) was still going into the large dataset made available for building cool recommendation engines - I think that’s a valuable contribution to ListenBrainz.

6 Likes

One problem I see here is the way of anonymizing the data. I see two options:

  1. Your user name gets replaced for all your listens with the same random string
  2. Your user name gets replaced with a different random string for each of your listens

The first option doesn’t add much to your anonymity. Your listen history is still a single dataset for a user, and it probably would be possible to correlate it back to you. Also you can already today choose a completely random user name not linked to your person.

The second option undermines the ability to draw conclusions from the dataset, since all your listens are seemingly from a different user (with a listen history of only a single track). In such a dataset you cannot draw conclusions like “people listening to artist A also often listen to artist B”.

6 Likes

Yeah I definitely think you’d aim to leave the users listens bundled together. I don’t think it would be particularly easy to correlate back to the user in the first option though?

Lets say all user data was anonymised in the dataset, but there was a seperate control over the visibility of recent listen data which controls what’s seen on the users recent listen page, as well as changing all API requests to require a token - then there would be nothing to correlate to for people that needed to keep their data private?

3 Likes

If this privacy option is activated I think there should be no publicly available view of a users listens and also the listens should show up in recent listens only with the same anonymizing. But I still would be concerned that the data can be correlated back to persons anyway. I’m not saying having both privacy and a useful dataset isn’t possible, just that it might be more difficult then it seems at first. I’m personally more concerned about fake privacy options, that in the end don’t add much, then about a lack of keeping the listens private. In the latter case I know what I get, while the first one holds bad surprises :slight_smile:

In general I actually share your concerns, I would very much see some privacy options here. People are sometimes judged by their musical preferences. In Germany there was that case where a teacher who also fronted an extreme death metal band got fired from his job.

I don’t think requiring an API token for all requests has some effect on the privacy of the dataset.

3 Likes

listenbrainz is open source so people could potentially run thair own instance.
If people want to keep thair data private they (or someone else) could run an instance that does not make the listens public.

There are not public database dumps of listenbrainz (but there is a public api).
What you could do is have a private instance and import the listens from everyone and do the recommendations on your server.

3 Likes

I guess it depends on your listening story :slight_smile: Given my own, it would probably be reasonably easy to identify me as a Spanish person living in Estonia or the other way around - although it might be harder to trace it to me specifically. Of course, someone probably could use something like the music artists I like on Facebook to correlate further, but if I choose to make that data visible, then I can’t really expect privacy anyway :slight_smile:

3 Likes

Ah, this is the type of situation I was racking my brain to find - good point thanks! :slight_smile:

Regardless seems like far less risk of an individual correlation, though yeah seems easy to make some assumptions about groups of users perhaps.

This is essentially what I was meaning yeah - all good points.

Agreed on the point that poorly thought through privacy features are possibly worse than none at all, especially where the user thinks it does something it perhaps doesn’t. Thanks for bringing that up, it’s these types of things I was sure must have been canvased before. Gives some good context.

For what it’s worth Spotify has a number of cases similar to your teacher story resulting from their ‘public only’ listen history as well. They’ve got a couple features buried in settings such as one that lets you start a 6 hour ‘private listening session’, which seems mostly useless but interesting nonetheless.

2 Likes

Yes – this was a quite conscious decision from the beginning. As you may or may not know, ListenBrainz got its start during a hack weekend in London. The people who hacked that weekend were a mix of MetaBrainz team members and the hard-core inner group of last.fm founders/hackers.

When the question of private “scrobbles” came up, RJ stated that he really regretted ever allowing private data. In the end it turned out to be a giant pain to maintain these separate code paths and to have to deal with increased privacy issues. And the real kicker? Very few people ever made use of this – the vast majority of people were fine with their scrobbles being public.

That was enough for us to decide to not allow private listens.

However, another part of the story that hasn’t unfolded yet was that we originally used Kafka as a data pipeline for ListenBrainz. The idea was that anyone could hook into the data pipeline and get a live feed of the listens – not just ListenBrainz. One of the paths that we had envisioned (loosely, not well defined, mind you) someone could use all of the soon to be enabled ListenBrainz tools to record their listen history, but then not have ListenBrainz actually record that data. Instead they could insert a private hook that allowed them to record their own listens and not share them with anyone else.

Sadly, Kafka was a pain to use and we decided to toss it in favor or more stable tools like RabbitMQ (a wise decision we’ve been quite happy with). But, given that we have RabbitMQ we can still build these sorts of private data hooks if we want to. However, ask foreseen by RJ, the demand for private listens has been quite low so far, so we have not focused on it – instead we’re choosing to focus on the features that will hopefully encourage more people to record their listening histories.

That said, if you feel passionate about private listens, we can work out a scheme similar to what I just described to make it possible. But the bulk of the work would need to be done by others, not the core MetaBrainz team – for the reasons stated above.

Does that make sense?

10 Likes

Thanks for taking the time to write this up! Yeah that does make sense. I spent some time trying to piece together some background from the goals stated around the various MetaBrainz projects and was missing that context - I’m sure this will be useful for people wondering similar things later :slight_smile:.

As I’ve sort of alluded to already, I’ve become interested in those few people who I feel are often excluded somewhat in our systems (all of them, not just digital) for the sake of the vast majority - however I definitely understand the pragmatism around making that call, especially given the risk of having the added complexity hinder the ability to create value quickly.

The private data hooks you’ve mentioned are a pretty exciting thing - I’ve seen a fair amount of chatter begin to happen around that stuff. The idea of users having actual ownership over their data and where it’s stored is awesome. That way people building platforms like ListenBrainz could focus on what they’re attempting to do, and the user just brings their data to it from whatever data storage platform they choose to use. Or something along those lines. That service would be the company that specialises and is ultimately responsible for those extra code paths you talk about, and for risk around false privacy features as mentioned in earlier posts. I can kind of see a sort of marketplace eventually emerging for that sort of thing in the near future. Seems to me like we’d need some good options in that space for the users though before there was serious uptake - so my gut says it’d be a while out before it was useful to anyone. Someone is likely to know much more than me about it though!

I’m definitely interested in private listens done well, not from a selfish perspective (ie “I want services but I don’t want to share”), but coming from the point of view that I think users having control over their online presence is ultimately a good thing, and probably where things are heading longer term. I don’t think I know enough to drive it, or probably have the time right now - but I’m keen to help anyone that comes along wondering the same thing who thinks that they could! :slight_smile:

1 Like

There are weekly exports of listenbrainz so a starting point would be to write the code to automatically find the latest export and import this to a local instance.
http://ftp.musicbrainz.org/pub/musicbrainz/listenbrainz/fullexport/listenbrainz-dump-20190401-000403/

There is an export to big query.
For this to be used for replication you would need an import time column to the output so you could query for all public listens since the last time you ran an import job.

Listenbrainz also has a public api so you could write a replication method that queries each user for thair listens and import this into your local instance.
This could put a bit of load on the server so it may not be the recommended approach.

I have thought about running my own instance and playing around with the data.

Just to add a +1 to privacy features: The primary (or maybe only?) reason I don’t use ListenBrainz is that I don’t want to publish real-time information about myself. I’d be totally happy with my listens being public after a week or so, just not in real time.

1 Like

This sounds even more involved than just a “public: yes/no” which is already more involved than what we want to deal with.

What you want could be achieved by have your client(s) cache listens for a week before submitting them to LB, or you could set up an intermediary web cache that would take the listens and delay submission/forwarding to LB by 7 days.

4 Likes