Next AI discussion: AI and our data

rob · May 5, 2023, 12:43pm

Hello!

Matthew, one of our directors has just raised this question:

Datasets as training material for AIs is going to heat up as a political thing and we should have a POV on it as a foundation. While we’re not providing audio files for training, using the other metadata is potentially tricky and valuable to ML/AI companies building models.

I personally see AI companies are a new source of revenue in the future – because the music knowledge in things like ChatGPT leave a lot to be desired. This is a potential positive for us, but we need to ask ourselves:

Would making our data available to AI companies make us complicit in some sort of atrocity in the future?

What do you think?

reosarevok · May 5, 2023, 12:48pm

It is well established by now that (most) AI companies don’t give a shit about data ownership anyway, so as long as there is a way to get a CC0* dump of the data, some of them are going to use it and just not tell us nor support us anyway. As such, not sure there’s much we can do here.

* Or any other license really, CC0 just gives them a legal out but see image AI using clearly copyrighted datasets without consequences.

rob · May 5, 2023, 12:50pm

Correct. We cannot prevent them from using our data and yes, they don’t give a fuck. May they be compelled by law to do so in the future? Possibly.

However, if we did determine that we wish to have our data not be used in AI, we could condemn the use of our data in AI in a public way; that is the most effective thing that I could see us do – anything else would compromise our core values.

reosarevok · May 5, 2023, 12:53pm

I would be in favour of at the very least requiring anyone using our data in AI to sign up and explain what for and wait for us to agree, even if it’s non-commercially. I’m not sure anyone would respect our decision if we say no, but…

rob · May 5, 2023, 1:10pm

Well, if the new datasets page we just put up is enough to catch them, then I am happy to chat with them… and if it is against our values, tell them NO personally.

mfmeulenbelt · May 5, 2023, 1:33pm

If we are happy with the core data being in the public domain (as per the CC0 license), then I don’t see why we would make an exception for AI training. Because I don’t think that MB’s music metadata would be a useful tool to destroy humanity.

The companies behind AIs are probably evil, but are they more evil than other corporations?

aerozol · May 6, 2023, 1:09am

We have a in-house “go suck on a big one AI” attitude that I find fun, but I see a public anti-AI post/stance as a unnecessary risk to MeB (until it becomes necessary).

Not a legal risk or because I want to suck up to AI companies (lol). But with our stuff under CC0 licenses it would be grandstanding, and some of our current users and clients might not appreciate it - they might have quite cool AI plans.

I think encouraging sign ups and dialogue with our dataset users is the way to go, including for AI, so we can at least be in the loop.

On the other hand, no problem with MeB publicly condemning evil AI/dataset use, if someone (@rob?) is feeling the urge to write something pithy!

rob · May 8, 2023, 10:01am

I really don’t – I have no problems with our data being used in AI as long as we generally agree that it isn’t going to destroy the world.

Having a conversational AI that understands MB data would be a really cool tool I wish to use!

“MusicGPT, please make me a 20 track playlist that links Beethoven and Metallica based on artist influences”

Could be fun.

Victini · May 10, 2023, 10:08pm

Machine learning has been here for over a decade now, it’s just that it’s boomed thanks to companies like OpenAI and the accessibility of processors capable of training these advanced ML models (for both regular consumers and large businesses), and now it’s the hit new thing for venture capital firms to throw money at. I just did a cursory search for “musicbrainz machine learning” and there are some papers dating back to 2012 that used MB data one way or the other (mainly with data classification, which makes sense, and the Million Song Dataset seems to be mentioned a lot), so in a way, MB has been used for this industry already. The only thing that’s different now is that training these models have gotten a whole lot easier.

My opinion is that it absolutely depends on what the use case is, really, and even then I’m not sure if the license attached to the database allows that sort of selective approval (I presume not, but lemme know otherwise). As long as it’s not stepping on the careers of musicians and it aligns with the foundation’s values, I can’t see why MeB should disallow the use of its data with AI/ML projects.

If it does step on said musician’s careers, public condemnation of the specific bad actor/s should suffice (unless funding, which becomes hairier). There’s a whole general issue with bad actors using free and open source software and how that should be handled, which is a political issue that goes way beyond the scope of this particular subject but is at least worth mentioning briefly here.