Bot based on Discogs Data

joaopedroantonio · December 10, 2023, 6:15pm

Hi everyone!

First of all, I’m new around, so bear with me if I ask any silly questions! Long story short, I’m a Data Engineer and I’m a music enthusiast; I haven’t had a pet project for some time and a few weeks ago I stumbled upon Musicbrainz again and decided to check if I could contribute.

My proposition is to find ways to extract information from the Discogs data dumps and automatically create edits. There are so many possibilities, I don’t even know where to start.

Here’s an example: if I have a Musicbrainz Artist connected with a Discogs Artist (via a URL), I can extract URLs from the Discogs side and suggest edits in Musicbrainz (for URLs that doesn’t exist yet, obviously). I ran this analysis just focusing on finding Spotify links and could find 2217 Spotify links on Discogs side that don’t exist on Musicbrainz.

With that said, I already found these links:

I guess what I’m trying to do here is something that looks like a Bot, to automate creation of edits; something to complement the existing userscripts (I haven’t explored them yet).

With all of that said (wall of text, I know ), I have a few questions:

Is it a good idea to automatically extract data from Discogs and load it in Musicbrainz? From a license point of view I think it’s fine, am I missing something?
Does anyone have any suggestion on what would be more useful for me to tackle? The example I mentioned regarding URLs is quite simple but I’m not sure how useful it is.
Where can I find other bots that work similarly to what I’m suggesting here? Are there any internal pipelines doing the similar work that would make such a bot redundant?
Any other databases / data dumps / APIs besides Discogs that would be interesting to explore?

Best,
João António

IvanDobsky · December 10, 2023, 6:33pm

This is better to do with a human involved to check the data…

The script that many of us use is really well written, but still makes errors.

spUdux · December 10, 2023, 6:36pm

Not completely. Discogs is generally a good reference but it needs to be analyzed before automatically importing, they have different rules and often times their data is just incorrect.

joaopedroantonio · December 12, 2023, 11:59am

@IvanDobsky @spUdux thank you for the feedback!

I understand your concerns and will take that into account. Maybe I can find a way to have some kind of automation to identify possible edits, but then those edits are reviewed before actually being submitted.

What areas of the Discogs database would you think would be more interesting to explore first? Or maybe looking from the opposite perspective, what areas of Musicbrainz are usually staying behind?

Just so that I have a concrete target (or set of targets) to aim at.

sound.and.vision · December 12, 2023, 3:45pm

Ultimately its to get track-level (what we consider recording-level) and release-level credits moved across. However as someone who was bought up through the ranks with Discogs, hated the way things are generally handled there and now a MBz convert I can say that trying to automate this is going to cause a lot of problems.

I 100% get your intentions, they’re good and valid. However there are plenty of examples where I think MBz does a better job than Discogs.

I’m working at the moment, so can’t get into this but will come back later this evening.

The biggest issue and what you’re feeling the kick back on is that Discogs can (like so many other resources online, including MBz) have “dirty data”.

For me, if I’m trying to source credits from Discogs, I always try and x-reference it with what is written on the packaging. Not just trusting Discogs blindly.

sound.and.vision · December 12, 2023, 9:51pm

OK I’m here, keep in mind what is about to be written are my opinions and thoughts. Other’s can chime in with how they feel.

So firstly let’s address what Discogs is. Discogs is (today) primarily a marketplace for the sale of physical audio media (vinyl, CD, cassettes etc.). To facilitate these transactions, the website has an extensive database of verisions of these physical releases wherein sellers can locate their specific edition/variant for sale; one could argue that this data is the secondary purpose of Discogs.

In my view discogs is not a music database.

If it was, I think they would approach things differently, in a similar vein to how we do with an emphasis on the metadata.

However, as the platform grew over the course of the last decade along with the vinyl revival Discogs cemented itself as the place to research, discover and ultimately buy new music. With this popularity came a wide variety of users - those who were buyers, those who were simply collectors wanting a way to track what they bought (and maybe how much they paid, the condition etc.) and then the few freaks who wanted to document the data. I’d argue that the types of users follow that order, with the most popular being the buyers and the least popular being the freaks and nerds.

It’s of course the information we want from the latter part, the people who like me back in 2009 began to painstakingly add all the bits of metadata from a physical release. However although Discogs made a few attempts to embrace being a haven for this metadata, many of their attempts failed (see when they tried to establish what we consider recording entities) and were simply reverted. What Discogs’ tracks on a release has remained largely unchanged recently, and there are many of those dedicated freaks and nerds crying out for things that MBz already provides (such as being able link recording entites to multiple releases).

So why is this a bad thing? Well it’s not inherently bad, you see a large amount of potential data and you want to bring it in.

You mention using a “bot” but I think if you were to have a bot scrape every bit of Discogs and pull it into MBz in a manner of automatic fashion you and your bot would be swiftly shown the door and many of the dedicated and hard working volunteers here would simply crumble and cry.

Why? Becuase a bot, no matter how much AI or “logic” you put behind it doesn’t have the ability to smell a rat or see how the data over in Discogs-land might be a better fit for MusicBrainz.

For example, over at Discogs a release’s company credits are a big glob mess. If an album with 12 tracks has tracks 1, 3 and 5 recorded at Recording Studio A, Los Angeles and tracks 2, 4, 6-12 recorded at Recording Studio Z, Nashville there is nowhere for that distinction to be easily made. You’re hoping that whoever submitted that data to Discogs put it in the notes, or maybe they put it in an edit note. Or maybe they put it nowhere, because all they’re limited to the way Discogs works and so that lives in the booklet. OK so the release has cover art, but it’s all nasty high-compression 600x600 artwork that Discogs stores.

A bot might come along and go “OK I will take these studios and add them at a release level on MBz” and that strictly speaking isn’t a fault. But it’s also not correct. A human submitter should hopefully go OK I see that there are two recording studios but no note as to what was recorded where, so I now need to find some high resolution scans of this release, or maybe an article on another website that confirms where the tracks were recorded, or simply consult the physical booklet sitting on their desk. Either way, once they’ve figured it out, they can add those studio credits to the relevant recordings. Where a piece of music is recorded is not a release-level credit really, it is best suited to a recording-level credit. This also means that if someone later finds that recording appears in an identical manner (which many studio recordings will do) on another release, all of that metadata is there. It does not need to be re-input, it is not now hiding under a specific release. It now appears for every occurance of that recording.

I’d argue that first and foremost MusicBrainz deals with recordings. It’s why we can add recording entries without a release. That is really important. Where those recordings appear, and how they work with other entity types like works is secondary.

So there’s that issue, the fact that the datasets for Discogs and MusicBrainz are just different - and the more you look at it, the more different it is. This isn’t just Discogs by the way, the same issue arises when we look to import/scrape data from any source be it streaming platforms, AllMusic, Wikipedia, museum collections, industry datbases (like ISWCNet, SoundExchange).

It is why these tools we all use day-to-day to save us precious minutes copying and pasting and wearing out our wrists, are so beloved but can be so destructive.

For example, when I’m importing a digital release using Atisket to scrape information from Spotify, Deezer and Apple Music, it’s rarely a dump and run exercise. There’s always a level of checks that I’m doing to ensure I’m not about to make a big balls up that may go unnoticed for years - this means finding the existing artist entity, existing relase groups, finding the correct label, confirming the release date i’ve been given looks legitimate, matching the recordings up to existing ones in the database, selecting the best artwork from these sources, ensuring the artwork matches.

Plenty of people use these tools and don’t make these checks, because they don’t realise it’s going to cause a lot of confusion.

There’s a lot of examples already in the database of now giant messes that we volunteers have got to go and clean up. Where if the original submitter just took a few extra seconds, we could have avoided it all together.

There are still thousands of CD relases that came in from an import from FreeDB. These releases have nothing to evidence their existence than the fact they existed in FreeDB. There’s rarely a catalog, any artwork, any identifier of any kind to distinguish them. So we volunteers have to make hard decisions on what happens to these lame bits of data.

I think a lot of people working on this project, paid or volunteer, would say they pride accuracy over quantity. If it was a case of the latter I think someone would have just simply dumped all that data that lives on Discogs, AllMusic, RateYourMusic, Spotify, Deezer, Apple Music, YouTube, Soundcoud, Bandcamp, 45cat and all of the other niche databases out there into MusicBrainz. But it would immediately render the project pointless, you’ve just got a load of data that hasn’t been checked, there’s no way of knowing it was even input correctly on the other platform. The idea of dumping data into MusicBrainz gives me the heebie jeebies, even sometimes I think I move too fast for submitting things - but at least I’m limited to processing things one at a time, it’s not some run-away batch job doing millions of edits per second that by the time I realise what I’ve done I’ve ruined years upon years of hard work and made a decades worth of clean-up tasks for not only myself but others around me.

So what do people want from Discogs. Well as mentioned the metadata, the thing that people want the most is to be able to easily “import” credits from there over to here. But as I have already mentioned that seems simple in principle and extremely complex in execution.

At the moment I will go to the relevant Discogs entry, I have a userscript that cross-references the Discogs entity to the MusicBrainz entity so I get helpful little shortcuts, and I will manually, copy each credit one at a time.

Is this slow - Yes.

Is this a good thing - Yes. As I have already found “oh this is wrong on Discogs, but I can correct it on MusicBrainz”.

Everyone wants to do things as quickly as possible, there’s a good reason for that. There’s a mountain of music that we haven’t even begun to document and its growing rapidly every single day, and when you look at the competition (Discogs) you may get green with envy and want to copy their homework to try and “catch-up”. Taking all their mistakes with you.

But I like the pace we move at, because although we do have individuals who come in and try and do things “really quickly” and make a mess, we have dedicated members of the community who spot this in real time and take action.

I believe we will get there in the end, I don’t think humanity will ever stop making music. But as more dedicated individuals come on-board to this project, and understand what the point of all of this is, the amount of information we capture will only increase. At least for the things in the past, that’s a finite number. There are only so many bits of audio recorded and released in a year gone by - it’ll be a matter of time before we have all of that information recorded.

So TLDR, please don’t make a bot that mindlessly dumps data into this project. You can make a tool or a script that aids with the importing of that data but it must do everything in its power to ensure that data coming in has had someone check it. A “one-click” solution will surely be a catastrophe for the quality of our data.

aerozol · December 12, 2023, 11:33pm

Welcome!

I just wanted to add that there are mountains of other fun (and not fun, yay!) things developers (and editors, designers, content creators, etc etc) can contribute to MeB. Data imports are a hairy one to start with…

Come check out the ticket tracker, or chat to the other devs on IRC.

Listenbrainz explore is where fun stuff has been happening lately, perhaps you have ideas for a data deep dives that could fit. Or updates to existing ones (Stats art generator could use some more templates).

If you are focussed on automated data imports then it’s going to be very tricky to find something suitable. Probably the last worthwhile one I can think of is when someone imported genre tags from releases with Bandcamp links… tags are reasonably hard to f*^% up, and the BC links were user-added/checked already. All the other mass imports have been chaos and carnage. Maybe someone can think of something suitable, but I can’t right now

UltimateRiff · December 13, 2023, 5:23am

a couple ideas along these lines… could do another Bandcamp genre import, or perhaps a Discogs genre import~ I don’t think we’ve ever done one of these, and there might be some good genres in there too~

apart from tag imports tho, I can’t think of anything offhand…

sound.and.vision · December 13, 2023, 9:02am

One issue is for the longest period they required a sub-genre to be chosen if Electronic was chosen, meaning a lot of music is tagged as “House” and it is nowhere close to that. But as mentioned, tags are tags and aren’t as precious as the real metadata.