OK I’m here, keep in mind what is about to be written are my opinions and thoughts. Other’s can chime in with how they feel.
So firstly let’s address what Discogs is. Discogs is (today) primarily a marketplace for the sale of physical audio media (vinyl, CD, cassettes etc.). To facilitate these transactions, the website has an extensive database of verisions of these physical releases wherein sellers can locate their specific edition/variant for sale; one could argue that this data is the secondary purpose of Discogs.
In my view discogs is not a music database.
If it was, I think they would approach things differently, in a similar vein to how we do with an emphasis on the metadata.
However, as the platform grew over the course of the last decade along with the vinyl revival Discogs cemented itself as the place to research, discover and ultimately buy new music. With this popularity came a wide variety of users - those who were buyers, those who were simply collectors wanting a way to track what they bought (and maybe how much they paid, the condition etc.) and then the few freaks who wanted to document the data. I’d argue that the types of users follow that order, with the most popular being the buyers and the least popular being the freaks and nerds.
It’s of course the information we want from the latter part, the people who like me back in 2009 began to painstakingly add all the bits of metadata from a physical release. However although Discogs made a few attempts to embrace being a haven for this metadata, many of their attempts failed (see when they tried to establish what we consider recording entities) and were simply reverted. What Discogs’ tracks on a release has remained largely unchanged recently, and there are many of those dedicated freaks and nerds crying out for things that MBz already provides (such as being able link recording entites to multiple releases).
So why is this a bad thing? Well it’s not inherently bad, you see a large amount of potential data and you want to bring it in.
You mention using a “bot” but I think if you were to have a bot scrape every bit of Discogs and pull it into MBz in a manner of automatic fashion you and your bot would be swiftly shown the door and many of the dedicated and hard working volunteers here would simply crumble and cry.
Why? Becuase a bot, no matter how much AI or “logic” you put behind it doesn’t have the ability to smell a rat or see how the data over in Discogs-land might be a better fit for MusicBrainz.
For example, over at Discogs a release’s company credits are a big glob mess. If an album with 12 tracks has tracks 1, 3 and 5 recorded at Recording Studio A, Los Angeles and tracks 2, 4, 6-12 recorded at Recording Studio Z, Nashville there is nowhere for that distinction to be easily made. You’re hoping that whoever submitted that data to Discogs put it in the notes, or maybe they put it in an edit note. Or maybe they put it nowhere, because all they’re limited to the way Discogs works and so that lives in the booklet. OK so the release has cover art, but it’s all nasty high-compression 600x600 artwork that Discogs stores.
A bot might come along and go “OK I will take these studios and add them at a release level on MBz” and that strictly speaking isn’t a fault. But it’s also not correct. A human submitter should hopefully go OK I see that there are two recording studios but no note as to what was recorded where, so I now need to find some high resolution scans of this release, or maybe an article on another website that confirms where the tracks were recorded, or simply consult the physical booklet sitting on their desk. Either way, once they’ve figured it out, they can add those studio credits to the relevant recordings. Where a piece of music is recorded is not a release-level credit really, it is best suited to a recording-level credit. This also means that if someone later finds that recording appears in an identical manner (which many studio recordings will do) on another release, all of that metadata is there. It does not need to be re-input, it is not now hiding under a specific release. It now appears for every occurance of that recording.
I’d argue that first and foremost MusicBrainz deals with recordings. It’s why we can add recording entries without a release. That is really important. Where those recordings appear, and how they work with other entity types like works is secondary.
So there’s that issue, the fact that the datasets for Discogs and MusicBrainz are just different - and the more you look at it, the more different it is. This isn’t just Discogs by the way, the same issue arises when we look to import/scrape data from any source be it streaming platforms, AllMusic, Wikipedia, museum collections, industry datbases (like ISWCNet, SoundExchange).
It is why these tools we all use day-to-day to save us precious minutes copying and pasting and wearing out our wrists, are so beloved but can be so destructive.
For example, when I’m importing a digital release using Atisket to scrape information from Spotify, Deezer and Apple Music, it’s rarely a dump and run exercise. There’s always a level of checks that I’m doing to ensure I’m not about to make a big balls up that may go unnoticed for years - this means finding the existing artist entity, existing relase groups, finding the correct label, confirming the release date i’ve been given looks legitimate, matching the recordings up to existing ones in the database, selecting the best artwork from these sources, ensuring the artwork matches.
Plenty of people use these tools and don’t make these checks, because they don’t realise it’s going to cause a lot of confusion.
There’s a lot of examples already in the database of now giant messes that we volunteers have got to go and clean up. Where if the original submitter just took a few extra seconds, we could have avoided it all together.
There are still thousands of CD relases that came in from an import from FreeDB. These releases have nothing to evidence their existence than the fact they existed in FreeDB. There’s rarely a catalog, any artwork, any identifier of any kind to distinguish them. So we volunteers have to make hard decisions on what happens to these lame bits of data.
I think a lot of people working on this project, paid or volunteer, would say they pride accuracy over quantity. If it was a case of the latter I think someone would have just simply dumped all that data that lives on Discogs, AllMusic, RateYourMusic, Spotify, Deezer, Apple Music, YouTube, Soundcoud, Bandcamp, 45cat and all of the other niche databases out there into MusicBrainz. But it would immediately render the project pointless, you’ve just got a load of data that hasn’t been checked, there’s no way of knowing it was even input correctly on the other platform. The idea of dumping data into MusicBrainz gives me the heebie jeebies, even sometimes I think I move too fast for submitting things - but at least I’m limited to processing things one at a time, it’s not some run-away batch job doing millions of edits per second that by the time I realise what I’ve done I’ve ruined years upon years of hard work and made a decades worth of clean-up tasks for not only myself but others around me.
So what do people want from Discogs. Well as mentioned the metadata, the thing that people want the most is to be able to easily “import” credits from there over to here. But as I have already mentioned that seems simple in principle and extremely complex in execution.
At the moment I will go to the relevant Discogs entry, I have a userscript that cross-references the Discogs entity to the MusicBrainz entity so I get helpful little shortcuts, and I will manually, copy each credit one at a time.
Is this slow - Yes.
Is this a good thing - Yes. As I have already found “oh this is wrong on Discogs, but I can correct it on MusicBrainz”.
Everyone wants to do things as quickly as possible, there’s a good reason for that. There’s a mountain of music that we haven’t even begun to document and its growing rapidly every single day, and when you look at the competition (Discogs) you may get green with envy and want to copy their homework to try and “catch-up”. Taking all their mistakes with you.
But I like the pace we move at, because although we do have individuals who come in and try and do things “really quickly” and make a mess, we have dedicated members of the community who spot this in real time and take action.
I believe we will get there in the end, I don’t think humanity will ever stop making music. But as more dedicated individuals come on-board to this project, and understand what the point of all of this is, the amount of information we capture will only increase. At least for the things in the past, that’s a finite number. There are only so many bits of audio recorded and released in a year gone by - it’ll be a matter of time before we have all of that information recorded.
So TLDR, please don’t make a bot that mindlessly dumps data into this project. You can make a tool or a script that aids with the importing of that data but it must do everything in its power to ensure that data coming in has had someone check it. A “one-click” solution will surely be a catastrophe for the quality of our data.