Satisfactory accuracy rate for Musicbrainz database bulk addition (Discogs Artists External link)

mmirG · December 3, 2019, 5:03am

Continuing the discussion from Is there any kind of project to improve holes in MusicBrainz coverage:

What do others think the mimimum acceptable accuracy rate for bulk addition of Discogs Artist links is?

Please include an explanation for your suggestion on the rate, and if applicable also explain why your suggested rate would require much higher accuracy than would most likely be acheived by waiting for the links to be added “naturally”.

Or present whatever your view is.

InvisibleMan78 · December 3, 2019, 7:59am

How do you define

If such a bulk addition improves “holes” in MB, there is IMHO no “accuracy” because such data just not yet exist in MB. If such data is accurate enough for Discogs, why should it not be for MB?

mmirG · December 3, 2019, 9:07am

There is a list of MB Artists.
And there is a list of Discogs Artists.
Editor ibaj understands that each Disogs Artist is matched to a MB Artist - but the External link has not been added on the MB Artist page.

http://reports.albunack.net/mbartist_discogsartist_report2.html
Editor ibaj created this list and writes,
“64,000 artists with possible link to a Discogs artist where it seems highly unlikely any are wrong”.

The question is, "How many of these 64,000 relationships would you require to be correct before you’d agree to the bulk creation of these MB-Discogs Artist links - without each link being checked by a human? "

If there was going to be only 63,999 correct links would that be OK?
If there was going to be only 63,950 correct links would that be OK?
If there was going to be only 60,000 correct links would that be OK?

InvisibleMan78 · December 3, 2019, 9:26am

62’720
( = 98% would be OK for me. Humans makes error too)

IvanDobsky · December 3, 2019, 2:37pm

I am not a statatision, so the question is a bit confusing to me. I am a music collector who wants to help the MB project be one of the best accurate databases around.

If you want my opinion, bulk adding of data without human checks is always a bad idea. I’m still cleaning out USA Amazon links from European releases.

Seeing mention of 64,000 links sounds like a worryingly high number to trust to a script to just import en-mass to fill some gaps. Especially as it would then not be clear as to which links were carefully added by a human who was checking their data, and which was just mass imported.

Quality over quantity. The guidelines are there to make sure we check our data. (I’ll go read that thread later in more detail…)

Importing in small CONTROLLED BURSTs where a single artist is updated and then checked by people who know the details. There are some artists in MB which are not well cared for, and they are chaotic at times from previous changes to the database. I can imagine what would happen if a Discogs import script suddenly pumped lots of fresh data on top of these.

And the more obscure an artist, the scrappy the data is on BOTH databases.

Data needs checking - 64,000 items can’t be checked in any sensible way.

InvisibleMan78 · December 4, 2019, 10:40am

It depends… Imagine a Wikipedia with 100 perfect entries. Nice? I prefer to have a Wikipedia full of entries which also can have errors. Human made errors. Everyone could fix them. Actually the english part has nearly 6’000’000 articles. There is a total of 301 languages. Will they ever be perfect? Of course not.

For me the situation at MB is very similar. I prefer an additional release and don’t care that much about brackets or apostrophes. If people want to correct them: Feel free. But you can’t fix entries who doesn’t exist.

As a self defined “open music encyclopedia” the quality should not prevent quantity, IMHO

elomatreb · December 4, 2019, 12:13pm

A counterpoint here could be the second-largest Wikipedia, the German one, which is infamous for its notability and citation strictness for new articles.

InvisibleMan78 · December 4, 2019, 12:38pm

Which keeps a lot of people from contributing.

ijabz · December 4, 2019, 1:27pm

The suggestion I was making to create a bot to add these links, therefore:

The bot would not add all 64,000 items in one go it would abide by the bot limits
This is only adding a link from a MusicBrainz artist to a Discogs artist, so there are not issues of spelling mistakes ectera, it would just be adding a link.

mmirG · December 4, 2019, 3:38pm

Are you a db knowledgable person?

Do such people agree about whether adding those 64K links is almost certain to go without any problems?

What would be the probability of anything going wrong with the adding process?

sibilant · December 4, 2019, 9:05pm

Seeing mention of 64,000 links sounds like a worryingly high number to trust to a script to just import en-mass to fill some gaps. Especially as it would then not be clear as to which links were carefully added by a human who was checking their data, and which was just mass imported.

I think we should add all 64,000 Discogs links. This is not a blind process. Don’t forget what the list actually is, and how these specific records were selected. The list was created from very conservative criteria to ensure that the risk of error in any entry is extremely low:

The artist name is unique in both databases.
The artist name in each database is associated with a release of exactly the same title.

These criteria were created (by a person) because it is especially low risk. The criteria seek to eliminate one of the potential variables: an ambiguous artist name. The release title is a further affirmation that the artist is a match.

Let’s consider some possible scenarios that could result in a mismatch:

Scenario 1
Two different artists with the same name also have a release with a shared name. Not impossible, but think about it: In order for this mismatch to happen, MusicBrainz must have Artist A but not Artist B, while Discogs must have Artist B but not Artist A. That coincidence is so improbable, it would actually be really funny if it happened even once!

Scenario 2
Both artists are unique in Discogs and MB because they are have releases for different artists in both databases, and no one has caught the error yet. For example: Imagine that there are actually two artists with the name Xipazzo Q. Onslaught, but everyone thinks there is only one, so they attribute all releases to the same artist in both database. By linking the artists, MB reinforces the misconception that the record is authoritative. (In library science, this would be called an “authority control” problem.)

In order for this to happen, the two artists would have to be similar enough to confuse editors. Like, they play the same instrument or the same genre, or they were active in the same country during the same time period. In other words, a rather ambiguous entry. I submit that in a case like this, a cursory review by a human editor is unlikely to discover the error, either. It would probably require some careful research or specific knowledge about the music.

These are the kinds of extremes that would result in a mismatch. How many of these do you think there are in the dataset?

psychoadept · December 4, 2019, 11:43pm

Technically, given the limitations on bot edits and the fact the type of edit being made is already done frequently and involves no changes to existing database queries, I would think nil. The possibilities @sibilant describes seem less remote than a technical problem.

dns_server · December 5, 2019, 2:49am

I have thought about writing this myself but it is something that could go wrong quite easily.

One things we could do instead is build a parallel database that is built for bots not humans.
This database would sit along side and have the information ready for a human editor to check and import the data.
When someone visits the artist page in musicbrainz there would be a prompt suggesting there is potentially a missing release and allow you to seed the release editor with this information.
There are some sites already doing this but we could add more features to these to make them easier to use and detect when there are releases missing.

ijabz · December 5, 2019, 7:45am

We are in danger of going off-topic here as the question was about a very specific task (linking Discogs artists to musicbrainz artists)

I don’t understand how your database idea is for bots rather than humans, its sounds like it is for humans because you are just providing data that a human can then use to seed a release.I have already done a version of this with albunack.net you can browse by artist, and it shows you both MusicBrainz and Discogs albums and whether they are already linked, it then makes it easy to link or import a Discogs release into MusicBrainz. However although its easy to seed the release this way its still rather slow to do things this way because MusicBrainz is very slow adding a release, and if an attempt it made to add more than a few at a time there is a good chance MusicBrainz will add the basic release but not add the tracks because of a bug in MusicBrainz.

outsidecontext · December 5, 2019, 9:31am

That is basically what we already have with albunack thanks to @ijabz

I think we should not conflate different things here. This is about a very specific import, which IMHO is pretty straight forward, rather safe and does no big harm if there are some mistakes in it.

We all know that importing things like releases or entire artist discographies is much more difficult and much more likely to have quality issues. But this is not the case here.

Llama_lover · December 5, 2019, 10:05am

If I’m understanding this post and a link is all that would happen, It sounds like a great tool. I would suggest that the link be shown in a special color (colour ) font or wording to help us immediately recognize it for what it is.

elomatreb · December 5, 2019, 10:19am

A lot of this discussion does seem a little overcautious for something as simple as an external link, I see little reason not to add those and correct them in the cases where they are inaccurate. But for more major edits (e.g. adding entirely new artist or release entities) I still am cautious about mass-auto-edits.

In any case, flagging automated edits in some way would be nice, so that they can be reviewed/monitored as what they are.

IvanDobsky · December 5, 2019, 8:52pm

Thank you for the patronising comment. So are you saying I should not take part in the conversation if I do not work with databases every day of the week? Wow!

@ijabz The main thing I’d ask is something nice and clear in the comments that can be understood by normal editors. Something that links to a thread\description about the import. Just so people know to double check the link. Discogs links can often be taken as perfect when that isn’t always the case.

@sibilant is thinking similar lines to me. Some of the more obscure corners of MB have some weird \ confused data. Duplicate artists, etc. Wouldn’t want the two sets of confusion to confuse each other.

I know when entering some odd punk bands I have found artists confused on both sides due to how they were shown on the covers. Info I have needed to go and correct in both databases. Information that stood out when comparing both as a whole and cross referencing elsewhere.

It seems sensible idea, as long as it stands out in the comments. I can also see plenty of checks are already done.

I also trust @ijabz to even more want to see this as correct. To you it isn’t just some database experiment, but a source of more accurate data. It comes back to being about the music.

mmirG · December 6, 2019, 11:43am

You mis-read me. I am trying to gauge how much weight to put on your comments about things going wrong when adding the data. I’ve got absolutely no knowledge in that area.

And congrats on staying cool when you’re responding to a patronisng comment.

jesus2099 · December 6, 2019, 1:46pm

I prefer having

3 common releases among singles and albums and EP only (excluding compilations and lives which often have common names)
Exclude all release titles containing following words: best, hits, greatest, anthology, songs, ベスト, コレクション, ヒッツ, ゴールド , ゴールデン, gold, golden

Last time a bot (initials: L.A.) massively added lots of stuff, we are still fixing it years later.
OK, it was releases so it is much more damaging than discogs links.