Seeing mention of 64,000 links sounds like a worryingly high number to trust to a script to just import en-mass to fill some gaps. Especially as it would then not be clear as to which links were carefully added by a human who was checking their data, and which was just mass imported.
I think we should add all 64,000 Discogs links. This is not a blind process. Don’t forget what the list actually is, and how these specific records were selected. The list was created from very conservative criteria to ensure that the risk of error in any entry is extremely low:
- The artist name is unique in both databases.
- The artist name in each database is associated with a release of exactly the same title.
These criteria were created (by a person) because it is especially low risk. The criteria seek to eliminate one of the potential variables: an ambiguous artist name. The release title is a further affirmation that the artist is a match.
Let’s consider some possible scenarios that could result in a mismatch:
Two different artists with the same name also have a release with a shared name. Not impossible, but think about it: In order for this mismatch to happen, MusicBrainz must have Artist A but not Artist B, while Discogs must have Artist B but not Artist A. That coincidence is so improbable, it would actually be really funny if it happened even once!
Both artists are unique in Discogs and MB because they are have releases for different artists in both databases, and no one has caught the error yet. For example: Imagine that there are actually two artists with the name Xipazzo Q. Onslaught, but everyone thinks there is only one, so they attribute all releases to the same artist in both database. By linking the artists, MB reinforces the misconception that the record is authoritative. (In library science, this would be called an “authority control” problem.)
In order for this to happen, the two artists would have to be similar enough to confuse editors. Like, they play the same instrument or the same genre, or they were active in the same country during the same time period. In other words, a rather ambiguous entry. I submit that in a case like this, a cursory review by a human editor is unlikely to discover the error, either. It would probably require some careful research or specific knowledge about the music.
These are the kinds of extremes that would result in a mismatch. How many of these do you think there are in the dataset?