Report showing acoustids likely to be bad link to musicbrainz recordings

Fantastic as Acoustid is, there are often spurious links to incorrect MusicBrainz recordings. This is fairly obvious when you look at an Acoustid page because the valid links will have similar titles and high source count, whereas the invalid ones will match to completely different title and have a source count of one. However worth noting many Acoustid do only have a single link to a MusicBrainz recording and a source count of one, and these are completely valid.

I have created a report and put in on Albunack that lists Acoustids that are linked to multiple MusicBrainz recordings, and the recording titles are significantly different, and one link only has a submission count of one, and another link has a submission count more than that. The report is ordered by the good artist credit and simplified MusicBrainz recording title so if you are interested you can look all the Acoustids for a particular artist that seem to have some bad links are grouped together in the report, then if you click on the link its take you to the Acoustid page and you can disable a bad link.

The report is based on Acoustid data from September 2019 (when last full dump available) so it is a little out of date but the testing I have done indicates the potential bad links found are mostly still enabled in Acoustid, most of the potential bad links do seem to be bad links but appreciate others taking a look

Report is here - http://reports.albunack.net/acoustid_report.html

I have written article on how to use the report on my own forum.

Still a work in progress but the SQL used to generate the report is here https://bitbucket.org/ijabz/acoustid-server/src/master/mbmatch2.sql made some adjustments since initial posting

6 Likes

A couple of thoughts

I do of course realize that an acoustid can match multiple completely different recordings by different artist, because the markers used by Acoustid to generate the fingerprint can be the same for completely different songs. But if an an acoustid musicbrainz link has only been submitted once for a particular acoustid , and yet has been submitted many times for another musicbrainz link then I think in most cases it is highly likely to be wrong.

It is also true that the chances of two different songs by the same artist having the same Acoustid are virtually zero, this could really only happen if the same song actually has two different names, so I did wonder if I should just restrict report to cases of acoustids linked to different songs by same artist ?

The code and testing is biased towards English, may not work as well for non-english especially non latin script.

3 Likes

Interesting to see a report as I often deal with these manually. It is quite common to find an album where everything is “off by one”. i.e. track 1 linked to track 2, and so on. Or where a live album has clearly been associated to the wrong gig. Funniest (and hardest to fix) were a number of seven CD audiobooks where CD 2,3,4,5,6 & 7 had all just been linked to CD 1.

Clearing these out are good. Leaving bad data just attracts more bad matches via Picard (and other tools). And if we sometimes “over scrub” and take out a few correct matches they will soon be fixed up with good data.

I also find the userscript that highlights duplicated AcoustIDs on a Release very good for spotting this issue.

Any way to get that report to be searchable? Or is it just a case of hunt the first letter of an artist and wade in with flame throwers fully loaded? :fire:

3 Likes

Some of these are peaches: Track "f49af20f-b7c1-4d9d-892e-38e52923474f" | AcoustID

Five DIFFERENT tracks, all different artists, all 9:32 in length, each only has 1 or 2 samples. I’m leaving that for someone else :rofl:

Interesting there is no “Additional user-submitted metadata” listed below. Shows an example of some bad submission tools that don’t leave track details. It is surprisingly common to see submissions like this with common times but totally different tracks.

Good example of one that needs to be skipped. But this mission is GOOD and I am doing my bit on artists I know while I cook some dinner. Usually these stand out clearly and the more duffers are knocked out the cleaner the data is.

3 Likes

Currently it doesn’t show possible bad match if there isnt also a good match fro the acoustid with at least 2 sources, but Im wondering if I should increase this to maybe 5 sources to filter out cases such as this.

2 Likes

No search so just have to use your knowledge of the alphabet to pick a page between 1 and 78, and narrow it down from there to find the artist you want. They are just dumb html pages so I dont think there is really a way to do search, but my html skills are limited, so maybe there is?

I suppose I could label the index so instead of number says the first few letters of the first entry on page, e.g first five pages would be !!, 7S,Ac, Ai, Al - would that be useful ?

1 Like

Okay I have done this so that can concentrate on the most obvious ones, this has reduced potential bad matches to 78,887

1 Like

It isn’t that hard to fish around the alphabet. I am going through some of the Pink Floyd and it is funny how old this list is as I spot many I have already dealt with. It also does show how it is important to work on an artist you know as sometimes tracks have multiple names.

It is not always Black and White due to the way an AcoustID match works. You have to keep an eye on the length of the fingerprint sample too. I’ve seen at least one example where the fingerprint is 23mins long, but only one item linked of that length, but also 11 other tracks of a different length. So some do need extra checking.

I think that could be a safer line to work from. If you have a 2 and a 1 and both the same length it is not clear enough which can be removed without the audio to hand.

The ones that this works best on are those very high number counts. Maybe even worth having a short list special of “over 100 samples”. That will clear a lot of obvious rouges.

On a tangent again about potentially confusing data. I know I spent time a few months back with copies of The Wall where the tracks on a CD are just plain labelled wrong. That will not be obvious when in a clean-up like this. So it is important to know the material well.

1 Like

Oh - and why is it your HTML page does not have the links changing colour after being clicked? This would be REALLY helpful to know which ones I’ve clicked already.

I dont know, anyone know ?

Oh, those! The worst out there are some x-teen-CD audiobook releases, all with just plain generic recording titles such us ‘track 1’ etc. Of course all the damned 'track 1’s across the release and often times over many totally different releases are then “expertly” matched to a single AcoustID :face_with_symbols_over_mouth:.
I wish I had never seen any of them :smiley:

2 Likes

Your Bootstrap CSS overwrites the default colour of links but does not define a new rule to give already visited links a different one, that’s why they have the same colour (by default):

a {
	color: #007bff;
	text-decoration: none;
	background-color: transparent;
	-webkit-text-decoration-skip: objects;
}

You would need to append an additional rule for a:visited and define a different colour.

Related Bootstrap issue which has been rejected:

Funnily enough it was @mr_monkey who has left a link to his solution there – the world is small :joy:

4 Likes

Thnks !
But because modifying the css requires redeploying the whole app I just tried modifying the html as follows
<a href="[http://acoustid.org/track/d1f58778-ab15-421b-af2c-8f146deed15c](view-source:http://acoustid.org/track/d1f58778-ab15-421b-af2c-8f146deed15c)" target="_blank" style="a:visited {color:green;}"> d1f58778-ab15-421b-af2c-8f146deed15c </a>

Dont suppose you know what I have done wrong here, tried without a: as well

<a href="[http://acoustid.org/track/d1f58778-ab15-421b-af2c-8f146deed15c](view-source:http://acoustid.org/track/d1f58778-ab15-421b-af2c-8f146deed15c)" target="_blank" style="visited {color:green;}"> d1f58778-ab15-421b-af2c-8f146deed15c </a>

The inline style attribute only allows properties and their values, i.e. style="color: green;" (which is not what you want in this case). But it does not support selectors because you have already “selected” the element to which it should apply by adding it as an inline attribute to it.

Edit: What you could do is to add a <style>a:visited {color:green;}</style> tag directly to the <head> of your generated HTML if you don’t want to or can’t modify the CSS now:

<head>
    <title>
        albunack
    </title>
    <meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <link rel="shortcut icon" href="images/albunack_64.png" type="text/png">
    <link rel="icon" href="images/albunack_64.png" type="text/png">
    <meta name="description" content="music metadata api for musicbrainz discogs">
    <meta name="keywords" content="albunack jthink musicbrainz discogs">
    <meta name="author" content="jthink">
    <link href="https://stackpath.bootstrapcdn.com/bootstrap/4.1.1/css/bootstrap.min.css" rel="stylesheet" type="text/css">
    <link href="http://www.albunack.net/style/albunack.css" rel="stylesheet" type="text/css">
    <style>a:visited {color:green;}</style><!-- FIXME: should be moved into http://www.albunack.net/style/albunack.css -->
</head>
3 Likes

okay Ive done it, thanks for your help

1 Like

Should answer your question regarding the application not sending Additional data.

Then happy to see this initiative as after lot of interest last year I gave up trying to clean manually since it seemed an initial automatic clean up could be performed as you demonstrate (difference in number of sources, huge difference in duration,…), May look tough but honestly just removing all with less then 2 or 3 sources may help :slight_smile: Or as since it was reported those errors are mostly due to people who sent massively their collection without checking maybe those could be identified and removed.

PS: Removed = totally removing, not just unlinking in order to clear the pages and as in case it was legit, a new fingerprint upload will make it appear again whereas when it s unlinked it wont. This behavior may seems strange but is actually useful for similar prints ex: the silences on https://musicbrainz.org/release/04c77734-4d5c-4b91-bd0d-27be62f2e67b

2 Likes

Here is an example where just removing the lowest count is not an answer:
https://acoustid.org/track/f8bf526d-aa92-4cf9-9d16-940163002340

I noticed a pattern with a number of Pixes tracks all on the same Release. Here we see “Gigantic” having a length same as “Debaser”. So I chased down the rabbit hole and located the albums in the MB IDs. With a bit of detective work it showed the AcoustID was correct, it was the track title that was wrong. An “off by one” error on a couple of track lists.

It is an example of why we can’t just assume that the lone one is wrong and needs deleting. In this example it was right, just had the wrong name on the recording.

It is also something where AcoustID is really valuable in helping identify the error. Have had issues like this before.

(Note: For the completist, The AcoustID from this list that started that rabbit hole was this one: Track "324f18cb-30c4-4a97-802f-b71566e9d76d" | AcoustID)

@ulugabi - interesting what Picard does. As that MBID is enough then I’ll look slightly different at these now, but still will doubt a single track on its own. Funny how many I just wiped out tonight where a random Audiobook track had been linked to a music album. A good example of those editors who let Picard mass tag thousands of tracks and then upload AcoustID data without checking.

Has made me laugh working on this old 2019 list tonight. Especially when in the middle of Pink Floyd, I was recognising so many of these I had already sorted out in the last 18 months.

1 Like

Yes they are definitely real cases that deserve some time to be investigated. Like the wrong name example I had similar with duration: almost unlinked a track due to length diff. then realized that was MBz whom was wrong.

Problem is that they are lost in dozens of garbage next to them: You open a recording start to remove the one which is 44min, then 4 which are off by >30 secs with 1 source then the 3 with different band names then you don’t know what to do with the 2 which have 5 secs diff and only one source without Additional data,… then after only (when you have lost focus and energy) you can look at those other cases :confused:

Maybe we could split by batches in order to only have to spot similar cases at the same time and slowly reduce the exclusion rules to more sensible cases?

ex:

  1. More than 5 mins diff. AcoustID/Recording with 1 source max
  2. More than 5 mins diff. AcoustID/Recording with more than 1 source
  3. More than 30 secs diff. AcoustID/Recording with 1 source max
  4. More than 30 secs diff. AcoustID/Recording with 1 source max but less than 5
  5. Artist/Title/Duration (1 source) dont match with Goodmatch (10 sources min)
  6. Artist/Title/Duration (1 source) dont match with Goodmatch (5 sources min)
  7. Artist/Title (1 source) dont match with Goodmatch (10 sources min) and only 2 recordings linked
  8. Artist/Title/ (1 source) dont match with Goodmatch (5 sources min) and only 2 recordings linked
  9. Title (1 source) dont match with Goodmatch (10 sources) and only 2 recordings linked
    … updating the variables

Then when obvious/easy cases are treated we could look at new patterns to refine the detection?

Regarding the report itself @ijabz that s 79k potential wrong links among how many in the dump?
Have you any contact with the team in charge to see if there is a way to massively unlink wrong ones or even better removing them from db? (even a simple file with acoustID + recording to to remove)

It would be awesome if someone who has had x amount or percent of submissions marked as incorrect could have their submissions removed or downgraded or marked :thinking:

1 Like

The dump is 15.5M acoustid/mbid pairings, 79K are only the most obvious errors though, I could modify the report to show additional possible bad links and this would probably increase to 200K but here I am concentrating on the most obvious ones.

Note the sql simplifies the recordingname and groups together acoustid/mbrecordingname that are different as long as the simplified name matches, so potential bad link only gets into report if it is the only acoustid/mbrecordingname link to a particular song. So the point of the report is to find those links that match to entirely the wrong song rather than possibly an incorrect version of the right song, whilst the second case is an issue its not such an issue. Note that the report lists highly likely bad pairings, however the link to Acoustid will of course show all the pairings for that Acoustid, so there should be an obvious bad pairing on that page, but there maybe less obvious pairings on that page as well.

There are many releases that have only been submitted by one person and hence only have 1 source for each song. So a link 1 source can only really be considered a potential bad match if it is an outlier with the acoustid having a link to other mbids with significantly higher number of sources.

I have contacted the Acoustid team, but team is only one person.
I agree more could be done by Acoustid to tidy things up but the project just seems to be ticking over these days,not much in terms of new stuff. I don’t think Acoustid would be keen to delete pairings based on an automated algorithm, but maybe they could be convinced to delete pairings that have already been marked as disabled, that would be useful.

I have hopes that I maybe able to get a uptodate full dump soon, but in the meantime I would suggest you try some artists that you are familar with but havent previously worked on because for the artists I have tried 95% of the potential bad matches indicated by report are still active.

1 Like