Report showing acoustids likely to be bad link to musicbrainz recordings

Oh, those! The worst out there are some x-teen-CD audiobook releases, all with just plain generic recording titles such us ‘track 1’ etc. Of course all the damned 'track 1’s across the release and often times over many totally different releases are then “expertly” matched to a single AcoustID :face_with_symbols_over_mouth:.
I wish I had never seen any of them :smiley:

2 Likes

Your Bootstrap CSS overwrites the default colour of links but does not define a new rule to give already visited links a different one, that’s why they have the same colour (by default):

a {
	color: #007bff;
	text-decoration: none;
	background-color: transparent;
	-webkit-text-decoration-skip: objects;
}

You would need to append an additional rule for a:visited and define a different colour.

Related Bootstrap issue which has been rejected:

Funnily enough it was @mr_monkey who has left a link to his solution there – the world is small :joy:

4 Likes

Thnks !
But because modifying the css requires redeploying the whole app I just tried modifying the html as follows
<a href="[http://acoustid.org/track/d1f58778-ab15-421b-af2c-8f146deed15c](view-source:http://acoustid.org/track/d1f58778-ab15-421b-af2c-8f146deed15c)" target="_blank" style="a:visited {color:green;}"> d1f58778-ab15-421b-af2c-8f146deed15c </a>

Dont suppose you know what I have done wrong here, tried without a: as well

<a href="[http://acoustid.org/track/d1f58778-ab15-421b-af2c-8f146deed15c](view-source:http://acoustid.org/track/d1f58778-ab15-421b-af2c-8f146deed15c)" target="_blank" style="visited {color:green;}"> d1f58778-ab15-421b-af2c-8f146deed15c </a>

The inline style attribute only allows properties and their values, i.e. style="color: green;" (which is not what you want in this case). But it does not support selectors because you have already “selected” the element to which it should apply by adding it as an inline attribute to it.

Edit: What you could do is to add a <style>a:visited {color:green;}</style> tag directly to the <head> of your generated HTML if you don’t want to or can’t modify the CSS now:

<head>
    <title>
        albunack
    </title>
    <meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <link rel="shortcut icon" href="images/albunack_64.png" type="text/png">
    <link rel="icon" href="images/albunack_64.png" type="text/png">
    <meta name="description" content="music metadata api for musicbrainz discogs">
    <meta name="keywords" content="albunack jthink musicbrainz discogs">
    <meta name="author" content="jthink">
    <link href="https://stackpath.bootstrapcdn.com/bootstrap/4.1.1/css/bootstrap.min.css" rel="stylesheet" type="text/css">
    <link href="http://www.albunack.net/style/albunack.css" rel="stylesheet" type="text/css">
    <style>a:visited {color:green;}</style><!-- FIXME: should be moved into http://www.albunack.net/style/albunack.css -->
</head>
3 Likes

okay Ive done it, thanks for your help

1 Like

Should answer your question regarding the application not sending Additional data.

Then happy to see this initiative as after lot of interest last year I gave up trying to clean manually since it seemed an initial automatic clean up could be performed as you demonstrate (difference in number of sources, huge difference in duration,…), May look tough but honestly just removing all with less then 2 or 3 sources may help :slight_smile: Or as since it was reported those errors are mostly due to people who sent massively their collection without checking maybe those could be identified and removed.

PS: Removed = totally removing, not just unlinking in order to clear the pages and as in case it was legit, a new fingerprint upload will make it appear again whereas when it s unlinked it wont. This behavior may seems strange but is actually useful for similar prints ex: the silences on https://musicbrainz.org/release/04c77734-4d5c-4b91-bd0d-27be62f2e67b

2 Likes

Here is an example where just removing the lowest count is not an answer:
https://acoustid.org/track/f8bf526d-aa92-4cf9-9d16-940163002340

I noticed a pattern with a number of Pixes tracks all on the same Release. Here we see “Gigantic” having a length same as “Debaser”. So I chased down the rabbit hole and located the albums in the MB IDs. With a bit of detective work it showed the AcoustID was correct, it was the track title that was wrong. An “off by one” error on a couple of track lists.

It is an example of why we can’t just assume that the lone one is wrong and needs deleting. In this example it was right, just had the wrong name on the recording.

It is also something where AcoustID is really valuable in helping identify the error. Have had issues like this before.

(Note: For the completist, The AcoustID from this list that started that rabbit hole was this one: Track "324f18cb-30c4-4a97-802f-b71566e9d76d" | AcoustID)

@ulugabi - interesting what Picard does. As that MBID is enough then I’ll look slightly different at these now, but still will doubt a single track on its own. Funny how many I just wiped out tonight where a random Audiobook track had been linked to a music album. A good example of those editors who let Picard mass tag thousands of tracks and then upload AcoustID data without checking.

Has made me laugh working on this old 2019 list tonight. Especially when in the middle of Pink Floyd, I was recognising so many of these I had already sorted out in the last 18 months.

1 Like

Yes they are definitely real cases that deserve some time to be investigated. Like the wrong name example I had similar with duration: almost unlinked a track due to length diff. then realized that was MBz whom was wrong.

Problem is that they are lost in dozens of garbage next to them: You open a recording start to remove the one which is 44min, then 4 which are off by >30 secs with 1 source then the 3 with different band names then you don’t know what to do with the 2 which have 5 secs diff and only one source without Additional data,… then after only (when you have lost focus and energy) you can look at those other cases :confused:

Maybe we could split by batches in order to only have to spot similar cases at the same time and slowly reduce the exclusion rules to more sensible cases?

ex:

  1. More than 5 mins diff. AcoustID/Recording with 1 source max
  2. More than 5 mins diff. AcoustID/Recording with more than 1 source
  3. More than 30 secs diff. AcoustID/Recording with 1 source max
  4. More than 30 secs diff. AcoustID/Recording with 1 source max but less than 5
  5. Artist/Title/Duration (1 source) dont match with Goodmatch (10 sources min)
  6. Artist/Title/Duration (1 source) dont match with Goodmatch (5 sources min)
  7. Artist/Title (1 source) dont match with Goodmatch (10 sources min) and only 2 recordings linked
  8. Artist/Title/ (1 source) dont match with Goodmatch (5 sources min) and only 2 recordings linked
  9. Title (1 source) dont match with Goodmatch (10 sources) and only 2 recordings linked
    … updating the variables

Then when obvious/easy cases are treated we could look at new patterns to refine the detection?

Regarding the report itself @ijabz that s 79k potential wrong links among how many in the dump?
Have you any contact with the team in charge to see if there is a way to massively unlink wrong ones or even better removing them from db? (even a simple file with acoustID + recording to to remove)

It would be awesome if someone who has had x amount or percent of submissions marked as incorrect could have their submissions removed or downgraded or marked :thinking:

1 Like

The dump is 15.5M acoustid/mbid pairings, 79K are only the most obvious errors though, I could modify the report to show additional possible bad links and this would probably increase to 200K but here I am concentrating on the most obvious ones.

Note the sql simplifies the recordingname and groups together acoustid/mbrecordingname that are different as long as the simplified name matches, so potential bad link only gets into report if it is the only acoustid/mbrecordingname link to a particular song. So the point of the report is to find those links that match to entirely the wrong song rather than possibly an incorrect version of the right song, whilst the second case is an issue its not such an issue. Note that the report lists highly likely bad pairings, however the link to Acoustid will of course show all the pairings for that Acoustid, so there should be an obvious bad pairing on that page, but there maybe less obvious pairings on that page as well.

There are many releases that have only been submitted by one person and hence only have 1 source for each song. So a link 1 source can only really be considered a potential bad match if it is an outlier with the acoustid having a link to other mbids with significantly higher number of sources.

I have contacted the Acoustid team, but team is only one person.
I agree more could be done by Acoustid to tidy things up but the project just seems to be ticking over these days,not much in terms of new stuff. I don’t think Acoustid would be keen to delete pairings based on an automated algorithm, but maybe they could be convinced to delete pairings that have already been marked as disabled, that would be useful.

I have hopes that I maybe able to get a uptodate full dump soon, but in the meantime I would suggest you try some artists that you are familar with but havent previously worked on because for the artists I have tried 95% of the potential bad matches indicated by report are still active.

1 Like

In that list if tracks have the same band and track name, but are off by 30 secs, I will not remove them. Not on this kind of trawl. I’d want to hear them. Maybe there is crowd noise, or a chunk of silence on the end of a tape.

If track name differs I will check the length is off too. That Pixies example above stood out to me as I have seen that pattern before. Track listing errors in MB that have got out to the MP3\FLAC files of people which then get fed back in via Picard. (These are some of my favourite items to pick up on checks like this)

Those with different band names but same track name I’ll check if this is a cover. Or a related artist (I have seen Syd Barrett tracks listed under Pink Floyd before). Its why I focus on bands I know and stay well away from classical.

The only ones I am 100% happy at are when I see 100+ samples for a track, and one sample for a totally separately named track with a different artist.

I am always wary of automation. During this trawl I have spotted tracks that would have been incorrectly removed by a script. Disabling is a good safe way to clean up a database as it is reversable.

With Pink Floyd the Zabriskie Point soundtrack is something that would break all scripts. Same tracks have been repackaged many times, but the band name changes, the track names change, and sometimes the samples from the film are different lengths. (Had the same with some Blade Runner soundtracks last month) This leads to single examples that look rogue to a script, but a human check will spot them as legit.

I am just starting with a few more obvious artists as I jump in on letters. It is kinda nice reminder seeing so many already disabled from earlier manual trawls of mine.

Surely a cover of the same song would resolve to a different acoustid, and therefore if there is a cover of a song with Sources of 1 alongside a link to the actual song it would surely always be a bad link ?

This is true and good to be careful. However regarding this particular list, for all these pairings there would have to be only one source for that particular ArtistName/Simplified Track Name combination. I would have thought that for all valid Pink Floyd combinations that you would get more than one source, bearing in mind that all combinations of ArtistName/Simplified Track Name would be summed together even if they point to different Mb Recordings (different versions of same song). Do you have an example Acoustid in the report that contradicts that ?

If we can get sufficient interest in working on this, might be an idea for people to post the artists they have processed, I dont currenlty have any automated way to do this in report.

It should do, and usually will be unique. But some covers can be pretty spot on identical. Especially if electronic music.

It is one of those areas when I get a bit of doubt when I see a dozen other AcoustIDs added the same and will not touch it. If it is only 1, then it can likely be deleted. If there are a dozen fingerprints, then I’d want to know why.

My usual test is to click through on that cover recording and see if it has other acoustIDs. If it does, then I have a fingerprint to feed into the compare.

It can be tricky when you are in the depths of the bootlegs as you may not have many examples of AcoustIDs for the more rare bootlegs. Often I am working on something with only 3 or 4 fingerprints submitted.

I’ll have to look for more “rule breakers”. By lifting the test to over 5 was a good start for clearing some of these. The higher the gap between numbers of samples the easier it gets. Get to 500 and you are in bot territory.

The problem with bootlegs gets messier when recordings have not been merged. I was working on some Blade Runner bootlegs last month where the same tracks were repeated on multiple bootlegs, but they had different artists attached. Now all of them were correct, it is just how that bootlegger had named it. This meant that before the merge you had three recordings, sharing acoustIDs, sharing names, but different artists. Some of those bootlegs were more common so had a higher count of submissions. This would have led to a case where really they all needed to be merged, but would have looked like a “badly linked cover version” to a script.

Maybe this is a bootleg thing… Let me see if I can find an example in the report where I haven’t already merged the recordings :smiley: (I am good at finding Edge Cases in any test algorithm… maybe cos I am an Edge Case myself :crazy_face: :rofl:)

No real need as that stands out anyway. When you wade into an artist and already see the first few fixed it gets obvious quite quickly. I am not the only one who has been fixing up Pink Floyd in the last 18 months. Even though there were a LOT in this list, probably over half of them had already been done. It is not just this list that has sent people into Fix the AcoustID sessions.

1 Like

OK so the good news is I have now managed to import the acoustid daily updates required for this report and rebuilt the report, so now the number of potential bad cases has gone up from 78,000 to 117,000.

The bad news is that the daily updates doesn’t seem to be reporting when pairs are disabled properly.

I thought it might just be a very recent issue so I looked at the first Pink Floyd record in the list, and saw you disabled it on 23rd June 2020 - Track "0161da2e-85d6-435a-8978-0c15a8df3df5" | AcoustID

So I downloaded the files for that day and searched for mbid in the file

grep 58ce3b51-da28-4bdd-9279-56eb6c16ac30 2020-06-23-track_mbid-update.jsonl

but no matches

I then looked for any disabled records for that day, and there were some, here are the first two

2020-06-23-track_mbid-update.jsonl:{"id":2263529,"track_id":8895037,"mbid":"a78d43e5-4724-4b57-ac8f-aff521f97d97","submission_count":1,"disabled":true,"created":"2011-08-26T13:46:22.519936+00:00","updated":"2020-06-23T08:57:19.723022+00:00"}
2020-06-23-track_mbid-update.jsonl:{"id":10827664,"track_id":18475023,"mbid":"03e02a24-1d09-45cb-92d4-32a84f36b37d","submission_count":1,"disabled":true,"created":"2014-06-03T11:59:18.380241+00:00","updated":"2020-06-23T13:38:22.6228

So looked up the track to get track gid

jthinksearch=# select * from acoustid_track where id=8895037;
   id    |            created            |                 gid                  | new_id
---------+-------------------------------+--------------------------------------+--------
 8895037 | 2011-08-19 12:45:49.831943+00 | eeb951c4-c47b-4087-a23e-2e9b3ffa19b7 | \N
(1 row)

https://acoustid.org/track/eeb951c4-c47b-4087-a23e-2e9b3ffa19b7

and we can see that although a78d43e5-4724-4b57-ac8f-aff521f97d97 was disabled for this Acoustid it was done on Jan, 09 2015 17:40:54 (by a bot) not 2020-06-23T08:57:19 as the line in the file implies.

So I don’t see how to find out when a pairing has been disabled, is this a bug or misunderstanding on my part, if anyone understands this please let me know.

I also double checked the track I wanted wasn’t mention in the file at all at all, for this I needed to find the internal trackid

jthinksearch=# select * from acoustid_track_mbid where mbid='58ce3b51-da28-4bdd-9279-56eb6c16ac30';
    id    | track_id |                 mbid                 |            created            | submission_count | disabled
----------+----------+--------------------------------------+-------------------------------+------------------+----------
 16459345 | 10158239 | 58ce3b51-da28-4bdd-9279-56eb6c16ac30 | 2018-06-16 15:10:20.894442+00 |                1 | f
  1304057 | 10158206 | 58ce3b51-da28-4bdd-9279-56eb6c16ac30 | 2011-08-22 11:24:13.782687+00 |               33 | f
  8811442 | 23987646 | 58ce3b51-da28-4bdd-9279-56eb6c16ac30 | 2013-08-20 02:16:34.453068+00 |                1 | f
  7700873 | 21655210 | 58ce3b51-da28-4bdd-9279-56eb6c16ac30 | 2012-12-31 15:57:04.133796+00 |                1 | f
(4 rows)

jthinksearch=# select * from acoustid_track where id in (10158239,23987646,21655210);
    id    |            created            |                 gid                  | new_id
----------+-------------------------------+--------------------------------------+--------
 10158239 | 2011-08-22 11:24:13.782687+00 | 0161da2e-85d6-435a-8978-0c15a8df3df5 | \N
 21655210 | 2012-12-31 15:57:04.133796+00 | dbbcfc6b-35ee-4cef-90d0-bd587b32a57a | \N
 23987646 | 2013-08-20 02:16:34.453068+00 | 102dea03-fe1b-4525-abaa-8f4c5fb44b7c | \N
(3 rows)

grep  10158239 *

returns no results for that day

The only other good news is any links visited for previous report in your browser will still show as visited for this report so you can see what you have already looked at.

1 Like

A post for awkward edge case examples. I’ve hidden it due to waffle. I’ll also add more oddities here as I find them. Examples that could trip up a bot. Yeah, I enjoy finding Edge Cases to break automation :rofl:

Lots of examples

All of these have already been fixed - but are interesting examples.

A few of the odd ones that needed brain engaged.

The simple stuff for a bot is anything with a big gap in Sample (20 to 1 and above), especially when only 1 sample, when times are different, when track names and artist different, if lots of fingerprints of same length, and lots of “additional user-submitted metadata” match. All in the “no brainer” category and easily culled.

These next ones needed thoughts

Track "8db860df-6d81-4ab1-8ea2-b7fdaae80a52" | AcoustID
Sometimes you just look and wonder how a mess like that happens? I stripped out another three this time, but many more random ones been connected and removed before.

https://acoustid.org/track/9c9bb9ca-9e04-41f8-a028-1c9b040bdfc6
That looks like the same track multiple times, but one of the titles is in Icelandic instead of English. As length is same this is likely to still be the same track and not to be removed. How does Bot spot that?

Track "b5446faa-9a73-43b2-9bb1-da2b44b9f202" | AcoustID
Track "ba53e1da-24b6-4635-8918-65a8d7351617" | AcoustID
Variation on a theme - how do you spell “part 2”? Or a Medley by a different name

Track "f812ba4a-2de2-48c7-a09a-14944c07f9ca" | AcoustID
Or what do you do with hippies who forget what they call a track? Pretty sure both names used on this is same recording.

Track "433732e4-884f-4b90-89dc-3240520ec4d1" | AcoustID
Or what happens when you have variations on a track name. Basically - don’t let a bot drop something of a same length

Classical takes those language variations to another level of confusion.

Track "9606be1e-b3d3-4a66-bc3c-2f39db233aad" | AcoustID
One track has 11 samples, other track has 5 samples. It is the 11 sample track that gets culled.

Why? Research. First the length and other data samples. Next look through at the recordings and other AcoustID samples. Seems to support that there are mislabelled bootleg tracks out there. (This would be skipped by a bot if a nice wide margin kept between samples)

Track "62c07a28-a25b-4701-9c95-2d004643360e" | AcoustID
Can we just have a filter that picks out tracks called “track nn” with one sample? So so many of these audiobooks randomly attached to stuff.

Track "2ef97208-f2f4-4732-a6d9-61aa2d92c313" | AcoustID
A Motörhead track picks out the example of Instrumental versions sharing AcoustIDs but different name and length. Similar happens with 5.1 mixes, mono, etc. Recordings that share acoustIDs but not recordings. In this case complicated by it being a longer track too. Bot will need some kinda “match first half of the track name”

Motörhead is a funny artist to go through. It is usually obvious, coulpe of dozen samples, and I bet it is one person who uploaded most of the errors from a boxset that misaligned. A BOT would have done well in Motörhead as the pattern was so clear and easy to check.

Track "2cec8545-5e14-4927-bf7f-095088ef7e7e" | AcoustID
An interesting issue - remixes. When remixes are on a 12" they are often named after the remix (Drop the Break) and not the original track name. This release also has a messy combination of artist names that are bot unfriendly.

Track "64ae8c73-c98a-4004-b0b2-7a8be6d0df58" | AcoustID
Even wondered how you get a fingerprint of 4:16 but the track shows 0:39? That is what happens when the recording has two incorrectly merged recordings put together. That recording was a 4:16 and a 0:39 - bad data causes confusions.

1 Like

5 posts were merged into an existing topic: Blocking submissions from “bad submitters”?

Could look at the work the recording is linked to, if both recordings linked to same work then same song. Although in this case it appears that this is actually not just a change in the title it is sung in Icelandic, and looking at the other pairings I think the incorrect pairing may be the English version - https://musicbrainz.org/recording/a616960a-4327-4359-82b0-4692417ee4a5

Part Two and Part 2 are already handled in the report because we replace text numbers to numbers as part of the simplification, but it doesn’t do roman numerals yet.

Again looking at the works could identify same songs where name is different

But it wouldn’t work for the medley, all I can think is to eliminate when songs lengths are similar and over a certain length.

In theory could use Work again, but doesn’t have one in this case, I can can only think you could construct some sort of manual exception list to handle such a case, but that would require user knowledge.

In report we simplify recording name and discount similar names using things such as levenstein distance, but hasn’t worked in this case.

The report only picks up cases where potential bad match only one source, so this isnt in report anyway. I think restricting to one source is a good starting point, but looking at this example comparing the track length of the mbrecording with the fingerprint would give a way to not pick the wrong one if the right one only had one source.

These are dumb webpages so It cant implement a filter that would work over al the pages as a whole. What I could do is create a separate report for this.

Interesting.

So in summary I can improve the report to filter out some bad matches that arent bad matches. But of course it all takes time, I’m more inclined to do this the more people that are making using the report.

Im not sure if these are suggestions to improve the report for manual unlinking or if you are suggesting that we could possibly use the data used by the report as a feed into an automated bot that actually unlinks the records in Acoustid.

1 Like

I didn’t expect each to get a reply. :smiley: It was a list to put some of the oddities together to help refine the report (or reports). I am not knocking your excellent work, just showing examples that break the pattern to help improve the report script. As you noticed, many of those have their own patterns

Maybe multiple types of smaller reports would give better focus. Smaller sub-sets would also get more eyes looking I expect.

I am no fan of automatic bots making decisions due to the error rate that gets introduced. I don’t like @aerozol’s idea of deleting all of one user’s data because of a few errors. That could only work if it can focus on a date range - that can be checked. Picking on people is the wrong focus.

Unless it is an auto bot that removes all those random Stephen King audio book random additions. :rofl:

No problem, I realized that.

I may well do that when have improved existing report, if you have suggestions of specific reports I can consider that.

AcoustID.org adds more then 10’000 AcoustID’s daily to its database. Do you expect that @IvanDobsky and others process your reports from now on for all eternity?
What I meant was: Acoustid.org should fix it at the source right after submission or even doesn’t accept /reject obviously wrong submissions.

1 Like