Report showing acoustids likely to be bad link to musicbrainz recordings

IvanDobsky · September 16, 2021, 9:03am

In that list if tracks have the same band and track name, but are off by 30 secs, I will not remove them. Not on this kind of trawl. I’d want to hear them. Maybe there is crowd noise, or a chunk of silence on the end of a tape.

If track name differs I will check the length is off too. That Pixies example above stood out to me as I have seen that pattern before. Track listing errors in MB that have got out to the MP3\FLAC files of people which then get fed back in via Picard. (These are some of my favourite items to pick up on checks like this)

Those with different band names but same track name I’ll check if this is a cover. Or a related artist (I have seen Syd Barrett tracks listed under Pink Floyd before). Its why I focus on bands I know and stay well away from classical.

The only ones I am 100% happy at are when I see 100+ samples for a track, and one sample for a totally separately named track with a different artist.

I am always wary of automation. During this trawl I have spotted tracks that would have been incorrectly removed by a script. Disabling is a good safe way to clean up a database as it is reversable.

With Pink Floyd the Zabriskie Point soundtrack is something that would break all scripts. Same tracks have been repackaged many times, but the band name changes, the track names change, and sometimes the samples from the film are different lengths. (Had the same with some Blade Runner soundtracks last month) This leads to single examples that look rogue to a script, but a human check will spot them as legit.

I am just starting with a few more obvious artists as I jump in on letters. It is kinda nice reminder seeing so many already disabled from earlier manual trawls of mine.

ijabz · September 16, 2021, 9:29am

Surely a cover of the same song would resolve to a different acoustid, and therefore if there is a cover of a song with Sources of 1 alongside a link to the actual song it would surely always be a bad link ?

This is true and good to be careful. However regarding this particular list, for all these pairings there would have to be only one source for that particular ArtistName/Simplified Track Name combination. I would have thought that for all valid Pink Floyd combinations that you would get more than one source, bearing in mind that all combinations of ArtistName/Simplified Track Name would be summed together even if they point to different Mb Recordings (different versions of same song). Do you have an example Acoustid in the report that contradicts that ?

If we can get sufficient interest in working on this, might be an idea for people to post the artists they have processed, I dont currenlty have any automated way to do this in report.

IvanDobsky · September 16, 2021, 12:23pm

It should do, and usually will be unique. But some covers can be pretty spot on identical. Especially if electronic music.

It is one of those areas when I get a bit of doubt when I see a dozen other AcoustIDs added the same and will not touch it. If it is only 1, then it can likely be deleted. If there are a dozen fingerprints, then I’d want to know why.

My usual test is to click through on that cover recording and see if it has other acoustIDs. If it does, then I have a fingerprint to feed into the compare.

It can be tricky when you are in the depths of the bootlegs as you may not have many examples of AcoustIDs for the more rare bootlegs. Often I am working on something with only 3 or 4 fingerprints submitted.

I’ll have to look for more “rule breakers”. By lifting the test to over 5 was a good start for clearing some of these. The higher the gap between numbers of samples the easier it gets. Get to 500 and you are in bot territory.

The problem with bootlegs gets messier when recordings have not been merged. I was working on some Blade Runner bootlegs last month where the same tracks were repeated on multiple bootlegs, but they had different artists attached. Now all of them were correct, it is just how that bootlegger had named it. This meant that before the merge you had three recordings, sharing acoustIDs, sharing names, but different artists. Some of those bootlegs were more common so had a higher count of submissions. This would have led to a case where really they all needed to be merged, but would have looked like a “badly linked cover version” to a script.

Maybe this is a bootleg thing… Let me see if I can find an example in the report where I haven’t already merged the recordings (I am good at finding Edge Cases in any test algorithm… maybe cos I am an Edge Case myself )

No real need as that stands out anyway. When you wade into an artist and already see the first few fixed it gets obvious quite quickly. I am not the only one who has been fixing up Pink Floyd in the last 18 months. Even though there were a LOT in this list, probably over half of them had already been done. It is not just this list that has sent people into Fix the AcoustID sessions.

ijabz · September 21, 2021, 1:39pm

OK so the good news is I have now managed to import the acoustid daily updates required for this report and rebuilt the report, so now the number of potential bad cases has gone up from 78,000 to 117,000.

The bad news is that the daily updates doesn’t seem to be reporting when pairs are disabled properly.

I thought it might just be a very recent issue so I looked at the first Pink Floyd record in the list, and saw you disabled it on 23rd June 2020 - Track "0161da2e-85d6-435a-8978-0c15a8df3df5" | AcoustID

So I downloaded the files for that day and searched for mbid in the file

grep 58ce3b51-da28-4bdd-9279-56eb6c16ac30 2020-06-23-track_mbid-update.jsonl

but no matches

I then looked for any disabled records for that day, and there were some, here are the first two

2020-06-23-track_mbid-update.jsonl:{"id":2263529,"track_id":8895037,"mbid":"a78d43e5-4724-4b57-ac8f-aff521f97d97","submission_count":1,"disabled":true,"created":"2011-08-26T13:46:22.519936+00:00","updated":"2020-06-23T08:57:19.723022+00:00"}
2020-06-23-track_mbid-update.jsonl:{"id":10827664,"track_id":18475023,"mbid":"03e02a24-1d09-45cb-92d4-32a84f36b37d","submission_count":1,"disabled":true,"created":"2014-06-03T11:59:18.380241+00:00","updated":"2020-06-23T13:38:22.6228

So looked up the track to get track gid

jthinksearch=# select * from acoustid_track where id=8895037;
   id    |            created            |                 gid                  | new_id
---------+-------------------------------+--------------------------------------+--------
 8895037 | 2011-08-19 12:45:49.831943+00 | eeb951c4-c47b-4087-a23e-2e9b3ffa19b7 | \N
(1 row)

https://acoustid.org/track/eeb951c4-c47b-4087-a23e-2e9b3ffa19b7

and we can see that although a78d43e5-4724-4b57-ac8f-aff521f97d97 was disabled for this Acoustid it was done on Jan, 09 2015 17:40:54 (by a bot) not 2020-06-23T08:57:19 as the line in the file implies.

So I don’t see how to find out when a pairing has been disabled, is this a bug or misunderstanding on my part, if anyone understands this please let me know.

I also double checked the track I wanted wasn’t mention in the file at all at all, for this I needed to find the internal trackid

jthinksearch=# select * from acoustid_track_mbid where mbid='58ce3b51-da28-4bdd-9279-56eb6c16ac30';
    id    | track_id |                 mbid                 |            created            | submission_count | disabled
----------+----------+--------------------------------------+-------------------------------+------------------+----------
 16459345 | 10158239 | 58ce3b51-da28-4bdd-9279-56eb6c16ac30 | 2018-06-16 15:10:20.894442+00 |                1 | f
  1304057 | 10158206 | 58ce3b51-da28-4bdd-9279-56eb6c16ac30 | 2011-08-22 11:24:13.782687+00 |               33 | f
  8811442 | 23987646 | 58ce3b51-da28-4bdd-9279-56eb6c16ac30 | 2013-08-20 02:16:34.453068+00 |                1 | f
  7700873 | 21655210 | 58ce3b51-da28-4bdd-9279-56eb6c16ac30 | 2012-12-31 15:57:04.133796+00 |                1 | f
(4 rows)

jthinksearch=# select * from acoustid_track where id in (10158239,23987646,21655210);
    id    |            created            |                 gid                  | new_id
----------+-------------------------------+--------------------------------------+--------
 10158239 | 2011-08-22 11:24:13.782687+00 | 0161da2e-85d6-435a-8978-0c15a8df3df5 | \N
 21655210 | 2012-12-31 15:57:04.133796+00 | dbbcfc6b-35ee-4cef-90d0-bd587b32a57a | \N
 23987646 | 2013-08-20 02:16:34.453068+00 | 102dea03-fe1b-4525-abaa-8f4c5fb44b7c | \N
(3 rows)

grep  10158239 *

returns no results for that day

The only other good news is any links visited for previous report in your browser will still show as visited for this report so you can see what you have already looked at.

IvanDobsky · September 22, 2021, 5:05pm

A post for awkward edge case examples. I’ve hidden it due to waffle. I’ll also add more oddities here as I find them. Examples that could trip up a bot. Yeah, I enjoy finding Edge Cases to break automation

Lots of examples

All of these have already been fixed - but are interesting examples.

A few of the odd ones that needed brain engaged.

The simple stuff for a bot is anything with a big gap in Sample (20 to 1 and above), especially when only 1 sample, when times are different, when track names and artist different, if lots of fingerprints of same length, and lots of “additional user-submitted metadata” match. All in the “no brainer” category and easily culled.

These next ones needed thoughts

Track "8db860df-6d81-4ab1-8ea2-b7fdaae80a52" | AcoustID
Sometimes you just look and wonder how a mess like that happens? I stripped out another three this time, but many more random ones been connected and removed before.

https://acoustid.org/track/9c9bb9ca-9e04-41f8-a028-1c9b040bdfc6
That looks like the same track multiple times, but one of the titles is in Icelandic instead of English. As length is same this is likely to still be the same track and not to be removed. How does Bot spot that?

Track "b5446faa-9a73-43b2-9bb1-da2b44b9f202" | AcoustID
Track "ba53e1da-24b6-4635-8918-65a8d7351617" | AcoustID
Variation on a theme - how do you spell “part 2”? Or a Medley by a different name

Track "f812ba4a-2de2-48c7-a09a-14944c07f9ca" | AcoustID
Or what do you do with hippies who forget what they call a track? Pretty sure both names used on this is same recording.

Track "433732e4-884f-4b90-89dc-3240520ec4d1" | AcoustID
Or what happens when you have variations on a track name. Basically - don’t let a bot drop something of a same length

Classical takes those language variations to another level of confusion.

Track "9606be1e-b3d3-4a66-bc3c-2f39db233aad" | AcoustID
One track has 11 samples, other track has 5 samples. It is the 11 sample track that gets culled.

Why? Research. First the length and other data samples. Next look through at the recordings and other AcoustID samples. Seems to support that there are mislabelled bootleg tracks out there. (This would be skipped by a bot if a nice wide margin kept between samples)

Track "62c07a28-a25b-4701-9c95-2d004643360e" | AcoustID
Can we just have a filter that picks out tracks called “track nn” with one sample? So so many of these audiobooks randomly attached to stuff.

Track "2ef97208-f2f4-4732-a6d9-61aa2d92c313" | AcoustID
A Motörhead track picks out the example of Instrumental versions sharing AcoustIDs but different name and length. Similar happens with 5.1 mixes, mono, etc. Recordings that share acoustIDs but not recordings. In this case complicated by it being a longer track too. Bot will need some kinda “match first half of the track name”

Motörhead is a funny artist to go through. It is usually obvious, coulpe of dozen samples, and I bet it is one person who uploaded most of the errors from a boxset that misaligned. A BOT would have done well in Motörhead as the pattern was so clear and easy to check.

Track "2cec8545-5e14-4927-bf7f-095088ef7e7e" | AcoustID
An interesting issue - remixes. When remixes are on a 12" they are often named after the remix (Drop the Break) and not the original track name. This release also has a messy combination of artist names that are bot unfriendly.

Track "64ae8c73-c98a-4004-b0b2-7a8be6d0df58" | AcoustID
Even wondered how you get a fingerprint of 4:16 but the track shows 0:39? That is what happens when the recording has two incorrectly merged recordings put together. That recording was a 4:16 and a 0:39 - bad data causes confusions.

Freso · September 23, 2021, 11:12am

5 posts were merged into an existing topic: Blocking submissions from “bad submitters”?

ijabz · September 23, 2021, 8:09am

Could look at the work the recording is linked to, if both recordings linked to same work then same song. Although in this case it appears that this is actually not just a change in the title it is sung in Icelandic, and looking at the other pairings I think the incorrect pairing may be the English version - https://musicbrainz.org/recording/a616960a-4327-4359-82b0-4692417ee4a5

Part Two and Part 2 are already handled in the report because we replace text numbers to numbers as part of the simplification, but it doesn’t do roman numerals yet.

Again looking at the works could identify same songs where name is different

But it wouldn’t work for the medley, all I can think is to eliminate when songs lengths are similar and over a certain length.

In theory could use Work again, but doesn’t have one in this case, I can can only think you could construct some sort of manual exception list to handle such a case, but that would require user knowledge.

In report we simplify recording name and discount similar names using things such as levenstein distance, but hasn’t worked in this case.

The report only picks up cases where potential bad match only one source, so this isnt in report anyway. I think restricting to one source is a good starting point, but looking at this example comparing the track length of the mbrecording with the fingerprint would give a way to not pick the wrong one if the right one only had one source.

These are dumb webpages so It cant implement a filter that would work over al the pages as a whole. What I could do is create a separate report for this.

Interesting.

So in summary I can improve the report to filter out some bad matches that arent bad matches. But of course it all takes time, I’m more inclined to do this the more people that are making using the report.

Im not sure if these are suggestions to improve the report for manual unlinking or if you are suggesting that we could possibly use the data used by the report as a feed into an automated bot that actually unlinks the records in Acoustid.

IvanDobsky · September 23, 2021, 8:46am

I didn’t expect each to get a reply. It was a list to put some of the oddities together to help refine the report (or reports). I am not knocking your excellent work, just showing examples that break the pattern to help improve the report script. As you noticed, many of those have their own patterns

Maybe multiple types of smaller reports would give better focus. Smaller sub-sets would also get more eyes looking I expect.

I am no fan of automatic bots making decisions due to the error rate that gets introduced. I don’t like @aerozol’s idea of deleting all of one user’s data because of a few errors. That could only work if it can focus on a date range - that can be checked. Picking on people is the wrong focus.

Unless it is an auto bot that removes all those random Stephen King audio book random additions.

ijabz · September 23, 2021, 8:50am

No problem, I realized that.

I may well do that when have improved existing report, if you have suggestions of specific reports I can consider that.

InvisibleMan78 · September 23, 2021, 8:51am

AcoustID.org adds more then 10’000 AcoustID’s daily to its database. Do you expect that @IvanDobsky and others process your reports from now on for all eternity?
What I meant was: Acoustid.org should fix it at the source right after submission or even doesn’t accept /reject obviously wrong submissions.

IvanDobsky · September 23, 2021, 8:57am

By creating and refining these reports it would give AcoustID something to look at to help that out. The trouble is that AcoustID has problems knowing what is good or bad data. For example, there are many lone AcoustIDs with no data attached - these are especially confusing to spot. If it is a first fingerprint you have nothing to compare to. If the length doesn’t match the MB track, then how do you know the error isn’t on MB?

Messy

ijabz · September 23, 2021, 8:57am

No I agree that Acoustid could do alot more to improve their data quality, and I would prefer not to have to spend this time creating reports and then using them. Im just saying that at least I have come up with a partial solution that helps everyone, whereas your solution only helps Picard users and wouldnt work in many cases either, if it was easy to do I would have already done it in SongKong.

As for processing reports to eternity, the fact is musicbrainz editors are already correcting errors in the MusicBrainz database for eternity this is just another tool to help.

aerozol · September 23, 2021, 9:18pm

That is NOT my suggestion

That discussion has been moved here if anyone’s wondering what’s being referenced.

ijabz · September 25, 2021, 6:15am

Done some more work on this report

I have incorporated work name (where it exists) into the report creation, so if mutiple mb recordings for the same acoustid resolve to the same work name then they wont be considered a potential bad match if their simple names vary.

And I have split the results up, the first report now shows only those cases where the artistname is the same but the songname is different. The chances of the same artist having two actually different songs that resolve to the same fingerprint is virtually zero, so these are almost definitently bad pairings. The only cases when they are not will be when they are actually just different names for the same song that the report has not managed to filter out.

Then the other reports are where both the artistname and the songname are different, these maybe valid but since they have only one source it is more likely to be bad data, have split into 4 different reports bad upon how many data sources the good link has.

Lastly we have the original report, this contains more records than the others put together and in my testing most seem to be bad matches, so I have kept this report.

Note any links to Acoustid you have visited in old report should still shows as visited in new reports, all these reports listed at http://www.albunack.net/reports.jsp

ijabz · September 25, 2021, 3:02pm

New report added that shows Acoustid links to at least 5 different songs names, this is not just different recordings but recordings with different names

http://reports.albunack.net/new_acoustid_report6_1.html

IvanDobsky · September 25, 2021, 4:04pm

Interesting… but hard to fix ones like this: Track "e026986d-943c-4108-af54-c8b1e60da12a" | AcoustID

And goes to show that The Beatles are a random mess of randomness. (I had not looked at them before)

Also doesn’t surprise me to see plenty of Classical in there as the names vary a lot

ijabz · September 28, 2021, 5:24pm

Good news, all the reports have now each record checked against the live Acoustid database, and if the offending pairings have already been disabled then the record is removed from the report.

So as of this moment the report should only show rows that still have the potential problem.

ijabz · October 2, 2021, 8:06am

Just updated the reports so they have latest Acoustid/MusicBrainz data

Good to see making steady progress, for example the Multiple recordings for same artist with different name report has gone down from over 21,000 to just under 14,000 entries, so thanks to everyone who has helped with this.

ijabz · October 4, 2021, 7:34am

Added the average fingerprint length (can span 7 seconds) and the Mb Recording lengths to the Acoustid reports.

I have been going through the first report that shows Acoustids linked to multiple songs with different names by same artist, this is clearly wrong in almost all cases, however when there is little difference in track lengths and no of sources between the good and match it is not always so clear which is the bad match

So I have now split the report into:

Multiple songs by same artist, song length doesnt match fingerprint length

Multiple songs by same artist, song length matches fingerprint length

The first report has the easiest cases, and I think there is a strong case for fixing these automatically (i dont know if technically possible to do this), the second report has more difficult cases

ijabz · October 27, 2021, 3:14pm

Added one more report - http://reports.albunack.net/new_acoustid_report8_1.html

Shows cases where the linked MusicBrainz recording is at least 30 seconds different to the fingerprint length (and not covered by earlier reports) , so cannot really be a valid match

This gives us 132,000 matches.

This is not going to be cleared manually so my question is if it is possible to write a bot to remove the pairs from Acoustid - think the entries in http://reports.albunack.net/new_acoustid_report1_1.html would be safe to delete.

Probably is possible to write a bot but not something I’m familar with, my first issue would be how to programmatically login to Acoustid, if anyone can give help with this that would be great.