Report showing acoustids likely to be bad link to musicbrainz recordings

Fantastic work - I’ve always thought this would be a great feature, but holy moly a lot of work ahead of us x.x

1 Like

30seconds is nothing on a 10 minute track. Similar when the AcoustID is linked to a 30min bootleg - 3mins is not much of a difference. Not everything is a short pop-single. These numbers need to be more of a percentage of the whole when working with bootlegs.

Removing the need to check an AcoustID on the AcoustID page is wrong, but then I have already said my thoughts before about checking data.

But you have spurred me to get way more agressive on deleting dodgy looking data. Especially when there is only one or two samples of it. There is a hell of a lot of carp data in there so don’t let my thoughts sway your Flame Thrower mission. A dowsing with a bit of napalm will likely do more good than harm.

I also think there needs to be a specific mission for recordings called “track 01”, “track 02”. I found a Russian band’s release the other week with a 19 track album all just labelled “track nn”. And you have never seen such a HUGE mess of crap AcoustIDs attached. Must have had about 50 IDs per track - and all wrong junk. It is Picard doing a name match that is then not checked by a human. I wanted a Flame Thrower Script to use on that Release Page to disconnect the lot.

2 Likes

The percentage idea is reasonable idea, it probably would be better to convert the 4/3/2/1 minute reports into percentage instead. I haven’t really considered taking the release type of the matching track into consideration. I could also do that if that would help.

Yeah, well the reasoning was there was too much bad data built up over many years for us to ever get through it manually, and already the manual edits using this report seemed to have dried up. But my thought was that having blitzed it would then be possible to keep on top of it in a more careful manner.

Something on MusicBrainz would be good, and the MB acoustids page is severely lacking in detail. But are they listed in my MB Recordings Ids linked to at least 50 Acoustids report ?

1 Like

Okay Ive changed some of the reports to use percentages as requested - albunack

No. It was something really obscure. Every track named track 01, track 02, etc. I found it via it being linked to something else I was working on.

And your “over 50” list is a bit bogus as that is just “these are popular albums” list. Some people do submit some very messed up data at it is attached to common albums. I have hit a lot of those Pink Floyd tracks before and an amazing volume of it is likely legit. But these are also artists where the “real” data has thousands of fingerprints.

A problem with a band like Floyd is many concept albums have tracks that are cut up in different ways, leading to overlapping AcoustIDs. Made more complex by home rippers making some very different versions of the album when converting to MP3. Concept albums are a whole totally different area of making confused AcoustIDs… Impossible to have a “perfect” set there.

And yes, I have spent evenings clicking on every AcoustIDs comparing far too many with Floyd.

This is part of the confusion. Data geeks like us may care about this stuff, but the average user doesn’t care a toss. They just want to tag up 1000s of tracks in one go. Don’t check anything. And then hit that “Submit AcoustIDs” button. IMHO Picard should disable that button for anyone attempting to submit more than a dozen albums at a time. But now we are on ranting tangents… :exploding_head:

The percentage error makes the reall errors stand out more. Now a 1 minute difference on a 4 min track is clear it is going to be bogus. But as also pointed out by others above, an early fade that chops something off the end of a track will have a common AcoustID. This happens a lot on Live, but also will happen on Greatest Hits type tracks.

When I do a manual attack on this kinda data I look for other data to back it up. If I see no album names listed, and only 2 or 3 samples it is more likely to get culled.

Wow - and my browser history comes to the fore. I can literally say “and there is the bastard”. Mr Bastard that is:

Just look at the kinds of data tied up there. You can see much of it will be from partially ripped CDs where tags were not setup. So Picard (or similar) did a “match by track names” and then the clueless user of the software submitted their data without looking. I checked a few tracks as you can see - and not found a SINGLE legit AcoustID. This is why a little bit of Napalm is required here.

May be an interesting example of “bad matching” though.

The more I poke around in AcoustIDs, the more I see matches where there are a few words in common in the track names. Some people should not be trusted with Submit AcoustID buttons.

2 Likes

@IvanDobsky So my question is there anything in the fingerprint is at least 100% different to MusicBrainz length report that could not be blitzed, i ask you to have a look and if you cannot find anything then I could blitz them without having to manually check them all.

I did have a quick look at the 24 Pink Floyd songs listed, and to me they all look invalid, in most cases being linked to the wrong version of the right song, and in few cases likely not to be a Pink Floyd recording at all.

2 Likes

Well, true most of the albums listed are popular ones, but only the recordings that are out of range are listed and these are likely wrong

But it also picks up albums that arent popular, and when you see one of these in the list is likley there is a bigger problem, sometimes it is obvious when it has a name like [silence], but other times no obvious reason

e.g

Sorry, I have just got bored going through this stuff after many hours of manual changes. Going through the lists were useful as I found other related errors I could fix from it. Trouble is, very few other people seem to care. Load up the Napalm. Especially on those obvious ones. Collatoral damage will be minimal compared with the postitives that will come from this.

I took a quick poke at those Floyd tracks in the 100% list and it is showing classic mismatch errors. Picard “match by name” lookups where the user doesn’t manually verify. I am seeing Wall Versions mixed with demos and live. Nuke it from Orbit, only way to be sure.

Also if one or two AcoustIDs on an album are accideantly taken out by friendly fire, the rest of the album will be there to guide the user to the right matching album.

:fire: :fire: :fire: :fire: :fire: :fire: :fire: :fire: :fire: :fire:

3 Likes

Okay, experiment time.

A deeper poke at the 24 Floyd AcoustIDs looks like Napalm is a good idea. A couple of cases it will totally remove the AcoustID as there will be no recordings of that length. Which is positive.

The only one curious is this one:
https://acoustid.org/track/90edb4eb-f84f-462d-9f53-f94c87c8c815

Look at that odd data. A 0:38 and a 0:43 fingerprint. BUT this matches two different Roger Waters recordings . And it may well be correct as these are likely to sound the same. It will certainly match the 0:43 recording credited to Pink Floyd as that is the same recording just wrong credit from a bootleg.

So out of 24 AcoustIDs, there is only this one that could be confused by the Napalm option.

Now if this one example AcoustID only gets those above 100% recordings disconnected - then it is an all round win. We keep the matching recordings, but loose the obvious errors.

As a test I hit your Napalm button. And realised it was not strong enough. It only took out the 1:20 recording. I then manually disconnected the other errors on that page.

What I deduced from that sample of 24 is 100% errors can be torched without mercy. I’d probably go down to 50% errors without thinking. Only as I aproached the 20% would I then want to look closer.

:fire: :fire: :fire: :fire: :fire: :fire: :fire: :fire: :fire: :fire: :fire: :fire: :fire: :fire:

Edit: Damn - you got me reading again…

I would suggest if there is no time, do not delete that recording.
https://acoustid.org/track/13f48c08-6c9d-416f-b5fd-add6bf9eea36

With that Sting example, the AcoustID is 1:30 - so take out the 3:17 recording, but leave the no time recording. I have seen occasions where no time is in the MB database but there are a full set of AcoustIDs that are shown to be correct. No time does not mean a mistake.

1 Like

The report it is only highlighting the Pink Floyd/Bring the Boys Back Home (01:27) link, it wouldn’t touch those two, its not meant to.

The report only shows links to recordings that do have a time, so ones without time will not be deleted.

And you got me working on these damn reports again, should be doing something else really. You may have noticed I have now added a No of Sources count for the possibly errant MB recording.

Okay I’ll nuke the links in the 100% report

2 Likes

Don’t blame me for your addiction. :joy: :crazy_face: You are as bonkers as I am. No sane person is looking at the data at this level. It is left to us Loonies in the corner to see this mess and clear it up.

We just understand the satisfaction involved in liberal application of napalm to Bad Data.

I noticed that when I punched the button. This is why I ended with that follow up note about going deeper into the percentage. After checking those other Pink Floyd items in that list they could all of been improved with a lower percentage. (Argh - now I got to go check on them and manually cull some extras…)

I don’t have much time these days to dive into this. What I can tell from my experience so far Acoustid DB is far from perfect, any effort to clean out the bad data is very welcome :+1:. Even if a few good bits will be wiped I can easily live with that as I believe it gets re-submitted sooner or later. What concerns me more is the fact bad data can be re-submitted just as easily :frowning_face:

2 Likes

So I have nuked the 100%, and 60%-100% reports some still listed but they have all been done.

Wondering if you could look at Pink Floyd in the 30% to 60% report to establish if would be safe to nuke these as well ?

3 Likes

I’ll do some delving later this evening. I’ll dip into a few other bands too while I am there.

Your new manual buttons could be useful as they would let me check what is going to happen. Trouble is they don’t work for me (Vivaldi on Windows) I press 'em and nothing happens. (Already signed in to AcoustID, etc) Aha - hang on. Firefox lets them work. Makes sense - needs more Fire in the Browser.

Will report back findings… :fire: :fire: :fire: :fire: :fire: :fire: :fire: :fire: :fire:

1 Like

Short version: My vote says - Throw the match in, burn the lot in the 30% to 60% list… :fire: :fire: :fire: :fire: :fire: :fire: :fire: :fire: :fire: :fire: :fire: :fire: :fire:

Waffle version: I randomly hit a few bands to check: Chumbawamba, Sub Focus, Stiff Little Fingers, Skunk Anansie, Pink Floyd, Pixies. That meant a LOT of tracks under Floyd. A good random sample as Floyd can get complex. Track names reappear in bootlegs and other reissues.

Generally all looks good to me. This will do more good than harm, I have “Hit the button” on all of the above.

A couple of weird things in here.

Odd examples (Boring observations most people will want to ignore… :zzz:):

http://acoustid.org/track/d43c929c-2339-4c8d-9464-888df1ae1763 is correctly in the list. This has a DJ Mix version(2:29) in the middle of normal length versions (4:55). Nuke button applied.

I then looked at the shorter recording: Recording “Could This Be Real (Joker remix)” by Sub Focus - Fingerprints - MusicBrainz Why is AcoustID cdcf458a not being picked up? That looks like it should be wiped out, but not much other data with it. (Yes, I manually nuked it - but it was there when I typed this)

Are you only nuking data which has metadata attached? There are tons of carp example with no metadata, with a single Fingerprint of a different length to the single recording attached. When I see these by eye I remove them now as they are more likely junk than the Recording having the wrong length.
-=-=-
Track "5ffe2606-b3e5-434e-b3b4-fa0399d7e166" | AcoustID - going by the metadata, that one will wipe out the opposite to the correct one. Kinda weird.
-=-=-
I can see examples where a shorter edit will be dropped when it is could be generating same AcoustID. Don’t think that is a massive issue though.
-=-=-
It is doing a good job of cleaning up the different Wall CD edits. (Geek Mode on) The Wall CD has been printed for 40+ years, but as a concept album this means the tracks have been cut up in different ways \ different lengths over the years. The first Japanese edits made a right mess of Young Lust and the tracks around it. These AcoustID cullings you have just caused seems to have done a pretty good job of spotting some of these.

Now a possible error that could not be spotted here are those cases were recording would be correct but has the wrong CD track time attached to it due to people cloning releases and not correcting times. This is not a huge issue on MB (not like the mess at Discogs) but we do have some CDs with clearly the wrong times listed. This set of corrections you have just triggered will have wiped out a few of the attached AcoustIDs here. It is going to make some of my repairs of those CD times a little harder. Collateral damage. Your list is more good than bad I believe. (/geek mode off)
-=-=-

1 Like

These reports just list where fingerprint and Mbid are way out in length, it doesnt exclude those without metadata.

Okay nuked, and now merged all three nuked reports into single one albunack since nuking them anyway - still showing some records for some reason but I think all done.

Split the 20%-30% and 10%-30% into two reports each, cases where acoustid/mbid only submitted once so likely to be completely wrong, and those submitted more than once so probably the correct song but either the wrong version of the song, maybe because MusicBrainz doesnt have the correct version, or MusicBrainz has the recording item incorrectly entered.

Could maybe break these down further, e,g split into where there are valid fingerprint/mb pairs for this acoustid for the same artist/song and where there are not. Also could look at the user submitted metadata, but there is so much of it performance was a bit of an issue when i tried it before.

So why did Track "cdcf458a-d87c-46ae-aed2-e0140ce1568c" | AcoustID not get picked up? I didn’t see it in the list.

Edit: Already spotted another one like this that has been missed from the list. Track "76670e8d-d345-4d3f-805d-a8920f51e8cf" | AcoustID

-=-=-

With the new combined list - the band names at the top are handy, but they always scroll past the one I click on. If I click on an artist with only one item, then I’ll need to scroll up one to find them. I assume the green header is sitting on top of them.

Example: Art of Noise and Max Headroom Max is hiding under the header at the top.

-=-=-

May dig in a bit later, but more nervous about these. It can be common for a batch of recordings to be different lengths. Easy to have a 4:00 Recording mixed with 3:40 recordings due to “early fade” rules clipping audience from the track.

-=-=-

Haha - only one Pink Floyd in you list this time. And so far the handful I try and look at have all been manually fixed in the last week or two. This check will take a few more days I think. Just realised - most of the ones I am looking at you have already hit.

This is a funny one. Track "8a4b609a-7f8b-4ec1-b0b7-bbb09ba23327" | AcoustID Metadata split 50\50 between two very different recordings

1 Like

I dont know, had a quick look but couldnt see a reason, will have to come back to you on this

Yeah noticed this, I dont write much html so havent looking into fixing this.

Agreed, I may have to break these down further to see if there are subsets of this list that can be done automatically with the remainder having to be done manually

1 Like

For the amount of carp data out there without metadata it is very noticeable that every result of your searches always brings up AcoustIDs with metadata sections. Don’t think I have seen one yet without.

And I know there is a lot of junk without metadata. (Another one: Track "76670e8d-d345-4d3f-805d-a8920f51e8cf" | AcoustID ) I find these when I click on a recording to go and cross check the other AcoustIDs attached.