As part of the MBID Mapper, our technology that maps incoming listens to MusicBrainz IDs, we have a tool called the metadata cleaner (was metadata detuner). It attempts to clean metadata from the seemingly random cruft that gets added to recording names.
For instance:
“For The Love feat. Amy True” may not get matched in the database, but if we remove the feat. portion and try to look up “For The Love” that has a greater chance of matching in our mapping. I’ve spent some time looking at the text for recordings trying to find cases that we should clean up and then implemented that cleaning process.
So far, I am pretty happy with it, but I would love it if we could crowdsource the spot checking of this result. If you feel like helping out and have a few minutes, please open this 22Mb file:
Don’t open the file in google docs, its too big for that to work – just take a look in the preview. Read the introduction at the top of the file and follow the instructions there.
Please leave comments/questions here.
Once this is debugged it will be included in the production ListenBrainz mapper so that it will hopefully improve its accuracy around these crufty recordings.
Thanks for this feedback – good stuff. I’ll go over it in detail a bit later.
But, there is one thing I should’ve mentioned: It is ok to do too much or too little cleaning for some cases. The way this is used is a fall back for when the default matching doesn’t work. If the cleaning butchers the metadata beyond repair, it will simply not match anything at all, and the secondary lookup fails – this is ok. Our goal in this case is to try and get as many cases to be cleaned correctly, while trying to keep the incorrect (too much/too little) as low as possible.
That said a fair number of the things you seen won’t be fixed, but a few are clear improvements to make.
Great to hear, my first concern was that we would be incorrectly matching heaps of stuff like:
“More to Life (Jn Radio Edit)”
To the album track “More to Life” by default
I suspect we’ll still get some negative feedback, but let’s see. Now that emergency manual re-linking is nice and smooth that’ll help.
Some things I found that stood out for some reason, take or leave as you please:
(You're So Square) Baby I Don't Care (Movie Edit, 2013, Take 16/2021, Take 6) Binaural → (You're So Square) Baby I Don't Care (Movie Edit,
(seems a weird place to cut)
(a) Cuban Love Song / (b) Honolulu Eyes → (a) Cuban Love Song
(medley) I Heard the Bells on Christmas Day / Silver Bells → (medley) I Heard the Bells on Christmas Day
(we should be careful matching multiple songs/medleys to just the first song)
-10 on the Care-Meter → -10 on the Care
264 - Das Herz → 264
(a lot of ‘-’ causing trouble/early cuts)
2 Pièces froides: No 2 - Danses de travers → 2 Pièces froides: No 2
2000-01-14: Programme 2, “Cherie Blair Meets Columbo” → 2000
(all correctly tagged broadcasts will be cut off like this)
28. Arioso (Soprano/Sopran) : Behold And See If There Be Any Sorrow → 28. Arioso (Soprano
(perhaps it’s sometimes better to cut the start rather than anything following)
3 (Joshua Treble remix) → 3
(naturally there’s a lot of this sort of thing, hopefully there is a safeguard to make sure a ton of recordings aren’t auto-matched to a specific search result for ‘[number]’ etc. We might end up with some strange ‘most popular tracks’)
That’s all my eyeballs can take for now. I don’t need feedback on any of these fyi @mayhem, hopefully some are helpful.