Metadata cleaning -- please help!

rob · May 16, 2023, 1:48pm

Hello!

As part of the MBID Mapper, our technology that maps incoming listens to MusicBrainz IDs, we have a tool called the metadata cleaner (was metadata detuner). It attempts to clean metadata from the seemingly random cruft that gets added to recording names.

For instance:

“For The Love feat. Amy True” may not get matched in the database, but if we remove the feat. portion and try to look up “For The Love” that has a greater chance of matching in our mapping. I’ve spent some time looking at the text for recordings trying to find cases that we should clean up and then implemented that cleaning process.

So far, I am pretty happy with it, but I would love it if we could crowdsource the spot checking of this result. If you feel like helping out and have a few minutes, please open this 22Mb file:

Don’t open the file in google docs, its too big for that to work – just take a look in the preview. Read the introduction at the top of the file and follow the instructions there.

Please leave comments/questions here.

Once this is debugged it will be included in the production ListenBrainz mapper so that it will hopefully improve its accuracy around these crufty recordings.

Thanks!

rob · May 16, 2023, 2:36pm

And now lets do the same thing for artists:

Same concept as above – please open the file and read the instructions at the top.

Thanks!

chaban · May 16, 2023, 2:43pm

cleaned_recordings.txt

Too much cleaning:

307: & Jay-Z singt uns ein Lied → & Jay (Song “& Jay-Z singt uns ein Lied” - MusicBrainz)
1495: )--- ---x--- ---( → )
1505: ***(Znikad z miloscia...) → ***
1524: *Pitch Shift* → *Pitch Shi
1724~????: A couple audio books (or dramas?) have their part (“Teil”) and/or chapter name removed
3286~????: similar to above, different format
3460~????: Broadcast DJ-mixes have everything but the year removed
100807~100828: Post-Apocalyptic Ho-Down → Post
100852~ 100878: similar to above, e.g. Pot-Pourri Samba Reggae → Pot
125157~125182: T-Ball Rag → T

None or not enough cleaning:

80860: Madness (DJ Gollum feat. DJ Cap Remix) → Madness (DJ Gollum
95140: Other Place [Live]

Apart from less common stuff, edge cases and some classical titles it looks pretty good from what I’ve seen.

cleaned_artists.txt

Can’t find much bad about it except some silly names or unofficial/unknown name style i.e. brackets:

[intentionally left blank] → [intentionally le
[men's chorus from Bansko, Bulgaria] → [men's chorus from Bansko
[ - TCA - ] → [
[,silu:’et] → [
Amoeba (raft boy) → Amoeba (ra
A/N【eɪ-ɛn】 → A
3 Years - 3 Days → 3 Years
(~￣▽￣)~〰 → (

Some are not (fully) cleaned:

Trio Urs & Therese Fuhrer - Hermann Stucki → Trio Urs & Therese Fuhrer
Bach; Henryk Szeryng, Jean-Pierre Rampal, Michala Petri, Elisabeth Selin, George Malcolm, Academy of St. Martin‐in‐the‐F → Bach; Henryk Szeryng
Babylumalotoroony and the Jerry Lewis Bone-A-Thons featuring Athena → Babylumalotoroony and the Jerry Lewis Bone-A-Thon

rob · May 16, 2023, 3:35pm

Thanks for this feedback – good stuff. I’ll go over it in detail a bit later.

But, there is one thing I should’ve mentioned: It is ok to do too much or too little cleaning for some cases. The way this is used is a fall back for when the default matching doesn’t work. If the cleaning butchers the metadata beyond repair, it will simply not match anything at all, and the secondary lookup fails – this is ok. Our goal in this case is to try and get as many cases to be cleaned correctly, while trying to keep the incorrect (too much/too little) as low as possible.

That said a fair number of the things you seen won’t be fixed, but a few are clear improvements to make.

Thanks!

rdswift · May 16, 2023, 3:52pm

Perhaps checking for a word boundary at the start of the “ft/feat” test might help?

rob · May 16, 2023, 4:12pm

Yes, good catch. Fixed.

aerozol · May 16, 2023, 10:10pm

Lazy…

Great to hear, my first concern was that we would be incorrectly matching heaps of stuff like:
“More to Life (Jn Radio Edit)”
To the album track “More to Life” by default

I suspect we’ll still get some negative feedback, but let’s see. Now that emergency manual re-linking is nice and smooth that’ll help.

Some things I found that stood out for some reason, take or leave as you please:

(You're So Square) Baby I Don't Care (Movie Edit, 2013, Take 16/2021, Take 6) Binaural → (You're So Square) Baby I Don't Care (Movie Edit,
(seems a weird place to cut)
(a) Cuban Love Song / (b) Honolulu Eyes → (a) Cuban Love Song
(medley) I Heard the Bells on Christmas Day / Silver Bells → (medley) I Heard the Bells on Christmas Day
(we should be careful matching multiple songs/medleys to just the first song)
-10 on the Care-Meter → -10 on the Care
264 - Das Herz → 264
(a lot of ‘-’ causing trouble/early cuts)
2 Pièces froides: No 2 - Danses de travers → 2 Pièces froides: No 2
2000-01-14: Programme 2, “Cherie Blair Meets Columbo” → 2000
(all correctly tagged broadcasts will be cut off like this)
28. Arioso (Soprano/Sopran) : Behold And See If There Be Any Sorrow → 28. Arioso (Soprano
(perhaps it’s sometimes better to cut the start rather than anything following)
3 (Joshua Treble remix) → 3
(naturally there’s a lot of this sort of thing, hopefully there is a safeguard to make sure a ton of recordings aren’t auto-matched to a specific search result for ‘[number]’ etc. We might end up with some strange ‘most popular tracks’)

That’s all my eyeballs can take for now. I don’t need feedback on any of these fyi @mayhem, hopefully some are helpful.

rob · May 17, 2023, 9:34am

@chaban: I’ve learned and fixed the following things from your recordings list:

A hypen in the first half of the string should likely be ignore (most of your examples cover this)
“(guff)” should be applied before “stuff - guff” matching. (madness, jay-z case)
Stuff in other “parens” {} [] <> should aslo be treated like ().
Line 1945 should not be cleaned.

And from your artist list:

“Intentionally left blank” accidental feat fix
“mens chorus” could be improved to see if we can avoid splitting on - inside a set of brackets.
As far as fully cleaned, I realize that I need to possibly re-run the cleaning step more than once.

I learned a lot and added several test cases to our test suite – thanks!