Metadata cleaning -- please help!


As part of the MBID Mapper, our technology that maps incoming listens to MusicBrainz IDs, we have a tool called the metadata cleaner (was metadata detuner). It attempts to clean metadata from the seemingly random cruft that gets added to recording names.

For instance:

“For The Love feat. Amy True” may not get matched in the database, but if we remove the feat. portion and try to look up “For The Love” that has a greater chance of matching in our mapping. I’ve spent some time looking at the text for recordings trying to find cases that we should clean up and then implemented that cleaning process.

So far, I am pretty happy with it, but I would love it if we could crowdsource the spot checking of this result. If you feel like helping out and have a few minutes, please open this 22Mb file:

Don’t open the file in google docs, its too big for that to work – just take a look in the preview. Read the introduction at the top of the file and follow the instructions there.

Please leave comments/questions here.

Once this is debugged it will be included in the production ListenBrainz mapper so that it will hopefully improve its accuracy around these crufty recordings.



And now lets do the same thing for artists:

Same concept as above – please open the file and read the instructions at the top.




Too much cleaning:

  • 307: & Jay-Z singt uns ein Lied& Jay (Song “& Jay-Z singt uns ein Lied” - MusicBrainz)
  • 1495: )--- ---x--- ---()
  • 1505: ***(Znikad z miloscia...)***
  • 1524: *Pitch Shift**Pitch Shi
  • 1724~????: A couple audio books (or dramas?) have their part (“Teil”) and/or chapter name removed
  • 3286~????: similar to above, different format
  • 3460~????: Broadcast DJ-mixes have everything but the year removed
  • 100807~100828: Post-Apocalyptic Ho-DownPost
  • 100852~ 100878: similar to above, e.g. Pot-Pourri Samba ReggaePot
  • 125157~125182: T-Ball RagT

None or not enough cleaning:

  • 80860: Madness (DJ Gollum feat. DJ Cap Remix)Madness (DJ Gollum
  • 95140: Other Place [Live]

Apart from less common stuff, edge cases and some classical titles it looks pretty good from what I’ve seen.


Can’t find much bad about it except some silly names or unofficial/unknown name style i.e. brackets:

  • [intentionally left blank] [intentionally le
  • [men's chorus from Bansko, Bulgaria][men's chorus from Bansko
  • [ - TCA - ][
  • [,silu:’et][
  • Amoeba (raft boy)Amoeba (ra
  • A/N【eɪ-ɛn】A
  • 3 Years - 3 Days3 Years
  • (~ ̄▽ ̄)~〰(

Some are not (fully) cleaned:

  • Trio Urs & Therese Fuhrer - Hermann StuckiTrio Urs & Therese Fuhrer
  • Bach; Henryk Szeryng, Jean-Pierre Rampal, Michala Petri, Elisabeth Selin, George Malcolm, Academy of St. Martin‐in‐the‐FBach; Henryk Szeryng
  • Babylumalotoroony and the Jerry Lewis Bone-A-Thons featuring AthenaBabylumalotoroony and the Jerry Lewis Bone-A-Thon

Thanks for this feedback – good stuff. I’ll go over it in detail a bit later.

But, there is one thing I should’ve mentioned: It is ok to do too much or too little cleaning for some cases. The way this is used is a fall back for when the default matching doesn’t work. If the cleaning butchers the metadata beyond repair, it will simply not match anything at all, and the secondary lookup fails – this is ok. Our goal in this case is to try and get as many cases to be cleaned correctly, while trying to keep the incorrect (too much/too little) as low as possible.

That said a fair number of the things you seen won’t be fixed, but a few are clear improvements to make.



Perhaps checking for a word boundary at the start of the “ft/feat” test might help?


Yes, good catch. Fixed.


Great to hear, my first concern was that we would be incorrectly matching heaps of stuff like:
“More to Life (Jn Radio Edit)”
To the album track “More to Life” by default :+1:

I suspect we’ll still get some negative feedback, but let’s see. Now that emergency manual re-linking is nice and smooth that’ll help.

Some things I found that stood out for some reason, take or leave as you please:

  • (You're So Square) Baby I Don't Care (Movie Edit, 2013, Take 16/2021, Take 6) Binaural(You're So Square) Baby I Don't Care (Movie Edit,
    (seems a weird place to cut)
  • (a) Cuban Love Song / (b) Honolulu Eyes(a) Cuban Love Song
  • (medley) I Heard the Bells on Christmas Day / Silver Bells(medley) I Heard the Bells on Christmas Day
    (we should be careful matching multiple songs/medleys to just the first song)
  • -10 on the Care-Meter-10 on the Care
  • 264 - Das Herz 264
    (a lot of ‘-’ causing trouble/early cuts)
  • 2 Pièces froides: No 2 - Danses de travers2 Pièces froides: No 2
  • 2000-01-14: Programme 2, “Cherie Blair Meets Columbo”2000
    (all correctly tagged broadcasts will be cut off like this)
  • 28. Arioso (Soprano/Sopran) : Behold And See If There Be Any Sorrow28. Arioso (Soprano
    (perhaps it’s sometimes better to cut the start rather than anything following)
  • 3 (Joshua Treble remix)3
    (naturally there’s a lot of this sort of thing, hopefully there is a safeguard to make sure a ton of recordings aren’t auto-matched to a specific search result for ‘[number]’ etc. We might end up with some strange ‘most popular tracks’)

That’s all my eyeballs can take for now. I don’t need feedback on any of these fyi @mayhem, hopefully some are helpful.

@chaban: I’ve learned and fixed the following things from your recordings list:

  1. A hypen in the first half of the string should likely be ignore (most of your examples cover this)
  2. “(guff)” should be applied before “stuff - guff” matching. (madness, jay-z case)
  3. Stuff in other “parens” {} [] <> should aslo be treated like ().
  4. Line 1945 should not be cleaned.

And from your artist list:

  1. “Intentionally left blank” accidental feat fix
  2. “mens chorus” could be improved to see if we can avoid splitting on - inside a set of brackets.
  3. As far as fully cleaned, I realize that I need to possibly re-run the cleaning step more than once.

I learned a lot and added several test cases to our test suite – thanks!

1 Like