Tutorial: How to build your own music tagger, with MusicBrainz Canonical Metadata

There’s a new blog post up: How to build your own music tagger, with MusicBrainz Canonical Metadata – MetaBrainz Blog

In it @rob, the scalliwag, has outlined how to use our new MusicBrainz Canonical Metadata dataset (link to dataset announcement) to write a semi-automated tagger in Python. It only has three steps!? That seems too few… ah well, I’m sure he knows what he’s doing. Anyway, if this is something you’re interested in, check it out, or give it a share to people who may be interested. With everywhere locking down on API usage lately it’s pretty cool to be releasing more open resources.

Also, just for the forums, a big pat on the back to the team that refreshed the MetaBrainz datasets pages, they’re now really clearly formatted, have an inviting button and info hierarchy, and there’s a new sign-up workflow! Really nice job.


And thanks to @lucifer, @reosarevok and @mr_monkey for all the hard work that went into creating the dataset and all the work around the new data-set pages. Thank you!

And if anyone has more comments about how we can improve the semi-automated tagger, please post a comment here!

I think there’s a tiny typo in this code snippet from step 3 – the parameter is named artist_name, but it’s referenced as artist_credit_name in the body:

def make_combined_lookup(self, artist_name, recording_name):
     return unidecode(re.sub(r'[^\w]+', '', 
                      artist_credit_name + recording_name).lower())

And just to mention it, it sounds from the docs like the \w here may preserve underscores. I’m not sure if that was the intent. \W may be clearer than a negated \w, too. :slight_smile:


Thanks for that typo – fixed.

As for the underscore issue – damnit, you found a bug in our code. lol. thanks!


I’m afraid removing [^\w]+ (or \W+ indeed is better) from artist names will blank all non-Latin artist names:

+ [^]+

If supported, \P{Letter}+ (not a letter) would be better:



That is not how python works, actually:

import re
re.sub(r’[\W]+', ‘’, “モーニング娘。”)


Cool, I don’t know much about Python!
Then it’s all good! :slight_smile:

It does not work with that online test tool (pythex) but I think it’s because they are not in Unicode mode or in a proper Locale, or something.

That tool is using the now long obsolete Python 2