Tutorial: How to build your own music tagger, with MusicBrainz Canonical Metadata

aerozol · June 14, 2023, 11:03am

There’s a new blog post up: How to build your own music tagger, with MusicBrainz Canonical Metadata – MetaBrainz Blog

In it @rob, the scalliwag, has outlined how to use our new MusicBrainz Canonical Metadata dataset (link to dataset announcement) to write a semi-automated tagger in Python. It only has three steps!? That seems too few… ah well, I’m sure he knows what he’s doing. Anyway, if this is something you’re interested in, check it out, or give it a share to people who may be interested. With everywhere locking down on API usage lately it’s pretty cool to be releasing more open resources.

Also, just for the forums, a big pat on the back to the team that refreshed the MetaBrainz datasets pages, they’re now really clearly formatted, have an inviting button and info hierarchy, and there’s a new sign-up workflow! Really nice job.

rob · June 14, 2023, 11:07am

And thanks to @lucifer, @reosarevok and @mr_monkey for all the hard work that went into creating the dataset and all the work around the new data-set pages. Thank you!

rob · June 14, 2023, 11:07am

And if anyone has more comments about how we can improve the semi-automated tagger, please post a comment here!

derat · June 14, 2023, 3:16pm

I think there’s a tiny typo in this code snippet from step 3 – the parameter is named artist_name, but it’s referenced as artist_credit_name in the body:

def make_combined_lookup(self, artist_name, recording_name):
     return unidecode(re.sub(r'[^\w]+', '', 
                      artist_credit_name + recording_name).lower())

And just to mention it, it sounds from the docs like the \w here may preserve underscores. I’m not sure if that was the intent. \W may be clearer than a negated \w, too.

rob · June 14, 2023, 3:31pm

Thanks for that typo – fixed.

As for the underscore issue – damnit, you found a bug in our code. lol. thanks!

jesus2099 · June 14, 2023, 3:37pm

I’m afraid removing [^\w]+ (or \W+ indeed is better) from artist names will blank all non-Latin artist names:

[^]+

If supported, \P{Letter}+ (not a letter) would be better:

{Letter}

rob · June 15, 2023, 9:59am

That is not how python works, actually:

import re
re.sub(r’[\W]+', ‘’, “モーニング娘。”)
‘モーニング娘’

jesus2099 · June 15, 2023, 10:11am

Cool, I don’t know much about Python!
Then it’s all good!

It does not work with that online test tool (pythex) but I think it’s because they are not in Unicode mode or in a proper Locale, or something.

outsidecontext · June 15, 2023, 6:05pm

That tool is using the now long obsolete Python 2