Planning on writing a bot to submit ISRCs

derobert · September 11, 2016, 6:34am

As a result of ripping with morituri, I have a bunch of .cue files, and many of them contain ISRC data in them. At quick count, I have 320 CDs with ISRC data, consisting of 3845 ISRCs. I don’t think there currently exists any code to submit these, so I’m thinking of writing some. I’d submit a patch to musicbrainz-isrcsubmit (which has an open ticket requesting it), but:

My solution is going to be somewhat targeted to my setup—morituri, files not rearranged by Picard, etc.
Not targeting to one-at-a-time submissions.
I don’t actually know Python…

I’ve looked at a few of the cue sheets, and my ripping drives (mostly LG BD-RE WH14NS40 and Lite-On iHAS124) do not appear to suffer from the adjacent track duplicate ISRC problem some drives do.

I plan to write the bot in Perl, and it’ll be free software posted on either GitHub or GitLab. (Note to self: there is some ISRC submitting code in perl @ https://gist.github.com/njh/9159699)

At the moment, I’m thinking it should:

NOTE: All my rips are one directory per CD. Each directory contains a bunch of FLACs (one per track), a .cue file, and a .log file. There are other files as well, but they’re not relevant for this. (All of these are actually symlinks due to git-annex—again, not really relevant)

Check with some local, persistent database that the bot hasn’t already processed this directory.
Read the ISRCs from the .cue file. Do the following basic sanity checks:
ISRCs are unique (I have CDs which break this, such as DEF057301700, but will review them manually before submitting)
Each track has one and only one ISRC
Strip out hyphens (e.g., in NL-E42-11-02105), convert to uppercase (both of these are correct according to the ISRC validation bulletin).
Confirm it’s not the obviously invalid ISRC /^0+$/
Confirm it matches the expected format /^[A-Z]{2}[A-Z0-9]{10}$/i
If any check fails, skip the disc until I review the failure.
Read the CDDB discid from the .cue file, compare it to the CDDB discid in the .log file. Throw a tantrum if they don’t match. This should never happen.
Extract MB disc ID from .log file.
Extract release MBID and disc number from 01.*.flac
Query MB to confirm the disc ID is associated with the release. If not, skip disc until I deal with it (I expect some where I’ve failed to submit the discid when I created a new release in a release group)
Use the disc number tag to find the correct disc of the release.
For each recording on the given disc of the release (I’ll get the recordings currently on the release—not the recording MBIDs tagged in my files):
Some of my ISRCs have hyphens in them; pretty sure those should be stripped out. (Sometimes the field is quoted too, obviously the quotes aren’t part of the ISRC).
Confirm the ISRC I’m about to submit is not currently on it
AFAIK, I should submit even if there is a different ISRC already on it.
Submit the ISRC.
Mark the directory processed in the local database.
Check number of ISRCs submitted today vs. rate limit, then either proceed to the next directory or stop.

I’ve reviewed the Bot Code of Condict and of course plan to comply with it.

I’m looking for any comments/suggestions/etc., before I start work on this.

Is there anything I should know about ISRCs before doing this?
Are there other sanity checks I should perform?
Is this code that’d be useful to anyone else?
I presume this counts as a bot?
Is there some reason this is a Bad Idea™?

ListMyCDs.com · September 11, 2016, 11:47am

I believe your bot or script would be useful for many of us. It seems you’ve already planned it well and I see no potential problems caused by it.

This is important because labels also make mistakes. Some tens of times I’ve seen complete wrong ISRC on CD (not talking about reading failures). There’s no point of fixing MB if bot sends same faulty ISRC back again.

It’s common for older classical recordings to share the same ISRC for all parts/recordings of the same work. If symphony is split to 4 recordings those all could have the same ISRC. This is usually seen with older recordings by Deutsche Grammophon and ISRC usually ends with two zeros. Later releases of the same material usually use similar codes, for example your DEF057301700 is on later releases DEF057301701, DEF057301702, DEF057301703…

Recordings are often having multiple ISRCs because of the MB guidelines. New code is typically given for remasters but in MB we count different masters as a same recording. Sometimes there’s no good reason why the old code wasn’t used for exactly same recording.

It depends. I would count it as a bot only if you keep it running automatically. If you run it only manually when necessary it’s just a script. For a bot I would recommend adding another MB account (and requesting it to be marked as a bot). For a script you could use your own account because it isn’t that much different compared to how people currently send ISRC from their releases.

Jim_DeLaHunt · September 11, 2016, 7:07pm

Good for you! I would benefit from a bot/script like this. When you release it, I’ll be wanting to see if I can use it, and maybe contribute improvements.

My source of cue sheets will be a “CD Archive” directory tree, each leaf directory corresponding to one Release with a single FLAC file for each Medium in the Release. The cue sheets are both embedded in the single FLAC file, and available as separate files in the leaf directory.

jesus2099 · September 11, 2016, 8:32pm

But note that the cue sheet does usually not contain the last track length nor the full length - that’s the same thing actually - that would allow the compute of a CD TOC / Disc ID.

If it comes from EAC, you should keep the rip log file for this.

Jim_DeLaHunt · September 11, 2016, 8:54pm

Thank you for the tip. I’m using XLD to rip my CD’s. The log file there has a Table Of Contents entry:

TOC of the extracted CD
     Track |   Start  |  Length  | Start sector | End sector 
    ---------------------------------------------------------
        1  | 00:00:00 | 07:57:69 |         0    |    35843   
        2  | 07:57:69 | 01:59:58 |     35844    |    44826   
        3  | 09:57:52 | 13:20:71 |     44827    |   104897   
        4  | 23:18:48 | 04:44:65 |    104898    |   126262   
        5  | 28:03:38 | 07:39:04 |    126263    |   160691   
        6  | 35:42:42 | 01:52:58 |    160692    |   169149   
        7  | 37:35:25 | 01:26:30 |    169150    |   175629   
        8  | 39:01:55 | 09:16:30 |    175630    |   217359   
        9  | 48:18:10 | 06:58:59 |    217360    |   248768   
       10  | 55:16:69 | 01:58:03 |    248769    |   257621

I haven’t proved it by writing the code, but I’m hoping that this will be enough to compute a Disc ID.

But also, I’m making a disc archive. The objective is to be able to recreate the contents of the disc, and for that one needs the audio file in addition to the cue sheet or TOC.

jesus2099 · September 11, 2016, 9:14pm

It does have enough data indeed.
I only know how to build a TOC URL, I don’t know how to build a Disc ID, I only know that it is possible and documented somewhere.

aerozol · September 11, 2016, 11:30pm

Sounds awesome - maybe this could eventually be built into Picard?
Just a thought.[quote=“Jim_DeLaHunt, post:5, topic:131515”]
I’m hoping that this will be enough to compute a Disc ID.[/quote]
Someone made a nice little webpage where you can use log files or something to submit discID’s to MB - but I just spent ten minutes trying to find/google for it and no luck! Arrrrghhh. I’ve used it a couple of times and it seems to work fine.

ListMyCDs.com · September 11, 2016, 11:32pm

http://eac-log-lookup.blogspot.fi/

derobert · September 12, 2016, 4:13pm

I’ll probably have some time to start working on this tomorrow (Tuesday). I’ll keep in mind Jim’s alternate layout.

derobert · September 12, 2016, 4:15pm

Do the FLAC files have MBID-containing tags in them? That’d definitely be an easy way to find the release. (Otherwise, it’ll be interesting, as there isn’t a 1:1 between releases and disc IDs).

LordSputnik · September 16, 2016, 10:19am

I wrote https://gist.github.com/LordSputnik/7500554 a while back - it’s for EAC cue files, but I’d expect morituri’s to be similar, if not exactly the same.