Calculating and submitting DiscID without a physical CD

quinedot · February 6, 2022, 11:57am

For background, I have a bespoke CD ripping setup. As part of the ripping process, it records the CDDB signature of the CD, and once upon a time the “cue file” as well (until the cue file program I used disappeared from my distro). The CDDB files (amusingly named like discid-1.txt) are pretty standard for anyone familiar with such:

990afa0b 11 150 20040 35556 63347 83878 101224 121771 140953 154387 170750 180897 2812

I have almost 900 of these files (almost 700 cue files).

As far as I can tell, the only supported ways of uploading these IDs is to have a program scan the CD itself. Is that accurate? Chances are quite low I’d ever go to that much bother. However, after reading how they are calculated, it seems like the only missing information is (a) if there is a data track or not, and (b) what the lead-out is.

I can tell if there’s a data track based on the number of audio (flac) files that are present. As for the lead-out, I ran an experiment by picking a random CD, calculating the number of CD sectors in the last track based on the number of samples, and then calculated the DiscID on the assumption that the end of the last track was the leadout. (I did also sanity check for sample rate, number of channels, and no “slop” – number of samples was a multiple of 588.)

I looked up the result and it matched.

So my question are

Is this leadout assumption reasonable?
Would there be an interest in getting any missing DiscIDs that I happen to have into MusicBrainz in some automated manner?
Or is there already such a way that I’ve missed?

jesus2099 · February 6, 2022, 10:31pm

I think we should avoid automatic FreeDB/CDDB to disc ID conversions.

Please read (but maybe you already know):

https://www.jwz.org/doc/cddb.html

If there is a data track, the last audio track will be wrong by 2:32.
It seems no automatic thing can know if there was or not a data track at the end of the CD. Except if, with you rips, you also saved the data track files or something.

Also, and most importantly, you will never have the correct last audio track length because this format only has it in seconds, it did not save the milliseconds.

foxgrrl · February 6, 2022, 10:45pm

I too have a bespoke CD ripping setup, and I hacked partial BIN/CUE support into Picard, so that “Lookup CD” will calculate the TOC hash from the selected CUE file, rather than hitting the (non-existant) CD Drive that I don’t have on my tagging machine. Picard then acts the way that it usually does, as if you had the physical CD in a drive…

I’ve been meaning to clean up the code and publish it for a while, but I had to work on some other stuff…

Anyway, it is so SO SO MUCH more easier to use than needing to insert and read a physical CD a zillion times.

kellnerd · February 6, 2022, 10:51pm

Picard 2.8 will be able to use a log file to calculate a TOC:

foxgrrl · February 6, 2022, 10:58pm

Awesome!

The code cleanup I needed to do was to write a proper CUE parser for whatever any random ripper produces (not just the two I’m using), better file handling for the split tracks (or BIN), because I needed to actually read the reference audio data to know the actual length of the CD and stuff… (Because CUE files are a terrible format). Also write a CloneCD (.CCD) parser. (And I vaguely recall needing to parse some cdrecord -toc output at some point… It’s been a while since I looked at this.

foxgrrl · February 6, 2022, 11:03pm

Well, I’m interested… As for leadout accuracy, you’re probably going to need to empirically measure exactly what your drive and ripping software is doing with a large representative sample of CDs with extra CDPlus sessions, or MODE 1 tracks or whatever… and verify that it’s doing the exact same thing that libdiscid is doing

foxgrrl · February 6, 2022, 11:09pm

Oh yeah, I also modified isrcsubmit.py to use the same BIN/CUE parsing code, so I could quickly and easily submit ISRCs without needing to take minutes physically swapping and reading physical CDs

quinedot · February 7, 2022, 2:31am

I know, CDDB is trash (but it’s always a joy to reread some JWZ). However I’d still like to extract any value possible.

I did also save the number of audio tracks present (in addition to preserving track numbers), so I know when there was a data track. I was just going to skip those cases, at least to start, as I believe they’re relatively rare. Let me do a quick check… 5%.

That’s why I’m asking about my calculation of last audio track sectors from sample count, not CDDB/FreeDB/GnuDB.

Thanks! Those are good ideas. And thank you @kellnerd too for the ticket link; I can see in the related issues that a lot of the concerns I had about a general script to do this are shared. I’ll definitely checkout the libdiscid source and work on some ways to sanity check the generation.

outsidecontext · February 7, 2022, 5:11am

I would not generally recommend CUE sheet based disc ID submission (lookup is a different thing, that can be tried, no harm done). But your approach with good quality ripping data sounds sane to me. If you can identify and exclude the problematic cases I think the calculated disc IDs should do.

Another issue could be if you can identify the exact release to submit to. Do you have enough information for this? E.g. have the FLAC files have been carefully tagged against the correct release and include the MBIDs?

In any way I’d suggest to only submit disc IDs for releases that don’t have any disc ID attached yet. And maybe generate a report for those where there is a conflict with existing disc IDs to manually check.

quinedot · February 7, 2022, 10:28am

Preliminary results against IDs where I’m sure there’s no data track involved are about 70% matches, 15% no data, and 15% non-matching data. I haven’t looked at the last category in depth, just a couple at random, and it is indeed looking like I have either some mistags, or some legitimate tags to releases that have not yet had a (duplicate) DiscId associated (or both). Should be interesting to see how the mix falls out.

Yeah, at this point it’s clear that the only results I may be confident in submitting without some more legwork are those of releases which I created myself (which is still worthwhile). I’ll revisit that once I run some physical CD sanity checks. Perhaps there will also be some with only one release or only one release I can be confident is my version. Outside of those, it would be less work to pop the CD in the drive to verify than it would be to dig the liners out and check barcodes etc, probably (I ditched my jewel cases some time ago).

Guess I’ll be tuning my tagging as much as contributing with this exercise, which wasn’t the intention but . Thanks everyone for your feedback; I think the upshot is that this is a valid approach for querying and probably sufficient for newly created releases, but limited as a general data-harvesting tool.