Internet Archive's CD Collection incudes TOC/disc ID

I recently mentioned here that the IA has a CD collection that includes complete high-resolution scans of all the artwork (covers, booklets, the whole thing). I recently added a release based on their scans, but one thing that annoyed me is that I couldn’t confirm the track lengths, because they only have 30-second samples, so I had to use the digital release’s lengths, which might be different.

Looking harder into the files the IA provides, I realized they actually have a scandata.json with the complete TOC string and even an already calculated MB disc ID. (They also include logs for their whole CD ripping process.)

Thanks to this post by @outsidecontext I learnt how to submit the TOC string directly so that the release has the disc ID and correct lengths.

Maybe someone can make a userscript to make this even easier? In any case, we have this treasure trove of hundreds of thousands of CD TOCs I feel most people are unaware of.

7 Likes

:open_mouth: I thought these files were locked from public view?

yeah it turns out some of them you genuinely can’t reach :frowning:

see:
https://archive.org/download/cd_modern-rock-august-1996_various-artists-alice-in-chains-better-tha

https://archive.org/download/cd_modern-rock-august-1996_various-artists-alice-in-chains-better-tha/cd_modern-rock-august-1996_various-artists-alice-in-chains-better-tha_scandata.json

Then there are some other’s which have other data - not sure if this is any good to re-generate?

https://ia800805.us.archive.org/13/items/cd_stuff-magazine-mad-fly-jams_various-artists-beanie-sigel-full-devil-ja/cd_stuff-magazine-mad-fly-jams_various-artists-beanie-sigel-full-devil-ja_scandata.json

(for https://beta.musicbrainz.org/release/93f2e6ec-1ba0-4a70-98ea-fa442c4fd202)

OK yeah, that’s enough to be able to calculate… something for evening play sessions to try and build something to work this out :slight_smile:

This isn’t locked:

"technical_metadata": {
        "discs": [
            {
                "mb_discid": "z_hFxbIpJ5RZrP1K.BDZzrnQ9nk-", 
                "toc_string": "1 19 325294 150 19268 35947 51160 70706 83411 101623 119328 136188 150237 177044 194486 213061 231685 242559 259986 274651 294189 305470", 
                "track_sectors": [
                    19118, 
                    16679, 
                    15213, 
                    19546, 
                    12705, 
                    18212, 
                    17705, 
                    16860, 
                    14049, 
                    26807, 
                    17442, 
                    18575, 
                    18624, 
                    10874, 
                    17427, 
                    14665, 
                    19538, 
                    11281, 
                    19824
                ]
            }
        ]

But it also has the TOC. It’s possible CDs ripped at different times have (slightly?) different formats, but all I’ve seen have TOC and disc ID. If some are actually locked, it’s probably a mistake, this is just metadata, there’s nothing copyrightable.

1 Like

yep that’s what I’m thinking, a slight change in the way things were done generating different files… but now I know what I am looking for in the JSON files, i think might be able to work something out over time - but i’m no expert, so someone else might figure it out before me :smiley:

2 Likes

OK I’ve [see end of post] made something very quick and dirty to speed up the process:

http://computer-legacy.com/musicbrainz/iatocsubmit.html

Really simple HTML form, put in the URL to the JSON file on IA, it will then find the information inside and then produce a URL to add.

Basic instructions for use:

  1. Find the item on IA
  2. Locate the JSON file
  3. Copy the URL
  4. Put URL in the field on the page
  5. Press Submit
  6. If all works OK click on the hyperlink to then complete ToC addition on MBz

Many Limitations

  • Really basic error handling, that will simply say “welp i dont know”
  • Can’t handle multiple disc sets (I need to adjust to enable array processing)
  • No logic to confirm legitimacy, this is in my opinion something the human contributor needs to do at the MBz stage
  • If the variable changes it won’t work

Note: I am not good with userscript creation, or anything like that - but to try and do this quick and dirty I leveraged the powers of ChatGPT to do most of the script creation.

2 Likes

You work fast! This is already useful, it allows us to not edit the TOC directly and accidentally delete a digit. I tried a few releases, both already in MB and not, and it seems to be working as intended. Not supporting multiple disc sets is a big issue, though.

Yeah, this is unavoidable, and the only thing that worries me, people can just search for the release name and add the TOC without confirming it is the same release. But I guess the same goes for most data here.

I can’t help thinking about how much could be done with this data, though:

  • An icon on IA item to submit the TOC directly.
  • A button/link on CD releases on MB to search for the TOC on AI (through UPC)
  • Doing this as batch process (matching MB releases without disc ID with IA CD items with TOC and the same UPC.
  • An import script to import CD items from the IA, including track lists, artwork, IDs, TOC, the whole thing.

And more, probably.

I also noticed you added the IA link as “discography entry” to the release I’m working on. I guess that works, but I’m thinking we should have a specific IA external link type.

2 Likes

Cheers, but don’t thank me - thank the computer that wrote it :laughing:

Yeah indeed, although I’m getting strange results with some - now these are weird low-run radio-station promotion compilations so the chances of misprints are quite high, but some of the durations are coming out quite a bit off to whats printed on the release, example: https://musicbrainz.org/release/30f746ef-2ead-42cf-be59-8e2897c26908

So I am putting an annotation stating:

Present DiscID has been calculated from Internet Archive log file; however there are some discrepencies with the durations printed on the release to those calculated by the log. Needs someone with a physical copy to validate.

Yet other DiscID’s are pretty much bang on, matching existing entries or even matching physical copies I own. So maybe it is indeed a misprint that’s causing the odd result.

Yep, certainly, I will try and make my script public so people can pull it apart and put it back together again in other ways, as mentioend I’m not great with this kind of stuff and ususally make tools that work well-enough for myself.

Yes, I thought that was the best relationship type to choose. IA is meant to be a library anyway, so it would be considered an “entry”. Although these have streaming snippets, i feel like marking them as “streaming available at” is disingenous.

1 Like

also I make sure to very clearly state that the DiscID/ToC calculation is coming from Internet Archive log files

1 Like

If the lengths are significantly off I’d prefer not to add it, I think. But my money is on the misprint, I also tried a couple CDs I own and added the TOC myself to MB, and so far the TOCs seem perfect.

Yeah, I agree, my point is the IA is kind of its own thing, it should probably have its own type. Then you could have different attributes like, “free download”, “samples only”; probably the idea is to eventually make the music “available to rent” like with books, etc. But that really is a different issue.

maybe but considering how much they’ve realised how not to keep pissing off these large book publishers they might not get to that goal for a while.

I’m just going to say that this is awesome. Thanks for finding this and the creation of the TOC submit site. I’ve bookmarked it already. Never really thought about checking for this there before. I use the site all the time, but never stumbled across DiscIDs, etc. there. Very cool.

4 Likes

The same for me, actually. I love the IA, use it all the time, but somehow the wonders of their CD collection completely passed me by. Until now.

3 Likes