I recently mentioned here that the IA has a CD collection that includes complete high-resolution scans of all the artwork (covers, booklets, the whole thing). I recently added a release based on their scans, but one thing that annoyed me is that I couldn’t confirm the track lengths, because they only have 30-second samples, so I had to use the digital release’s lengths, which might be different.
Looking harder into the files the IA provides, I realized they actually have a scandata.json with the complete TOC string and even an already calculated MB disc ID. (They also include logs for their whole CD ripping process.)
Thanks to this post by @outsidecontext I learnt how to submit the TOC string directly so that the release has the disc ID and correct lengths.
Maybe someone can make a userscript to make this even easier? In any case, we have this treasure trove of hundreds of thousands of CD TOCs I feel most people are unaware of.
But it also has the TOC. It’s possible CDs ripped at different times have (slightly?) different formats, but all I’ve seen have TOC and disc ID. If some are actually locked, it’s probably a mistake, this is just metadata, there’s nothing copyrightable.
yep that’s what I’m thinking, a slight change in the way things were done generating different files… but now I know what I am looking for in the JSON files, i think might be able to work something out over time - but i’m no expert, so someone else might figure it out before me
Really simple HTML form, put in the URL to the JSON file on IA, it will then find the information inside and then produce a URL to add.
Basic instructions for use:
Find the item on IA
Locate the JSON file
Copy the URL
Put URL in the field on the page
Press Submit
If all works OK click on the hyperlink to then complete ToC addition on MBz
Many Limitations
Really basic error handling, that will simply say “welp i dont know”
Can’t handle multiple disc sets (I need to adjust to enable array processing)
No logic to confirm legitimacy, this is in my opinion something the human contributor needs to do at the MBz stage
If the variable changes it won’t work
Note: I am not good with userscript creation, or anything like that - but to try and do this quick and dirty I leveraged the powers of ChatGPT to do most of the script creation.
You work fast! This is already useful, it allows us to not edit the TOC directly and accidentally delete a digit. I tried a few releases, both already in MB and not, and it seems to be working as intended. Not supporting multiple disc sets is a big issue, though.
Yeah, this is unavoidable, and the only thing that worries me, people can just search for the release name and add the TOC without confirming it is the same release. But I guess the same goes for most data here.
I can’t help thinking about how much could be done with this data, though:
An icon on IA item to submit the TOC directly.
A button/link on CD releases on MB to search for the TOC on AI (through UPC)
Doing this as batch process (matching MB releases without disc ID with IA CD items with TOC and the same UPC.
An import script to import CD items from the IA, including track lists, artwork, IDs, TOC, the whole thing.
And more, probably.
I also noticed you added the IA link as “discography entry” to the release I’m working on. I guess that works, but I’m thinking we should have a specific IA external link type.
Cheers, but don’t thank me - thank the computer that wrote it
Yeah indeed, although I’m getting strange results with some - now these are weird low-run radio-station promotion compilations so the chances of misprints are quite high, but some of the durations are coming out quite a bit off to whats printed on the release, example: https://musicbrainz.org/release/30f746ef-2ead-42cf-be59-8e2897c26908
So I am putting an annotation stating:
Present DiscID has been calculated from Internet Archive log file; however there are some discrepencies with the durations printed on the release to those calculated by the log. Needs someone with a physical copy to validate.
Yet other DiscID’s are pretty much bang on, matching existing entries or even matching physical copies I own. So maybe it is indeed a misprint that’s causing the odd result.
Yep, certainly, I will try and make my script public so people can pull it apart and put it back together again in other ways, as mentioend I’m not great with this kind of stuff and ususally make tools that work well-enough for myself.
Yes, I thought that was the best relationship type to choose. IA is meant to be a library anyway, so it would be considered an “entry”. Although these have streaming snippets, i feel like marking them as “streaming available at” is disingenous.
If the lengths are significantly off I’d prefer not to add it, I think. But my money is on the misprint, I also tried a couple CDs I own and added the TOC myself to MB, and so far the TOCs seem perfect.
Yeah, I agree, my point is the IA is kind of its own thing, it should probably have its own type. Then you could have different attributes like, “free download”, “samples only”; probably the idea is to eventually make the music “available to rent” like with books, etc. But that really is a different issue.
I’m just going to say that this is awesome. Thanks for finding this and the creation of the TOC submit site. I’ve bookmarked it already. Never really thought about checking for this there before. I use the site all the time, but never stumbled across DiscIDs, etc. there. Very cool.