Sounds like a good idea to me. I sent an email to let them know about the closure and about the problem that the Wayback Machine has with archiving product pages. (Also made a donation 'cause they rock.)
I’ve been having issues connecting to the IA’s servers to save all these pages. Right now, it’s alternating between 502s and ERR_CONNECTION_REFUSED. I hope this means they’re stressed out from lots of people archiving CD Baby pages before the latter site shuts down…
ETA: These issues are intermittent. I also occasionally get internal server errors, 503s and plain ol’ timeouts.
I hope this post doesn’t dissuade anyone from archiving all the links they can.
This came up on the IA forums a while back. Iirc, someone claimed that persistent errors of this sort went away after they cleared cache (and maybe cookies and site settings/data).
If you already did that and still get 502s or 503s, maybe it has to do with the National Emergency Library going online?
Anyone have any thoughts on how to scrape Google’s search results and cache without risking an IP ban? Most of the unarchived data was for artists and albums from the past three years. Google has at least 75,000 albums from this time period in their cache, plus more than 500,000 artist pages.
The goal is to rescue artist bios, artist area info, album descriptions, personnel credits, and recording details that can’t be found anywhere else.
IA never fully crawled store.cdbaby.com, so there are only a few saved pages here and there. Before the store’s subdomain changed in May 2017, www.cdbaby.com was crawlable for many years; IA has lots of archives from that period. Also, the URLs for album pages stayed the same for more than ten years.
The best approach might be to use the Wayback Availability JSON API to determine which artist/album pages were archived from each of the two subdomains. Some pages can only be found at store.cdbaby.com, some only at [www.]cdbaby.com. Since each subdomain represents a different time period, you need both to have a more complete history.
I just would like that MB didn’t change my URL when I added the ENDED flag.
As cdbaby doesn’t exist, we should (or could?) remove this normalisation code.
In fact, even if a CD Baby URL hasn’t been marked “ended”, I don’t think the URLs should be “updated” to store.cdbaby.com. Some artist/album pages only had functional URLs at www.cdbaby.com, while others were created during the store.cdbaby.com era (May 2017 – March 2020).