CD Baby store closes on 31 March

sibilant · March 16, 2020, 4:54am

Starting on April 1st, our retail store will transition into a download portal where fans can download previous purchases and redeem download cards.

The store wasn’t profitable enough, so CD Baby is closing it down. If there’s anything you want to scrape (or save in the Internet Archive), now is probably the time.

Some courtesy links:

dashv · March 16, 2020, 11:31pm

Thanks, there are a number of artist profiles that may not be elsewhere that I will need to move/put in the waybackmachine.

sibilant · March 17, 2020, 1:32am

If the Wayback Machine fails on any page, try archive.today. For me it worked successfully on release pages that Wayback Machine couldn’t process.

HibiscusKazeneko · March 17, 2020, 3:56am

Should we alert the IA to this, so we can get some help archiving everything? (I know this was floated in the thread where it was revealed FreeDB would be shutting down the same day.)

sibilant · March 18, 2020, 2:45am

Sounds like a good idea to me. I sent an email to let them know about the closure and about the problem that the Wayback Machine has with archiving product pages. (Also made a donation 'cause they rock.)

sibilant · March 28, 2020, 6:41pm

Sorry if I’m late in noticing this, but the Internet Archive has made a major update to the Wayback Machine, and it can now save CD Baby album pages.

https://web.archive.org/save

HibiscusKazeneko · March 28, 2020, 10:40pm

I’ve been having issues connecting to the IA’s servers to save all these pages. Right now, it’s alternating between 502s and ERR_CONNECTION_REFUSED. I hope this means they’re stressed out from lots of people archiving CD Baby pages before the latter site shuts down…

ETA: These issues are intermittent. I also occasionally get internal server errors, 503s and plain ol’ timeouts.

I hope this post doesn’t dissuade anyone from archiving all the links they can.

sibilant · March 29, 2020, 12:23am

This came up on the IA forums a while back. Iirc, someone claimed that persistent errors of this sort went away after they cleared cache (and maybe cookies and site settings/data).

If you already did that and still get 502s or 503s, maybe it has to do with the National Emergency Library going online?

HibiscusKazeneko · April 1, 2020, 4:29am

The day is finally upon us…

sibilant · April 4, 2020, 1:48am

Anyone have any thoughts on how to scrape Google’s search results and cache without risking an IP ban? Most of the unarchived data was for artists and albums from the past three years. Google has at least 75,000 albums from this time period in their cache, plus more than 500,000 artist pages.

The goal is to rescue artist bios, artist area info, album descriptions, personnel credits, and recording details that can’t be found anywhere else.

jesus2099 · October 31, 2020, 11:35pm

@reosarevok, could we disable cdbaby URL cleanup now?
Or another solution for my issue:

https://musicbrainz.org/url/1f3a2d58-49c3-407c-9ac5-5f340652d86a/edits
A CD was linked to http://www.cdbaby.com/cd/tranmanhtuan4
When I wanted to mark this URL as ENDED, it was priorly automatically changed from http://~~www~~.cdbaby.com/cd/tranmanhtuan4 to https://store.cdbaby.com/cd/tranmanhtuan4.

This disables the possibility to explore archived version of www.cdbaby.com/cd/tranmanhtuan4 and only leads to empty Web Archive for store.cdbaby.com/cd/tranmanhtuan4 instead…

sibilant · November 1, 2020, 8:19pm

IA never fully crawled store.cdbaby.com, so there are only a few saved pages here and there. Before the store’s subdomain changed in May 2017, www.cdbaby.com was crawlable for many years; IA has lots of archives from that period. Also, the URLs for album pages stayed the same for more than ten years.

The best approach might be to use the Wayback Availability JSON API to determine which artist/album pages were archived from each of the two subdomains. Some pages can only be found at store.cdbaby.com, some only at [www.]cdbaby.com. Since each subdomain represents a different time period, you need both to have a more complete history.

jesus2099 · November 1, 2020, 8:24pm

I just would like that MB didn’t change my URL when I added the ENDED flag.
As cdbaby doesn’t exist, we should (or could?) remove this normalisation code.

sibilant · November 1, 2020, 8:44pm

I agree, your link should not have been changed.

In fact, even if a CD Baby URL hasn’t been marked “ended”, I don’t think the URLs should be “updated” to store.cdbaby.com. Some artist/album pages only had functional URLs at www.cdbaby.com, while others were created during the store.cdbaby.com era (May 2017 – March 2020).

silentbird · August 1, 2023, 7:30am

@reosarevok I just encountered a CD baby stub that I want to import but this issue is still a problem. Could you remove this automated URL cleanup?

jesus2099 · August 1, 2023, 7:55am

Apparently I was able to force the correct URL in my example.
Maybe I deactivated JavaScript, or something…

reosarevok · September 6, 2023, 12:05pm

This seems sensible actually: