Bandcamp as source of AcousticBrainz data

I am working on script that find releases without AcousticBrainz data but with link to Bandcamp (now there is 191k such releases), download release from Bandcamp streaming and submit to AB. It can submit to AB 4000 releases from one PC (bottleneck is AB client CPU consumption).
I think there is a things to discuss:

Legal aspect. Legality of downloading music from streaming, let’s say, questionable. But i don’t think it is a problem for me (Bandcamp seems to be good guys, and i live not in US which complicates to prosecute). Legal purity of AB data collection seems not in danger here.

Data quality. BC screaming is MP3 128kbit/s, not so lossless. Besides there can be (and i believe there is) wrong (or at least not totally right) links to Bandcamp in MB database. I carefully automatic collate tracklists from MB and BC. About 15% of downloaded releases script consider as not match with MB release, mostly falsely. But i’m sure among 191k links there are some that will falsely pass my filter.

Script also can submit acoustic fingerprints to AcoustID, but AcoustID seems to have some issue now (maybe provoked by mass data submit during the script test).
Some releases on Bandcamp can be «bought» for $0 and downloaded in lossless, but in harder to automatize. Maybe there is another similar sources of data, or another data that we can automatically get from this source.

P.S. Legal statement: I am not distribute music downloaded from Bandcamp streaming, not listen this music, and keep it on computer only few minutes

17 Likes

According to my estimate 20% of releases without AB data but with Bandcamp link can be downloaded from Bandcamp for free (name-your-price) in flac. My plan is finish current submission (hope it will take month or so), make more of less automatic processing for new releases, and then start working on lossless downloader (hope AB database dumps will be available then)

5 Likes

Just wanted to say that this is really interesting. Can’t wait to hear more about your progress. :+1:

3 Likes

Last 6 days script work almost 24/7 on my PC (and it fully load my CPU). During this time and few test runs before script processed 66k releases w BC link but w/o AB data (i really need short term for this). 46k of them was submitted to AB.
Downloading of 9k was failed (it can mean anything from wrong link in MB database to bag in script and network glitch).
10k releases not match enough with MB database, and therefore was not submitted to AB. I think there is many false mismatch on this stage.
I think such errors can be used to search problems in MB database, but it requires human check of 19k releases and i cant do it. Here list of release witch script fail to dowbload, and with matching problems.

Also script did submit 12k releases to AcoustID before it went down. Now it submit only to AB. When AcoustID come back separated run of script will be needed to re-download and submit to AcoistID all releases that was submitted to AB, but not to AcoustID. Fortunately submit to AcoustID is much faster then submit to AB.

Here graph for last 3 years, new unique records with has data in AcousticBrainz


By the way today total unique records count reach 6000000

I hope that i can process such releases faster then then editors add new, but seems like new releases appear a little bit faster then script process them. So i need accelerate. Bottleneck is CPU. I have only one Core i7700, not fastest CPU in the world. If someone want have unused server witch he want temporary contribute to this, you are welcome.

Concerning lossless i have not come up with anything better then use headless browser to download them. It’s slower then bandcamp-dl witch i use now, but downloading is not a bottleneck. In the other hand calculating acoustic data for lossless demand even more CPU time then for MP3.
Here i need feedback from AB data users, what they need — more data from MP3 128 or less from lossless? How much different such data?

12 Likes

By the way, I would appreciate any feedback. Tips, ideas, suggestions…

Pretty sure everyone’s just speechless because this is amazing :joy:

6 Likes

Hello,

Looked at the 21 first ones from failed to download list.
Below the results, waiting for your feedback to see if something can be done before checking more:

  • A: 2 seems to work (I just looked quickly and they are playing). But they are not standard: for one all the tracks are more than 10 mins, for the other one there are 31 tracks in the release, do you think it could cause issues for the script?
    Release “217 XIII” by centrozoon - MusicBrainz tracks >10min
    Release “Calendar - 07 | 19” by Keith Hillebrandt - MusicBrainz 31 tracks in total

  • B: 1 is non digital release (Vinyl format in MBz, link to buy it): Guess you know how to prevent this one :slight_smile:
    Release “First Single” by Wirehead - MusicBrainz

  • C: 15 are dead links: For those it always show same error msg (in the language set by user). Sometimes that s the release which no more exist, sometimes the artist account itself.

  • D: 1 is linked to main account rather than the release: Can be seen directly from the URL which miss parameters after “…bandcamp.com/”

  • E: 2 need subscriptions to be listened to: Wording not always the same (ex: join, subscribe), would require more examples to confirm a pattern.

Other possibilities:

  • For cases B and E looking at the link description (ex: Purchase for mail order, download for free, …) could be interesting to spot cases were it is set the music is not directly available. Problem, those are not perfectly set by editors. Could be more a second check to perform in case script return fail.

  • For case C results could be refined if a dead link is detected by checking the artist account (removing release parameters from the bandcamp link) then the editors or MBz Bots could know all the links to inactivate.

Hope the AcoustID would work again soon

6 Likes

A: I believe long track or long tracklist not a problem. Among successfully processed releases was long and large.
«217 XIII» has no streaming available for 2 of 6 tracks, it make hard to match tracks to MB release.
Seems like «Calendar - 07 | 19» was just a glitch.

B: I can separate BC links without streaming and check their type in MB database to find links with wrongly set type «free streaming»

C: I can separate such fails and post here list of dead links and releases with use them. Also i can automatic check if account deleted, if it will be really useful for editors

D: Is it correct link for MB releases? I think editors sometime add such links (instead of album link) because BC display last album page on artist main page. Here list of such releases on 2021-08-18. 4369 releases

E: I think it should be handled same way as B-case. Check «free streaming» in MB, if it set collect release and post here list of such releases

So i can detect two types of errors in MB data: broken links and mistakenly set «free streaming»

3 Likes

Found bug responsible for huge amount of false mismatches. Downloader and matcher used different version of database. Also in last 8 hours script found 240 death links (seems like it’s a serious problem), and only 5 for non-digital releases

2 Likes

I wonder if you could automate adding them to a collection or adding a tag?

It’s not surprising to me that a lot of BC links are in the wrong format! Until you came along with this process I don’t think there was much motivation or reason to change them.

1 Like

There is a way of capturing the streams for the “paid for” albums, if you’re doing it for the purposes of creating fingerpritns - using the developer console, Network tab, reload and look for communications to Bandcamp Bits/Bitz URL’s - only 128kbps MP3, but is fingerprintable

Asked @lukz about commit to AcoustID. He say 45k fingerprints per day it’s ok, so now i commit data also to AcoustID. Average speed for last 24 hours is 240 releases/hour

8 Likes

Nice to hear.
Tell us when you can rerun the scripts with those updates to test on a larger scale the results.
And as mentionned by @aerozol we could imagine a process after where the results are sent to collections (or just logs as psoted before) in order to edit the different links)

:slight_smile:

Script submit data to AcoistID and AB last few days. Code run on 4 PC almost 24/7. Hope in 6 - 9 days it finish processing 2021-08-21 database dump. On AcoistID my submissions still in pending state, but i give up understand what happening inside acoustid blackbox and just relay on @lukz.

As for BC links cleanup. Broken links collecting, now there is 9000 404-error links. I think only reasonable way to deal with them is write a bot. Now i try understand how to do it. If someone want take on dealing with broken links, i will be happy to give list of them in any format. Here collection Collection “bandcamp 404” - MusicBrainz in few hours add all found releases with broken links in it.

Also i think about other useful data we can get from bandcamp pages: link type (stream/download/order), tags, license, cover art. Anyway i well need download tons of BC pages if i want download freely available FLACs for AB. But it require addition checks and some bot to mass-edit MB

1 Like

With the broken links, is this something that would be better to have the URLs to get an End Date set? (Even if it is the date it is spotted as actual date will be unknown)

I have seen a few Bandcamp accounts that get taken down. It would be useful to keep a history in MB that there used to be a Bandcamp link instead of deleting it.

3 Likes

Agree. There is useful checkbox «ended» in relationship editor of new MB version (beta.musicbrainz.org)

I am not skilled in bots, but I believe the current site can already do an “end” date. Just harder to find the GUI hence the new beta version. You just have to know a specific URL to locate it. I expect the bot wizards would know the secret. We’ve all see that lyric bot chugging around in our subscribed edit histories. That initially was setting the relationships to “ended”. That’s the bot you need.

BTW - have to add. Nice mission you are on here. Quality job.

2 Likes

To see their available bandwidth but it may be possible to rely on current one Editor “MBBE_Bot” - MusicBrainz by just creating a ticket rather than starting from scratch.

Then to pursue this idea, results after checking links for first page in collection (100 releases)*:
A: Significative cases

  • 29 are totally dead
  • 57 have the link not working but the artist page is still existing (Would be interesting to have the bot adding the link on artist page rather than just removing the release link)
  • 11 have the link not working and the artist page exists but requires to click “Follow” (rather than directly displaying the artist page)
  • 3 have their links already inactivated**
    Total: 100
    NB: There is a bias as (no matter the limited number tested), some releases have same artist

B: Irrelevant due to number but still interesting

  • 2 links that I considered “Artist page still exist” were showing traditional chinese text with same picture: Second case happened at the end so I didnt check more but could be a generic disclaimer for error.
  • Less than 5 showed other links: 1 had the other link working, 1 had 5 links all dead for “free” or “score”,… to see if isolated cases from the samples or if they are more like those

.* I proceeded like this:

  1. Copying first collection page in a ODS calc file
  2. Click on the release link: No page where ever shown (ie. confirms the results are in line with what we expect) so it ends up to proceed with 3.
  3. Removing what comes after “…bandcamp.com/”: If it showed “Creating Artist page” I consider as “Totally dead”, if something else appeared I consider as “Artist page still exists”
  4. I write down in file results + some not standard cases
    Still have the details per release if needed

.**
Should be detected by the script as no action required.
Among the 3 releases, 2 removals were made as same date by same user but it was for same artist so not relevant.

2 Likes

As i understand «Follow» pages is pages of artists without releases

Yes. But artist page link (example.bandcamp.com) good to be added to artist, not to release.

2 Likes

To download lossless from Bandcamp i need first download and parse Bandcamp album page (to check if free download is available), so eventually i will have up to 220k saved Bandcamp album pages. It’s data source! From BC page we can get:

  1. Tags
  2. Correct relationship types for link (purchase for mail-order, purchase for download, download for free, stream for free)
  3. Cover art url
  4. Links to artists social media pages
  5. License info

Tags:
It’s easiest part. Plan is filter out tags which not in list of MB genres and submit them by API to corresponding release / release group (and may be artist). Will it be correct?

Relationship types:
I Wrote script that parse BC page and detect if free streaming / download for free / purchase for download / purchase for mail-order available on it. But it should be tested by human. Here parse result for 1200 links i already download and parsed. It would be good if someone check some path of it. Also i don’t know how classify subscription and preorder only pages.

Covers:
Not sure if this is good idea, automatic upload covers to CAA, but if someone want such data i can create a list of Digital Media releases which have no cover in CAA, and direct links to covers on BC

Social media links:
In purpose of automatic import it seems good idea to use only links to sites known to MB.

License:
About 5% of BC page declare some Creative Commons license. I sure some of corresponding MB releases have no license relationship. But may be it also require manual approach to ensure correspondence BC page and MB release

I can do tags import by myself, for other types of data i prefer rely on MBBE bot or people help

P.S. seems like current DB dump will be processed in two days!

4 Likes