I’m using the json dumps of the Musicbrainz data (from http://ftp.musicbrainz.org/pub/musicbrainz/data/json-dumps), and noticed something weird. It holds only about 60k-70k recordings, while the Musicbrainz database statistics mention there are over 17 million recordings (https://musicbrainz.org/statistics). How come they are not all in the dump? And is it possible to get a complete dump of this? The numbers for other json dumps I’ve looked at (releases, release groups, artists, labels, areas) do add up compared to the database statistics.
The “recording” dump is very useful for me to couple different music related datasets for research, since it contains ISRC numbers where available (and another dataset I use has them as well).
Hi there, it actually intentionally only includes standalone recordings (I left a comment on the ticket). You’ll have to get recordings from the release dump.
Hi! Thanks for the response. I understand why the recording dump isn’t complete now, but my problem still stands. The release dump doesn’t include the ISRC code. The recording dump is the only one to include them as far as I have been able to see, so I still don’t have a way of getting the ISRC through the dumps.
Is there any way the recording dump could be completed or the ISRC codes added to the release dump?
You’re absolutely right, sorry for claiming that they included ISRCs without verifying it. That was a bug which I’ve now fixed. I’ve triggered a new full dump, so I’ll link that here when it’s ready.
Hi, thanks! The ISRC is available indeed. I now notice the latest release dump is a lot smaller compared to the previous one I have from 9th of August (1.31gb vs 2.7gb) and it holds less releases as well. Is this another problem?
I also saw that there is a dump from with same date as the last one. That one has a release dump of 2.71gb as I would expect. Would it cause me any problems if I use that one instead?
There is definitely something strange with the file sizes that I’m investigating (it takes a while to decompress them…), but at a glance I see the opposite of what you’re saying: the latest release dump from 20170823-225118 is much larger (9.3G) than the one from 20170823-001002 (2.6G), not smaller. Can you point to exactly which file is 1.3G?
If the other dump from the same date you’re talking about is 20170823-001002, then no, that one can’t be used because it was from before I applied the fix to dump ISRCs.
Sorry, I still don’t understand where you’re getting/seeing 1.31 GB. The latest release.tar.xz is 9.3 GB and contains more releases as expected (1848011 vs. 1847460 from the previous one).
Hi, sorry for my late reply. I left on vacation (thinking something just went wrong with my download after reading your last comment) without computer, so I wasn’t able to download again and check if everything worked before I got back.
Anyway, I just got back and downloaded the latest dump and still have a similar problem. In the latest json dump, the release.tar.xz seems to be 1.46gb. It seems to hold only about 200000 releases. I am really confused about what could be the difference between what you are seeing and what I am seeing.
Yep, that’s very strange. I also have a Mac (10.12.6), and it shows 1.9GB in Finder’s FTP window. But I can confirm it’s 9.4GB on disk, which is also what the OSUOSL web interface reports, so I assume it’s synced correctly. It could be hitting an ancient 2GB limit somewhere. Can you download it via curl instead? curl also reports the total as 9583M for me.
Regarding the size increase (from 2.6G to 9.3G): I changed the tool used to compress the dumps, and created a new dump (20170911-234731), and the release one is back to 2.5G, so it seems to have been an issue in pixz.
As I’m writing this the 20170911-234731 dump isn’t synced to FTP yet, but it should be soon.
Right, that works, thanks! I checked if the directory existed when I started looking into this, and while I solved my stupid previous errors, it must’ve been replaced