Get the whole dataset

Hello everybody
First of all , thanks you for this new forum.

My question is may be naive but I would like to know if it’s possible to get the whole database of musicbrainz in common format (csv,json …anyway) ? I would like the whole dataset to do an exercice of big data analysis.

I found this link to download the DB (http://ftp.musicbrainz.org/pub/musicbrainz/data/fullexport/) but I wondering if it’s really the whole of database because It seems not so big (~5 Giga).

Thanks
F.

Yes, see https://musicbrainz.org/doc/MusicBrainz_Database/Download

You found the right place. There is a lot of text data one can fit into 2GB compressed :slight_smile:

8 Likes

that’s the answer I was looking for to be sure.

Thank you very much !

Hey Fargo88,

great question!!
Did your project work? I’m currently working on a similar project. For this purpose it is essential to work with a csv file of the database. Is there any way to share your csv file?
Thank you for replying!
P.

have u done ur project? plz share the csv file with me

I think we’re a few people looking for a CSV file we can import into other tools.
In my case I want to load Musicbrainz data into neo4j graph database.

If I’m able to get a CSV file I will post a link back here!

Sharing and Collaboration does not seem to be very frequent…

MusicBrainz has a web service which provides data as XML or JSON. With proper queries you can gather the data you’re interested in and convert it to whatever format you need.

3 Likes

Isn’t the MB database dump just a bunch of TSVs anyway? You can just load them into any decent CSV parser and tell it to use a tab instead of a comma as a delimiter.

Python:

import csv

with open('<some db dump file>') as f:
  reader = csv.reader(f, delimiter='\t')
  data = list(reader)
1 Like

Thanks for the insight Zas,
Since I’ not a developer, extracting Musicbrainz data to XML or JSON means that I still have to code in order to ETL something.
Not lazy just not talented as you are :wink:
This not user friendly enough loll

Thanks for the suggestion!
Again, your idea means installing Python, use some code as the one you suggested…
Seems simple, I may give it a try.
Otherwise could I import Musicbrainz dump straight into a CSV parser?
Could you suggest one please?
Regards,
Simon

For Neo4j specifically, you can actually just specify the delimiter directly upon import.
Via LOAD CSV in Cypher: https://neo4j.com/developer/kb/how-do-i-define-a-load-csv-fieldterminator-in-hexidecimal-notation/
Via neo4j-import: https://neo4j.com/developer/kb/how-do-i-specify-the-field-and-array-delimiter-to-neo4j-import-as-a-ascii-character/

1 Like

Ropdebee, Again, Thanks for the quick response!
Simon

That’s a very simplistic view of a modern relational database consisting of millions of entities, and using gigabytes of data. That’s a bit like thinking a modern car is just a bunch of metal and plastic parts… and you can dissassemble then reassemble it without huge knowledge in mechanics… possible, but not “easy”.

Because you think it’s more complicated than trying to dump a whole complex database to simple CSVs. The thing is that you totally miss the meaning of “relational”. Even though each table could be exported as a CSV/TSV somehow (and even this isn’t that simple), those tables are linked together, I can tell you you’re on the wrong way if you think you can interact with MB data this way. At best, if someone is able to make it work, it will be extremely slow and prolly totally unusable.

See https://musicbrainz.org/doc/MusicBrainz_API/Examples

Since you’re not a developer, you’d better hire one and focus on the use of the data you want to make.

Well, yes, each table could be exported then imported into a CSV parser, if you have enough resources you could even be able to do something with it, but, to be frank, that’s not something you want to do.

4 Likes

If you’re only interested in a handful of fields from one entity, then I think that such a simplistic view is totally acceptable. it’s completely viable to just load in the TSVs in that case, I’ve done it before. IMHO, loading all of the data into an actual RDBMS or using a bunch of requests to the MB API to eventually get the same data is much more complicated for that specific use case.

If you need data from multiple entities, then yes, I agree that it’s easier to import the data into an RDBMS and work with it from there, or use the API. In that case, it doesn’t even make much sense to talk about a single CSV file, as it would be heavily denormalised and barely usable.