Creating a mysql database of acousticbrainz data


#1

Hi!

So I have a script to go through the AcousticBrainz data dumps and insert them in to a mysql database using the dataset library for python and it’s been running for a few days and only half way finished. I’m new to databases so I read up and now realize “insert” is very costly.

Anyone have experience doing this and can give me some direction? Should I be doing something else completely?

Thanks!


#2

When loading a lot of data a lot of the time can be spent checking constraints and building indexes.
Is it possible to disable or drop the indexes while you load data to the tables and then rebuild the indexes.

As long as the primary key is unique you should be able to add this once at the end instead of having the database check this on every row.
Likewise it is faster to build an index once all the data has been inserted instead of having to calculate this for every record.

Myslql has a csv file engine, it may be possible to convert your data to a csv file and make this appear in mysql as a table. Note that you are likely to have problems if you include some of the name columns as song names can have a coma.
You could then specify that this table is a csv file and use it as is or do an insert as select statement to copy to your real tables. ie
insert into real_table as select * from csv_external;

Personally I prefer using postgres over mysql as it has a lot of features not found in mysql.


#3

You’d need to tell us what you want to accomplish.

Also, futzing around with the insert process may pay off, if you are going to do it a lot of times. If you just want to have a one-off database to do some queries, manual labour will not pay off: just let the computer burn CPU for a few days more, and be done with it. So, again, depends on what you want the database for.


#4

Here are a few suggestions which I use when I need to bulk insert data into postgresql. Mysql should have some similar features.

Convert the data to csv (or similar) and use the bulk insertion tools. It seems that mysql has LOAD DATA INFILE: http://dev.mysql.com/doc/refman/5.7/en/load-data.html. Also, by converting to CSV you separate the overhead of loading all of the json files and decoding all of them. If you know exactly what data you want, you can dump to CSV once and then import as many times as you need

When you do an import, make sure that the table you are using has no indexes, primary keys, or foreign checks. Every time you do an insert the database takes time to check these constraints. It’s much faster to do a bulk insert and then create the indexes when the import has finished.

Good luck!


#5

Inserts on their own should not be too costly, just make sure your not re-indexing each time.