Reasons for why there hasn’t been a full database dump in years

ijabz · May 21, 2019, 7:25am

Im sorry I really cant understand why its seem either so difficult/such low priortity to not be able to create a full dump over four years since the last one.

alastairp · May 22, 2019, 3:34pm

there’s a bunch of things going on here as to why we don’t have a dump for acousticbrainz. First of all, we know that we haven’t been making dumps, and it’s something that I wish we had been able to do sooner.

However, when keeping this in mind, there are a bunch of reasons as to why this hasn’t been higher on our priority list:

I’m not paid to work on AcousticBrainz, which means that effort by me is mostly on a volunteer basis
We have lots of great contributions by other people, especially GSoC students, and so when I have time to put towards reviewing code, I focus first on work by other contributors to show them that we appreciate their efforts
We have always given open, non-ratelimited access to the AcousticBrainz API, and also included bulk API get methods. On your suggestion we updated the documentation to make this more visible
We did some work on the dumps late last year, which we were able to use to make a dump, however we also found some bugs in the way that dumps, especially incremental dumps, were made, and they took too long to run (over 2 days) and put too much strain on our database server. Getting dumps to work well is difficult for a number of reasons, including
- we have a huge amount of data to dump, in 3 different formats
- properly testing dumps means that we need to do this on a large database - often what seems OK on our local machines doesn’t scale to the size of the AB database
- we have personal information of users in the AB database, which means we can’t just make a direct dump of the database tables
- we have to consider how we want to make dumps that are incremental when new data is submitted after a dump is made
- we have to consider how to make dumps work when we release new versions of our data models
Unfortunately, you’ve really been the only person who has been pushing for us to publish dumps. If there had been more people requesting it, it’s possible that we would have pushed this higher up the development list
I had always hoped that we could finish all dumps and get it released, but perhaps in hindsight we should have got at least one interim type of dump out before finishing all three types of dumps, with all of the incremental functionality that we want to provide.

I’ve been working with @iliekcomputers over the last few months to clean up the backlog of open PRs (like we discussed on our blog post), and when he’s finished with exams we’ll be doing another push for the currently open ones. Keep an eye on this space.

ijabz · May 27, 2019, 9:36am

Thanks for replying, although I dont agree with much of your logic

One point to consider is that without access to database data I don’t see really possible how volunteers can contribute much, other than bug fixes to the code base.

reosarevok · December 30, 2020, 7:31am

3 posts were split to a new topic: Fixing AcoustID broken data