Doubts regarding GSOC project "Storage for AcousticBrainz v2 data"

Hello, I am Sweta Vooda currently pursuing BTech 2nd year.

I am almost ready to submit my GSOC proposal for Storage for AcousticBrainz v2 data
I need some clarification regarding a few points to make sure I am heading to the right point to finish my proposal.

In the requirements mentioned for this project.

  • The third point it says “Update the client software to include a check where they announce to the server what version they are

After checking the client-side code it is clear that the client doesn’t add any extra version or fields but sends the data straight from the extractor output to the server.
The “process” function in the acousticbrainz.py (in client) stores output from the extractor into the “tmpname” json file and sends it straight to the server after some checks.
It is also clear that the client is not setting any extractor version, the json file posted with low-level data has extractor version.

{u'essentia': u'2.1-beta2', u'extractor': u'music 1.0', u'essentia_build_sha': u'70f2e5ece6736b2c40cc944ad0e695b16b925413', u'essentia_git_sha': u'v2.1_beta2'}

Then why do we need to send it again?

  1. Do you want explicitly send one specific data version attribute and set that to extractor version instead of sending in as part of version json structure which has version details of other tools as well?

    (Or)

  2. Do you want it to be version number that is there in PKG-INFO of abzsubmit’s package info?

I assume it is #1 above, please clarify/confirm.

  • Next, In the 1st point “Update the database schema to include a data version field, and allow the Submit and Read methods to switch between them.”

There already exists a version table in which the “data” field stores all versions in the form of a string.

{u'essentia': u'2.1-beta2', u'extractor': u'music 1.0', u'essentia_build_sha': u'70f2e5ece6736b2c40cc944ad0e695b16b925413', u'essentia_git_sha': u'v2.1_beta2'}

So can we not use it itself to identify versions? or should we store the extractor version separately in another table to make it easy to query (read and submit ) the version and to maintain lowlevel_json data for various versions?

  • Lastly, there is a lot of ambiguity regarding the 2nd point “Update the frontend including the dataset editor

As far as I have understood it should be

  1. Able to show different versions through API or server for the same mbids.

  2. allow user to select version of low-level data that he wants to evaluate.

  3. final output should also contain a version field of extractor used while displaying the low and high-level data corresponding to the recording.

@alastairp It would be helpful if you could clarify these points and lead me to the right way on submitting my proposal soon.

Thanks for the questions.
First of all, a reminder and suggestion: These project outlines are very brief and don’t include all the information about the project. As we describe on our SoC intro pages, it’s a good idea to join the community and discuss any projects that you are interested in with us so that you understand what we’re hoping to make. At the moment there are not many days left before the proposal deadline, so I recommend that you get it submitted as soon as possible to start this discussion with us.

The idea here is that after we launch the new storage system, users with an old version of the client could continue submitting data, even though it’s not really useful to us any more (we would want users to submit using the new version of the extractor). We don’t want users to waste their time processing files with an old extractor. So this isn’t directly related to the version field in the json, but to the submission process of the client sending data to the server. As an idea, we could follow the following process:

  1. when the user starts the client, it retrieves the current version of the extractor
  2. the client sends a request to the server saying what version it is
  3. the server either says OK, or says that the client is too old
  4. if the response from the server is that the client is too old, show an error message to the user and tell them how to update to a newer version
  5. optionally: automatically download and upgrade the extractor to a newer version and continue submitting

You’re right that we have the version table. However, we have many items in this table for the same value of 'essentia_git_sha': u'v2.1_beta2'. This is because we have extractors on different operating systems (the build_sha is different), or we had some earlier versions of the extractor, or bug fixes applied (a different git_sha), even though the value of 'extractor': u'music 1.0' is the same. A new version of the extractor will have a different extractor version (e.g. ‘music 2.0’). This is the new version field that we want to additionally keep track of.

When we have a new version of the extractor, this is going to become the primary version that we collect, use, and share with people. However, we have over 15 million submissions from users already, and it will take some time before we have this much data for the new extractor. Therefore, we need to decide what to do when users request data. Some ideas that I have about this are here:

  • The data already includes the version of the extractor, so we don’t need to add additional information here
  • When users request data through the API for an MBID, give them data for the new version if it exists
    • If there is no data for the new version, but there is for the old version, we could return it but we need a very clear way for the user to know that they are receiving old, outdated data
  • The same with the data page. We may have more useful data with the new version of the extractor, but may not have data for all MBIDs. We should make sure that the page shows that the data is out of date if we only have it for the old version
  • We shouldn’t give users access to the old data if we have submissions for a new extractor. We know that this data is out of date and we want to users to use the new improved data as much as possible
  • When using the dataset editor, we have a similar issue. If we have a dataset of 10,000 items, but only 3000 of these items have features from the new extractor, we cannot train this model. We need a way of indicating to the user that they can either train the model with the old data, or that they have to wait until there are enough submissions with the new extractor before they can train the model again.

In your proposal it would be good if you can show that you understand the cases where we need to differentiate between old and new data, but keep in mind that you will not have enough time to finish all of these changes. You should describe as many situations as you understand, but be clear on the few ones which you think are important to finish first.

One last item which isn’t listed in the project overview and which is important for this project is where to store the low-level data. The data from the new version of the extractor will be ~10x larger than the current extractor, and so we need to improve the way that this is stored in the server. Some points here:

  • Determine a more efficient way to store the output of the extractor than using json. There are many ways of compressing this data significantly, and it would be nice to evaluate these methods to see which is the best
  • Don’t store lowlevel data directly in the database: This was an idea that we originally had when we started AcousticBrainz, but experience has shown us that it’s not useful. It makes the database slow, and we’re not getting any value out of storing the data here. We want to move this to store data as files on disk instead. Metadata (version table, etc) and highlevel data can still be stored in the database because this is useful information that we want to sort and filter.

I’m looking forward to seeing the proposal to see what you’re interested in working on for this project!

Thank you so much for your inputs @alastairp I submitted a part of my proposal here please check and let me know what you think about it.
Also, I have some more doubts listed there I would be grateful if you could clarify them as well.
I hope I am going in the right way and submit a proper and detailed proposal to work on.

For this, I was thinking if we can have a JSON to CSV converter and store the files in the file system. Still, I am not sure how efficient this is, I will research more on this and try to come up with a better solution.