GSOC Proposal for Storage for AcousticBrainz v2 data

sweta05 · March 27, 2020, 4:34pm

Personal information

Name: Sweta Vooda
Nickname: sweta
IRC nick: sweta05
email: sweta.vooda@gmail.com
GitHub: swetavooda
Time Zone: UTC+5:30 hours

Proposal for Storage for AcousticBrainz v2 data

Overview and benefits to the community

With the increase in technology and large contributions to the music technology industry, there is always a need to keep our website data up to date and keep our clients and customers happy with fresh and latest data-carrying better analysis and new features.
Acousticbrainz is an organization which uses reference or songs from MBID and processes it to get the characteristics of music which includes low-level spectral information and information for genres, moods, keys, scales and much more characteristics and features of the song.
Acousticbrainz aims to provide music technology researchers and open source hackers with a massive database of information about music. To produce good quality data it is required to constantly update and use better tools to extract data and make it available to the world.

Therefore it is very important to have the features required to store data for new versions along with the current data and provide better training models to generate more accurate features and characteristics for the recordings.

Aim

This project aims to store data for the new version of the extractor tools in addition to data from the current version of the extractor.
Points covered in the project

Server-side

Update the database schema to include a field for storing the version of the extractor.
Include a field in config file to store currently supported version of the extractor in use.
Add API to GET supported extractor version.
Add API to validate extractor version
Change all DataBase queries for insert and retrieve in data.py
Changes in dataset editor to evaluate new version data.
changes in API for dataset editor.

Client-side

Create a validation check to verify the version of extractor used at the client-side(through API).
Allow “already submitted” files to resubmit using the new version of the extractor and process them.

Other features for improving the efficiency of data storage

Implementation plan

Server-side

1. Update the database schema to include a field for storing a version of the extractor.

Current feature
The database schema at present contains a version table, which all version details of the software used.
Following is the version table description.

   Column    |           Type           | Collation | Nullable |               Default               
-------------+--------------------------+-----------+----------+-------------------------------------
 id          | integer                  |           | not null | nextval('version_id_seq'::regclass)
 data        | jsonb                    |           | not null | 
 data_sha256 | character(64)            |           | not null | 
 type        | version_type             |           | not null | 
 created     | timestamp with time zone |           |          | now()

The the data field in the table stores the JSON data of version sent by the extractor tool, includes the information regarding the version of the feature extractor used, exact version the source code (git commit hash) and also an increasing version number.

Contents of data field:

                                                                          data                                                                           
---------------------------------------------------------------------------------------------------------------------------------------------------------
 {"essentia": "2.1-beta2", "extractor": "music 1.0", "essentia_git_sha": "v2.1_beta2", "essentia_build_sha": "70f2e5ece6736b2c40cc944ad0e695b16b925413"}

problem
Since the extractor version is stores in the JSON format inside the table it makes it difficult to query the version and and use it to make changes.
Proposed changes
To explicitly tag, the low-level data sent by the client with extractor version will add a new column in version table called extractor_version.
This will help distinguish existing data and data came from the new extractor, so we can use this version information in doing further operations like retrieval, evaluation etc… on need basis.
IMPLEMENTATION
Updated schema
Query
ALTER TABLE version ADD COLUMN extractor_version DEFAULT="music 1.0";
Database schema diagram
database schema2

2. Include a field in config file to store currently supported version of the extractor in use.

Add the extractor version field in the config.py file to make it easy to know the latest extractor version in use/supported by the server.
This can be used to check if the clients are using the latest version or not, and notify them to update to latest extractor version if they are using the old extractor.
IMPLEMENTATION
add EXTRACTOR_VERSION field in the config.py file and enter the latest version in use.
Example: EXTRACTOR_VERSION = “music 2.0”

# extractor
EXTRACTOR_VERSION = "music 2.0"

3. Add API to GET supported extractor version.

External and 3rd party clients should be able to get the latest version of extractor supported being used in the AcousticBrainz to give them feature to check the version we need to add API to GET version details on request.
IMPLEMENTATION

4. Add API to validate extractor version

This API is used to double-check with the client if they are using the latest version of the extractor.
Process
During the initialization of client-software and running the extractor first it should check if the client is using the latest version. To achieve this API is used.
API call

this will receive extractor version being used by the client in the request URL and will compare it with the extractor version supported by the server (which is mentioned in the config file)

Input: extractor version at client side
Output:
- if extractor version is same as currently supported version at server: return True/pass
- else: Display message to update extractor version.return <extractor version supported + URL for download>

5. Change all DataBase queries for insert and retrieve in data.py

a. Modify query used for storing the low-level data received from the client.
Add logic to add extractor_version:
Location: data.py file def insert_version :
IMPLEMENTATION
In this insert, the query will be adding new column extractor_version and will set it with the extractor version info retrieved in core.py(point 2 server-side)
(or) we can get extractor info from the JSON file submitted to the server.
But Extracting from JSON can be a waste of computation although it returns a True value and no chance of error.
To eliminate the chances of occurrence of errors in noting the extractor version we will follow a double-check method mentioned in point 4 server-side.

b. Change in the retrieval of low-level data (through API and views)
The present logic for retrieval of low-level data does not take the version of the extractor as a parameter and do not differentiate between versions.
Therefore multiple changes are to be made in the definitions and logic written to retrieve data.

when API or views are used to retrieve the recording information using (/MBID/low-level)
When API is used to retrieve the low-level information of a given MBID then we need to show the latest version data
If the latest version data is not available then we need to display a message saying “The data shown is from the old version and is outdated, new version data not available”
IMPLEMENTATION
make changes in core.py and data.py

Filter the old low-level data by adding a check to find if data are available for the latest version.

  if (new version data available for the recording):
      call function in data.py(retrieve only new version data corresponding to the MBID)
  else:#no new low-level data 
      if(old version data available for the recording):
          call function in data.py(retrieve --only-- old version data corresponding to the MBID)
          Display message saying "The data is outdated(old version)"

c. Other Functions logic to be changed
Changes in query in [GET]

def get_low_level(mbid):
def count(mbid):
def get_many_lowlevel():
Changes in the functions used to post and retrive data
submit_low_level_data(check version and reaise exception if old version used)
Query changes required in
def load_many_low_level(recordings):
def count_lowlevel(mbid):
def count_many_lowlevel(mbids):
def get_unprocessed_highlevel_documents_for_model(highlevel_model, within=None):
def get_unprocessed_highlevel_documents():
def write_low_level(mbid, data, is_mbid):

6. Changes in dataset editor to evaluate new version data.

Currently, the dataset editor checks for a low-level JSON dump for given recordings and evaluates the dataset if low-level data is available.
Since we are changing version we must restrict the user to evaluate dataset using the old version of extractor because we aim to collect data for the latest version.
However, initially, we won’t be having enough data in new version therefore we must give an option to the user to evaluate the dataset using old version data or wait until enough data is collected.
To implement this many checks are to be written and logic must be refined.

Logic

While trying to evaluate the dataset with recording having data of new extractor version
Check if enough data are available for training the model. (select a trigger)
- Case I: If a significant number of the songs in the dataset don’t have data for the new version.
  - Display message to the user that “not enough low-level data(new extractor version) available” and give a choice to the user
    - Wait until enough data is collected (or) evaluate using old version data.
- Case II: If a small portion of the dataset doesn’t have the data from the new version
  - Ignore the recording which doesn’t have new version data and evaluate the rest of the recodings.
    - Display a message that the following didn’t get evaluated due to unavailability of data in the new version.
      This work should simultaneously fix a part of the bug AB-148

Implementation
Add logic to check version in “dataset_eval.py” → validate_dataset_contents() to check if latest version of low-level data available
pseudo code for the above

if the number of valid recordings above limit:
    Give user the option to delete invalid recordings
    if selected:
        call delete recordings function for the MBIDs
        Create job
    else:
        if low-level data of old version available:
            give user option to use old version data and evaluate
            if selected:
                create job
            else:
                quit
        else:
            raise exception

Here a problem arises the evaluator doesn’t know which version of MBID in low-level to extract load low_level data to load data of which version?
Therefore we need to store the extractor version in dataset_eval_jobs table based on the selection made by the user
Solution
Add version column to dataset_eval_jobs table and change insert query in _create_job.
Query
DB schema Diagram
After the changes are made in the database we must use the version to select the data for evaluation.
IMPLEMENTATION
changes in evaluate.py

during evaluation process we must change the def evaluate_dataset(eval_job, dataset_dir, storage_dir): function logic to select low-level dump for given version.
Change logic of dump_lowlevel_data function to return dump for a given version.
dump_lowlevel_data must take another parameter of version (eval_job[version]) and process.
changes in data.py
Change logic in load_low_level(recording)
add parameter for version in def create_groundtruth_dict(name, datadict): to insert version.(at present it is hardcoded)

7. changes in API for dataset editor.

At present, the datasets can also be created using the API so we must make sure to implement all mentioned points above in the API as well.
IMPLEMENTATION
Add checks to process the dataset before evaluation as mentioned earlier in point 6 server-side

Client-Side

1. Create a validation check to verify the version of extractor used at the client-side(through API).

When the client is submitting the files for extraction we must first check the version of extractor used by the client.
This should be done before the extraction process starts to avoid unnecessary computation and producing outdated data.
IMPLEMENTATION

In the client initialization check version of the extractor, send extractor version to the server to check compatibility.
if the response from the server is TRUE then proceed with normal
else stop the further flow and show message received from the server
Handling old client which does not have server supported version check

2. Allow “already submitted” files to resubmit using the new version of the extractor and process them.

After client downloads the latest extractor version there is a possibility that few of the old files don’t get evaluated because they have already been processed. In that case, we must check if the already extracted/processed files using the latest version of the extractor or not.

if they were used then ignore these files.
otherwise allow these files to be processed.

Solution
The client uses an SQLite database to store the entries of submitted files
The table stores:“status” and “directory of file”
So we can add a version column to the table and process files again if used old extractor.
a. Update SQLite table schema
Implementation
Alter table to add version column

def alter_table_sqlite(dbfile):
    conn = sqlite3.connect(dbfile)
    c = conn.cursor()
    c.execute("""alter table filelog add column version text""")
    conn.commit()

b. to check if the files previously evaluated used latest version
Implementation
change def is_processed(filepath): to additionally add logic to check the version in which the file got processed.
c. During extraction update the version of extractor used
Implementation

add function to insert version for a file def add_version_to_filelist(filepath,version):
after process the process call add_version_to_filelist and send version as a paramenter.

Other features for optimisation

The data from the new version of the extractor will be 10x larger than the current extractor so we need to improve the way that this is stored in the server.
SOLUTION
We can use JSON to CSV converter and store the data in the file system using the directory structure as follows:

lowlevel_json
├── llid1
│   ├── offset1
│   ├── offset2
│   └── offset3
└── llid2
    ├── offset1
    ├── offset2
    ├── offset3
    └── offset4

Timeline

Here is a detailed Timeline of 13 weeks of GSOC coding period.
Also, the required documentation for API and new changes will be updated simultaneously while working with the code.
coding → testing(scripts) → testing integration → documentation

Before community bonding period(April):
Contribute more to the organization by fixing some more bugs and adding features.
Learn more about data compression techniques to store and retrieve huge data.

Community Bonding period:
I will spend time learning more about python unit testing, docker and understanding more deeper about AcousticBrainz functionalities and data compression techniques.
Getting more acquainted with the members of the organization discuss the project much more and planning for the coding period.
Brainstorm to find more efficient ways to store JSON files.

Week 1:
the database schema changes config changes and changes the logic for submitting functions. (server-side points 1,2,5a)

Week 2:
Make required changes to mentioned functions in point 5 of server-side.

Week 3:
API changes point (server-side points 3,4)

Week 4:
writing test scripts

Phase 1 Evaluation here

Week 5:
Client-side changes for validation (client-side points 1)

Week 6:
allow already submitted files to submit again (client-side points 2)

Week 7:
test scripts and server-client integration testing

Week 8:
start dataset editor changes schema changes and (evaluate.py) (server-side points 6)

Phase 2 Evaluation here

Week 9:
finish changes in dataset editor and make API changes for datasets (server-side points 6)

Week 10:
write test scripts

Week 11:
buffer period and documentation and finish testing

Week 12:
Work on efficient ways to store data in the file system rather than in the database

Week 13:
Buffer period

Additional Ideas
Plans on compressing JSON data to a more compact and easier structure to store and serialize because with the increase in the version that can become larger and costly to process and transfer.

Detailed information about yourself

I am a 2nd-year student pursuing a Btech in Information Technology at Keshav Memorial Institute of Technology, India.

I was very eager to contribute to an open-source organization and started searching for the most suitable organization that interests me.
I came across MetaBrainz and was very inclined towards this organization since the beginning.

I love coding and music and found AcousticBrainz to be the best match for my interests. The goal of AcousticBrainz fascinated me to contribute to the organization.

Tell us about the computer(s) you have available for working on your SoC project!
I have a Lenovo IdeaPad with intel core i5 7th generation,256 SSD and 8GB RAM with Dual BOOT windows and UBUNTU.

When did you first start programming?
I wrote my first hello world program in the mid of high school

What type of music do you listen to?
I mostly listen to melody and classical songs.
Some of the songs I like on MusicBrainz.(MBID)

Have you ever used MusicBrainz to tag your files?
Yes, only to test the client-side process and learn more but now that I have learnt how to use it I will surely try using them in future for my music purposes.

Have you contributed to other Open Source projects? If so, which projects and can we see some of your code? If you have not contributed to open source projects, do you have other code we can look at?
No, AcousticBrainz is the first open-source organization that I have contributed to but I have some of the pieces of code I wrote as a part of my college hackathons and classes, you can check them here.

What sorts of programming projects have you done on your own time?
I participated in some competitive programming competitions and participate regularly in college hackathons.
I am also a senior developer of our college e-magazine Which is revamped regularly and maintained by the students of my college, I am also a mentor this year and mentor 3 juniors and help them learn technologies and implement them to maintain the website.

How much time do you have available, and how would you plan to use it?
I am free all summer except according to my college schedule classes shall resume from mid-July but due to COVID-19 situation in my country we are not sure of the schedule. However, I plan to give 30-35 hours a week to this project and easily about 3-4 hours after college reopens.

Do you plan to have a job or study during the summer in conjunction with Summer of Code?
No

sweta05 · March 27, 2020, 4:53pm

Here is my proposal ~~This is a part of my proposal~~ @alastairp I would grateful if you could check it and clarify some of my doubts as well.
Firstly please excuse my bad formatting I will change it soon.
Secondly, I have some doubts listed below please clarify them.

Do we need a default or can we assume NULL to be the old version?

What should be the most optimal trigger for giving the user the option to go with old version data?

Any inputs on how we can get the version of the extractor in use at the client before running the extractor to avoid unnecessary computation? (instead of retrieving it after extraction)

When releasing a new version of client-side software are we going to update or download the new client software? This point can change a lot of implementation details i have listed below.

alastairp · March 30, 2020, 3:31pm

Hi,
Thanks for submitting this proposal. It’s getting close to the submission time, so perhaps we won’t be able to iterate on the proposal as much as we could have if you had sent it earlier, but I think there are definitely some easy things that you can adjust to make it better.

One thing that I think this proposal is missing in general is discussion about the motivation or reasoning behind the changes that you want to make. In many cases you have given an example about a new method or a change to a method, and included some specific code changes, but personally I would prefer to see only an outline of the process, including any considerations or special cases that you think are important to consider in this change.
The reason that I prefer an outline of necessary steps is that implementation details can change - for example we might choose a different way to implement a change, or a previous change that we make might make a specific code change no longer required.

Having said that, here’s some feedback on some specific items in your proposal:

To me, this is a specific implementation detail. I have some other ideas about how we could store this version number, which I don’t think are important at this level of the proposal. Your description of how the database works currently is good. I recommend that for your implementation detail you just say “We will add a specific version field to each submission to know which version of the extractor was used to generate it” (much like you have in your “Proposed changes” section).

This is an important feature (have the server know which extractor version is in use), but my recommendation is that we have a separate table to store versions of all extractors. Here you can just say that we will have some system to know what the current version of the extractor is.

You’re missing a description under “Implementation” for 3. here. I recommend that you suggest an endpoint URL here and a general description of the kind of response that you might want to return. You don’t need a specific definition of the response at the moment. It seems to me that these two things are the same (“check that the client is up to date”). You could suggest it under a single header, and include two sub-parts for the modification to the server and to the client.
It might be a good idea to group items in this proposal by functionality - Here you could include information about the server version check and the client version check which you have below in the client section. I make this recommendation because the full functionality requires that you make changes to both the client and server. I think it’s a good idea to join the descriptions together to make this clear.

I think you’re suggesting here to get the extractor version from the JSON document that is submitted to the server. This is a good idea - we already read this json document to split out the version information, so this should be fine.

I’m not sure what you mean here, can you clarify?

We have two API endpoints for getting this data, /mbid/low-level, and /api/v1/mbid/low-level. One thing we should take care of is to ensure that we don’t break existing clients. Imagine that we have hundreds of clients that read data from the AB API. If we change this API to return the text “The data is outdated(old version)” then all of these programs will fail the moment that we release the new version. We should ensure that we make this change in a way that is backwards compatible. My suggestion

/mbid/low-level: Legacy endpoint, make no changes
/api/v1/mbid/low-level: Return old data with no additional parameters, otherwise if a version query parameter exists, load data for that parameter

Like I said earlier, it would be nice if you listed this in terms of feature, rather than code changes. That is, don’t just say “change get_unprocessed_highlevel_documents_for_model”, you should explain the feature that this method is used in, and describe the general changes that need to be made for this functionality to be used with new and old versions of low-level data.

This section is good. I recommend that you add it as a possible feature if you have time.

This is good, but I’d also like to see a description of how you will do the upgrade process here - we don’t want to add this column every time the client starts.

This is a very basic approach that doesn’t scale to the amount of data that we have. Unfortunately I’m not sure there’s enough time to have this discussion before the deadline, but here are some thoughts about parts that need to be considered:

What file format should we use? Json and CSV are bad solutions, because numbers are stored as 1 byte per digit. We should perform a comparative analysis of binary numerical data formats, evaluating storage use, speed to generate and read, and compatibiliy over different programming languages
What changes need to be made to the submit endpoint and read endpoints to support this
Will we move all existing items for the old extractor to disk too, or only the new version?
What changes will need to be made to data backups?

I would also consider your priority order of tasks. From my experience, you have too much in this proposal to finish in the duration of the project. In my view, the priority of these items should be

Determine a way to store the new feature files
New version support in the server
New version support in the client
New version support in the high-level calculator
Store feature files on disk instead of in the database
Access new feature files from API
Show new features on the website (I see nothing in your proposal about this step)
Update dataset and model training system to use new versions

In specific response to your other questions:

No, we know the old version, this is the one which is currently in use

I don’t think this is important for this level of the proposal.

This output is shown in the music extractor when you run it. Have you tried this?

Currently the only place where we download the client from is the acousticbrainz.org download page. It would be nice to have this uploaded to pypi so that people can also install it with pip. We also have a proposal for running this in docker which would be another way of distributing updates. I think that it would be best to release a full new version of the extractor python package every time we have another version of the extractor binary.

sweta05 · March 30, 2020, 6:04pm

Thank you so much @alastairp for your feedback even in the last moment. It means a lot. I will try my best to make it much better.
I realise that I have concentrated more on the implementation part than the outline and features, I’ll make the changes and give a clear outline of the necessary steps.

I’ll change this point and write the outline clearly.

I’ll change the flow of the proposal to order by functionality.

Here I mean to say that before running the extractor itself, we should check with the server if the client is using the latest version or not.

Yes I agree, what I meant to say is that we will alert on the screen that they are seeing outdated data(to make sure the clients know). However, we will still show them the results.

makes sense, I’ll make these changes.

I was thinking we will be adding the column only the first time. Actually, I was thinking the better option is to change the schema and create a new table with this field. (since anyways the client has to download the new software.

Initially, I haven’t thought about this, so I gave it the least priority but I realise how important it is to compress and store data efficiently after you told that the new data will be much larger.I would love to think more about this and add give this higher priority if time permits.
I also forgot the point about storage and providing data downloads for different versions, should this also be considered?

This looks good, but I am afraid I haven’t given the first point much thought and priority. Even I felt only a few things will be possible to be covered in this project and gave the rest higher priority.
Also, in point 7 do you mean to say that we should be able to dynamically generate and show the new features or can it be hardcoded like the current version features?

I am afraid I haven’t observed and wasted lots of time figuring out how to find it.I’ll try checking.

That’s great, it is surely much better to upload it to PyPI.

Summarizing all the feedback got,
I will first try to change the formatting and flow of the proposal and give out the outline of the process and group according to functionality.
Second, I will prioritize the task as you have mentioned but I’ll leave the first point at the last because of time constraints. However, I will try my best to find better solutions for data compression and serialization and develop a complete project.
Lastly, I will try my best to implement all the proposed changes and submit the revised proposal tomorrow. I hope you will verify it and make sure it goes on the GSOC portal on time.
Finally once again thank you very much for all the valuable feedback and suggestions even in the last moment I will keep up the expectations and submit my proposal.

sweta05 · March 31, 2020, 3:29pm