GSoC 2018: More detailed integration of AcousticBrainz with MusicBrainz

little_rsh · March 12, 2018, 8:24am

A much firm integration with MusicBrainz database and thus allowing users to better understand their data.

Personal Information

Name : Rashi Sah
IRC nick: rsh7
E-mail: rashi.747@gmail.com
Github: https://github.com/rsh7
Blog: rashisah
Time Zone: UTC+0530

Proposal

Project Overview

A tight integration with MusicBrainz database would be very helpful in AcousticBrainz because it would allow us to better understand our data by using the data in many places in AB such as giving real-time feedback to users about artist filters while creating datasets, MBID redirects to determine duplicate recordings, adding recordings based on some common features like same artist, same tag or same release, and we could show visualizations and statistics for the MusicBrainz data present in AcousticBrainz.

There are two ways to implement the method of direct database access: a direct connection between MB and AB database and method to copy the metadata from MB database into AB database. I would implement these two methods and then evaluate them by testing both the methods on the basis of certain criteria through which we would decide what method works better for us in order to use the MB data for implementing many features in AB.

Recordings in AcousticBrainz are stored based on their MBID from MusicBrainz. If there are many users trying to access the recording information from AcousticBrainz site, there are a lot of requests per page to the web service which takes a lot of time and also increases the load on the server. A direct access to MusicBrainz database would mean we could directly query the database (one query per page) and thus would result in significant increase in speed with which AcousticBrainz loads.

How do we access the MusicBrainz database?

There are two ways of performing the integration with MusicBrainz database (as per the ideas page). The first one is a direct connection with MusicBrainz database and another one is copying the relevant information from the MusicBrainz database and saving into a separate schema in AcousticBrainz database.

Directly connecting the MusicBrainz database to AcousticBrainz allows us to run a separate container of the docker image from MusicBrainz-server by importing the MB database dumps in AB.

Copying the data in a separate schema in the AcousticBrainz database would make the data access really fast in comparison with mirroring MusicBrainz directly and the present being used web service as we would be able to do joins between the tables and retrieve the data with a single query in the AcousticBrainz database.

Here, I am going to discuss pros and cons of both the methods of MB database access:

Direct connection with MusicBrainz Database

Pros

There are a lot of requests per page to the XML web service, and thus much more load on the service. Directly connecting with MusicBrainz database would result in significant increase in speed.
With the direct connection, we would not require any time to time update as the database will be synchronized.

Cons

As both the databases are separate, in order to use their data it is not possible to directly apply joins between the tables, so we have to do one query against both databases.
Writing different queries to two databases can lead to slower speed comparatively than replicating the database.

Copy relevant information from MusicBrainz database into a separate schema in AcousticBrainz database

Pros

There are a lot of requests per page to the XML web service, and thus much more load on the service. Importing the MusicBrainz database in a separate schema in AcousticBrainz database and thus querying directly from AB database would result in much increase in speed.
If we are able to import MB database into AB database, we can fetch our data with one query and can apply joins between the tables. Example: If we want low-level data for a recording of a release then we can join the lowlevel_json table of AB database with recording, artist_credit and release table of MB database.

Cons

The problem here is that we need to update the local schema of MusicBrainz data in AcousticBrainz whenever MB is updated.
As we are copying some relevant information from MB database, whenever our need or use case changes, then we have to consider copying the data again.

Implementation details

Setup for the direct connection with MusicBrainz database

New Infrastructure is allowing us to easily read data directly from the MusicBrainz database. To run AcousticBrainz in development, we would connect directly with MusicBrainz database, and the existing docker image of MusicBrainz-server can be used. This service has an option to download the database dumps and import the data into the local AcousticBrainz database.

I have started my work on adding a new service, a separate container to AB docker-compose files for development which allows downloading the MB database dumps by establishing a direct connection. I have opened a PR for the same. To run AcousticBrainz in production, we would connect directly to the main MusicBrainz database without having to use an additional docker image.

After building the MusicBrainz database development environment in AB, the next step would be fetching the data from the database. I have used mbdata.models to write queries for accessing the data to get SQLAlchemy models mapped to MusicBrainz database tables. I have worked on Recording entity which fetches recording information using mbdata also in this PR.

As quite similar work is being done in CritiqueBrainz, as suggested by my mentor I would write the code accordingly and move the existing code in CB to our cross-project python library - brainzutils and use the code in AB and thus other MetaBrainz projects can also use it easily.

Structure

└── acousticbrainz-server
    └── webserver
        └── external
            └── musicbrainz_db_direct
                ├── __init__.py
                ├── exceptions.py
                ├── artist.py
                ├── includes.py
                ├── recording.py
                ├── release.py
                ├── recording_gid_redirect.py
                ├── track.py
                ├── entities.py
                ├── tests       # testing functions here
                │   ├── __init__.py
                │   ├── artist_test.py
                │   ├── recording_test.py
                │   ├── release_test.py
                │   ├── recording_gid_redirect_test.py
                │   ├── track_test.py
                └── utils.py

I am describing the functions and details of each file below.

The musicbrainz_db_direct module would contain following functions initially:

__init__.py
    This file contains functions like init_db engine to use   SQLAlchemy for querying the musicbrainz database. Also it contain   musicbrainz_session() function.
    -  session()
    -  initialize_db()

musicbrainz_db_direct.exceptions
    -  class MBDatabaseException(Exception)
    -  class DataNotFoundException(MBDatabaseException)

musicbrainz_db_direct.artist
    -  fetch artist with artist_credit and artist_credit_name
    -  get_artist_by_id(mbid)
    -  to_dict_artists()

musicbrainz_db_direct.recording
    -  get information related to recording such as recording gid and         recording name and call other functions to get details for recording page and returns the dictionary
    -  get_recording()
    -  to_dict_recordings()

musicbrainz_db_direct.includes
    -  contains the list of entity types related to any entity and checks if includes specified for an entity are valid includes
    -  check_includes()

musicbrainz_db_direct.release
    -  get release gid and name by id
    -  get_release()
    -  to_dict_release()

musicbrainz_db_direct.recording_gid_redirect
    -  checks if recording is present in the redirect table
    -  get_redirect_model_gid()

musicbrainz_db_direct.track
    -  get track_number, track_position and track_length
    -  get_track_by_id()

musicbrainz_db_direct.utils
    -  return unknown entities (if not present in MB)
    -  return entities that have multiple MBIDs with their actual mbid for mbid redirect

The main type of MusicBrainz data that we show in AcousticBrainz is about recordings. Although AB also uses MB data in other places where we could use the method of direct access. For artist filtering (in the dataset training part of AB), we get artist information which fetches the data present in low level table (data put by the submission client). Ideally we should use MusicBrainz data to fetch the results from. In the dataset editor, we send an ajax query from the client to get metadata for each recording in the class. Also, when we add any recording to a class, it uses MB data to fetch recording information along. AB presently uses musicbrainz ngs to get the recording information for the web page. I would change the current work by implementing the direct access method using mbdata.models to retrieve the recording information.

Implementation details to copy relevant information from MusicBrainz database

For copying the MusicBrainz database, MB database is too big to store a copy of the database on the AcousticBrainz server. We should at least include a title of the recording, recording gid, artist name, artist gid, artist count, release year, release gid, release name, track length, track position and track number. As MusicBrainz database contains around 18 million recordings, the metadata we will store in the local schema in AcousticBrainz will have around 3 million recording information because AB has around 3 million recordings in its database.

Ideal tables to import in our local schema from MusicBrainz database

For downloading the replication packets and apply them to construct the tables, we would have to add all the columns for a table which we require and as we have foreign keys (for some other tables) defined in our required tables, then we should also have to include those other tables despite our requirement while importing. I would like to discuss this more in-depth with my mentor and other community members during community bonding period.
A proposed initial list of tables along with the additional tables we should import is as follows:

recording
artist_credit
artist_credit_name
artist
release
track
recording_gid_redirect

In our required tables above, we have foreign keys defined for the following list of tables:

release_group
medium
release_status
language
release_packaging
script
area
gender
artist_type

I would connect to the container of the docker image that I have added to AB in this PR to connect to the MB database. As tables are not in the same database so we couldn’t use direct SQL copy command to copy the metadata. The local schema would be populated using sqlalchemy to first engine to the MusicBrainz database to query the data from by using the docker container and then engine to AcousticBrainz database to store the results by first building a new schema for MB tables in AB database and finally we execute the query to insert the metadata into the tables. We would transfer the data in the form of batches (a bunch of data at a time) and thus not copy the entire table data at once so that it will reduce the load on the server.

I would add a function in manage.py for importing metadata into the AB database.
The metadata can be imported using: python manage.py import_musicbrainz_data and this command would be added to the DockerFile.

A pseudo code to import the metadata

connect to the docker container of musicbrainz-server present in AB
engine to the MB database to get the metadata from
source_session = connect to musicbrainz database
engine to AB database to save the tables in
dest_session = connect to acousticbrainz database
create a new schema and new tables in AB database
read recording ids from source_session 
for each source session recording id:   # getting a bunch of data at a time
     get_data()
     execute the query to write the data to dest_session

After getting the list of recording mbids to add, here is a pseudo code to perform the operation on a recording mbid.

Method to copy a recording:

check if recording id is present in recording_gid_redirect table and copy if it is present
copy recording table row
copy recording artist credit row
copy artist credit name row for artist
copy the list of artists performing on the recording
copy medium row for track
copy release row for each medium
copy release group for each release
copy release status for each release
copy artist credit rows for release and release groups
copy track table row
copy language table row
copy release_packaging table row
copy script table row
copy area table row
copy gender table row
copy artist_type table row

Once we have imported the metadata into a local schema, we would be dealing with 2 steps:

Updating the metadata in AcousticBrainz whenever the data in MusicBrainz server is updated
Fetching new metadata when a new recording is added to AcousticBrainz

For updating the MB data we have in AB, downloading the replication packets that MB provides will allow us to do a direct mapping from one database to another. Replication packets is a way so that the copy of MusicBrainz database keeps up to date. If we work on keeping the structure of tables exactly same, we could then just look in the replication packet file and check if the item in MusicBrainz server gets updated and if it is present in local schema then copy the data. Previously updated column can then contain the timestamp of the replication packet or it’s sequence number. We could also save space by making it one table to store timestamps for all the tables like the replication_control table in MB database. We would compare the timestamps between the original MB database and the local schema, and whenever we find the timestamp of original database greater than the timestamp we have in our local schema, the local schema will be updated.

After modification of the LoadReplicationChanges script for:

skipping UPDATEs and DELETEs for rows not present in AB database.
skipping INSERTs for recordings not in AB tables.
skipping replication tables not copied in AB database.

we can use the replication script for making the changes in the tables in the local schema of MB database.

For fetching the new metadata whenever a new recording is added to AcousticBrainz database, I would write a script which copies and save only the subset information which we require in our local schema of MusicBrainz database in AcousticBrainz. And then while fetching the data we could first try getting the data from AB’s subset of MB and if an MBID exists in AB and is not present in our metadata tables (i.e there is a new addition to AcousticBrainz database) then we can get from the direct connection to the MB database and save it in the subset (local schema of MB database).

A proposed structure for directories and files we would have for this method:

gist.github.com

https://gist.github.com/rsh7/8206f3454db3e7f26aab8f86f5374d6c

structure_for_copy_method.md

```
└── acousticbrainz-server
    └── db
        └── musicbrainz_external
            ├── __init__.py
            ├── artist.py
            ├── exceptions.py
            ├── entities.py
            ├── includes.py
            ├── recording.py

This file has been truncated. show original

These files in this structure would contain similar functions as shown above for the previous method but we would use raw SQL to query the database and not mbdata because we don’t have models defined for AB database like we have for MB database.

Evaluation of both methods

Performing tests in order to decide the better method

The thing that I propose to do is to find a reasonable compromise between the two methods so that we can have the best of both the methods. We can create a direct connection to the MusicBrainz database while copying only a small subset of the entire MusicBrainz database into the AcousticBrainz schema so that we can use a direct connection in places where we find a new addition of a recording to AB database. The most important part of the project is to perform tests on both the methods on the scale of AcousticBrainz to see what method works best for us in order to perform the integration of AB database with MB. After developing both the methods, I would implement some test queries for direct connection and for importing the metadata to see whether we are able to get the test queries working.

I would do tests on queries related to how we will use the data for different integrations/tasks. I would write testing scripts for different tasks to check speed for both database access methods and thus we would be able to determine the better method on the basis of time differences in both queries. I would perform tests on information on the recording page, unique MBIDs based on duplicates and artist filtering as these are great to perform the test with because they give us feedback about real tasks that we have to perform in AcousticBrainz.

For information on the recording page, I would write test queries for getting the recording information over the direct connection and by querying directly from the database and check which method loads the information faster. I would also test methods by loading the page in the browser for both methods and also compares them with the loading time the present structure takes.
An example test query for the direct connection using mbdata.models:

def test_recording_by_id():
    get 1000 recordings
    for each recording:
        get recording info
        calculate time taken to load the info in the web page
    measure the average time taken by both methods
    choose the method that gives us better results

def get_recording(mbid):
    initialize musicbrainz session
    query = session.query(models.Recording)
    fetch artists with artist-credit 
    query = query.options(joinedload("artist_credit")).\
                options(joinedload("artist_credit.artists")).\
                options(joinedload("artist_credit.artists.artist"))
    similarly fetch media with tracks info
    fetch release info
    convert entity information into dictionaries
    return rec dict

For unique mbids based on duplicates and artist filtering, I would sample test on some 100 recordings for both methods. I will write a test using both database access methods to get recording_gid_redirect information and checking whether we would add the items to the class or not. For artist filtering, we would fetch the artist information from the database for that recording and then checks whether the artist has already been added or not for both db access methods.

An example query for mbid_redirect information:

get recording mbids
get recording model for recording type using models.Recording
get recording_model.gid for mbids
get redirect model using models.RecordingGIDRedirect
calculate remaining ids from mbids which are not recording gids
if remaining ids, calculate again excluding the redirect gids
get the recording with their original mbids and return 
now we are able to detect the duplicate recordings

Test on both methods and calculate the speeds of getting the data and decide the suitable method.

Testing metadata access on the dataset editor
We can make the process of loading a class in the dataset editor to load the metadata for each recording faster by loading a bulk of metadata at a time.
We will write a test using both database access methods to load metadata for classes of 10, 20, 50, 100, 500, and 1000 items. Because we want the dataset editor to be responsive, we will set a maximum query time of 1 second to populate this data and respond to the API request. If we are unable to respond with metadata of 500 items in at least 1 second, we will implement a bulk metadata API endpoint for the client to query with up to 100 items at once.

Selection criteria

After I have the test queries working, I would perform the experiments on the complete AcousticBrainz database in collaboration with MetaBrainz team in order to decide the best method that fits our needs. We would decide the best method for us on the basis of many parameters:

Speed
Storage
Memory usage
Webpage rendering time i.e calculating the average time it takes presently to render a web page in AcousticBrainz website and then calculating the time taken by both the methods and then we could decide better after comparing it with the present time.
Response time, for example in direct connection if there occurs any problem in MB database then the process would have to wait indefinitely or starvation would occur.

Using MusicBrainz data in AcousticBrainz: How should we use our data in order to get a better understanding?

After implementing the methods for database access and testing the methods, we would be fetching the data and use it in different places in AB. Presently we get recording information using python-MusicBrainz NGS. An example code to fetch recording information for the AB recording page:

gist.github.com

https://gist.github.com/rsh7/60d493ef08b60596a06b16cbec210e73

recording.py

from brainzutils import cache
from mbdata import models
from mbdata import utils
from sqlalchemy.orm import joinedload
from webserver.external.musicbrainz_database import musicbrainz_session
import exceptions as mb_exceptions

CACHE_TIMEOUT = 86400  # 1 day

def get_recording_by_id(mbid):

This file has been truncated. show original

Now would be the time to integrate AB database to MB so I would help in performing the integrations and would use the data in many places in AB.

Using MBID redirect information to determine when two distinct MBIDs refer to the same recording

An entity can have more than one MBID. When an entity is merged into another, it’s MBID is redirected to the other entity.
In order to determine the duplicate recordings, I would implement a function which redirects the MBID to its original entity. I would query the recording and recording_gid_redirect table of MusicBrainz database to use the information in the function which returns the entities with their MBIDs which would then be used to not allow adding duplicate recordings to a class. MBID redirect information would allow us to select any duplicate recording regardless of its MBID when using the API and the AcousticBrainz website.

The function would return a dictionary of entities with keys as their MBIDs. I would maintain a record in a set with original MBIDs of the recordings which have already been added so that whenever there is a new recording we compare its original MBID with the recordings present in the set. And thus with the help of MBID redirects, a user won’t be able to add similar recordings to a particular class because their MBIDs would redirect to the same original entity.

We can implement this method easily using the direct connection to MB database using mbdata.models much similar to what we are doing in CritiqueBrainz here. But only after testing both the methods of database access, we would be able to decide better what method works well for this integration.

Using Artist information in AcousticBrainz

Artist filtering states that one recording per artist should be present in any dataset class so that during evaluation we have unique artists in each class to present the user with a cutoff of class size during training. We can not choose artist filtering in the case when creating a challenge for classifying an artist. If we have a fast MB database access, we can provide users with real-time feedback about artist filters while creating datasets by fetching the MBID of a recording from AB database and applying joins between tables artist and recording.

We may implement the process in this way: Whenever a user adds one recording, we fetch the artist information from artist table and save the name or artist mbid in a set. Now when the user attempts to add any recording with same artist name in that class, we check whether the artist of the recording is already present in the set or not. If it is present, we won’t allow her to add that recording to that class.

A pseudo code description


create an empty set artist_mbids 
for each addition of recording:
    get a recording by user
    fetch the artist mbid of the recording
    check if it is present in the artist_mbids
    if present:
        do not allow the user to add that recording in the class
    else:
        add the recording to that class
        store the artist mbid of that recording into the set

Possible extensions

In case I am finished with my GSoC project early, I plan to use the rest of my time with using MusicBrainz data to show statistics

We can widely use MB data in many places in AB. I would like to add statistics and visualizations to the data with the help of charts and graphs. We could add a new view, maybe acousticbrainz.org/statistics which will be a place to show sitewide graphs. We could store stats in the jsonb format in statistics table in AB database. When stats are calculated they must be saved in the database entity wise, so that whenever the page is opened again, the stats calculation process is not repeated for the same data and it would fetch the data from the database. We would recalculate/update the statistics on a weekly or fortnightly basis.

I would add charts based on following:

Top X Artists by recordings
Most frequently submitted recordings
Top X submitted recordings by an Artist
Most frequently submitted releases over a year
Most / least submitted MusicBrainz tags

A graph for most commonly submitted recordings and top 10 artists would look like:

I have used Plotly for demo graphs. We could use plotly.js or Bokeh for data visualizations, depending on the opinions of the community.

Timeline

A phase by phase timeline of the work to be done is summarized as follows:

Community Bonding (April 23 - May 14)
Spend this time trying to formalize what exactly I need to code and will start setting up the connection with docker container which connects to the MB database for the development environment of AB. Also, discuss design decisions with the mentor and discuss ideal tables to import from MusicBrainz database in the local schema in AcousticBrainz and to make sure that no bad decisions are made early in the process.
Phase 1 (May 14 - June 11)
I aim to complete the method of copying the metadata from MB database into a separate schema in AB database. Use the docker image in dockerfile to connect to the MB database and load the metadata and import the data into a local schema in AB by adding a function to manage.py. I would also work on writing a script which fetches the data from MusicBrainz database whenever a new MBID is added to AcousticBrainz.
Phase 2 (June 12 - July 9)
In this phase, first I aim to work on updating the local schema of MB database whenever the MusicBrainz server is updated by downloading the replication packets. Perform tests on both the methods on the scale of AcousticBrainz to decide which method works best for us. On the last dates of this phase, I will update the AcousticBrainz build documentation.
Phase 3 (July 10 - August 6)
This phase would involve the work on two integrations. I would like to start working on using MBID redirect information to select duplicate recordings first and then I would work on allowing users to add one recording per artist in real time in each class for artist filtering. If I complete early then I would help in using MB data to show statistics in AB.
After Summer of Code
To continue my work on adding more functionalities to AcousticBrainz such as allowing the users to add the recordings to a particular class in the dataset editor based on certain criteria such as by a given Artist, in a given release, same release year or based on a given tag in MusicBrainz and working on other MetaBrainz projects as well. Also, working on new machine learning infrastructure would be a very interesting work to do.

Here is a week by week timeline of my work for summer:

GSoC 1st week (14th May to 20th May): Begin with the dockerfile setup. Connect to the container made for the direct connection which connects to the MusicBrainz database. Start with loading and copying small data first.
GSoC 2nd week and 3rd week (21st May to 3rd June): Writing the script to import the data in batches from MB database to a schema in AB database.
GSoC 4th week (4th June to 10th June): Work on writing the script to fetch new data to local schema whenever new MBID is added to AcousticBrainz database.
GSoC 5th week (11th June to 17th June): Start working on writing scripts for changes in replication packets update and insert feature.
GSoC 6th week (18th June to 24th June): Work on downloading replication packets and set up using cron and update the local schema whenever there is any update in MusicBrainz server.
GSoC 7th week (25th June to 1st July): After developing both the methods, I would work on testing sample queries for both the methods and see if queries are working properly.
GSoC 8th week (25th June to 1st July): Testing both the methods on the scale of AcousticBrainz on the basis of several parameters and thus deciding which method works best for us.
GSoC 9th and 10th week (9th July to 22nd July): Work on using MBID redirect information to don’t allow duplicate recordings to be added to a class.
GSoC 11th week (22nd July to 29th July): Start working on artist information to give users real-time feedback about artist filters.
GSoC 12th week (30th July to 5th August): Complete previous week’s integrations if left with any and solve bugs. Work on documentation for AcousticBrainz website.
GSoC 13th week (6th August to 14th August): Complete if there is any pending stuff and work on final submission and make sure that everything is working fine.

Detailed information about myself

gist.github.com

https://gist.github.com/rsh7/8b7308b9d4d4b298d677185b38ddbbc6

about_me.md

# Detailed Information about myself:

I am a Computer Science undergraduate student at National Institute of Technology, Hamirpur. I've been helping out in development work in AcousticBrainz since last December. A list of `commits` and `Pull Request's` to `acousticbrainz-server` and `acousticbrainz-client` can be found [here](https://github.com/metabrainz/acousticbrainz-server/commits?author=rsh7), [here](https://github.com/metabrainz/acousticbrainz-server/pulls?utf8=%E2%9C%93&q=is%3Apr+author%3Arsh7+) and [here](https://github.com/MTG/acousticbrainz-client/pulls?utf8=%E2%9C%93&q=is%3Apr+author%3Arsh7+). The pull requests I've worked on over time, most notable ones of which are: MB database [image setup](https://github.com/metabrainz/acousticbrainz-server/pull/262) in AB and a feature to select [SVM parameters preferences](https://github.com/metabrainz/acousticbrainz-server/pull/257) for dataset evaluation.

**Question: Tell us about the computer(s) you have available for working on your SoC project!**

Answer: I have a DELL laptop with an Intel i3 processor and 6 GB RAM, running Ubuntu 16.04.

**Question: When did you first start programming?**

This file has been truncated. show original

little_rsh · March 12, 2018, 8:26am

This is an initial draft of my GSoC proposal. Feedback and suggestions would be greatly appreciated.

alastairp · March 14, 2018, 5:02pm

Thanks for the proposal! The proposal has a good level of detail, and you’ve already made a start on some early parts of the project, which is great to see. I’m going to quote some specific things in the proposal that I think could be changed or made clearer, and then I’ll give some overall notes.

For me, this initial proposal section only says what we can do with the change, it doesn’t say why we want to do it. These proposals should come with a strong motivation to do the task, otherwise we’re just writing code because we want to (that is, why do we want to have tighter integration with the database when we can in theory do exactly the same with the web api?)

These shouldn’t be images, they’re too difficult to talk about and copy/paste from.

Direct connection pro: lots of requests per page with the xml webservice

This is a pro of both approaches, and is the reason why we want to stop using the webservice

You can make it explicit here that you are referring to development. When we run AcousticBrainz in production we can connect directly to the main MusicBrainz database without having to use an additional docker image.

We had a very similar SoC project last year to connect directly to a MusicBrainz database server for CritiqueBrainz (GSoC 2017: Directly Access the MusicBrainz Database in CritiqueBrainz, critiquebrainz/critiquebrainz/frontend/external/musicbrainz_db at master · metabrainz/critiquebrainz · GitHub). We shouldn’t duplicate any work here. It would be a good idea to, as part of this project, move the existing code in CB to our cross-project python library, brainzutils, and use the same code in both CB and AB.

I would make the direct database and local copy options different sections. In each of these sections it would be good to include a description of the method (like you already have), and then also add a description of how you will do this (e.g. you might want to talk about new modules (acousticbrainz.external?) or class names, or workflow - here you should show that you have thought a little bit about how you might write this system. It doesn’t matter if the final version doesn’t look like this, but we’re interested in seeing that you can think about the process in a “big picture” manner)

This also needs more detail. I’m not sure what you mean by dataframe here - are you considering using pandas? I don’t think that this is a good fit. You’ve already had some experience with the sqlalchemy bindings and seem to understand them well. Be clear about what tools you want to use, perhaps you just need to change some terminology.

Be explicit here. I want to know what the command should be called, and a simple psuedo code description of how it works, e.g.

get list of all newly added recording mbids
for each recording mbid:
    look up recording in database, using mbid redirect table if needed
    get list of artists performing on recording
    for each artist...
        ....

Some time ago I made a proposal of a schema for this metadata: WIP: Metadata tables · metabrainz/acousticbrainz-server@9d2b53d · GitHub
The difference here is that I also include release group, and don’t include track or medium. We should continue discussing this to see what the ideal tables to include are. I recommend that you talk with @reosarevok and @murdos to see if they have suggestions, they always bring up interesting points when asked a question like this.

This type of code is not necessary for the proposal

This is a very important part of this project. You should have an entire section in the proposal talking about the evaluation. We want to know at least

How you will do the evaluation
a suggestion on how you will decide what is the best system

You asked in a previous forum post (and I didn’t answer, sorry), and this is very important. You correctly realised that we need a lot of data to replicate the size of the current acousticbrainz server. I suggest that you develop the two methods of getting metadata, and implement some test queries. When you have the test queries working, MetaBrainz can provide you with a virtual machine that contains a complete copy of the AcousticBrainz database, and we can run the experiments there.

I can tell you now that you won’t be able to finish both the integration of two methods of getting data, testing which one to use, and all of these implementations of the data. I recommend that you choose only two of the four proposals that you have listed here.
My favourite options are using artist information for doing Artist Filtering in datasets, and for determining redirect MBIDs to see what submissions are the same. However, you’re welcome to choose whichever items you prefer. Note that things like the statistics may require information from the duplicate recordings task, and so you would have to do that one first.

In general I like the overall goal of this proposal, but it needs to be improved to show two main things:

You need to explain not only what you want to do, but why. Start each section with a stand-alone sentence explaining the goal of this section, and then continue with the text that you already have explaining how you think you will do it
Include more detail in the “how you will do it” sections. Here you should show to us that you have already looked at the acousticbrainz server and understand the data enough to start making proposals using module names, function names, or psuedo code. You should show that you’ve potentially found issues in the data or process (don’t worry if you don’t know how to solve these, we can help with that).

Thanks again for the proposal, I look forward to reading a revised version!

Freso · March 15, 2018, 7:50am

This information is stored in a table in the MusicBrainz database. If the database is already going to be queried directly or fully or partially mirrored, why not just also query/mirror the information in this table? (There might be a perfectly good reason, I just don’t know it. )

little_rsh · March 15, 2018, 9:42am

The idea is to use MB database tables for implementing a function which redirects MBID to its original entity. I would query the recording and recording_gid_redirect table (let’s say for entity type as recording) of MB database to use the information in the function which returns the entities with their MBIDs which would then be used to not allow adding duplicate recordings to a class. I would also add this detail to MBID redirect section of the proposal. Thank you!

little_rsh · March 17, 2018, 11:23am

Thanks for the review, @alastairp

Thank you!

I have updated the proposal accordingly.

Changed the pros and cons from image format to plain text.

I have updated the proposal accordingly.

I have a query regarding the suitable tables we should import:

As we require tables according to our needs in AcousticBrainz, I have added to the proposal the ideal tables we should have. (along with the tables whose foreign keys we have in our required tables)
As the tables have foreign keys for some other tables, would it be ideal to include those other tables as well though we won’t be using the data of those tables?
For example: for release table, we have foreign keys for tables:- language, release_packaging etc. So should we be including these other tables as well? Because replication packets might work only on the basis of the present structure of MusicBrainz tables.
Or would it be possible to skip those columns of the tables which we don’t require to use in AB? Looking forward to your feedback.

alastairp · March 26, 2018, 4:10pm

This is getting better, well done. There are still some changes that I would like to see:

Try and make a more interesting opening paragraph! This is quite a boring section which doesn’t tell me why I should be excited about this project. You following section (“Why do we need to have a tighter integration with [the] MusicBrainz database?”) is a much more interesting section to get me interested in the project.

This is a good outline, but I’m not sure what many of these files are for. Can you add a small 5-10 word description next to each saying what it is for? For example, I don’t understand what mbid_redirect.py and artist_filtering.py are for. I think that packages in this module should be directly mapped to tables or concepts in the musicbrainz database. See the critiquebrainz database which has a separate file per musicbrainz database table.
You can also include a list of changes that you will make in existing code in AcousticBrainz. For example

I will use the method musicbrainz_db.recording.get_recording in [module x] to get recording information instead of accessing musicbrainz using the database.

This list of locations that you will change should be complete - I would like to see that you can find every place that can be switched to use the direct database access.
At the moment the main type of MusicBrainz data that we show in AcousticBrainz is about recordings, however the project idea also talks about Artists (in the dataset training part of AcousticBrainz). Can you also briefly describe a change to the dataset_eval.artistfilter module saying how you can use access to the database instead of the current method that we use to filter artists?
Can you find any other types of MusicBrainz data in AcousticBrainz that we can change for direct access?

This is a small thing, but you can clean up the grammar here. I would say “For copying the MusicBrainz database, MB database is too big to store a copy of the database on the AcousticBrainz server.”

I find that the order of some parts of this document make it difficult to understand what you are suggesting in the plan. To me, it makes more sense to list sections as:

Use direct access to the MusicBrainz database
- Outline of module structure, changes that should be made to AcousticBrainz
Use a copy of some MusicBrainz tables
- What tables we need to copy
- How to populate the tables with data from MusicBrainz when new data is submitted to AcousticBrainz
- How to perform replication/updates of data with a MusicBrainz replication paket
- Outline of module structure, changes that should be made to AcousticBrainz
Evaluation of the two methods
- What tests to run
- Evaluation criteria
Features that you will implement using the metadata

You put the section about updating metadata after the section about evaluation, it’s not clear that this step is only required for the database copy method. Instead, your proposal makes it seem like this is something that we would have to do for both methods.

little_rsh:

A pseudo code to import the metadata

source_engine = create_engine()
source_session = sessionmaker(source_engine)
engine to AB database to save the tables in
Destination_engine = create_engine()
dest_session = sessionmaker(destination_engine)

In psuedocode, you don’t need to write things like this if you’re not actually using the variables in the rest of the code. For example, you could write:

source_session = connect to musicbrainz database
dest_session = connect to acousticbrainz database
read recording ids from source_session
for each source session recording id:
    get_data and write to dest session

This only shows variable names that are important for the task

little_rsh:

for each recording mbid:
    look up recording in the database using mbid redirect table if needed
    get the list of artists performing on the recording
    for each artist:
        get gid and name

You have already listed all of the tables that you think are necessary for our local copy. You could add that section first, and then have this psuedocode. This block would be clearer if it performed an operation on just one recording MBID. For example:

method to copy a recording:

check if recording id is in recording_gid_redirect and copy if it is
copy recording table row
copy recording artist credit row
copy track row for recording
copy medium row for track
copy release row for each medium
copy release group row for each release
copy artist credit rows for release and release groups
... more

It seems to me like you’re saying two different things here. You’re right that we must include tables for each table that is referred to from the main tables that we require. Your second list of required/foreign key tables is great. I think that you can remove the ER diragram, as it confuses the goal.
Be confident about your statements - " I suppose we have to add all the columns…" makes it sound like you’re unsure about finishing the task. Show us that you’re confident in what you are proposing!

You need to describe this process much more here. “I will write some tests” is not enough to explain this important part of the project. I’ve previously mentioned that you need to describe the tests in more detail. You have described three tasks that you would like to do with the data (information on the recording page, unique MBIDs based on duplicates, artist filtering). These are great tasks to perform the test with because they give us feedback about real tasks that we have to perform in AcousticBrainz. Describe how you want to test each of these. How many recordings do you want to put in your test database? What is an example query that you want to test. How many times will you test it? Will you test by loading the page in your browser, or will you write a testing script?

For example:

Testing metadata access on the dataset editor
When we load a class in the dataset editor we send an ajax query from the client to get metadata for each recording in the class. We can make this faster by sending metadata about the class with the API response containing the class members, or by allowing the client to bulk load a page of metadata at a time,
We will write a test using both database access methods to load metadata for classes of 10, 20, 50, 100, 500, and 1000 items. Because we want the dataset editor to be responsive, we will set an maximum query time of 1 second to populate this data and respond to the API request. If we are unable to respond with metadata of 500 items in at least 1 second, we will implement a bulk metadata API endpoint for the client to query with up to 100 items at once.

This section is good

As I said above, we can use mbdata in both methods, so we don’t need SQL examples here. Show an mbdata example if you can.

little_rsh · March 27, 2018, 11:15am

@alastairp Thanks a lot for your feedback and suggestions.

Thank you!

I have made more changes to my proposal since your last review. It would be great if you review the changes so that I can work on it before the final deadline if there are changes to be made. For now, I have uploaded the pdf of this draft on GSoC website.

I have added the ‘recording page code’ and ‘About me’ section into the gist files here because the proposal was exceeding the word limit of this editor.

alastairp · March 27, 2018, 12:35pm

By reading this introduction I am still not convinced about your goals for the project and why it’s a good idea. You still start with a small implementation detail about how the current system works. Tell me what you want to do!
Here are some proposals from another project that could give you inspiration: CLTK announces GSoC 2017 students. Note how especially the second project gives a problem statement and give a brief overview of what the participant will do.

Otherwise, well done on cleaning up the proposal. Good luck for the selection process!

little_rsh · March 27, 2018, 2:24pm

@alastairp I’ve reframed the introduction part and included what is to be done in the proposal.

Thanks a lot!