A much firm integration with MusicBrainz database and thus allowing users to better understand their data.
- Name : Rashi Sah
- IRC nick: rsh7
- E-mail: email@example.com
- Github: https://github.com/rsh7
- Blog: rashisah
- Time Zone: UTC+0530
A tight integration with MusicBrainz database would be very helpful in AcousticBrainz because it would allow us to better understand our data by using the data in many places in AB such as giving real-time feedback to users about artist filters while creating datasets, MBID redirects to determine duplicate recordings, adding recordings based on some common features like same artist, same tag or same release, and we could show visualizations and statistics for the MusicBrainz data present in AcousticBrainz.
There are two ways to implement the method of direct database access: a direct connection between MB and AB database and method to copy the metadata from MB database into AB database. I would implement these two methods and then evaluate them by testing both the methods on the basis of certain criteria through which we would decide what method works better for us in order to use the MB data for implementing many features in AB.
Recordings in AcousticBrainz are stored based on their MBID from MusicBrainz. If there are many users trying to access the recording information from AcousticBrainz site, there are a lot of requests per page to the web service which takes a lot of time and also increases the load on the server. A direct access to MusicBrainz database would mean we could directly query the database (one query per page) and thus would result in significant increase in speed with which AcousticBrainz loads.
How do we access the MusicBrainz database?
There are two ways of performing the integration with MusicBrainz database (as per the ideas page). The first one is a direct connection with MusicBrainz database and another one is copying the relevant information from the MusicBrainz database and saving into a separate schema in AcousticBrainz database.
Directly connecting the MusicBrainz database to AcousticBrainz allows us to run a separate container of the docker image from MusicBrainz-server by importing the MB database dumps in AB.
Copying the data in a separate schema in the AcousticBrainz database would make the data access really fast in comparison with mirroring MusicBrainz directly and the present being used web service as we would be able to do joins between the tables and retrieve the data with a single query in the AcousticBrainz database.
Here, I am going to discuss pros and cons of both the methods of MB database access:
Direct connection with MusicBrainz Database
- There are a lot of requests per page to the XML web service, and thus much more load on the service. Directly connecting with MusicBrainz database would result in significant increase in speed.
- With the direct connection, we would not require any time to time update as the database will be synchronized.
- As both the databases are separate, in order to use their data it is not possible to directly apply joins between the tables, so we have to do one query against both databases.
- Writing different queries to two databases can lead to slower speed comparatively than replicating the database.
Copy relevant information from MusicBrainz database into a separate schema in AcousticBrainz database
- There are a lot of requests per page to the XML web service, and thus much more load on the service. Importing the MusicBrainz database in a separate schema in AcousticBrainz database and thus querying directly from AB database would result in much increase in speed.
- If we are able to import MB database into AB database, we can fetch our data with one query and can apply joins between the tables. Example: If we want low-level data for a recording of a release then we can join the lowlevel_json table of AB database with recording, artist_credit and release table of MB database.
- The problem here is that we need to update the local schema of MusicBrainz data in AcousticBrainz whenever MB is updated.
- As we are copying some relevant information from MB database, whenever our need or use case changes, then we have to consider copying the data again.
Setup for the direct connection with MusicBrainz database
New Infrastructure is allowing us to easily read data directly from the MusicBrainz database. To run AcousticBrainz in development, we would connect directly with MusicBrainz database, and the existing docker image of MusicBrainz-server can be used. This service has an option to download the database dumps and import the data into the local AcousticBrainz database.
I have started my work on adding a new service, a separate container to AB docker-compose files for development which allows downloading the MB database dumps by establishing a direct connection. I have opened a PR for the same. To run AcousticBrainz in production, we would connect directly to the main MusicBrainz database without having to use an additional docker image.
After building the MusicBrainz database development environment in AB, the next step would be fetching the data from the database. I have used
mbdata.models to write queries for accessing the data to get SQLAlchemy models mapped to MusicBrainz database tables. I have worked on
Recording entity which fetches recording information using
mbdata also in this PR.
As quite similar work is being done in CritiqueBrainz, as suggested by my mentor I would write the code accordingly and move the existing code in CB to our cross-project python library -
brainzutils and use the code in AB and thus other MetaBrainz projects can also use it easily.
└── acousticbrainz-server └── webserver └── external └── musicbrainz_db_direct ├── __init__.py ├── exceptions.py ├── artist.py ├── includes.py ├── recording.py ├── release.py ├── recording_gid_redirect.py ├── track.py ├── entities.py ├── tests # testing functions here │ ├── __init__.py │ ├── artist_test.py │ ├── recording_test.py │ ├── release_test.py │ ├── recording_gid_redirect_test.py │ ├── track_test.py └── utils.py
I am describing the functions and details of each file below.
The musicbrainz_db_direct module would contain following functions initially:
__init__.py This file contains functions like init_db engine to use SQLAlchemy for querying the musicbrainz database. Also it contain musicbrainz_session() function. - session() - initialize_db() musicbrainz_db_direct.exceptions - class MBDatabaseException(Exception) - class DataNotFoundException(MBDatabaseException) musicbrainz_db_direct.artist - fetch artist with artist_credit and artist_credit_name - get_artist_by_id(mbid) - to_dict_artists() musicbrainz_db_direct.recording - get information related to recording such as recording gid and recording name and call other functions to get details for recording page and returns the dictionary - get_recording() - to_dict_recordings() musicbrainz_db_direct.includes - contains the list of entity types related to any entity and checks if includes specified for an entity are valid includes - check_includes() musicbrainz_db_direct.release - get release gid and name by id - get_release() - to_dict_release() musicbrainz_db_direct.recording_gid_redirect - checks if recording is present in the redirect table - get_redirect_model_gid() musicbrainz_db_direct.track - get track_number, track_position and track_length - get_track_by_id() musicbrainz_db_direct.utils - return unknown entities (if not present in MB) - return entities that have multiple MBIDs with their actual mbid for mbid redirect
The main type of MusicBrainz data that we show in AcousticBrainz is about recordings. Although AB also uses MB data in other places where we could use the method of direct access. For artist filtering (in the dataset training part of AB), we get artist information which fetches the data present in low level table (data put by the submission client). Ideally we should use MusicBrainz data to fetch the results from. In the dataset editor, we send an ajax query from the client to get metadata for each recording in the class. Also, when we add any recording to a class, it uses MB data to fetch recording information along. AB presently uses musicbrainz ngs to get the recording information for the web page. I would change the current work by implementing the direct access method using mbdata.models to retrieve the recording information.
Implementation details to copy relevant information from MusicBrainz database
For copying the MusicBrainz database, MB database is too big to store a copy of the database on the AcousticBrainz server. We should at least include a title of the recording, recording gid, artist name, artist gid, artist count, release year, release gid, release name, track length, track position and track number. As MusicBrainz database contains around 18 million recordings, the metadata we will store in the local schema in AcousticBrainz will have around 3 million recording information because AB has around 3 million recordings in its database.
Ideal tables to import in our local schema from MusicBrainz database
For downloading the replication packets and apply them to construct the tables, we would have to add all the columns for a table which we require and as we have foreign keys (for some other tables) defined in our required tables, then we should also have to include those other tables despite our requirement while importing. I would like to discuss this more in-depth with my mentor and other community members during community bonding period.
A proposed initial list of tables along with the additional tables we should import is as follows:
In our required tables above, we have foreign keys defined for the following list of tables:
I would connect to the container of the docker image that I have added to AB in this PR to connect to the MB database. As tables are not in the same database so we couldn’t use direct SQL copy command to copy the metadata. The local schema would be populated using sqlalchemy to first engine to the MusicBrainz database to query the data from by using the docker container and then engine to AcousticBrainz database to store the results by first building a new schema for MB tables in AB database and finally we execute the query to insert the metadata into the tables. We would transfer the data in the form of batches (a bunch of data at a time) and thus not copy the entire table data at once so that it will reduce the load on the server.
I would add a function in manage.py for importing metadata into the AB database.
The metadata can be imported using:
python manage.py import_musicbrainz_data and this command would be added to the DockerFile.
A pseudo code to import the metadata
connect to the docker container of musicbrainz-server present in AB engine to the MB database to get the metadata from source_session = connect to musicbrainz database engine to AB database to save the tables in dest_session = connect to acousticbrainz database create a new schema and new tables in AB database read recording ids from source_session for each source session recording id: # getting a bunch of data at a time get_data() execute the query to write the data to dest_session
After getting the list of recording mbids to add, here is a pseudo code to perform the operation on a recording mbid.
Method to copy a recording:
check if recording id is present in recording_gid_redirect table and copy if it is present copy recording table row copy recording artist credit row copy artist credit name row for artist copy the list of artists performing on the recording copy medium row for track copy release row for each medium copy release group for each release copy release status for each release copy artist credit rows for release and release groups copy track table row copy language table row copy release_packaging table row copy script table row copy area table row copy gender table row copy artist_type table row
Once we have imported the metadata into a local schema, we would be dealing with 2 steps:
- Updating the metadata in AcousticBrainz whenever the data in MusicBrainz server is updated
- Fetching new metadata when a new recording is added to AcousticBrainz
updating the MB data we have in AB, downloading the replication packets that MB provides will allow us to do a direct mapping from one database to another. Replication packets is a way so that the copy of MusicBrainz database keeps up to date. If we work on keeping the structure of tables exactly same, we could then just look in the replication packet file and check if the item in MusicBrainz server gets updated and if it is present in local schema then copy the data.
Previously updated column can then contain the timestamp of the replication packet or it’s sequence number. We could also save space by making it one table to store timestamps for all the tables like the replication_control table in MB database. We would compare the timestamps between the original MB database and the local schema, and whenever we find the timestamp of original database greater than the timestamp we have in our local schema, the local schema will be updated.
After modification of the LoadReplicationChanges script for:
- skipping UPDATEs and DELETEs for rows not present in AB database.
- skipping INSERTs for recordings not in AB tables.
- skipping replication tables not copied in AB database.
we can use the replication script for making the changes in the tables in the local schema of MB database.
For fetching the new metadata whenever a new recording is added to AcousticBrainz database, I would write a script which copies and save only the subset information which we require in our local schema of MusicBrainz database in AcousticBrainz. And then while fetching the data we could first try getting the data from AB’s subset of MB and if an MBID exists in AB and is not present in our metadata tables (i.e there is a new addition to AcousticBrainz database) then we can get from the direct connection to the MB database and save it in the subset (local schema of MB database).
A proposed structure for directories and files we would have for this method:
These files in this structure would contain similar functions as shown above for the previous method but we would use raw SQL to query the database and not mbdata because we don’t have models defined for AB database like we have for MB database.
Evaluation of both methods
Performing tests in order to decide the better method
The thing that I propose to do is to find a reasonable compromise between the two methods so that we can have the best of both the methods. We can create a direct connection to the MusicBrainz database while copying only a small subset of the entire MusicBrainz database into the AcousticBrainz schema so that we can use a direct connection in places where we find a new addition of a recording to AB database. The most important part of the project is to perform tests on both the methods on the scale of AcousticBrainz to see what method works best for us in order to perform the integration of AB database with MB. After developing both the methods, I would implement some test queries for direct connection and for importing the metadata to see whether we are able to get the test queries working.
I would do tests on queries related to how we will use the data for different integrations/tasks. I would write testing scripts for different tasks to check speed for both database access methods and thus we would be able to determine the better method on the basis of time differences in both queries. I would perform tests on information on the recording page, unique MBIDs based on duplicates and artist filtering as these are great to perform the test with because they give us feedback about real tasks that we have to perform in AcousticBrainz.
For information on the recording page, I would write test queries for getting the recording information over the direct connection and by querying directly from the database and check which method loads the information faster. I would also test methods by loading the page in the browser for both methods and also compares them with the loading time the present structure takes.
An example test query for the direct connection using mbdata.models:
def test_recording_by_id(): get 1000 recordings for each recording: get recording info calculate time taken to load the info in the web page measure the average time taken by both methods choose the method that gives us better results def get_recording(mbid): initialize musicbrainz session query = session.query(models.Recording) fetch artists with artist-credit query = query.options(joinedload("artist_credit")).\ options(joinedload("artist_credit.artists")).\ options(joinedload("artist_credit.artists.artist")) similarly fetch media with tracks info fetch release info convert entity information into dictionaries return rec dict
For unique mbids based on duplicates and artist filtering, I would sample test on some 100 recordings for both methods. I will write a test using both database access methods to get recording_gid_redirect information and checking whether we would add the items to the class or not. For artist filtering, we would fetch the artist information from the database for that recording and then checks whether the artist has already been added or not for both db access methods.
An example query for mbid_redirect information:
get recording mbids get recording model for recording type using models.Recording get recording_model.gid for mbids get redirect model using models.RecordingGIDRedirect calculate remaining ids from mbids which are not recording gids if remaining ids, calculate again excluding the redirect gids get the recording with their original mbids and return now we are able to detect the duplicate recordings
Test on both methods and calculate the speeds of getting the data and decide the suitable method.
Testing metadata access on the dataset editor
We can make the process of loading a class in the dataset editor to load the metadata for each recording faster by loading a bulk of metadata at a time.
We will write a test using both database access methods to load metadata for classes of 10, 20, 50, 100, 500, and 1000 items. Because we want the dataset editor to be responsive, we will set a maximum query time of 1 second to populate this data and respond to the API request. If we are unable to respond with metadata of 500 items in at least 1 second, we will implement a bulk metadata API endpoint for the client to query with up to 100 items at once.
After I have the test queries working, I would perform the experiments on the complete AcousticBrainz database in collaboration with MetaBrainz team in order to decide the best method that fits our needs. We would decide the best method for us on the basis of many parameters:
- Memory usage
- Webpage rendering time i.e calculating the average time it takes presently to render a web page in AcousticBrainz website and then calculating the time taken by both the methods and then we could decide better after comparing it with the present time.
- Response time, for example in direct connection if there occurs any problem in MB database then the process would have to wait indefinitely or starvation would occur.
Using MusicBrainz data in AcousticBrainz: How should we use our data in order to get a better understanding?
After implementing the methods for database access and testing the methods, we would be fetching the data and use it in different places in AB. Presently we get recording information using python-MusicBrainz NGS. An example code to fetch recording information for the AB recording page:
Now would be the time to integrate AB database to MB so I would help in performing the integrations and would use the data in many places in AB.
Using MBID redirect information to determine when two distinct MBIDs refer to the same recording
An entity can have more than one MBID. When an entity is merged into another, it’s MBID is redirected to the other entity.
In order to determine the duplicate recordings, I would implement a function which redirects the MBID to its original entity. I would query the
recording_gid_redirect table of MusicBrainz database to use the information in the function which returns the entities with their MBIDs which would then be used to not allow adding duplicate recordings to a class. MBID redirect information would allow us to select any duplicate recording regardless of its MBID when using the API and the AcousticBrainz website.
The function would return a dictionary of entities with keys as their MBIDs. I would maintain a record in a set with original MBIDs of the recordings which have already been added so that whenever there is a new recording we compare its original MBID with the recordings present in the set. And thus with the help of MBID redirects, a user won’t be able to add similar recordings to a particular class because their MBIDs would redirect to the same original entity.
We can implement this method easily using the direct connection to MB database using
mbdata.models much similar to what we are doing in
CritiqueBrainz here. But only after testing both the methods of database access, we would be able to decide better what method works well for this integration.
Using Artist information in AcousticBrainz
Artist filtering states that one recording per artist should be present in any dataset class so that during evaluation we have unique artists in each class to present the user with a cutoff of class size during training. We can not choose artist filtering in the case when creating a challenge for classifying an artist. If we have a fast MB database access, we can provide users with
real-time feedback about
artist filters while creating datasets by fetching the MBID of a recording from AB database and applying joins between tables
We may implement the process in this way: Whenever a user adds one recording, we fetch the artist information from artist table and save the name or artist mbid in a set. Now when the user attempts to add any recording with same artist name in that class, we check whether the artist of the recording is already present in the set or not. If it is present, we won’t allow her to add that recording to that class.
A pseudo code description
create an empty set artist_mbids for each addition of recording: get a recording by user fetch the artist mbid of the recording check if it is present in the artist_mbids if present: do not allow the user to add that recording in the class else: add the recording to that class store the artist mbid of that recording into the set
In case I am finished with my GSoC project early, I plan to use the rest of my time with using MusicBrainz data to show statistics
We can widely use MB data in many places in AB. I would like to add statistics and visualizations to the data with the help of charts and graphs. We could add a new view, maybe
acousticbrainz.org/statistics which will be a place to show sitewide graphs. We could store stats in the jsonb format in
statistics table in AB database. When stats are calculated they must be saved in the database entity wise, so that whenever the page is opened again, the stats calculation process is not repeated for the same data and it would fetch the data from the database. We would recalculate/update the statistics on a weekly or fortnightly basis.
I would add charts based on following:
- Top X Artists by recordings
- Most frequently submitted recordings
- Top X submitted recordings by an Artist
- Most frequently submitted releases over a year
- Most / least submitted MusicBrainz tags
A graph for most commonly submitted recordings and top 10 artists would look like:
A phase by phase timeline of the work to be done is summarized as follows:
Community Bonding (April 23 - May 14)
Spend this time trying to formalize what exactly I need to code and will start setting up the connection with docker container which connects to the MB database for the development environment of AB. Also, discuss design decisions with the mentor and discuss ideal tables to import from MusicBrainz database in the local schema in AcousticBrainz and to make sure that no bad decisions are made early in the process.
Phase 1 (May 14 - June 11)
I aim to complete the method of copying the metadata from MB database into a separate schema in AB database. Use the docker image in dockerfile to connect to the MB database and load the metadata and import the data into a local schema in AB by adding a function to manage.py. I would also work on writing a script which fetches the data from MusicBrainz database whenever a new MBID is added to AcousticBrainz.
Phase 2 (June 12 - July 9)
In this phase, first I aim to work on updating the local schema of MB database whenever the MusicBrainz server is updated by downloading the replication packets. Perform tests on both the methods on the scale of AcousticBrainz to decide which method works best for us. On the last dates of this phase, I will update the AcousticBrainz build documentation.
Phase 3 (July 10 - August 6)
This phase would involve the work on two integrations. I would like to start working on using MBID redirect information to select duplicate recordings first and then I would work on allowing users to add one recording per artist in real time in each class for artist filtering. If I complete early then I would help in using MB data to show statistics in AB.
After Summer of Code
To continue my work on adding more functionalities to AcousticBrainz such as allowing the users to add the recordings to a particular class in the dataset editor based on certain criteria such as by a given Artist, in a given release, same release year or based on a given tag in MusicBrainz and working on other MetaBrainz projects as well. Also, working on new machine learning infrastructure would be a very interesting work to do.
Here is a week by week timeline of my work for summer:
GSoC 1st week (14th May to 20th May): Begin with the dockerfile setup. Connect to the container made for the direct connection which connects to the MusicBrainz database. Start with loading and copying small data first.
GSoC 2nd week and 3rd week (21st May to 3rd June): Writing the script to import the data in batches from MB database to a schema in AB database.
GSoC 4th week (4th June to 10th June): Work on writing the script to fetch new data to local schema whenever new MBID is added to AcousticBrainz database.
GSoC 5th week (11th June to 17th June): Start working on writing scripts for changes in replication packets update and insert feature.
GSoC 6th week (18th June to 24th June): Work on downloading replication packets and set up using cron and update the local schema whenever there is any update in MusicBrainz server.
GSoC 7th week (25th June to 1st July): After developing both the methods, I would work on testing sample queries for both the methods and see if queries are working properly.
GSoC 8th week (25th June to 1st July): Testing both the methods on the scale of AcousticBrainz on the basis of several parameters and thus deciding which method works best for us.
GSoC 9th and 10th week (9th July to 22nd July): Work on using MBID redirect information to don’t allow duplicate recordings to be added to a class.
GSoC 11th week (22nd July to 29th July): Start working on artist information to give users real-time feedback about artist filters.
GSoC 12th week (30th July to 5th August): Complete previous week’s integrations if left with any and solve bugs. Work on documentation for AcousticBrainz website.
GSoC 13th week (6th August to 14th August): Complete if there is any pending stuff and work on final submission and make sure that everything is working fine.