GSoC 2018: A way to associate listens with MBIDs

kartikeyaSh · March 17, 2018, 7:46pm

Personal Information

Name: Kartikeya Sharma
IRC nick: kartikeyaSh
Email: 09kartikeya@gmail.com
Github: https://github.com/kartikeyaSh
Blog: https://kartikeyaSh.github.io
Time Zone: UTC+0530

Project Details:

Currently Listenbrainz uses MSIDs (Messybrainz-ID) for retrieving useful user stats (e.g. user listens). Now ListenBrainz also plans to generate data which could be used by MusicBrainz to show useful information like artist popularity. MusicBrainz has MBIDs (MusicBrainz-ID) associated with each artist, recording, and release. In order to provide MusicBrainz with the ability to access information based on MBIDs, we have to associate recording_mbids, artist_mbids and release_mbids to the listens where we can. For most of the listens we don’t have artist_mbids and relesae_mbids associated with them. But have recording_mbids associated with them. So, now I plan to associate MBIDs to MSIDs. To do this I’ve divided the project into four parts.

Create clusters and association based on MBIDs present in the recording.
Create clusters and association for artists and releases using recording MBIDs.
Make sure new listens are clustered on insertion properly whenever possible.
Create API endpoints in MessyBrainz.

Note: I’ve assumed that There will be a MusicBrainz database which can be queried.

Goals of the project:

Create infrastructure for creating, updating, collapsing, and deleting clusters and associating MBIDs to the created clusters.
Execute devised algorithms on the database using the infrastructure created.

I’ve written this proposal with the mindset that I’ll be creating some algorithm and to get that algorithm executed I’ll create the infrastructure that is needed. This infrastructure can be used in future.
In part 1 I’ll create infrastructure for creating, updating, collapsing and deleting clusters.
In part 2 I’ll create an infrastructure to access MusicBrainz database and store the required information in our database for future use.
In part 3 I’ll create the infrastructure cluster newly inserted listens to MessyBrainz database and use the cache tables created in part 2.
In part 4 I’ll create infrastructure for retrieving information from the MessyBrainz database using MSIDs and MBIDs which can be used by ListenBrainz and MusicBrainz to show useful stats.

Part 1:

Create clusters based on MBIDs already in the database.

In this phase, I’ll be creating clusters based on MBIDs which have been already inserted when submitting listens.

Why not create a cluster for MSIDs and then associate MBIDs?

We can use meta_sha256 to cluster the MSIDs and then assign MBIDs for those clusters. The meta_sha256 field is computed using the values of (artist, title) fields in the JSON data stored in the recording_json table. So, fields which have same artist name and title for the recording will be clustered together. But with this approach, we can’t handle cases in which we have recordings of the same artist with same names (Summertime by Miles Davis and (Summertime by Miles Davis) or cases where recording, artist, and release all have the same value for different recordings (see track 1 and track 4 of release We Want Miles). Such recordings will have same (artist, title) fields, hence will have exactly similar SHA values. So, if we use meta_sha256 to cluster these fields we will end up clustering the above two recordings into one cluster. Which is not a correct thing to do. As in the end we want to assign MBIDs to clusters. We won’t be able to assign an unique MBID to such a cluster. On the other hand if we from the start use MBIDs present in the data to create clusters we will never have any ambiguities as MBIDs are unique for every recording, artist, and release. Also by using MBIDs to form clusters, we will also be able to cluster recordings which may have some spelling mistake in the artist, title fields.

First, all this will be done on recordings.
Here is the schema of MessyBrainz database:

List of foreign keys:

recording.data --> recording_json.id
recording.artist --> artist_credit.gid
recording.release --> release.gid
recording_cluster.recording_gid --> recording.gid
recording_redirect.recording_cluster_id --> recording_cluster.cluster_id
artist_credit_redirect.artist_credit_cluster_id --> artist_credit_cluster.cluster_id
artist_credit_cluster.artist_credit_gid --> artist_credit.gid
release_redirect.release_cluster_id --> release_cluster.cluster_id
release.cluster.release_gid --> release.gid

--> represents “foreign key to” relationship
e.g. recording.data is “foreign key to” recording_json.id .

How to access MusicBrainz database:

As done in CritiqueBrainz I’ll use the Docker image which the MusicBrainz project is using. I’ll use mbdata to access the MusicBrainz data for my purposes.

Before proceeding further we need to get unique recording_mbids from the recording_json table which we can get by running a query like
SELECT DISTINCT data ->> ’recording_mbid’ FROM recording_json WHERE data ->> 'recording_mbid' IS NOT NULL;

Validate data in recordings:

It may happen that MBIDs submitted in the recording are incorrect. Before clustering/associating the MBID we can check if other data inside the JSON is similar to the corresponding MBID. We can use MusicBrainz database to validate this information. I am thinking of a check like https://en.wikipedia.org/wiki/Edit_distance where we set some threshold and only if that threshold is crossed we say that recording may have wrong MBIDs. Because it is quite easy for manually tagged files to have few errors. And comparison can be made case-insensitive. We can also check if the submitted artist MBIDs and release MBID corresponds to the recording or not.

Create clusters for recordings :

Here are the steps to create a cluster for a single recording using recording MBID present in the data:

For a given recording MBID in the recording_json table, get all recording_json.id which contain this recording MBID.
From this list of recording_json.id query the recording table to get the recording.gid associated with this recording_json.id.
For this list of recording, gids increment the cluster_id in the recording_cluster table and associate this cluster_id to all the recording gids.
In the recording_redirect table add an entry for the cluster_id and the recording MBID which represents this cluster.

Example:

Here is an example of how it will be done.
I’ve used data present the data dump available here.

Let’s create clusters for 'recording_mbid' = '58e48a5d-0ce7-49b8-b1f9-b96a56892eec';

Get IDs from recording_json table for 'recording_mbid' ='58e48a5d-0ce7-49b8-b1f9-b96a56892eec'.
SELECT * FROM recording_json WHERE data ->> 'recording_mbid' ='58e48a5d-0ce7-49b8-b1f9-b96a56892eec';

id	data	data_sha256	meta_sha256
1054240	{“artist”:“The Oh Hellos”,“recording_mbid”:“58e48a5d-0ce7-49b8-b1f9-b96a56892eec”,“release”:“The Oh Hellos EP”,“title”:“Cold Is the Night”}	d8134f6e80c8d841ceb229024942b73154668ea0da20d83eeebf1bd29d0d01bb	550185d5dc2b9a269db15e2c94fffa22b0ab4cea89023770358a8b310ac58ca5
1107275	{“artist”:“The Oh Hellos”,“recording_mbid”:“58e48a5d-0ce7-49b8-b1f9-b96a56892eec”,“release”:“The Oh Hello’s”,“title”:“Cold Is the Night”}	2cdea46c3446d0cca1732df5bce12cec3305b94b6c7bdcff9af87f376fbbd84b	550185d5dc2b9a269db15e2c94fffa22b0ab4cea89023770358a8b310ac58ca5
5864854	{“artist”:“The Oh Hellos”,“recording_mbid”:“58e48a5d-0ce7-49b8-b1f9-b96a56892eec”,“title”:“Cold Is the Night”}	ebbce841db049579f6350c5e8cc665439e9bb7a16bf7ecb8a50b0bc1cc8abaaa	550185d5dc2b9a269db15e2c94fffa22b0ab4cea89023770358a8b310ac58ca5
8772454	{“artist”:“The Oh Hellos”,“recording_mbid”:“58e48a5d-0ce7-49b8-b1f9-b96a56892eec”,“release”:“The Summer Kollection”,“title”:“Cold Is the Night”}	33cf15c39be9368094480a3483923b759422f3bf0e551c475efed80bf42af98a	550185d5dc2b9a269db15e2c94fffa22b0ab4cea89023770358a8b310ac58ca5

Now using the IDs from recording_json table we query the gids from recording table and get the following table :
SELECT gid FROM recording WHERE data = 8772454 OR data = 1054240 OR data = 1107275 OR data = 5864854;

gid
f9c89d8b-b6d8-4996-83fa-8a9444962b98
28aae0dd-6f0a-49da-92a0-1ddddcf0ea90
1e30ac1b-6d1a-4e0d-bd12-4177eb7d9ecb
8ec1db13-aa0c-4a52-9885-2ffcd031534d

These four gids represent the same recording. So we should put them in the same cluster. For that, we first increment the serial column cluster_id in recording_cluster table and put these gids into the recording_cluster table. The following table depicts this:

cluster id	recording_gid
1	f9c89d8b-b6d8-4996-83fa-8a9444962b98
1	28aae0dd-6f0a-49da-92a0-1ddddcf0ea90
1	1e30ac1b-6d1a-4e0d-bd12-4177eb7d9ecb
1	8ec1db13-aa0c-4a52-9885-2ffcd031534d

Now we put this cluster_id into the recording_cluster_redirect table with the recording_mbid on which we were working and get following table:

recording_cluster_id	recording_mbid
1	58e48a5d-0ce7-49b8-b1f9-b96a56892eec

Doing that for all Unique MBIDs we will get clusters based on MBIDs for all the listens which have MBID.

On the bases of the data in the available data dump. Here are some stats:

We have 9335675 recordings in recording_json table. Out of which 4377964 recordings contain recording_mbid, So by doing the above process we will end up associating 46.89% recording MSIDs with recording MBIDs.

The same process can be applied to create clusters of artists and releases where MBIDs are present in the json data present in the recording_json table. A little modification needs to be done for creating clusters for artists.
For artists we will get a list of MBIDs so all those MBIDs are to be associated to a single MSID. For this, we have the artist_credit_redirect table where we will associate a cluster id to multiple MBIDs.
For example recording I Don’t Wanna Live Forever (Fifty Shades Darker) has two artists ZAYN and Taylor Swift so the artist_credit_cluster table will contain a single cluster_id which will map to both ZAYN and Taylor Swift artist MBIDs.

Here are a few stats based on data in data dump:
We have 689185 artists in artist_credit table. 12875 recordings contain artist_mbids, So by doing the above process we will end up associating 1.87% artist MSIDs with artist MBIDs in case MSIDs are not clustered at all i.e. MSIDs are already unique.

We have 1066187 releases in release table. 10691 recordings contain release_mbid, So by doing the above process, we will end up associating 1.0027% release MSIDs with release MBIDs in case MSIDs are not clustered at all i.e. MSIDs are already unique.

Even if we have some clustering these results will not be great. Due to this, I’ve proposed the second method in part 2.

Implementation details:

Scripts will be written to execute the above-specified algorithm on the database. The scripts to create clusters for all three entities have somewhat similar structure. Code for creating clusters, merging two clusters, deleting clusters, and updating clusters to add or remove or edit some field is quite similar for all three entities. While creating clusters for recording I’ll keep this in mind and create a file like data_utils.py which contains this shareable code. So, that while creating scripts for artist and release I’ll simply reuse the code from data_utils.py.
As recording MBIDs and MSIDs are to be fetched a lot of times. So, to save time we can simply create a table which will contain fields for (recording MBIDs, recording MSIDs, release names) which will be indexed on recording MBIDs and recording MSIDs for fast lookup.
A functionality should be present in the script to let user only run this script on data submitted after a specified timestamp. For this we already have fields (submitted, updated) in the tables which can be used to create such functionality.
A dry run feature for the scripts which will not manipulate anything in the database but will keep track of information like how many clusters are formed, how many entities are examined, how many MSIDs to MBID association are made will be written. Dry run will only create temporary tables to store information in case required. For dry run the script will keep a list of variables to store information which will be logged after the execution of the script in dry run.

Part 2:

Creating clusters for artist_credits and releases where we only have recording MBID in the recording_json.data

I’ll explain the process for the artist_credit table which will be applicable for release table also.

For creating these clusters we will also take help of MusicBrainz database
We will use the recording MBIDs to fetch artist_mbids from the MusicBrainz database.
As written in the documentation of MessyBrainz create_tables.sql file “Messybrainz artists are artist credits. That is, they could represent more than 1 MusicBrainz id. These are linked in the artist_credit_redirect table.” As an artist_credit can represent more than one MBID so an artist_credit_cluster will also represent multiple MBIDs.

Creating cache tables:

To access the required information from MusicBrainz database we should create two new tables. We will create tables to store information about these two entities:

artist MBIDs corresponding to a recording MBID.
release MBID corresponding to a recording MBID.

By creating these tables we won’t have to query MusicBrainz database for MBIDs every time. And due to some reason we don’t have artist MBIDs for some recording MBIDs in our cache tables, we will query the MusicBrainz database and insert that information into cache tables. I’ve used the word cache in the sense that we will first try to get the required information from these tables and if the information is not present in these tables, we will query MusicBrainz database. The data in these tables is permanent.
Before proceeding further, I’ll create scripts to fill up these tables. These scripts can be executed periodically. And will fetch artist MBIDs, and release MBIDs (If possible, as recordings can be present in multiple releases) from MusicBrainz database for the recordings which only contain recording MBIDs. We can use the ‘submitted’ field in the recording_json table to run the above script only for the recordings submitted after some timestamp.

Algorithm to create clusters for artist_credit:

Let me explain the algorithm to create a cluster using a single recording_mbid.

For a recording_mbid fetch cluster_id from the recording_redirect table.
Now using this cluster_id find all the MSIDs that this recording_mbid represents. And for these recording_MSIDs we find all artist MSIDs that these recording MSIDs corresponds to in the recording table.
Now put all these artist MSID values in the same cluster as all these artists represent the same recording and should be the same.
Fetch artist MBIDs for this recording_mbid from MusicBrainz database. This can have multiple MBIDs.
For each of the MBIDs we got from MusicBrainz database, we put them in the artist_credit_redirect table with the same cluster_id value.

Example:

Lets create cluster for artists using "recording_mbid":"0b2432c3-9215-4115-a1c8-87ef048bd3df".

Get the list of id for the recording_mbid from the recording_json table.

id	data	data_sha256	meta_sha256
228602	{“artist”:“Mohit Chauhan, Viviane Chaix, Tanvi Shah, Suvi Suresh & Shalini”,“recording_mbid”:“0b2432c3-9215-4115-a1c8-87ef048bd3df”,“release”:“Rockstar”,“title”:“Hawaa Hawaa”}	2248cd19ba67bb99ab82bcabb8ef0806916aca0e5df069121c661f37f053d24c	791f18d4c22c67d3b3ed2b4c9bfa01a20220511e4cf26da542313b17bba1eb08
9594	{“artist”:“Mohit Chauhan, Viviane Chaix, Tanvi Shah, Suvi Suresh & Shalini”,“recording_mbid”:“0b2432c3-9215-4115-a1c8-87ef048bd3df”,“title”:“Hawaa Hawaa”}	c14c9b3308fabd40878b43d2b6864a4e7934474e29a9b392d8dc086041f9e692	791f18d4c22c67d3b3ed2b4c9bfa01a20220511e4cf26da542313b17bba1eb08

Now using these IDs fetch the artist_msids from the recording table.

select distinct artist from recording where data=228602 or data=9594;

artist
afa7da69-ba11-4cc3-9193-3b67903f72b5

Now put all these artist MSID values in the same cluster.

cluster_id	artist_credit_gid	updated
1	afa7da69-ba11-4cc3-9193-3b67903f72b5	2018-03-05 20:06:00.425561+05:30

Fetch artist MBIDs for the "recording_mbid":"0b2432c3-9215-4115-a1c8-87ef048bd3df" from MusicBrainz database.

artist_MBIDs
1dd28f27-4ab3-4a3f-8174-4ccd571a9dce
e58e9ad6-66be-4ec2-b1d1-9f7f6def9711
fc8ee5d5-f03a-4e7e-97c5-624ee35c9894
10300673-e9b8-40ba-a7aa-5954238bb3e6
edd8f606-b78e-4410-baff-eacf17f169cc

For each of the MBIDs we got from MusicBrainz database, we put them in the artist_credit_redirect table with the same cluster_id value.

artist_credit_cluster_id	artist_mbid
1	1dd28f27-4ab3-4a3f-8174-4ccd571a9dce
1	e58e9ad6-66be-4ec2-b1d1-9f7f6def9711
1	fc8ee5d5-f03a-4e7e-97c5-624ee35c9894
1	10300673-e9b8-40ba-a7aa-5954238bb3e6
1	edd8f606-b78e-4410-baff-eacf17f169cc

Now using this approach we will get clusters of artist_credits wherever possible. We will use the same approach to get clusters for releases.
While using the same approach for creating clusters for releases we will also take help of the release field in the JSON data. We will fetch the list of releases and release MBIDs in which the recording has been released. Out of these releases, we will match the name of the release that the release field contains in the JSON data. If we find a match then this release MSID will be mapped to the release MBID that has the same release name. In case of release name not being able to disambiguate the release, we will have to rely on additional information which can be used to disambiguate this. We don’t cluster releases which have different names.
Even in the case, we get a unique release in the MusicBrainz database we can’t just cluster the releases as it may happen that the release listed in the JSON data may be new and not present in the MusicBrainz database. So, by clustering such releases we will be making the data incorrect in the database which is something we don’t want.
For listens that don’t contain MBIDs we simply create a cluster for each recording, artist, release and put them in one to one relationship in respected cluster tables and don’t put these cluster_ids in cluster_redirect tables.
Here are some stats based on the data available in the data dump.
We have 2200662 distinct recording_mbids in the data dump so applying the above approach we will be able to associate MBIDs to a significant number of listens.
We have 7512343 recordings in recording_json which contain all three fields i.e. artist, title and release so we will also be able to match most of the release MBIDs that we get by the above approach.
To get the exact stats we will have to first execute this approach on the data present in the data dump.

Implementation details:

After creating scripts, manage.py will provide functions which will have options to create clusters for an entity(recording, artist, and release), delete any formed clusters, and a dry run feature which will not mainupulate anything in the database but will keep track of information like how many clusters are formed, how many entities are examined, how many MSIDs to MBID association are made. Dry run will only create temporary tables to store information in case required.
For dry run the script will keep a list of variables to store information which will be logged after the execution of the script in dry run.
And then tests to verify the correctness of these scripts will also be written.

Part 3:

Cluster newly submitted data into appropriate clusters.

When new listens are inserted into the database we should cluster them into some cluster whenever possible. The access to the MusicBrainz will be done in two steps. First the cache tables created in part 2 will be queried and if MBIDs are not found then we will query MusicBrainz database for MBIDs.

Here is the pseudo code for clustering newly inserted recording:

Input to this algorithm is recording JSON.
First, we see that the same recording exists in the database or not by using sha256 as done here.

If we find that the recording is in the database already:
    We must have clustered and associated the recording MBID to this recording if it was possible. So, we don't work on it again.
Else: 
    Insert this recording to our database and assign it a new MSID.
    If this recording contains recording MBID:
        If this recording MBID is present in the recording_redirect table:
            Add this new recording MSID to the cluster represented by the recording MBID.
        Else:
            We create a new cluster and associate this cluster the MBID present in the recording.

As we know an artist_credit can be associated with more than one MBID. So, some modifications are done to the approach that we use for recording.

For clustering artists here is the pseudo code:

Input to this algorithm is recording JSON
First, we see that the same artist exists in the database or not by querying the database as done here.

If we find that the artist is in the database already:
    If the recording contains artist MBIDs:
        If the database already contains a cluster which points to only this list of MBIDs:
            We have already clustered and associated this artist_credit if it was possible. So, we don't work on it again.
        Else:
            Create a new cluster and assign the MSID of the artist_credit to it.
            Assign these MBIDs to this cluster.
    Else:
        If the recording contains recording MBID:
            Fetch artist MBIDs from MusicBrainz database.
            If the database already contains a cluster which points to only this list of MBIDs:
                We have already clustered this artist_credit. So, we don't work on it again.
            Else:
                Create a new cluster and assign the MSID of the artist_credit to it.
                Assign these MBIDs to this cluster.
Else: 
    Insert this artist to our database and assign it a new MSID.
    If the recording contains artist MBIDs:
        If the database already contains a cluster which points to only this list of MBIDs:
            We already have a cluster for this artist_credit. So, Put this new MSID to the list of MSIDs in the cluster.
        Else:
            Create a new cluster and assign the MSID of the artist_credit to it.
            Assign these MBIDs to this cluster.
    Else:
        If the recording contains recording MBID:
            Fetch artist MBIDs from MusicBrainz database.
            If the database already contains a cluster which points to only this list of MBIDs:
                We already have a cluster for this artist_credit. So, Put this new MSID to the list of MSIDs in the cluster.
            Else:
                Create a new cluster and assign the MSID of the artist_credit to it.
                Assign these MBIDs to this cluster.

This process is also applicable to releases by just putting a validation check on release name after finding the releases using MusicBrainz database.

Implementation details:

As a lot of listens are submitted to MessyBrainz so we can’t do the above computation in the same web container and will have to send the new submitted listens to a RabbitMQ queue which will be submitted to a container which will be continuously running a script to execute the above functionality. This will be similar to what influx-writer does in ListenBrainz. While doing so, we can use the utility functions in data_utils.py created during part 1 and part 2.

Part 4:

Create Endpoints for API.

After we have created clusters we also need MessyBrainz API to provide the following functionality:

Get a list of all MSIDs provided a single MSID.
Get a list of all MSIDs provided an MBID.

This information will be used by ListenBrainz to calculate stats based on MBIDs and MSIDs. I’m mostly interested in adding the functionality to MessyBrainz to let ListenBrainz fetch the above information. ListenBrainz does not use MessyBrainz API but uses MessyBrainz directly. For the sake of completeness, I will add API endpoints too.

`GET /msid/?{params=value}`

The value of the parameters will be used to generate the list of all MSIDs that the MSID in the URL is equivalent to.

URL Parameters
Required
id=[UUID]: The value of id is MSID. All the MSIDs which are represent the same MSID as in the URL will be returned.
request_type=[string]: This is the type of request which is made. Can have three types artist, release, and recording.
Optional
mbid=[boolean]: If MBID associated with the MSID also wanted in the response, then set it to true else set it to false. By default, it is set to false. Won’t return MBIDs in case no association is present in the database.

Sample Call: Request from curl to get MSIDs for request_type = recording.

$ curl "https://messybrainz.com/msid/?id=baf4f1b8-0665-4bba-b7fc-b23aa9cf0c95&request_type=recording&mbid=true \ -X GET

Response Example:

        "count":"1",
        "ids":[
                {   
                    "mbid_count": "1",
                    "msid_count": "2",
                    "mbid": "0b2432c3-9215-4115-a1c8-87ef048bd3df",
                    "msid": ["baf4f1b8-0665-4bba-b7fc-b23aa9cf0c95","dc696544-108f-4217-90da-f2b377b7327e"]
                }
            ]
    }

Sample Call: Request from curl to get MSIDs for request_type = artist.

$ curl "https://messybrainz.com/msid/?id=afa7da69-ba11-4cc3-9193-3b67903f72b5&request_type=artist&mbid=true \ -X GET

Response Example:

        "count":"2",
        "ids":[
                {   
                    "mbid_count": "1",
                    "msid_count": "1",
                    "mbid": ["88a8d8a9-7c9b-4f7b-8700-7f0f7a503688"],
                    "msid": ["afa7da69-ba11-4cc3-9193-3b67903f72b5"]
                },
                {
                    "mbid_count": "1",
                    "msid_count": "2",
                    "mbid": ["b49a9595-3576-44bb-8ac0-e26d3f5b42ff"],
                    "msid": ["afa7da69-ba11-4cc3-9193-3b67903f72b5", "93ec2d51-983d-4ce4-85c9-1380d63d86c0"]
                }
            ]
    }

This is a typical case in which a single artist_credit is corresponding to multiple artist MBIDs. Here MSID afa7da69-ba11-4cc3-9193-3b67903f72b5 corresponds to such an MSID. This can be a case where artists have the same name. For example, James Morrison (UK singer) and James Morrison (Australian jazz musician). In this case, MSIDs representing these MBIDs will be same and we may get a response like above.

`GET /mbid/?{params=value}`

The value of the parameters will be used to generate the list of all MSIDs that the MBID in the URL is equivalent to.

URL Parameters
Required
id=[UUID]: The value of id is MBID. All the MSIDs which represent this MBID in the URL will be returned.
request_type=[string]: This is the type of request which is made. Can have three types artist, release, and recording.

Sample Call: Request from curl to get MSIDs for MBID.

$ curl "https://messybrainz.com/mbid/?id=93ec2d51-983d-4ce4-85c9-1380d63d86c0&request_type=recording \ -X GET

Response Example:

    {   
        "msid_count": "2",
        "msids": ["baf4f1b8-0665-4bba-b7fc-b23aa9cf0c95", "dc696544-108f-4217-90da-f2b377b7327e"]
    }

Sample Call: Request from curl to get MSIDs for MBID.

$ curl "https://messybrainz.com/mbid/?id=afa7da69-ba11-4cc3-9193-3b67903f72b5&request_type=artist \ -X GET

Response Example:

    {   
        "msid_count": "3",
        "msids": ["93ec2d51-983d-4ce4-85c9-1380d63d86c0", "61746abb-76a5-465d-aee7-c4c42d61b7c4", "0b2432c3-9215-4115-a1c8-87ef048bd3df"]
    }

If we query MSIDs for some MBID we will always get only one cluster with a list of MSIDs which represent the same MBID.

Implementation details:

Now to generate the response for a get request using MSIDs we take following steps:

Using this messybrainz_id we get the cluster_id from the recording_cluster table.
Now from that cluster_id, we get all the ids that are inside this cluster from the recording_cluster table.

For MBIDs we will have to get the cluster_id from *_cluster_redirect table and fetch the MSIDs from *_cluster table.

Most of the code will be written in data.py which will include functions for adding this functionality and api.py will contain endpoints for clients to access this functionality.

Here is a link to my gist for additional ideas, timeline, about me and Q & A.

kartikeyaSh · March 17, 2018, 7:54pm

This is the initial draft of my proposal. Please provide reviews and feedback.

rob · March 19, 2018, 10:46am

I’ve just skimmed the proposal, I’ll have a more detailed read later. Some comments:

Did you download the messybrainz data to look at it? ftp://ftp.eu.metabrainz.org/pub/musicbrainz/messybrainz/ If not, you should and then update the data examples in your proposal to be more representative of the actual data. (e.g. MBIDs are always UUIDs)
That flowchart is pretty, but you could convey the flow more simply in text because a lot of steps are the same for the different entities. Tell us what what you plan to do for one entity and we can imagine the rest.
On “creating clusters for recordings” Why do you need to assign a new UUID here? Please know that in our universe UUIDS are created for referring to our data from an external database. In this case, the outside world needs to never reference our clusters, so we can use a Postgres SERIAL column for identifying the clusters, which is already in the schema as cluster_id.
I’d like you to think about phases of applying your scripts. Right now there are no clusters in the database, so we need to have a script that can look at the existing database and find and make new clusters. We should be able to run this script at any time and have it “tidy up”. Also, the MessyBrainz ingestion code needs to start forming clusters automatically as things get imported, obviating the need for running the tidy up script.

kartikeyaSh · March 23, 2018, 9:22pm

1: Created examples using data present in the data dump.

2,4: Modified part 3 to include algorithms to cluster and associate MBIDs to incoming listens.

3: Used SERIAL column for identifying the clusters.

Looking forward to more reviews.

rob · March 24, 2018, 6:38pm

Your proposal has improved quite a lot – it looks like you’re on the right track. I’ve spent some time thinking about what I would like to see happen for MessyBrainz and I now have a better idea. Hopefully I can share my ideas with you tomorrow morning, giving you some time to revise your proposal one more time.

Stay tuned!

rob · March 25, 2018, 4:42pm

Hi!

I was hoping to get some feedback from @alastairp and @iliekcomputers, but sadly that didn’t happen. But, here are my unchecked ideas about how I would proceed with MessyBrainz if I were to spend my summer on it:

https://docs.google.com/document/d/1uiye3smNrMoCq0_NjlRuT3mfGgNYo7OHXFwvEh6pdps/edit?usp=sharing

Perhaps you some of these ideas resonate with you so you can incorporate them into your proposal.

alastairp · March 26, 2018, 4:34pm

This is a neat proposal, and understands some of the messy things about dealing with this kind of data. Some small questions which I’d like to know about:

What if a submission has the wrong MBID in it? That is, what if the text of the artist credit or title is very different to all other items in the cluster? Do you think that it’d be a good idea to check this too?
As many recording MBIDs can refer to the same recording, will you also extend this matching process using mbids from the recording_gid_redirect table in MusicBrainz?

Remember that recordings can appear on many releases. Do you have a suggestion about how you might deal with this, and how it differs from the artist case? How will you deal with the artist case if a recording appears on 2 releases and the artist on each of these releases is different? Will you still create the credit?

From rob’s document:

This is a pretty neat extension to the work that you’ve already proposed. I think it’s a good idea to make a somewhat generic interface to allow us to add more cluster generation algorithms as we collect more data

Also a very safe idea!

alastairp · March 26, 2018, 4:38pm

One more thing - How many spotify ids are in the MessyBrainz data? This could be another easy win to create clusters

kartikeyaSh · March 26, 2018, 4:45pm

Only 89,670 in the available data dump. I thought about it but I’m not sure if it’s a big enough number as we have 9335675 recordings in the database.

kartikeyaSh · March 26, 2018, 5:33pm

Yes, we can add a check for that. I am thinking about a check like Edit distance - Wikipedia where we set some threshold and only if that threshold is crossed we say that recording may have wrong MBIDs. Cz it is quite easy for manually tagged files to have few errors. Maybe i can validate on artist, title fields in the json and other MBIDs. But validating over releases seams risky cz of recordings like this: eg_diff_releases.md · GitHub ,sorry for the unformatted output. Just have a look at release values for all the recordings.

kartikeyaSh · March 26, 2018, 5:37pm

I’ve written about it in the next paragraph

kartikeyaSh:

While using the same approach for creating clusters for releases we will also take help of the release field in the JSON data. We will fetch the list of releases and release MBIDs in which the recording has been released. Out of these releases, we will match the name of the release that the release field contains in the JSON data. If we find a match then this release MSID will be mapped to the release MBID that has the same release name. In case of release name not being able to disambiguate the release, we will have to rely on additional information which can be used to disambiguate this. We don’t cluster releases which have different names.

Even in the case, we get a unique release in the MusicBrainz database we can’t just cluster the releases as it may happen that the release listed in the JSON data may be new and not present in the MusicBrainz database. So, by clustering such releases we will be making the data incorrect in the database which is something we don’t want.

kartikeyaSh · March 26, 2018, 5:54pm

I’ll try to write about it in a much cleaner way. And put stress on the point that I’ll create infrastructure for that. I’ll try to eleborate it. I’ve created the proposal with a mindset that I’ll device some algorithm and create an infrastructure to get that algorithm executed. And this infrastructure can also be used for another matching algorithms.

kartikeyaSh:

How to access MusicBrainz database:
As done in CritiqueBrainz I’ll use the Docker image which the MusicBrainz project is using. I’ll use mbdata to access the MusicBrainz data for my purposes.

Creating cache tables:
To access the required information from MusicBrainz database we should create two new tables. We will create tables to store information about these two entities:

artist MBIDs corresponding to a recording MBID.
release MBID corresponding to a recording MBID.

By creating these tables we won’t have to query MusicBrainz database for MBIDs every time. And due to some reason we don’t have artist MBIDs for some recording MBIDs in our cache tables, we will query the MusicBrainz database and insert that information into cache tables. I’ve used the word cache in the sense that we will first try to get the required information from these tables and if the information is not present in these tables, we will query MusicBrainz database. The data in these tables is permanent.

Before proceeding further, I’ll create scripts to fill up these tables. These scripts can be executed periodically. And will fetch artist MBIDs, and release MBIDs (If possible, as recordings can be present in multiple releases) from MusicBrainz database for the recordings which only contain recording MBIDs. We can use the ‘submitted’ field in the recording_json table to run the above script only for the recordings submitted after some timestamp.

I’m working on elaborating this.

dns_server · March 27, 2018, 1:13pm

One thing you need to take care of is multiple artists with the same name.
There is a shapeshifters in the uk and a shapeshifters in nz for example.
For some common names there can be 10 or 20 with the same name.

There are also some common track names such as “intro” that are are common to lots of releases.

kartikeyaSh · March 27, 2018, 2:47pm

I’ve added a section to describe how i plan to work on the project.

And another section for additional Ideas.

kartikeyaSh · March 27, 2018, 2:51pm

Agreed, I’ve taken care of that.