GSoC 2021: Complete Rust binding for the MusicBrainz API

ritiek · April 13, 2021, 1:49am

Personal Information

Name: Ritiek Malhotra
Time-zone: UTC +05:30 (India)
GitHub ID: @ritiek
IRC Nick: RikkoM
Mail: ritiekmalhotra123@gmail.com
Blog: https://ritiek.github.io
RSS Feed: https://ritiek.github.io/feed.xml

Proposal

Project: Complete Rust binding for the MusicBrainz API

Possible Mentors: oknozor, yvanzo, bitmap

Abstract

MusicBrainz has client libraries for various programming languages such as C++, Python, Java, Go and more. However, we’re still missing out on a proper client library for the Rust programming language. There previously have been attempts on creating such a client library, such as the musicbrainz_rust (not being maintained at the moment) and the musicbrainz_rs projects, but none of these attempts entirely wrap around the MusicBrainz Web-API. As a part of GSoC’21, I propose to implement further functionality in musicbrainz_rs, a Rust client library which has proper automated tests and has been carefully documented along with examples. However, it still misses out on covering some important functionality from the Web-API, namely the Search feature and the CoverArt endpoint. There are also further possible improvements that could be made, such as gracefully handling rate-limiting from the MusicBrainz servers by performing auto-retries on failed queries in the library.

Implementation

CoverArt Endpoint

Network requests to query the MusicBrainz database are made by utilizing the endpoint to https://musicbrainz.org/ws/2/{entity}/{mbid}. However, fetching the coverart uses an entire new endpoint, which looks like https://coverartarchive.org/{entitiy}/{mbid}. We’ll have to take this into account and make a request to the appropriate endpoint depending upon the context.

We currently have constants defined for dealing with the main database endpoint. Similarly, we’ll have to define new constants to deal with the CoverArt endpoint:

pub(crate) const BASE_COVERART_URL: &str = "http://coverartarchive.org";

The CoverArt docs showcase multiple ways to fetch the coverart for releases and release-groups. The endpoint accepts both GET and HEAD network requests. Our main focus will on working with GET requests.

The method to access coverart information is to make GET request with atleast the entity type and the MBID. For example, fetching the release: http://coverartarchive.org/release/76df3287-6cda-33eb-8e9a-044b5e15ffdd, might look something like this in our implementation:

let in_utero_coverart = Release::fetch_coverart()
    .id("18d4e9b4-9247-4b44-914a-8ddec3502103")
    .execute()
    .expect("Unable to get coverart"); 

// Fetch 1200px cover images for every available type
for image in &in_utero_coverart.images {
    let thumbnail_1200_url = image
        .thumbnails
        .res_1200
        .as_ref();
    println!("{}", thumbnail_1200_url.unwrap());
}

We’ll implement a new trait fetch_coverart on a new CoverArt struct, similar to how the fetch trait is currently implemented to query information from the main MusicBrainz database. This trait will be common to all the entities exposing the coverart information, which presently are release and release-groups as supported by MusicBrainz:

pub struct Coverart {
    pub images: Vec<CoverartImage>,
}

pub struct CoverartImage {
    pub approved: bool,
    pub back: bool,
    pub comment: String,
    pub edit: u64,
    pub front: bool,
    pub id: u64,
    pub image: String,
    pub thumbnails: Thumbnail,
    pub types: Vec<ImageType>,
}

pub struct Thumbnail {
    ...
}

pub enum ImageType {
    ...
}

impl<'a, T> FetchCoverartQuery<T>
where
    T: Clone + FetchCoverart<'a>,
{
    pub fn id(&mut self, id: &str) -> &mut Self {
        self.0.path.push_str(&format!("/{}", id));
        self
    }

    pub fn execute(&mut self) -> Result<Coverart, Error> {
        HTTP_CLIENT.get(&self.0.path).send()?.json()
    }
}

pub trait FetchCoverart<'a> {
    fn fetch_coverart() -> FetchCoverartQuery<Self>
    where
        Self: Sized + Path<'a>,
    {
        FetchCoverartQuery(Query {
            path: format!("{}/{}", BASE_COVERART_URL, Self::path()),
            phantom: PhantomData,
            include: vec![],
        })
    }
}

Another interesting way to access the coverart would be to add a relevant method on the entity, so we access the coverart by calling this method after we’ve queried an entity from the MusicBrainz database:

let in_utero = Release::fetch()
    .id("18d4e9b4-9247-4b44-914a-8ddec3502103")
    .execute()
    .expect("Unable to get release"); 

let in_utero_coverart = in_utero.get_coverart()
    .execute()
    .expect("Unable to get cover art");

The CoverArt docs also mention about making queries to access only some specific information in the response. For example, one can make a query to http://coverartarchive.org/release/76df3287-6cda-33eb-8e9a-044b5e15ffdd/front-1200 if all they need is the 1200px front cover image and do not need any other information. To deal with making such queries in Rust, we’ll implement new traits (common between different coverart entities) and make use of the builder pattern as we’ve done so far to have something like this:

let thumnbail_1200_url = Release::fetch_coverart()
    .id("18d4e9b4-9247-4b44-914a-8ddec3502103")
    .front()     // front/back
    .res_1200()  // 250/500/1200
    .execute()
    .expect("Unable to get coverart");

I’ve also submitted a pull request which partially implements some of the ideas listed above, which can be found here. Many ideas presented here are also based on my discussion with @oknozor there.

Search Endpoint

The MusicBrainz database supports making search queries based on various types, the full list of currently supported types are:

Artist
Release Group
Release
Recording
Work
Label
Area
Place
Annotation
CD Stub
Editor
Tag
Instrument
Series
Event
Documentation

From of the entire list, musicbrainz_rs currently only supports making search queries on the Artist type. Quoting the relevant example from the crate docs, this is how it works in practice:

// Build the request query
let query = Artist::query_builder()
    .name("Miles Davis")
    .and()
    .country("US")
    .build();

// Execute the network request
let query_result = Artist::search(query).execute()?;

// Extract all matching artist names from the response items
let query_result: Vec<String> = query_result
    .entities
    .iter()
    .map(|artist| artist.name.clone())
    .collect();

println!("{:?}", query_result);
// ["Miles Davis", "Miles Davis Quintet", "Miles Davis and His Orchestra"]

I plan to implement a similar way to make search queries on other entities. The external trait query_builder will be derived for all other entities, such as for release, label, and more. This will allow us to easily generalize our query_builder trait to make use of attributes from the concerned entity automatically without having to explicitly define similar behavior for every entity individually.

However, in the case of artist search, all the JSON response parameters that are delivered to us from the search endpoint are already handled by our existing Artist struct.
But this is not the case with other entities. For example, when making search queries based on the Release type, we’re also returned with a sub-entity ReleaseGroup in our every Release item in the list of API response.

In more specific detail, if we make a query to https://musicbrainz.org/ws/2/release/?query=release:Tonight,%20Tomorrow, we get the release-group sub-entity as:

<release-list count="4792" offset="0">
  <release id="c0241bad-b8b8-4176-b941-349a4a3dbc94" ns2:score="100">
    ...
    <release-group id="1c2d144b-c92b-46a9-b5e2-55f0e21345a2" type="Single" type-id="d6038452-8ee0-3f68-affc-2de9a1ede0b9">
      <title>Tonight, Tomorrow</title>
      <primary-type id="d6038452-8ee0-3f68-affc-2de9a1ede0b9">Single</primary-type>
    </release-group>
    ...
  </release>
  ...
</release-list>

The problem with the release-group sub-entity here is that it doesn’t contain all the attributes that are expected by our existing ReleaseGroup struct (we get specifically missed out on first-release-date in this case), compared to when we make a dedicated call to the Release endpoint https://musicbrainz.org/ws/2/release-group/1c2d144b-c92b-46a9-b5e2-55f0e21345a2, we get returned with this:

<release-group id="1c2d144b-c92b-46a9-b5e2-55f0e21345a2" type="Single" type-id="d6038452-8ee0-3f68-affc-2de9a1ede0b9">
  <title>Tonight, Tomorrow</title>
  <first-release-date>2018-08-10</first-release-date>
  <primary-type id="d6038452-8ee0-3f68-affc-2de9a1ede0b9">Single</primary-type>
</release-group>

One way to workaround this would be to convert such missing attributes to Option<T>, making them optional, but this would be a backwards in-compatible change and may cause confusion to our users since other entities won’t have these similar attributes enclosed as Option<T>, so we wouldn’t be going with this approach.

Another way, which was discussed here and seems the better approach, is to use an associated type:

pub trait Search<'a> {
    type SearchResult;

    fn search(query: String) -> SearchQuery<Self::SearchResult> {
        ...
    }
    ...
}

This will allow us to additionally define new structs to hold the response data in case the existing one missing out on some attributes.

Similarly, I expect to follow the same plan for adding Search support based on all the other entities.

We might also be to make changes to our JSON deserialization module in the library to help deal with cases where deserialization can be generalized for attributes with different names but similar behavior. It will probably make more sense to consider this at the time during our work on the implementation of the Search functionality, and see if there are any such closely related fields in the API response within the various searchable entities.

Implement auto-retries when we’re being rate-limited

Once we implement both the above CoverArt and the Search functionality, I also have further plans to implement auto-retries in the library.

At the moment, the library doesn’t handle auto-retries on failed queries. Due to this, our current test-suite also feels a bit hackish at the moment as we’ve been running our test-suite in a single thread (when Rust makes it very easy to run tests in multiple threads), and also we’ve been currently adding sleep for a duration of 1s after every test so that be also we do not get rate-limited and our test-suite doesn’t fail.

Whenever our query is rejected by MusicBrainz due to being rate-limited, we’re returned with the HTTP error code 503 along with a Retry-After header that tells how much time to wait for before making the next query that won’t be rate-limited. We’ll implement auto-retries based on this header value by automatically sleeping the current thread for the duration as specified in the Retry-After header. Along with making it easier for our library users to handle rate-limiting cases by implementing this, we might also be able to remove the 1s sleep duration from our tests, or at least be able to reduce it, thereby making our test-suite run faster.

We may then be able to set the default number of auto-retries on rate-limited queries by calling a library wide method, something like this:

musicbrainz_rs::config::set_default_retries(2);

Which is similar to how musicbrainz_rs currently allows users to set default browser headers for all future API calls.

Relevant discussion on this issue can also be found here.

Further Plans

If time persists, I also plan to work towards completing existing features in the library, such as the Include parameter which presently does not cover isrcs and url-rels on the Recording entity, and also fetch relationships between various entities.

Timeline

This is the approximate timeline I propose to abide with.

Community Bonding Period:

There are currently still parts of the library that are not clear to me, especially how the deserialization of the API response is put together. I’ll further dig through the code base during this time to clear my present bottlenecks and see if the ideas presented in the proposal here can further be refined. I also plan to further work on my currently work-in-progress code implementation of the CoverArt during this time, to compensate for the time that will be lost due to my final semester exams during the first week of the official coding period.

Week 1:

Further continue working on the CoverArt implementation, as time allows. We’ll have the fetch_coverart trait on Release and ReleaseGroup ready by this time.

Week 2:

Implement get_coverart trait method on Release and ReleaseGroup allowing us to fetch the coverart once we’ve retrieved data from MusicBrainz database.

Week 3:

Implement the builder pattern allowing us to make specific calls to fetch the coverart.

Week 4:

Finalize our CoverArt implementation with appropriate tests and docs, along with suitable examples. It might also be a good idea to publish a new release at this point once we’re done with everything CoverArt. At this point, I should be pretty familiar with my current bottlenecks on the code base. I also plan to put together my thoughts on implementing the Search functionality for the next weeks.

Week 5:

Set up an associated type to deal with cases where the Search API response based on an entity misses out on attributes currently expected by the existing entity structs in the library.

Week 6-7:

Implement the Search functionality on all entities supported by the Web-API, along with proper documentation and tests.

Week 8-9:

I’ll work on implementing auto-retries in the library when a query fails due to being rate-limited from MusicBrainz, and also see if this allows us to reduce our test-suite run-times.

Week 10:

A spare week in case if anything does not go out as planned above. Otherwise, I’ll continue working towards the completion of already existing features in the library.

Additional Information

Tell us about the computer(s) you have available for working on your SoC project!

I have an Asus VivoBook running Manjaro with Intel i5 processor and 8 GB RAM.

When did you first start programming?

My first experience with programming was with QBASIC in middle-school. However, it wasn’t until 2016 when I first decided to open-source one of my few hobby projects.

What type of music do you listen to? (Please list a series of MBIDs as examples.)

I mainly listen to pop, but melodic dubstep is also nice!

Loote - tomorrow tonight
RUNN - Handle With Care
Julie Bergan & Seeb - Kiss Somebody
Sabai - Million Days ft. Hoang & Claire Ridgely

What aspects of the project you’re applying for (e.g., MusicBrainz, AcousticBrainz, etc.) interest you the most?

I find the entire idea of MusicBrainz about creating an open music database where anyone can add/change entries really interesting. Given the community keeps improving the project like the way they’ve done so far, I think we’ll have a really good alternative to closed music databases. And as our community grows, I imagine MusicBrainz database will have an easier time to stay up-to-date with any new music that gets released in the future, due to its open nature, and therefore making the MusicBrainz database an important resource to people who work with music metadata. By having proper client libraries for various programming languages, the database will become easier to access, which I believe is an important milestone to make above a reality. I’ve also always wanted to learn more about Rust which makes this a perfect opportunity for me.

Have you ever used MusicBrainz to tag your files?

No. Not as of yet.

Have you contributed to other Open Source projects? If so, which projects and can we see some of your code?

Yes. I have previously participated in GSoC’18 under OpenAstronomy and had successfully completed the program (link to archive). I’ve also created spotdl, a tool to download music from YouTube based on metadata from Spotify, which I no longer actively maintain due to time constraints.

Specifically for my experience with Rust, I’ve created piano-rs (a multiplayer piano game), rafy-rs (a library to fetch video metadata from YouTube), and an image cropping tool. Some other small experimental projects I’ve written in Rust can also be found on my GitHub. I’ve also made a few contributions to rustc, raspotify, and musicbrainz_rs.

How much time do you have available, and how would you plan to use it?

I have my final exams during the first week of June (when the GSoC coding period officially starts), so I plan to start working on my project before that to compensate for the work time lost studying in for my finals. After that, I’ll be done with my college and will have no problem to commit at least 35 hours every week to the project. I am also willing to make slight adjustments to my work schedule to match with that of my mentors, if necessary.

okno · April 13, 2021, 11:00am

Thanks for the proposal @ritiek, It seems very nice !

We will see how much time you can spare but I think you will be able to implement search on all entities (week 6-9) much faster than you think. This will probably be quite straight-forward and repetitive once all needed trait will be in place.

Depending on how much you can do on those last 3 weeks, we could finalize some incomplete features (relations, include parameters, non core lookup etc).

Regarding the deserializer, the main idea behind it is to use a generic struct to parse the API result while the API json output has specific fields for every browsable/searchable entities :

For instance when browsing a release you’ll get the following response.

{
  "release-offset": 0,
  "releases": [],
  "release-count": 0
}

And this when you browse an area :

{
  "area-count": 0,
  "areas": [],
  "area-offset": 0
}

To avoid defining a custom struct for every browse result with the correct json field names we use a generic one paired with the deserializer to get the correct fields.
The json format is found by the deserializer according to the associated type T.

#[derive(Debug, Serialize, PartialEq, Clone)]
#[serde(rename_all(deserialize = "kebab-case"))]
pub struct BrowseResult<T> {
    pub count: i32,
    pub offset: i32,
    pub entities: Vec<T>,
}

I think I will start a demo app to showcase your progress on the crate. If you have ideas regarding this I am open to suggestions

Don’t hesitate to @ me on the IRC if you have any question.

ritiek · April 13, 2021, 2:28pm

@okno Thanks for the feedback!

We will see how much time you can spare but I think you will be able to implement search on all entities (week 6-9) much faster than you think. This will probably be quite straight-forward and repetitive once all needed trait will be in place.

Depending on how much you can do on those last 3 weeks, we could finalize some incomplete features (relations, include parameters, non core lookup etc).

I see. I’ve updated my timeline accordingly to reduce the time I spend working on the Search functionality, and have added implementing auto-retries as a part of GSoC. I’ll work on completing these currently incomplete features as a part of extended goal to GSoC, if time allows.

Regarding the deserializer, the main idea behind it is to use a generic struct to parse the API result while the API json output has specific fields for every browsable/searchable entities

I think we might then also be able to make changes to the deserializer in this case, to help implement the Search functionality on various entities. I’ve added a mention about this in the proposal.

I think I will start a demo app to showcase your progress on the crate. If you have ideas regarding this I am open to suggestions

Yep, I think that would be nice. I don’t have any ideas at the moment, but I’ll let you know if I come up with something.

Don’t hesitate to @ me on the IRC if you have any question.

Sure, thanks!