[GSoC 2026]: GraphQL Server as a MusicBrainz API Alternative

Project Summary

  • Title: GraphQL server as a MusicBrainz API alternative

  • Proposed mentors: @bitmap , @jadedblueeyes

  • Languages/skills: Rust, GraphQL, SQL

  • Estimated Project Length: 350 hours

Expected Outcomes

  • Working GraphQL server in Rust covering a specific subset of entity types

  • Schema design with depth limiting and query cost analysis

  • In process caching layer

  • Multi-entity lookup in a single query

Extension Objectives

  • Cover additional relatable entity types beyond the initial subset

  • Add new database indexes or materialised views to support expensive links

  • Integrate with other MetaBrainz projects

Contact Information

Personal Introduction

Hi everyone! I’m Hari, also known online as owlpharoah(op3kay), and I’m a second year student at IIIT Jabalpur. I’ve been an avid music fan ever since I was a child. I was bought up in a household of artists and music played a huge role in my early years. I also really enjoy Rust and backend programming, so this project of building a GraphQL server as a MusicBrainz API alternative felt like a sweet dream come true :slight_smile:

Why This Project

The current MusicBrainz XML/JSON API works, but its not really optimal. You need a bunch of inc parameters to get related data, browsing support is not even across entity types, and you cant really look up five artists at once.

GraphQL is a really good fit here. It handles asked feilds with asked relationships without custom server logic per link type and multi entity lookup is possible. Query depth and cost can be analysed/limited before execution rather than after the database has already been hit. And this is something I’d want to spend the summer on.

Prior Work

Before writing this proposal, I built a rough prototype to validate the approach and get a better feel for the real problems. It covers Artist, Release Group, Release, and Recording focusing on basic field resolution, relationship between them, and an enforcable depth limit set at the schema level. The stack is async-graphql, sqlx, and Axum, which is what I’d use for the real thing.

Building it showed me a few issues I’ll need to focus on:

N+1 queries

In the prototype, resolving artist_type, gender, and area on an Artist each fire a separate query. That’s fine for a single artist lookup, but would be really hard on the server if it were for a list. DataLoaders fix this by batching, sometimes working somewhat like a schema level cache.

Depth limiting

async-graphql makes direct depth limiting pretty easy, but it doesn’t catch everything. A shallow query could still be expensive. Queries for different types will have to be analysed individually, assigned certain weights and an overall query cost limit must also be implemented

let schema = Schema::build(...).limit_complexity(200).limit_depth(5).data(pool).finish();

#[ComplexObject]
impl Artist{
    #[graphql(complexity = "10 * child_complexity")] #create weights
    async fn release(&self,..) -> async_graphql::Result<Vec<Release>>{...}

Proposed Project

Scope

Rather than trying to cover everything, I’ll pick a focused subset of entity types and do them properly. My current plan is:

  • First Priority Entity Fields: Artist, Release, Release Group, Recording, Label

  • Core fields:

    • Artist: gid, name, sort_name, comment, type, gender, area

    • Release Group: gid, name, comment, artist credit

    • Release: gid, name, comment, artist credit, release group,status, language, script, country, packaging

    • Recording: gid, name, comment, length, video

    • Label: gid, name, comment, type, area, label_code, begin/end dates

  • Standard relationships between them

  • URL relationships

Schema Design

A sketch of the Artist type:

#[derive(SimpleObject)]
#[graphql(complex)]
pub struct Artist {
    pub gid: Uuid,
    pub name: String,
    pub sort_name: String,
    pub comment: Option<String>,
    #[graphql(skip)]
    pub id: i32,// internal DB id ie not exposed
}

#[ComplexObject]
impl Artist {
    async fn area(&self, ctx: &Context<'_>) -> Result<Option<String>> { ... }
    async fn release_groups(&self,...) -> async_graphql::Result<Vec<ReleaseGroup>> { ... }
}

A similar sketch for Artist credit would be:

type ArtistCredit {
  artist: Artist!
  name: String!       
  joinPhrase: String   # e.g. "&","feat.",etc..
}

Architecture

The server will be written using:

  • async-graphql: Solid support for DataLoader pattern, query depth and complexity limiting.
  • sqlx: the actual database connection & queries will be handled by this
  • axum: The framework for the server

Here’s how a request flows through the system:

Caching Strategy

The caching will be in two levels, both in process so as to reduce external dependencies or extra server requirements:

  1. DataLoader batching: within a single request, identical lookups are deduplicated automatically.
  2. moka response cache: entity reads are cached in-process with a TTL . It runs in the same process and evicts least used cache hence promising performance.

Timeline

  • Week 1

    • Align with mentors on the final entity subset and schema conventions

    • I’ve already spent time with the MusicBrainz database schema and the existing API source, and built a prototype so this week would mainly be about locking in decisions

    • Set up the production project structure and testing scaffolding.

  • Week 2

    • Schema design review with mentors.

    • Field naming, nullability, and relationship directions need to be finalized before I start writing resolvers.

  • Weeks 3

    • Ill actually start implementing the resolvers from the approved schema. These would build the base foundation for the entire server. After which ill start to focus on Dataloaders and other parts of this.
  • Week 4-5

    • DataLoader implementation for all entity types.

    • This is where N+1 query problems get fixed. I’ll write tests to verify that fetching 100 artists with their release groups produces an actually optimized number of database queries.

  • Week 6

    • Query depth limiting and analyzing complexity with weights.

    • async-graphql has extension points for this.. I need to decide on defaults and make them configurable.

    • I’ll also identify any relationship links that are structurally expensive and either add indexes or block them entirely.

  • Week 7

    • Moka caching layer.

    • This covers the cache key design, TTL configuration, memory bounds, and making sure nothing serves empty(stale) data. Cache hit/miss metrics go in here too.

  • Week 8

    • Multi-entity lookup

    • A single query should be able to return 50 artists by MBID. Mostly about making sure the DataLoader handles list inputs correctly and the schema exposes it cleanly.

  • Week 9

    • Deployment work. Dockerfile, Docker Compose setup, and documentation for running the server locally and against the MusicBrainz database.
  • Week 10

    • Integration and documentation. API documentation, schema reference, and any integration work needed to connect the server to the broader MusicBrainz infrastructure.
  • Weeks 11–12

    • Buffer. Something will take longer than expected. If not, I’ll use the time to extend the entity coverage or add materialised views for the most expensive relationship queries.This week takes into consideration Murphy’s Law.

Community Affinities

What music do you listen to?

I was bought up in a house of artists and hence developed a love for it early on. Some tracks that I play on repeat constantly:

What aspects of MusicBrainz interest you most?

The scale of it. Collecting structured metadata for all recorded music, maintained by contributors rather than a big company. Im all ups for open sourced communities and data and anyone being able to contribute in helping it grow is what excites me the most.

Programming Background

I ran my first ā€˜hello world’ in python back when I was a freshman in highschool. I still remember all the possibilities my mind spun up as soon as I saw the text ā€˜hello world’ pop up on the terminal. That moment was phenomenal and something ill never forget. After which I moved onto writing simple CLI games like a stock trading CLI game with pnl and other metrics. Another instance I still remember was using python to generate fake handwritten notes that I had to submit instead of actually writing them by hand(it actually worked :slight_smile: ). I then went onto exploring programming languages like i was pokemon hunting from Java to Javascript, Go, C, C++ and ive finally ended up at Rust and haven’t looked back ever since.

During this time I also started exploring web dev, made a few fun projects like gitmaps – lets you see how spread on the world map your oss contributors are. With web dev I realised backend is something I really enjoy doing and thats what im working on now.

Some other projects ive built includes:

  • RFstarstarKC - CLI learning tool to learn about RFCs and their implementations using animations and markdown documents.

  • VReWind - an NPM package to scaffold react tailwind projects.

  • lobstorrent - CLI to find blazingly fast torrents for anything.


Practical Requirements

  • Available equipment: My primary laptop runs Arch with Niri (wayland). My secondary runs windows 11. Both devices have 16gb RAM and 512gb Storage.

  • Time available: My university semester ends by may 5th and from then on my summer vacations start. From that point I’m free of other commitments and can work roughly 30-35 hours a week on the project.

1 Like

Hi @op3kay ! Welcome to the forums.

Nice work on getting this far.

Just having a quick go through:

  • I’m not sure there’s a reason not to start with a dataloader implementation - the code for a dataloader isn’t much more complex than a similar one without.
  • I would suggest looking at the current schema both in the database and in the JSON and XML APIs in more depth. Your scope isn’t that detailed or complete! Have a dive around the wiki and the transcluded docs pages too, and perhaps the community musicbrainz crate.
  • I think I half-mentioned this before, but caching can be deferred until performance becomes an issue (ā€œmake it work, make it good, make it fastā€) - and should only be applied to places it improves. This should ideally be backed up with at least basic benchmarks / load tests.
  • You’ll want to have a look throught async-graphql’s API and examples regarding caching anyway.
  • There isn’t much mention of testing - what’s your testing strategy? How are you testing things, and when? Do you plan to make demos and examples?
  • ā€œMulti-entity lookupā€ - this is a feature of the graphql library, not something you need to spend a week implementing :slight_smile:
  • You seem to have put docs at the end - you need to be noting down information and reasoning for decisions throughout. You don’t want to get to this point and not remember things, or struggle!

I also had a look through your projects. They look reasonable, although I’d remind you about the MeB LLM policy for the duration of GSoC.

2 Likes