Query the MusicBrainz API using GraphQL with graphbrainz

brian.beck · November 26, 2016, 9:20pm

Hey all,

I published a project last night you all might find interesting: graphbrainz, a GraphQL server for talking to the MusicBrainz API.

GraphQL is a nifty query language designed for modern APIs: instead of talking to individual resource-oriented endpoints to get the data you want, you just talk to one endpoint and ask it for the exact structure of data you want back.

(Under the hood, this particular implementation is making (potentially multiple) calls to the real MusicBrainz REST API, subject to rate-limiting, although it could just as well be talking directly to the database or a mirror.)

This is neat because you can fetch as deeply nested data as you need in a single query (although very deep queries might take a while, again simply due to rate-limiting). For example, who the members of the Beatles on a particular Apple Records release married, and when.

I’m really into GraphQL, I think next-gen approaches like that and Falcor are where APIs are headed. Then we can do away with making bespoke endpoints for our frontends due to the sheer chattiness of typical resource-oriented REST endpoints. I’m hoping to eventually point a production deployment of this at a MusicBrainz mirror to speed up deeply nested queries. Enjoy playing around with it as a neat toy in the meantime!

ijabz · November 27, 2016, 12:17pm

Very Interesting, but as you say :

although very deep queries might take a while, again simply due to rate-limiting

isn’t this the problem though, that the underlying api has to understand what data uses most want and write the api accordingly because if they simply allow totally arbitary querys it going to be either too slow or overwhelm the server.

Would the ideal solution be for MusicBrainz to implement a GraphQL api (although how would you write to overcome my point above) because in this project it is running on top of the musicbrainz api so that when you say it only retrieves the data the users you wants you actually mean it does retrieve more data than you want but throws away data before it gets to the user.

brian.beck · November 27, 2016, 5:55pm

Good questions @ijabz!

That’s true, and in that case users just do what they do now: orchestrate multiple queries over multiple round-trips to construct the data they need – they don’t just give up and only design their use cases around needing 1 API request. So at the very least, the GraphQL endpoint is giving you the benefit of not needing to do that orchestration yourself – it makes requests in parallel, uses cached responses where possible, and handles the rate-limiting and retrying for you.

Unreasonably large queries could overwhelm the server (if we’re talking about a non-rate-limited implementation) but that’s something people have solved before: for example, there are many public SPARQL query servers (another, more well-established query language where you can construct arbitrarily complex queries across large graphs of data) – Wikidata and DBpedia’s SPARQL endpoints come to mind – and they simply have a limit on how long a query can take (their backend can presumably cancel pending queries on the server-side once the limit is reached). That’s one potential solution.

Another solution might be to allow the GraphQL server to make non-rate-limited REST API calls to fulfill queries, but cap the number of REST API requests it can make during the course of fulfilling a query. Then, sure, we wouldn’t be allowing unlimited complexity, but it would be reasonable & prevent abuse.

Anyway, we gotta assume that most requests are of a reasonable nature, otherwise MusicBrainz would be overloaded even worse in its current implementation, because people would still be making those requests but even less efficiently.

Having a GraphQL implementation deployed at the source is kind of my dream goal and an interesting idea the Metabrainz team should consider! In general, yes, implementing this as a translation layer to direct database queries (instead of the current REST API) would indeed be the ideal case, and open the door for more query optimizations.

Sort of – my particular implementation tries to make the most minimal REST API requests possible to satisfy the user’s query – the same as the user would need to do to orchestrate the same data fetching themselves. In other words, it’s not just requesting inc=everything + kitchen-sink-rels, etc. – it inspects the requested fields and makes a minimal set of requests.

As long as that’s the case, it’s not actually a problem that more fields are actually retrieved on the backend than are requested: adding more fields to the response actually doesn’t increase latency all that much if at all – but a big factor is response payload size, which makes a far greater difference on slow/mobile connections than the extra time it takes the database to return an extra field. So you’re still getting quite a big win by allowing minimal response payloads via GraphQL, even if secretly on the backend some data is “thrown away”.

Another advantage is that the GraphQL server can control the cache policy. Currently, my implementation caches the full response from every REST API call for 24 hours – that’s why many of the example queries on the GitHub page load instantly. So when I said “thrown away” before, what it’s actually doing is caching those fields for potential re-use, and just not returning them in the query response. If someone makes a slightly different query (requesting different fields) that happens to translate to similar underlying REST API calls, there’s a good chance it won’t need to hit the real REST API at all.

In summary: yes, there’s absolutely a ton more optimization possible to make this more feasible for public consumption, but there are still advantages even with the current implementation. In general, I think the approach to APIs should not be “protect our server from doing too much work by forcing users to contort themselves & make many piecemeal queries” but rather “allow whatever queries we can to satisfy our users’ needs, and figure out a way to mitigate the unreasonable ones.”

ijabz · November 28, 2016, 11:02am

Thanks for the detailed answer

Currently I think the SQL lookups the server uses to fufill api requests consists of loads of custom sql. it would make much more sense for there to be some sort of object/graph layer that constructs database queries as required rather than directly writing sql, this would then allow for caching as well. So I can see something similar to your dream goal making alot of sense. I suggested using Hibernate myself as a way to generate the sql, but my idea would have only used that to service the existing api rather than the flexiblity you could get with GraphQL.

Partially true, but in some cases application just haven’t implemented the functionality they really want because just too slow.

In your current implementation you say you run queries in parallel, I dont think this is actually allowed is it ?

reosarevok · November 28, 2016, 11:24am

Well, IIRC it’s actually implemented as something like X requests every Y seconds, so it might be possible to run a few in parallel as long as you don’t go over that limit? I always forget the exact number though (1 per second is an approximation)

brian.beck · November 28, 2016, 11:46am

Exactly right; the docs specifically say an average of 1 req/sec, so if it only takes a small burst of queries to fulfill a request, we might as well make them all at once. graphbrainz defaults to 5 req/5½ secs so it’s actually slightly under, and it tends to not get rejected.

(It also supports pointing a deployment at a custom MusicBrainz mirror, which would have no or different rate limits – the parallelism would really come in handy there!)

brian.beck · December 14, 2016, 6:10pm

There have been a ton of additions and improvements to graphbrainz since this post! I don’t plan on adding noise by bumping for every release (there have been quite a few), but you may have noticed many corners of the API were incomplete the first time 'round.

Better documentation, including a new Types document that offers an overview of every query, type, field, and argument in the schema. Everything has a description (mostly cobbled together from the MusicBrainz docs).
More efficient REST API calls on the backend, taking advantage of the inc subquery feature in more cases.
Added support for Media and Discs on releases.
Added support for Collections including lookup and browse queries.
Added support for Ratings.
Added new lookup and browse queries by disc ID, URL, ISRC, and ISWC.
Added missing fields like asin, ipis, isnis, isrcs, iswcs, and more.
Lots of tests, providing 98% code coverage.

Still on the TODO list: Missing fields like artwork, text representation, etc. – and potentially support for mutations (submitting ratings, barcodes, etc.)

graphbrainz has been downloaded over 600 times – probably many times by robots – hope someone is having fun.

Leo_Verto · June 4, 2017, 2:17am

I didn’t know this was a thing, I loved playing around with WikiData’s GraphQL interface and I thought there was only the dead-looking LinkedBrainz project for MB.

Thanks for making this!

reosarevok · June 4, 2017, 8:19am

FWIW, they’re “back, if basic for now” at http://linkedbrainz.org/