Improving Spotify export coverage via MusicBrainz URL relationships


The problem

When exporting a ListenBrainz playlist to Spotify, each recording MBID has to be resolved to a Spotify track ID. That resolution goes through the labs endpoint /1/spotify-id-from-mbid/json in listenbrainz-server, which uses a normalized-text join against mapping.spotify_metadata_index. If that text match fails, the track is silently dropped from the exported Spotify playlist.

For playlists with featured artists, non-English repertoire, or otherwise “long-tail” content, a significant fraction of tracks get dropped. On my own LB account I’m seeing roughly 20% drops on auto-generated Weekly Exploration playlists, and higher on manually-curated ones with lots of collaborations.

Concrete example

I ran an analysis on my own weekly exploration playlist (50 tracks, generated 2026-04-11). Here’s the breakdown:

Category Count
HIT — text-match resolves normally 39 / 50 (78%)
MISS_DRIFT — normalized MB string differs from normalized Spotify string 8 / 50
MISS_CACHE_ONLY — normalized strings match but spotify_metadata_index has no row 3 / 50

So 11 of 50 tracks (22%) would be silently dropped if I exported this playlist to Spotify today.

Of the 11 misses:

  • 8 are “featured artist” drift — 4 Gorillaz collaborations, 2 Daft Punk feat., 1 Aerosmith & YUNGBLUD, 1 Snarky Puppy multi-artist credit

  • 1 is a remaster variant drift (Led Zeppelin “Houses of the Holy”)

  • 1 is Apparat w/ Soap&Skin “Goodbye” — a true miss, no Spotify URL relationship in MB

  • 1 is Harry Styles “Ready, Steady, Go!” — cache miss (not in spotify_cache yet)

Methodology

For each MBID in the playlist:

  1. Call the prod labs /spotify-id-from-mbid/json endpoint

  2. If it returns a Spotify track ID → HIT

  3. If not:

    • Check whether the labs response includes canonical artist_name/track_name — if not, the MBID isn’t in canonical_musicbrainz_data

    • Query a local MB copy for the recording’s ISRCs, then use Spotify’s /v1/search?q=isrc:XXX&type=track to fetch Spotify’s canonical track data

    • Normalize both strings (MB canonical and Spotify canonical, using the same algorithm as labs: unidecode(re.sub(r'[^\w]+', '', text).lower()))

    • If they match → it’s a cache miss (text match would have worked if spotify_metadata_index had the row)

    • If they differ → it’s text drift

This lets us distinguish the two subcategories of misses empirically rather than guessing. The analysis script is small and I’m happy to share it.

Proposed fix

Use MusicBrainz’s URL relationships as a fallback. MB editors have curated direct recording → Spotify URL links (l_recording_url joined to url with link type free streaming or streaming). We can extract the Spotify track ID from those URLs and return it for any MBID the text lookup didn’t resolve.

Pseudocode:

class SpotifyIdFromMBIDQuery(MetadataIndexFromMBIDQuery):
    def fetch(self, params, source, offset=-1, count=-1):
        # Existing text-match lookup (unchanged)
        results = super().fetch(params, source, offset, count)

        # NEW: for any MBID with empty spotify_track_ids (or not in results at all),
        # try MB URL relationships as a fallback
        unresolved = [m for m in input_mbids
                      if m not in {r.recording_mbid for r in results if r.spotify_track_ids}]
        if unresolved:
            url_rel_matches = lookup_spotify_track_ids_from_mb_url_rels(unresolved)
            # Merge into results: update existing empty rows,
            # or append new rows for MBIDs that weren't even in canonical metadata
            ...
        return results

Where lookup_spotify_track_ids_from_mb_url_rels is a single SQL query against the existing MB database connection:

WITH input_mbids (gid) AS (VALUES %s)
SELECT input_mbids.gid::text AS recording_mbid,
       array_agg(DISTINCT substring(u.url FROM 'open\.spotify\.com/track/([A-Za-z0-9]+)')) AS track_ids
  FROM input_mbids
  JOIN musicbrainz.recording       r   ON r.gid = input_mbids.gid::uuid
  JOIN musicbrainz.l_recording_url lru ON lru.entity0 = r.id
  JOIN musicbrainz.url             u   ON u.id = lru.entity1
  JOIN musicbrainz.link            l   ON l.id = lru.link
  JOIN musicbrainz.link_type       lt  ON lt.id = l.link_type
 WHERE lt.name IN ('free streaming', 'streaming')
   AND u.url ~ '^https?://open\.spotify\.com/track/[A-Za-z0-9]+'
 GROUP BY input_mbids.gid

Key properties:

  • Purely additive. The text-match path is never modified, so any MBID currently resolved keeps its current Spotify ID. No existing exports are affected.

  • No new schema. Uses the existing MB_DATABASE_URI connection and reads only existing MB tables that LB already has access to.

  • No new cron, no offline build. Runtime SQL at query time, single join bounded by input size (usually 100 MBIDs or fewer per request).

  • No auth. Pure SQL; no external HTTP calls.

Result on the example playlist

Applied against the same 50-track weekly exploration:

  • 10 of 11 dropped tracks are recovered by proposed fix (they all have ≥1 Spotify URL-rel in MB)

  • 1 remains dropped: Apparat w/ Soap&Skin “Goodbye” — MB has no Spotify URL relationship for this specific recording at all. Proposed fix can’t help that; an ISRC-based fallback (for which MB has 5.6M recordings with ISRCs) would be a natural follow-up.

Example of a MISS_DRIFT recovery: Gorillaz feat. Mark E Smith “Delirium” — MB canonical is meridiandanfeatbighjmegermanwhipgermanwhip, Spotify canonical is meridiandanbighjmegermanwhipgermanwhip. The feat and its position around Big H JME land in different places after normalization, so the equality join fails. But MB has https://open.spotify.com/track/6T9ZqPIWm4I4vygRZwgpJv curated on the recording, which extracts cleanly.

Example of a MISS_CACHE_ONLY recovery: Harry Styles “Ready, Steady, Go!” — the normalized MB canonical and Spotify canonical strings are byte-identical (harrystylesreadystea​dygoready​steadygo), so the text match would have worked if the track were in spotify_metadata_index. It apparently isn’t yet. URL-rel extracts the Spotify ID directly.

Next Steps

I’ve implemented the change in my own fork of listenbrainz-server and set up a full local dev environment on AWS to test it end-to-end (using my own Spotify developer client ID for the export flow). The labs endpoint test passes, the real Spotify export flow picks up the recovered tracks, and my dropped-track rate on my own playlists goes from ~22% to ~2% on the sample I ran.

I’d be happy to submit a PR if the dev team is comfortable with this approach. The diff is small — a new function in labs_api/labs/api/utils.py and a fetch() override in labs_api/labs/api/spotify/spotify_mbid_lookup.py. I can include unit tests against a MB fixture.

I want to get this working properly for a non-commercial personal project I’m building that connects to ListenBrainz. Once it’s up and running, I plan to make a donation to support MetaBrainz — thank you for the enormous amount of work that goes into keeping all of this running. Happy to help with upstream fixes in the meantime.

Regards,
Gerrit

4 Likes