Missing data in JSON dumps

Hi there

I’m working with the JSON data dumps, mostly artists and release groups. Every now and then I notice that some data available in the frontend is missing from the dumps, even months after they must have been added to the data:

  • Benson Boone’s “American Heart” has wikipedia links in the frontend. But the JSON dump only has links to AllMusic, Discogs, and one German site. The frontend knows three more links to other German pages.
  • Roddy McIinnon is missing from early October dumps, but has been available online since September 20.

Somewhere I’ve read that dumps should be up to date within a few days. Am I not patient enough?…

Michael

3 Likes

Thanks for the report, somehow our JSON dumps container was reverted to an older version of the code and got stuck. I’ve kicked it, but it’ll take a couple days to catch up.

6 Likes

Sounds great! I’ll check again next week.

1 Like

I can confirm that the latest dump has both my cases covered. Thanks a ton for the quick fix!

2 Likes

Did the container catch-up complete? Downloading full dumps seems to be working, but I can’t download the incremental changes. I tried using https://metabrainz.org/api/musicbrainz/json-dumps/json-dump-180797/artist.tar.xz?token=[MY TOKEN], but it’s just returning “Can’t find specified JSON dump!”. This is my first time attempting to download incremental changes, so I’m not sure if I’m doing something wrong or if the APIs failing.

There was a permissions issue with the incremental dumps directory which has now been resolved, sorry for the inconvenience.

Hold on… are you saying there are incremental updates for the JSON dumps? Where would I find the documentation for this?

@mherger See “Hourly Incremental JSON Dumps” at API - MetaBrainz Foundation.

Thanks! How would I get from a dump timestamp to the packet number? Eg. the last dump I imported was last week’s 20251015-001001. Now there’s a new one from 20251018-001001. How would I get the packet number to get changes since the 15th?

There should be a file named REPLICATION_SEQUENCE inside in the full dumps you downloaded. That has the packet number at the time the dump was processed.

I’m trying it out with a couple of the REPLICATION_SEQUENCE from release groups from Saturday and the one prior to that and I’m getting a not found. Here’s an example url https://metabrainz.org/api/musicbrainz/json-dumps/json-dump-180708/release-group.tar.xz?token=<MY_TOKEN>

I did just generate the token, does it take some time for the token to become usable? Or am I just not using this as intended? The number there is from the dump on 20251015-001001

Thanks! Alas: I’m still getting a “Can’t find specified…” for https://metabrainz.org/api/musicbrainz/json-dumps/json-dump-180708/artist.tar.xz?token=xyz. I only created my API key today. Could that be a problem?

Actually, the permissions issue I mentioned before wasn’t fully resolved. I found out that one of our scripts keeps resetting the permissions. I patched the container for now & will submit a proper fix tomorrow (but it should work for the time being).

1 Like

Thanks! My test is working now. That should speed up processing on my end considerably, processing only 11MB instead of 260GB for the releases :slight_smile:.

So these are built hourly. If I wanted to have the diff since X I’d have to go through all hourly dumps from X to what I get from https://metabrainz.org/api/musicbrainz/replication-info?t

1 Like

Continuing here, as I’m still seeing odd differences between full dumps and incremental changes.

As mentioned before I used to read the full dumps because I thought the hourly diffs were not available for JSON. This was painfully slow, obviously, and caused my Mac to crash because it seems to have different counts for free disk space which are not always in sync… run out of disk space when Finder still reported hundreds of gigabytes free! But I’m diverting…

I’m in the process of updating my update script to work with the hourlies. I’m now comparing the results of importing the full dump as before with the results of the incremental update. And I’m seeing differences. Eg. for The Allman Brothers Band - MusicBrainz : though the page says that record was last edited seven years ago, I’m getting different YouTube or Apple Music links in the hourly vs. the full dump (Oct 18). The latter actually is outdated. It’s not that the incremental missed something or added more. The incremental would show the same links as the web page, while the full dump doesn’t.

Or the full dump is up to date, but the web page isn’t?

though the page says that record was last edited seven years ago, I’m getting different YouTube or Apple Music links in the hourly vs. the full dump (Oct 18).

The “Last updated” date in the sidebar is unrelated to external links (or JSON dumps in general). That date only changes when the artist row in the database changes. That’s just the name, sort name, or most information listed under “Artist information” in the sidebar (besides IPIs/ISNIs).

Can you clarify what differences you’re seeing? I’m seeing the following Apple Music and YouTube links in the full artist dump from the 18th:

$ jq -c 'select( .id == "72359492-22be-4ed9-aaa0-efa434fb2b01" ) | ( .relations[] | select( ."target-type" == "url" ) | .url.resource ), halt' mbdump/artist | grep -E '(apple|youtube)'
"https://itunes.apple.com/us/artist/id127835"
"https://music.apple.com/gb/artist/127835"
"https://www.youtube.com/user/AllmanBrosBandVEVO"

There were several edits to these links on the 20th:

So it would be expected to see those changes reflected in the relevant incremental dump from the 20th. Are you seeing some other changes that aren’t explained by the edit history?

1 Like

Ah, ok, that would explain it: I’m seeing the same old URLs in the large dump (from the 18th), and those edits from the 20th in the incremental files. I’m good then. This is excellent.

BTW: the jq query you quoted, how long would that take to parse a full dump? I’ll have to keep a copy of that. I’m often using the parallel command to parallelise queries. But I’m not sure how well that would work with the haltinstruction, which I assume would stop after the first finding?

1 Like

On my machine it takes under 2 seconds, but that’s with the dump on a spinning disk. :slight_smile: The artist dump is also relatively small (15GB) compared to the release one, so it depends on which one you’re looking it.

halt causes it to stop after the first ID match rather than continue processing the rest of the file. There’s only one record matching that ID, so it’s pointless to keep looking. If you’re parallelizing searches that aren’t by a unique key, you’d remove that.

We’re getting off topic… but this heavily depends on where within the file the record is stored, right? But my grepping actually isn’t helpful at all when dealing with a unique ID. The halt instructions might often be more efficient as it prevents processing the full file.

At first I thought <2s? Impossible! But then I run the same query on my SSD based system in under 1s. The record must be pretty high up in the file. Running the same with the last ID in the file would still take 3 minutes :grinning_face: . Thanks anyway! I’m still a relative beginner with jq.

We’re getting off topic… but this heavily depends on where within the file the record is stored, right?

Absolutely. I tried running the same query with the last ID in the file, and ended up just hitting ^C after a few minutes. :slight_smile:

You can also improve the time significantly by pre-filtering the file with grep/rg. Here’s locating the very last artist again:

$ time rg -F '124a614c-6d78-4c8d-b4b3-c270a94b8071' mbdump/artist | jq -c 'select( .id == "124a614c-6d78-4c8d-b4b3-c270a94b8071" )'
{"aliases":[],"disambiguation":"Italian Dance music producer","relations":[{"direction":"forward","begin":null,"artist":{"country":"IT","name":"Mentronik","sort-name":"Mentronik","type-id":"e431f5f6-b5d2-343d-8b36-72607fffb74b","type":"Group","disambiguation":"Italian project","id":"c951b815-969e-4e2a-9975-4d0c63c2b758"},"attributes":[],"ended":false,"target-type":"artist","end":null,"attribute-values":{},"target-credit":"","source-credit":"","attribute-ids":{},"type-id":"5be4c609-9afa-4ea0-910b-12ffb71e3821","type":"member of band"},{"ended":false,"end":null,"target-type":"artist","artist":{"country":"IT","name":"Ele Project","sort-name":"Ele Project","type":"Group","type-id":"e431f5f6-b5d2-343d-8b36-72607fffb74b","disambiguation":"Italian electro house trio","id":"763d28a1-6ce0-47b5-b47b-cb23fef66439"},"attributes":[],"begin":null,"direction":"forward","type-id":"5be4c609-9afa-4ea0-910b-12ffb71e3821","type":"member of band","attribute-ids":{},"source-credit":"","attribute-values":{},"target-credit":""},{"type":"discogs","type-id":"04a5b104-a4c2-4bac-99a1-7b837c37d9e4","attribute-ids":{},"url":{"id":"a2b72cfe-3279-4d7b-a443-99fe7f709225","resource":"https://www.discogs.com/artist/567882"},"target-credit":"","attribute-values":{},"source-credit":"","target-type":"url","end":null,"ended":false,"attributes":[],"begin":null,"direction":"forward"}],"id":"124a614c-6d78-4c8d-b4b3-c270a94b8071","begin-area":null,"name":"Filippo Arcangeli","annotation":null,"country":"IT","gender":"Male","tags":[],"area":{"name":"Italy","sort-name":"Italy","type":null,"type-id":null,"disambiguation":"","id":"c6500277-9a3d-349b-bf30-41afdbf42add","iso-3166-1-codes":["IT"]},"sort-name":"Arcangeli, Filippo","end-area":null,"type":"Person","type-id":"b6e035f4-3ce9-331c-97df-83397230b0df","isnis":[],"life-span":{"ended":false,"begin":null,"end":null},"gender-id":"36d3d30a-839d-3eda-8cb3-29be4384e4a9","ipis":[],"rating":{"value":null,"votes-count":0},"genres":[]}

________________________________________________________
Executed in    2.62 secs    fish           external
   usr time    1.09 secs    0.00 millis    1.09 secs
   sys time    1.50 secs    2.41 millis    1.50 secs

But if you’re doing tons and tons of lookups, it’d probably be worth importing the records into postgres as JSONB and setting up some GIN indexes (but we don’t provide any code or guidance to do this). And at that point you might consider just setting up a standard database mirror. :smiley: