Bulk upload of “cultural heritage” records?


#1

Is MB’s capacity to handle bulk data uploads still as limited and/or nonexistent as it was at the time of this discussion: http://forums.musicbrainz.org/viewtopic.php?id=1682 ? I’m particularly curious whether MB has ever considered a direct upload option for data from “cultural heritage” partners (libraries, museums, etc.), or if this isn’t supportable.

Context: I’m doing research with a collection of ~100,000 historic commercial recordings that would benefit a lot, in terms of organization and linkability, from having records in MusicBrainz. But uploading them individually is impractical for me, and even if I tried it, the volume of data probably would overwhelm the voting system. Any feedback would be appreciated.


#2

That sounds interesting and valuable for MusicBrainz.

@ianmcorvidae was working on Geordi - an “ingestion engine” for label data so I guess that would have been useful here. Unfortunately it seems he is no longer around and the last work on geordi was over a year ago. Perhaps @reosarevok knows what if anything is planned here?

The voting system has changed in that new releases do not go through a voting period but the same problem of not many editors checking and correcting the existing data remains. I’m not sure much can be done about that so I’m not sure stopping someone add a high quality collection of data will improve matters.

It would be possible to write a bot to upload these with a lot of community engagement. Some of the concerns would probably be around matching of existing entities and (less so) title case correction.


#3

I’m afraid that such bulk imports are still not possible. However if you are planning to record all that information in some place, wouldn’t it be almost the same amount of work to enter it in MusicBrainz first and link it from there? Your research portal could then refer to your collection on MusicBrainz for its data.


#4

As others have said, there’s still no really good way to do this, even if it is something we would very much like. The Geordi project is pretty much dead as its sole developer left, and I think @Rob got pretty disillusioned about it. If someone wants to try and pick up Geordi again, well, the code is still open source. :slight_smile:

Until then, the best bet, aside from just chucking it all in manually, is probably to make a bot do this for you:
https://musicbrainz.org/doc/Bots

Some of the bots currently active on MusicBrainz should have links to their source code in their bot account’s description. I know there’s at least one Python and one Perl base currently floating around which I think most of the more active bots are based on.


#5

Thanks so much for these very helpful replies! We will look into the bot option.

The collection is already semi-organized on our servers, so we’re hoping that some combination of automated processing and human checking (possibly crowd-sourced) can fairly rapidly clean the metadata to a point where it can be added to MusicBrainz or linked to existing MB records. Which would be way a better scenario for the collection than just continuing to gather dust in its corner of the internet.


#6

Are you willing to share the nature of the recordings - e.g. genre, time period, etc.?

What was the nature of his disillusionment?


#7

Development was slow and users barely ever used it, IIRC.


#8

Unfortunately Geordi in its original form wasn’t very useful and then it was rewritten so that it was totally useless !

If you can find those releases in Discogs then you might find http://www.albunack.net/ useful as this makes it quite easy to import a Discogs release into MusicBrainz, essentially you click on a link and it opens the MusicBrainz submit new release form with the information pre-added, but even then you still have to do a check of the information and submit the release. Also MusicBrainz cannot cope with many releases being submitted at the same time if you try and do this you end up with partial submissions, i.e new release added but with no tracklist as the submit new release code is not transactional.

Bots have been created to add information to existing releases but I do not think there is a bot to actually submit new releases.


#9

What? And nobody told me? :stuck_out_tongue_winking_eye:
There are so many hidden third party MB tools! :grinning: :thumbsup:


#10

They’re mostly mid 20th-century North American recordings. Some are more recent. I probably shouldn’t say much more about the collection in a public forum until this project moves beyond the vaporware stage.


#11

AIUI - MB database currently has ~15 million recordings.
~100K releases is around 0.66% of current recordings.

Is MB at a point where it can take such opportunities?


#12

[edit - sorry, missed your earlier reply]
I’m guessing based on your original question that all of this is stored in some sort of database already? Does it have any links between e.g. same artist on different recordings, artist and global identifiers (e.g. VIAF)?


#13

I don’t know of a public bot for submitting releases but there was an unauthorised one that submitted about as many Japanese releases in September 2014 (i.e. it has been achieved)


#14

Which, on the other hand, caused a huge mess some people are still cleaning up, since it added lots of duplicates and meh-ly entered data. So: possible, yes, done properly before, no. There’s also the FreeDB bot that added tons of releases ages ago, but was stopped because data was also terrible. All this has made the community not too receptive to bot-adding.

That said, a properly written bot, with sources that are reliable and run in a decent way (tested, with a small test run checked carefully by humans afterwards to find any issues), for music that is unlikely to get added otherwise, and with some good way to check for duplicates and not add stuff to the wrong artist because it has the same name, would be useful and I would definitely lobby for it with any community members who don’t like the idea. The problem is that it’s not easy :slight_smile:


#15

The metadata is already in a DB, but authority control and entity linking have been managed pretty poorly until now. So we’re throwing every refinement technique that we can think of at it: OpenRefine, AcoustID lookups of the songs, DBpedia Spotlight, MusicBrainz, Discogs, and VIAF searches for artist names, etc. Suggestions are always appreciated!

The majority of the content (probably at least 80%), however, is sufficiently outside the online mainstream that we don’t expect to find any LOD records about it, which is where the question about bulk uploads to MusicBrainz comes in – assuming that we can get sufficiently high data quality via some combination of the automated approaches described above with human verification.