MusicBrainz not fully indexed by Google

mmirG · December 11, 2017, 11:38am

Just checking that “33M” is the target figure.

Fortunately the “The Secret Mausoleum Of Mankind: Fetish Miniatures Of The Suicided Races” is now showing in Google search results. A few political tweaks here and there, institute world peace, and that will pretty much be a wrap for my day.

Unless there is an un-entered Mystery bonus track on the Musicbrainz release page then you can probably hear The Secret Mausoleum Of Mankind: Fetish Miniatures Of The Suicided Races on youtube.

rob · December 11, 2017, 11:55am

Yes, 33M is about the right number of pages we expect for Google (and others) to index.

bsammon · December 11, 2017, 7:04pm

Can you elaborate?
As in 33M is all of Musicbrainz, or you don’t really expect to get all of Musicbrainz indexed in google, and 33M is the portion (which portion?) that you expect to get indexed?

reosarevok · December 12, 2017, 1:43pm

I’d expect 33M is all except the stuff we disallow in robots.txt

justcheckingitout · December 13, 2017, 2:32am

If I am reading that correctly, you don’t allow
area, isrc, iswc, partners, recording, and track

Why would we not allow those things?

rob · December 13, 2017, 12:35pm

Indexing all of MusicBrainz would require some 2 billion+ pages, which is not realistic on any level.

Instead we worked with Google to embed JSON-LD markup in our pages that encode areas, isrcs and recordings. Not sure about ISWC or track, but you can inspect this yourself:

https://musicbrainz.org/release/f88d6d9b-9664-4c54-887c-2bd83248bc2c

View source on the page and then see the rather large JSON-LD blob on top of the page:

https://gist.github.com/mayhem/a43c326ce3240d7ca21b2c6b43045689

By encoding all the relevant information into the release page, we obviate the need to Google to crawl vastly more pages. From what we can tell at this point, 33M pages represents a complete indexing of all of the data that is relevant at this point in time.