MusicBrainz doesn't seem to appear in many search results?

Tags: #<Tag:0x00007f4d4f82d658>

What does it say as the explanation for the excluded pages? (“Crawled but not indexed”?) Just wondering also if there are any missing sitemaps or sitemaps that haven’t been read for years as well?

Status Type Validation Pages
Warning Indexed, though blocked by robots.txt Not Started 55590
Error Submitted URL seems to be a Soft 404 Started 526
Error Submitted URL has crawl issue Started 30
Error Submitted URL blocked by robots.txt N/A 0
Error Server error (5xx) N/A 0
Excluded Crawled - currently not indexed N/A 382728
Excluded Discovered - currently not indexed N/A 303697
Excluded Duplicate without user-selected canonical N/A 218684
Excluded Alternate page with proper canonical tag N/A 67009
Excluded Blocked by robots.txt N/A 66672
Excluded Soft 404 N/A 46018
Excluded Page with redirect N/A 22286
Excluded Crawl anomaly N/A 13767
Excluded Excluded by ‘noindex’ tag N/A 20
Excluded Duplicate, submitted URL not selected as canonical N/A 12
Excluded Not found (404) N/A 4
Valid Indexed, not submitted in sitemap N/A 590560
Valid Submitted and indexed N/A 5437
1 Like

This sounds a bit suspicious, especially in this amount, is this intentional or has Google missed a bunch of sitemaps?

200k page is a lot, what does it count as a duplicate? Is the canonical URL configuration correct?

This looks very familiar though, I have a site with a similar issue, a lot of pages being excluded:

I tried a lot of things, noticed that the only noticeably working “cure” was to reduce the amount of errors, for every fixed error page Google seems to be willing to index tens if not hundreds more pages. My hunch is that Googlebot crawls one site until it encounters an error and then it enters some massive cooldown period, skips the broken page sometimes and continues.

2 Likes

I didn’t take the time to try all your sets but looks similar with Bing.

2 Likes

This sounds a bit suspicious, especially in this amount, is this intentional or has Google missed a bunch of sitemaps?

In the sitemaps, we actually only list pages that have embedded JSON-LD markup, to ensure those are fully ingested by Google (we even supply hourly, incremental sitemap updates to them). The only reason we have sitemaps to begin with is because they contracted us to embed semantic markup (JSON-LD) in our pages, and needed a way for us to ping them when any of the markup changed.

So I’m not surprised if it says a ton of pages aren’t in the sitemaps.

200k page is a lot, what does it count as a duplicate? Is the canonical URL configuration correct?

I checked which URLs it’s complaining about for this, and the vast majority are random URLs from our FTP site, nothing MusicBrainz related. For example… http://ftp.musicbrainz.org/pub/ros/ros_docs_mirror/electric/api/bmp085/html/structbmp085__smd500__calibration__param__t-members.html

Though there are a few MB ones like https://musicbrainz.org/artist/3ebb2aa0-c5ac-4aaf-b654-d6ad17526508?va=0, I assume because ?va=0 is a no-op there. That’s something we could improve.

5 Likes

Maybe worth excluding the ftp share using robots.txt?

You can use this page and set va and any other no-op parameters there as “No: Doesn’t affect page content (ex: tracks usage)”.

Any chance they could be contacted again to see why Google is indexing so few pages?

Here is an example of a Release that does not appear in my Google results:

The “issue” is that ?va=0 is sometimes not a noop, so it does sometimes affect page content. Maybe the solution would be to not include that link for artists where it wouldn’t display/change anything? Edit: The problem with that solution would then be that users wouldn’t be “trained” to having it be there, which might (or might not!) be UX issue. Whatever we do, it’s always a compromise… :slight_smile:

2 Likes

This seems like a fair compromise? IMHO the lack of visibility on search engines seems a bit worse than not having users being trained to have the parameter there.

I’ve made a ticket for this one thing now at least:

3 Likes

It’s kinda amazing how so many pages are not shown by Google yet there are folks so desperate that they want to remove releases from MB in hopes to influence Google results.

An editor has concerns that the BBC Music links could negatively affect the indexing of MB.

I’ve searched for that Brazilian artist incl. disambiguation and birth year
Guess what, neither MB nor BBC showed up but some SEO crap did. And Discogs of course.

3 Likes

The thing is that Googlebot is fairly active, it’s crawling our websites almost permanently, so no, Googlebot (but also BingBot and others) are activily crawling.
But we still don’t appear in results, even for entities being in the database since years.

It is indexed.

It appears on second page for me, when searching on title.

Capture du 2020-05-27 11-29-11

I think what is a big factor is the fact most pages have no original textual content, we have mostly links to external resources, and bits of data, titles with only few words (tracklist), birth date, etc… for most bots I think our pages look “empty”. Also since they are text-empty, short excerpts shown are rather unattractive, not sure many people select them.
We don’t provide audio player, or ways to buy the actual music either, biographies shown are coming from Wikipedia mostly, and reviews are done elsewhere (CritiqueBrainz).

We don’t even have cover art and/or artists photos and/or label imprint images shown by default (or not at all). Also we have good quality data, but we lack quantity (if one wants to know all LP releases of one album he has better chance on discogs, plus he would be able to buy one directly from there).

And MusicBrainz website isn’t really mobile-compatible yet (this doesn’t help).

In short, I think our bad ranking is more due to the very nature of MusicBrainz, rather than the lack of indexation.

9 Likes

I google:
“Music for Millions: Vol. 1” monada
and get as results this thread, including images, and an Amazon listing.
Nada Musicbrainz.
Google seems to index but not display here.

(I’ve just spent 20 minutes trying to get the browser of my choice, a fat slow incontinent dog that might need to be put down soon, to take that screen shot. Perhaps time for me to step away from the device.)

Googling:
site: musicbrainz.org “Music for Millions: Vol. 1” monada
= still no listing of Release displayed here.

2 Likes

1st or 2nd result for me, despite having France country checked: https://duckduckgo.com/?q=site%3Amusicbrainz.org+“Music+for+Millions%3A+Vol.+1”+monada

1 Like

If I do the same request, results are clearly showing, but if I remove site:musicbrainz.org from query, only one page of results, and no Musicbrainz at all…

2 Likes

@mmirG, can you link the release we are supposed to find? Is there a typo?

Come on, please paste a normal MB link.
My browser says I should not go there, security threat or something.

2 Likes

I copy and paste URL.
And something else appears in post.
Interesting.
Let’s try again.

2 Likes

I know websites that only contain descriptions of specific numbers, plaintext equivalent of hashes, digits of pi, plus the massive amount of SEO spam that’s indexed. It’s very weird that it’s Musicbrainz that gets excluded from the index that much IMO.

Hopefully MBS-10573 gets fixed some time soon and we can see if Google just dislikes duplicates.

This could be a bigger issue than the lack of textual data, it’s hard to predict.