Two URLs for the same webpage

bsammon · April 1, 2019, 9:33pm

Occasionally, when I’m adding URLs to an artist, I run into websites where there’s more than one URL for the same webpage, and no obvious connection between the URL.
For example, an artist’s facebook page may be reachable by http://www.facebook.com/<artistname> or https://www.facebook.com/profile.php?id=<numeric id>

You also see similar situations with youtube, where the same videos can be reached via https://www.youtube.com/channel/<channel code> or via https://www.youtube.com/user/<username>

Based on the idea that there are people who use Musicbrainz for “find the artist that matches this URL”, it would make sense for Musicbrainz to have both Facebook URLs, or both Youtube URLs in cases like these.

Better would be if there was some way to indicate that both URLs point to the same webpage. It makes me think that it might be better to have “Webpage” be a primary entity, instead of “URL”. Maybe having URLs be a attribute of a “Webpage” entry.

I understand that schema changes happen rarely, and take a long time…

Thoughts? Has anyone considered this issue and how have you dealt with it?

yvanzo · April 2, 2019, 7:32am

I don’t think MusicBrainz should store redirects, it is supposed to store standardised URLs only.

From Style / Relationships / URLs § Standardised URLs:
For many sites, we use a standardised URL format. In most cases, the URL will be automatically formatted correctly.

The issue is automatic formatting is an offline method, whereas the examples you gave most probably require to connect to Facebook API or YouTube API so as to find out the permalink. IMHO, MusicBrainz needs script improvements rather than a database schema change here.

Another point of view is whether we should store such URLs or just identifiers (such as <artist name> and <numeric id> in your above example) which is the way followed by BookBrainz. This approach implies to parse input URL for identifiers and to reconstruct URL from identifiers, but it also assumes URL pattern is known. Most probably an hybrid solution should be retained.

Note: Next schema change is scheduled for May 13th and is mostly unrelated to URL handling at the moment, but it is still open to contributions!