Link rot: Dead links and the Wayback Machine

An edit was made to remove a URL relationship that was once was correct but now is essentially a dead link (edit #38467308). But there is more than one possible course of action when encountering a dead link:

  1. Remove the relationship.
  2. Set the “ended” attribute of the URL relationship.
  3. Add links to snapshots of the website taken while it existed.

Once you know that it is possible, it seems obvious that the link should not be removed but rather be set as ended. However, the link in case is not dead per se but used by some kind of spam network. This means that MusicBrainz will link to spam (possibly scams?), so we would have to consider the morality of that.

Snapshots are probably more useful than dead links, but more arbitrary. I propose that one URL entity should be created for each major revision of the homepage, preferably the last one available (in this case, 2010-08-14 and 2014-08-25). This would make sense in a URL-URL relationship to the actual link with the snapshot date as an attribute.

Ideally I imagine the snapshots would be displayed underneath the original link. Also, it might be appropriate to display the original link as plain text if it has ended to avoid linking to malicious sites.

What are your thoughts on these proposed changes? Until the relationship is implemented, I think it would make sense to add snapshots as ordinary homepage relationships alongside the ended original link. Do you agree?

8 Likes

Marking as ended is better for sites that did contain actual stuff indeed.
For what it’s worth, my user script called ALL LINKS will display the webarchive link for all ended marked URL.
For instance in 陰陽座 where their former official home pages did contain interesting stuff, it links to OHP ended and OHP 1999‐05‐09—2003‐01‐28.

1 Like

Automating* website snapshots via the Internet Archive would be amazing.
Band websites have a crazy short shelf life :neutral_face:

If we can mark the relationship as ‘ended’ then it should definitely only link to the archived snapshot (if possible), no need to link to the old URL.

*I think trying to decide on a manual style (not to mention getting people to follow it) is going to be very complex, which is why I specifically mention ‘automating’.

2 Likes

What are those dates corresponding to?

Why not? I think it would be incorrect to remove the URL that has actually expired. I also think it would be technically incorrect to link snapshots as homepages, which is why I propose the new relationship, but for the while both should be kept.

I’m not sure automating is possible, considering that a fair amount of snapshots are basically dead. What are the complexities in doing it manually?

There is no need to keep both, because the original URL is available from the snapshot URL.

They somehow duplicate the efforts of the Internet Archive Wayback Machine rather than taking advantage of it.

Here comes another proposal that does not require to change the database schema:

  • Style change: If an URL expired, it would be recommended to replace it by the last relevant snapshot URL. If no relevant snapshot exists, just use the Wayback Machine “history” (wildcard) URL.
  • Display change: Link to snapshots would be automatically detected and displayed differently by appending a special notice at the end of the title, and by changing the icon (ideally using half of the original icon and half of the Internet Archive icon).

Real examples:

  1. With archives: The label Active Suspension had an official website which is now discontinued (and replaced by irrelevant ads). I simply replaced it by its last archived version. Currently, it is erroneously linked as “Official homepage”. But with the proposed change, it would be automatically linked as “Official homepage (archived 2009)”.
  2. With no archive: The band O.Lamm had an official website which unfortunately has never been archived due to robots.txt policy. With the proposed change the “history browser” URL Wayback Machine could be added and automatically linked as “Official homepage (ended)”.
5 Likes

Do you mean 1999-10-09—2000-10-20 with dates based on the Wayback Machine history?
This would not be revelant for Wayback Machine (related to the above example).

[quote=“August_Janse, post:4, topic:36336”]
I’m not sure automating is possible, considering that a fair amount of snapshots are basically dead. What are the complexities in doing it manually?[/quote]

I definitely agree with it as a new relationship/s and guideline, but in my experience URL’s aren’t updated much by users, which is a shame.
I think this would be straight forward (not necessarily easy to code, I just mean machine-readable!) to automate:

  • User marks URL as ended
  • System checks if the Internet Archive has a snapshot of that URL
  • If so, it creates a link to that (using your new relationship) and hides the old link
  • If not, it hides the old link as per usual, no need to check the IA for snapshots again anytime, unless the site is marked as active again in MB at some point

Just a thought, I know our devs are busy…

4 Likes

Makes sense, but the latest snapshot isn’t always what we want. For example, this is NEO’s latest snapshot. In this case, there’s absolutely no purpose in storing it. So I think it would have to be curated manually.

3 Likes

You can get the way back bot to visit a site by doing a search for it.
It might be a reasonable idea to do that when a relationship is added to ensure it has been saved.

1 Like

For what it’s worth, there was an ArchiveTeam project to archive all outward links from MB not too long ago, so it’s likely that there’s at least one useful snapshot for most of them.

6 Likes

Why not ignore all snapshots after then “ended” date? So for your example, I’d enter an end date of 2014-11, and all the strange stuff from December 2014 onward will not disturb us. I’m assuming the server code will be able to query the web archive about the latest snapshot before the cutoff date, and automatically link to that.

BTW, archive.org is fantastic. No offense to Tim Berners-Lee, but Brewster Kahle deserves knighthood, fast.:heart_eyes:

4 Likes

No, what’s automatic (with the user script) is the display of webarchive link. The dates have been set by myself here and on various other URL.

The script the dates that are displayed in http://musicbrainz.org/artist/88d8f38f-adb4-48a0-8c1f-ec34f2a675ff/relationships
I have set those dates by documenting their edits but usually not based on wayback snapshots, except when two consecutive days can show the drop down of the site. :slight_smile:

But I realised it’s a bad example (fix is pending, maybe that’s why your questions) as I have made those ended links long time ago by setting the webarchive URL hard coded. Nowadays, I set the real URL, plus the ended or end date attribute — I don’t have any other example handy than this edit.

+1

Also it doesn’t have to show the latest snapshot (though that makes sense to me using docdem’s suggestion). It could expand to show a range of snapshot dates (a UI issue), or simply link to this page instead: Wayback Machine
Will all the links from there be relevant? No, but that’s expected behaviour imo.

1 Like

It is a sensible assumption, using either the Time Travel APIs or the Wayback CDX Server API.

It is possible I think. I don’t know if web archive has an API, but it is very likely.
When the end date is unknown, the all links user script shows the * page link (there you choose your date) and when the end date is known, it links to that date in the archive (it is just about putting numbers instead of *).

The archive shows the closest snapshot so it could lead to after the end date (and you would have to click one left arrow) but in my cases it showed the proper sites, at least, as far as I can remember.

Actually, my above proposal was a basic attempt to get round the lack of editing capabilities to set the “ended” flag and the “end date” field of an URL-relationship directly from the target entity editing form. It is probably best to ignore these tricks and to address the form issue separately, though display expectations were very close from the initial proposal. I am fully backing what is emerging from this topic.

(Edit: I deleted additional paragraph for the sake of clarity.)

zos18 added some interesting background in an edit note. There are definitely differing opinions so I wish we could come to a consensus, but feel free to vote on the edit as it stands for now.

If you want to see the editing form issue addressed, you can vote this ticket: http://tickets.musicbrainz.org/browse/MBS-3774

Edit: the issue discussed here has been pointed out by zos18 and requires an RFC: http://tickets.musicbrainz.org/browse/MBS-2289

1 Like

Dead links don’t bother me. I’d rather have a dead link than no reference at all to go on.

Personally I’d like to see the following flags added to the URL entity:

  • Defunct/Dead
  • Has Archive.org mirror - this would generate side link to archive.org for a copy of the page
  • Has [Mirror] - whatever other mirror project has a copy.
6 Likes