Wayback Machine downloads

I really don’t like the idea of designing/configuring Musicbrainz in a way that assumes/requires that people will use User Scripts. Isn’t that effectively assuming/requiring that people will use one of a very limited set of browsers?
I don’t even like assuming/requiring that users have Javascript turned on (currently only the case for maybe 50% of editing functionality and 5% of viewing functionality)

The ideal solution to this would be to update the schema & editing UI so that a URL entry can have a “original URL” attribute and an “archive URL” attribute. And the “archive URL” field would pop up in the edit page whenever the “item has ended” checkbox is checked.

4 Likes

I’ve got a similar view as bsammon.
And add another negative of that pathway - it makes MB more unfriendly to those global citizens who aren’t members of the techological elite.

I can see in the short and medium term having a website that is neg-UX without scripts can be a reasonable solution to balancing resources against UX for a wide range of people.
But in the longer term it seems very exclusionary in its effect.

This goes to the idea of MB deciding who its target populations are.
And then prioritising delivering a encyclopedia those populations can use enjoyably.

1 Like

Wouldn’t a computed archive link be misleading in the cases where specific URLs aren’t available from the wayback machine? In my experience that’s a significant percentage.

1 Like

Or it might be there in another location, for that matter. I’ve noticed that website redesigns result in all kinds of screwy archives.

We should just keep the original URL and MB should compute and display the archive URL when it’s ENDED:

1 Like

What about when somebody else buys the domain and turns it into spam or malware? I don’t think an end date property addresses this very well.

2 Likes

My concern with this is that I doubt a computer-calculated archive URL would be as good as one that had been researched/verified by an actual person. “https://web.archive.org/web/*/” URLs can be easily computer-calculated, but they are often just links to a page that says “Here’s 50 times we tried to archive the page–go spend an hour figuring out which ones are useful”.
Determining which of the various snapshots on web.archive.org are useful is a more challenging task (for a computer program) and if someone has already gone to the trouble to manually research it, they should be able to share their results with the world.

Also, anyone interested in this from a technical standpoint should research Wikipedia’s Internet Archive bot (I haven’t) which is itself not 100% foolproof.

1 Like

Also, interesting reading at https://en.wikipedia.org/wiki/Wikipedia:Link_rot for some insight on how others deal with this issue.

1 Like

more interesting reading:

a previous discussion of the topic:

3 Likes

We can use the date of add and it will link to the nearest available archived page.

We can display the URL (which is already hidden in the relationships tab, away from the sidebar), as text only, not add hyperlink. As tooltip, maybe.

Would it solve it to auto link to the last archived page before the end date?

And the link is only a direct link if there is an end date - we could have a broader ‘ended’ checkbox that perhaps just makes it link to the Wayback Machine page that shows all the archived dates, and the user can choose where to go from there. …or just do that for all of them I guess.

2 Likes

The end date would presumably be the date when an editor noticed the link is no longer valid; we don’t know how long before that the link stopped working.

When archive.org has multiple snapshots, finding the last meaningful snapshot is a manual process, in my experience. I’ve found numerous cases where later snapshots are just error pages.

2 Likes

Naturally text and buttons would be tweaked to make it clear that the end date relates to when you know a site has been removed/ to link to a specific wayback page.

Your point re. multiple later snapshots being error pages is specifically what it would be looking to address.

1 Like

I still don’t see how. If I notice in July 2020 that example.com/some-page is no longer working, I put that as the end date of the link. But the last good snapshot may be from 2018 with a bunch of error page snapshots between then and now. I don’t know a way to find that last good snapshot other than manually going to the wayback machine and hunting through all the snapshots they have.

Hmm… yeah… I imagine this would be good about 80% of the time. I’m imagining a number of (unlikely) possible cases where this “nearest” wouldn’t get the best version. The most likely would be if–as an example–say I added an “official homepage” link in 2018 but it was a “official homepage coming soon” placeholder for all of 2018, then maybe a 2019 snapshot (in 2019 the homepage had lots of actual information) would be better than a 2018 snapshot. (and yeah, in this hypothetical the homepage is gone in 2020)

1 Like

Yeah but that would no longer be what you would do, because you’re smart and the text would say something like:

ended
last known active date (not today’s date - a date when the relevant site was still active): --

If you don’t know what date the site has actually ended and don['t want to dig through the wayback machine then I don’t see the point in you putting in a random end date, just check ‘ended’.

1 Like

If I have to dig through the wayback machine to find the proper date, where’s the benefit to the “automatic” URL calculation? At that point it would be just as easy to add a direct link to the last good snapshot.

Added: but thank you for assuming I’m smart. :slight_smile:

1 Like

I think that would also be fine (a direct link).

If it is too complicated though maybe just a link to the wayback machine page with all the dates, in every situation.

I always check archive to see when it disappeared.
If I don’t have time or if I cannot see when because not enough archived versions, I just check ENDED.
No one should set a date without knowing, ENDED is enough in most cases.

Because we are not responsible of the Internet Archive URL pattern.
If they change the URL pattern, we just have to change our display code, not all thousands of relationships.

Also, the Internet Archive is often changing the genuine URL (adding :80, or superfluous index.htm).
It’s still historically nicer to not change our real URL.
And it can be hidden and displayed only as text (no hyperlink).

4 Likes