Implementing an alternative to the coverartarchive

gamebeaker · October 21, 2024, 10:04am

As the title states i think it should be worked on an alternative for the coverartarchive.
Why?
I fear that the Internet Archive is going to lose the ongoing court cases and is shutting down because they can’t pay the fines.
If it gets announced that they lost in court and they have to shut down because of this it is to late. Even if they stay available a few months after that i think a lot of people will have the same idea and try to archive the Internet Archive which will cause an increased load or even work like a ddos attack.

It would be sad to lose all art contributions since 2013.

Even if no real alternative is created i think it should be made an effort to backup all original-size images independent from the IA. That would allow the rebuilding of an alternative when the IA goes dark. If one image is requested per second it would still take 65 days (5548967/3600/24=64,22 days) just to create the backup.

rob · October 21, 2024, 12:14pm

I feel ya 100% and this has been on my mind for some time now. Here are the things I plan to do when the CAA comes back:

Make a backup of the whole archive (top priority)
Create docker project that creates a whole copy of the the archive or at least a caching proxy. The goal here is for others to be able to setup CAA services under their own control to avoid future disruptions.
Start thinking about where we can host an alternate place should the archive go dark.

#3 isn’t as concrete as it could be – and that is intentional. I don’t want to force our own hand until we have to make a decision. But, I plan to be ready so that we can deploy an alternative quickly should this event come to pass.

gamebeaker · October 21, 2024, 12:45pm

To your point 2 and 3 maybe it is possible that the docker cache is using bittorrent.
You can add a webseed https://… to torrent files → if IA is online it grabs the image from the website if it is offline it behaves like a normal torrent and searches it over DHT or torrent tracker.
Problems:
I am not sure how torrent files are generated. I think you need to have the file and compute a hash of it to create the file. (Only possible after the backup got created)
How are the torrent files distributed? New entry in Muisicbrainz database? (Maybe it is enough to save the hash of the file and the user client can generate the right torrent file from it. Pro: If IA is no longer available no need to change the database.)

Pro:
If there is one backup it just needs to be converted to a seedbox for torrenting and over time the files get spread over different proxies.
If Picard uses bittorrent to get the covers the load should get automatically “balanced” to other docker proxies.(The user doesn’t have to manually change a cover server etc. very user friendly)
If there was no change to the cover there is no need to re-download it. (At the moment it gets downloaded each time, there is no cover cache)

Picard uses bittorrent
problem:
How do the docker proxies get notified which covers to cache if they never get a direct request from user.
solution:
create a musicbrainz tracker for torrent files and create an api endpoint like get 100 torrents with the least seeders/ peers.

sound.and.vision · October 21, 2024, 2:28pm

I’m sure someone else might be able to confirm but my undersanding is that Torrents are pretty static, they have a table of contents (i.e. files/directories) that are generated at the point of creation; this ToC can’t be updated dynamically and so a new Torrent file would need to be produced to handle these changes.

With how rapdily CAA changes that would mean needing to periodically recreate that file.

Honestly I think it should be handled in a way that there is a 1:1 mirror of CAA hosted somewhere else and any applications that need to query it (be it MusicBrainz website or Picard) could maybe use a round-robin approach; if the first entry doesn’t respond or is seemingly overloaded then it would move to the next endpoint.

The major hurdle I can think of needing to over come is if you have two endpoints A+B (where A is Internet Archive, and B is something MetaBrainz host) is ensuring that if a change is made on B it happens on A, and if it happens on A it also happens on B. Although there’s probably someone smart out there who could probably figure that out already

HibiscusKazeneko · October 21, 2024, 2:45pm

This was discussed a while back. IIRC the consensus was there will effectively be no more CAA if the IA is gone.

gamebeaker · October 21, 2024, 2:47pm

That is right i meant that one torrent file is created for each cover not the whole database.

XonE · October 21, 2024, 3:34pm

as i would love to see that (!) i also want to mention copyright regulations. the internet archive has/had a pretty solid base for our purposes and we had the luxury of not having to worry about it.

gamebeaker · October 21, 2024, 3:49pm

You are right i didn’t think about that.

gamebeaker · October 29, 2024, 8:07am

@rob are you part of the musicbrainz team?
If yes,
maybe contact https://www.themoviedb.org as i understand it they are also an community driven collection of metadata/ cover for films (and they have solved this problem). They host the uploaded covers themselves and for legal reasons allow DMCA takedowns (DMCA Policy — The Movie Database (TMDB)).
My legal understanding is that in most countries a website isn’t responsible for user created content but there needs to be an option to remove illegal/ copyright material.

rob · October 29, 2024, 11:13am

I am!

Yep, I know that this is a possible way forward for us, but the problem is: Our board of directors. They are very good at looking out for us, but this approach is one step too dangerous for their liking. While I dislike this decision, I fully agree with it.