GSoC 2024 - Internet Archive integration: saving external URLs via Wayback Machine

Ashutosh_Aswal · March 28, 2024, 7:58am

Personal Information

Name: Ashutosh Aswal

IRC nick: yellowhatpro

Email: yellowhatpro3119@gmail.com

GitHub: yellowhatpro

Time Zone: UTC+05:30

Internet Archive integration: saving external URLs via Wayback Machine

Proposed mentors: bitmap , yvanzo
Estimated time: 350 hours
Tech Stack used: Rust

Project Overview

MusicBrainz database sees a lot of edits made on a daily basis. Tens of thousands of edits. That’s a large number. With each edit, there’s associated an edit note which provides additional information about the edit. Often, these edit notes, as well as some edits, contain external links, which we want to archive in Internet Archive.
This project idea aims at creating a rust application, that can archive the external URLs using Wayback Machine APIs.

My contributions up till now:

The task in hand requires rust bindings of the musicbrainz database. For this purpose, there’s a library sql-gen available, that can generate rust bindings for a given database schema. But the library was not compatible with the musicbrainz schema, so it required some fixes. With the library working, we are able to build a binary, which can generate all the rust types and important queries to work with the database.

The generated bindings are currently available in a GitHub repository, which we can use for the rust app we are going to build. At the moment, they have successfully compiled, but there might be runtime issues, so we can fix them as we proceed with the application development.

Implementation

Our main focus is to come up with an application that can take edits and edit notes from database, extract URLs from them, and save them using Wayback Machine APIs. This is the abstract idea, here’s how we are going to do them.

1. Making the sql-gen library working

The sql-gen library is a great tool to generate rust types for postgres databases schemas. It uses sqlx crate under the hood, which is a really nice tool for writing SQL queries. The problem with sql-gen is, it does not very well map with Postgres types, which caused a lot of errors in generating the rust entities. sql-gen also panicked on custom musicbrainz types, like edit_note_status, fluency, cover_art_presence, etc. As a fix, respective rust enums were added.
This is the mapping functions that maps Postgres types to sqlx compliant rust types:

 pub fn convert_data_type(data_type: &str) -> &str {
    match data_type {
        "_int8" => "i64",
        "_int4" => "i32",
        "_int2" => "i16",
        "_text" => "String",
        "jsonb" => "serde_json::Value",
        "time" => "chrono::NaiveTime",
        "bool" => "bool",
        "bpchar" => "String",
        "char" => "String",
        "character" => "String",
        "edit_note_status" => "EditNoteStatus",
        "fluency" => "Fluency",
        "oauth_code_challenge_method" => "OauthCodeChallengeMethod",
        "taggable_entity_type" => "TaggableEntityType",
        "ratable_entity_type" => "RatableEntityType",
        "event_art_presence" => "EventArtPresence",
        "cover_art_presence" => "CoverArtPresence",
    }
}

With these fixes, sql-gen can now read the database tables, and would not panic while generation process. The resultant code had some shortcomings too, which now have been fixed.
All these fixes and more can be seen here.

2. Generating rust types and functions for the database using sql-gen

With sql-gen working, one can generate bindings using this command:

sql-gen generate --database postgres://musicbrainz:musicbrainz@localhost:5432/musicbrainz_db --output ./schema --schema musicbrainz

sql-gen generates rust types for the database schema, and is capable of generating helper functions for various database queries, like: SELECT, SELECT by primary key, SELECT by foreign key, INSERT, UPDATE and DELETE statements.
For more complex queries, we can use the sqlx crate to write raw SQL queries.

3. Look for URL-containing data in the database

Ok, so till now we have with ourselves the rust generated bindings to interact with the tables. But where are the URLs located?

Let’s begin with the Edit Note.
This is how edit note looks like in the website:

This is how URL is stored in the edit_note table:

As we can see, the edit note has a column text, where we can find the URL.

For edits, there are multiple type of edits, but there are only few we want for our use case. For example: Add Relationship:

And the representation in edit_data table :

For edits, The data column stores the JSON formatted data. For Edit Type 90 (Add relationship), we can see there’s a type0 → type1 mapping of release → url, and the entity1 key gives us the object, which contains the URL we want to archive.

There are other types as well, which we can use, like Edit Type 101 (Edit URL):

The JSON representation is:
Pasted image 20240327135211

Other Edit Types which may contain external URLs are Add Artist Annotation, Edit Relationship .

Now that we know “where” to look for the URLs, we have to figure out “how” we will extract them.

4. Poll database tables to fetch URL-containing rows

To be able to make queries from the tables, we would have to interact with the database first. Having multiple ways to interact the database, the preferred and easier solution, which I found on discussion in IRC is polling, because of its simplicity and chances of idle polling being low because of number of edits made every day. On each poll, we can call a Postgres function from our rust app, which will DECLARE a cursor, OPEN it, FETCH the rows, and CLOSE the cursor. We can keep the track of the last processed row, and pass its id to the cursor query, so we can skip the previously processed rows.

Polling code will look something like this:


#[tokio::main]  
async fn main() {  
    let db_url = "postgres://musicbrainz:musicbrainz@localhost:5432";  
    let pool = PgPoolOptions::new()  
        .max_connections(5)  
        .connect(&db_url)  
        .await  
        .unwrap();  
  
    tokio::spawn(async move {  
        poll(pool.clone()).await;  
    });  
}

async fn poll(pool: sqlx::PgPool) {  
	// Duration can be configurable
    let mut interval = interval(Duration::from_secs(10));  
    loop {  
        interval.tick().await;  
        if let Err(e) = poll_task(&pool).await {  
            eprintln!("Error polling database: {}", e);  
            //Do something about the task
        }  
    }  
}  
  
async fn poll_task(pool: &sqlx::PgPool) -> Result<(), sqlx::Error> {  
	let urls = run_sql_queries_and_transform(pool);
	store_urls(urls);
    Ok(())  
}

5. Transformations(URL Containing rows) → List of URLs

The data that we get from SQL queries is different for different tables. We can’t just store them right away.

For example, here is edit_note.rs:

use crate::schema::types::*;
#[derive(sqlx::FromRow, Debug)]
pub struct EditNote {
  pub id: i32,
  pub editor: i32,
  pub edit: i32,
  pub text: String,
  pub post_time: Option<chrono::DateTime<chrono::Utc>>,
}

and edit_data.rs:

use crate::schema::types::*;
#[derive(sqlx::FromRow, Debug)]
pub struct EditData {
  pub edit: i32,
  pub data: serde_json::Value,
}

Once we have our results stored in the rust structs, we can start our operations to extract URLs from them. As we have seen earlier, the edit note can contain text other than URL, or there can be multiple URLs, and the URL in edit_data is also nested inside the JSON somewhere. We will have to perform operations in order to get the desired outcomes, that is, a list of URLs.

Once we have the list of URLs with us, we will save them in a new table.

6. Storing URLs in a new table

Now we have data to use, but we can’t use that yet. With the rate at which edits and edit notes are made, and the limitations of the rate limited API endpoint, it makes sense to use a buffer storage, where we can keep the URLs.

create table internet_archive_urls(
id serial,
url text,
job_id text, -- response returned when we make the URL save request
from_table varchar, -- table from where URL is taken
from_table_id integer, -- id of the row from where the URL is taken
created_at timestamp with time zone default now(),
retry_count integer, -- keeps track of number of retries made for the URL
is_saved boolean);

This can be a good schema for the table to start with.
One thing that we want to make sure is that we don’t save a URL again the same day, so, before saving any URL to the table, we can check if there exists any row with same URL, but with created_at timestamp difference of less than 24 hours, and discard any duplicate .
Whenever we confirm a URL has been saved by the API, we will also mark the is_saved value to true. The is_saved value can be helpful for cases when for any reason, the URL did not save, and we want to retry saving it.

7. NOTIFY listeners periodically

Now comes the interesting part, sending data to the URL-archiving code. For this purpose, we will employ a NOTIFY command. The NOTIFY command will periodically run after a duration of x seconds, since the task we want to do at the LISTENER’s end is rate limited.

...
let notify_pool = pool.clone();  
//another async task   
let _t = tokio::spawn(async move {  
// The duration can be configurable
    let mut interval = tokio::time::interval(Duration::from_secs(10));  
    while !notify_pool.is_closed() {  
        interval.tick().await;  
        notify(&notify_pool).await;  
    }  
});
...

async fn notify(pool: &sqlx::PgPool){  	
  batch_rows_from_table_and_send_through_a_channel(pool);
}

We will use the pg_notify(channel, payload) method to send the batched URLs to the listener, where we can make our final API calls.
The pg_notify function in Postgres generates a notification event. The function takes two arguments: the channel name as the first argument and the payload as the second.

The idea to use an asynchronous message passing approach here is so that the URL-saving/archiving task does not have to know about the concurrent tasks that will execute it, and thus decoupling them.
In our case, there are 2 concurrent tasks that would need to save the URLs (The main task polling batches of URLs, and the check_and_retry task which will check the is_saved and job_id status of URLs, and will try to retry saving any unsaved ones). Since both these tasks run concurrently, and are important for the execution of the app, we need a mechanism for them to communicate with the URL-saving task.
For these types of cases, we either have shared memory, or message passing solutions, but I think for our case, the message passing solution, which requires a channel to be created, sounds a better option. We can also do message passing using rust’s
std::sync::mpsc module, which uses send/recv methods, but since we are dealing with postgres, I think using sqlx-postgres module for this is a better option, although they both do the same job, and are more secure than creating Mutexes or Arcs.

8. Catching the payload at LISTENER’s end

The moment the NOTIFY command runs, if the listener is on the same channel, will get the payload from the channel.
For this purpose, we can use sqlx-postgres PgListener struct.
This listener will be listening the channel, and with the help of recv() method, we can get the payload sent by the notifier.

let mut listener = PgListener::connect_with(&pool).await?;  
println!("Start listen loop");  
listener.listen("channel_name").await?;  
loop {  
    let notification = listener.recv().await.unwrap();  
    println!("{}", notification.payload());
}

Now, we have a continuous supply of URLs, coming at us every x seconds (x here is the notify delay we introduced). Now, finally, we can make the API calls.

9. Saving the URLs using the Wayback Machine APIs

Now we can begin saving our URLs, but we have to be careful, we can’t cross the rate limit.
This is the API we will be interacting with:

Since there is rate limiting associated, we need to self rate limit our requests.

The rate limit of the API is of the type M requests per N seconds (12 req / 60 seconds), therefore we can have an M cyclic queue based approach, where a queue holds M queues, and ensures M queues can fire AT MOST M request during the time window N. This is the proposed idea, but we will have to play around with the rate limiting parameters, as it’s challenging to know at an early stage what configuration will work.

Anyway, this particular IA interacting logic can also be decoupled from the rest of the code, as we can extend its use case for more tasks, such as re-archiving links which can be changed later by artists, like Spotify, iTunes, which also need to be rate-limited.
After we make the API call, we get a response containing a job_id, which we can store in our internet_archive_urls table, and whose status can be checked using the following API:

This too is rate limited. So we will have to be careful with this as well, or we do have an option to use this API, which checks the archival status using the URL:
curl -X GET "https://archive.org/wayback/available?url=<url>".

10. Check_and_retry and Cleanup Task

There are fair chances that the URL we wanted to save, could not get saved, which could be due to errors like internal-server-error, browsing-timeout, bandwidth-limit-exceeded. To retry saving such URLs, and checking the saved ones, we can run another async task periodically, which will check if the URL got saved or not.
We have a job_id associated with each URL we tried saving, we will use this job_id to check the status of the URL being saved or not in the check_and_retry task.
Once we know our URL is saved, we can mark its is_saved value true in the internet_archive_urls table. If it still does not get saved, we will increase the retry_count. Retry count is necessary to limit the number of attempts.

There will also be an independently running cleanup task, which will check for the rows that we can delete. We can consider created_at and is_saved fields to delete the rows. Since we don’t want our internet_archive_urls to grow indefinitely, we will have to do this cleanup task at some time intervals.

Timeline

Following is the timeline for the 350 hours of GSoC coding period I’ll be devoted to:

Community Bonding Period (May 01 - May 26)

Discuss various engineering aspects with the mentors, including listen/notify mechanism, polling, rate limiting, error handling in different corner cases.
Study and understand Postgres functions, and the engineering tasks I’d be dealing with.

Week 1 (May 27 - June 2)

Set up the project environment by writing Dockerfile, the compose file, and set up the rust app, which can make the API reqwest calls, poll database, listen/notify the new table asynchronously, retry saving unsaved URLs, and cleanup the internet_archive_urls table.
This week’s objective would be to come up with a small, workable app, which can perform the above-mentioned tasks.
The polling and listen/notify mechanisms will be tested individually, with some example queries, which need not be final queries.

Week 2 (June 3 - June 9)

Discuss and implement database queries and constraints, like creating the query to poll the edit_note, and edit_data tables, batch the results, and make sure the next poll starts after the last accessed row.
Write query that’ll run on retry saving URLs and on cleanup of internet_archive_urls table.
Also write the query to run when we NOTIFY the internet_archive_urls table.
The objective of this week would be to come up with SQL queries that are needed across the application.

Week 3 (June 10 - June 16)

Work on the transformation logic to convert EditData and EditNote to plain URL.
Transformations will include checking the right URL, domain ignoring for certain URLs, removing any not required text, reading URL from the JSON.

Week 4 (June 17 - June 23)

Work on polling logic.
Integrate the SQL query to the polling logic, ensure it’s working correctly, and check if we are able to get the list of URLs after the transformation function.
Any error handling, and testing for this, will also be done in this week.

Week 5 (June 24 - June 30)

Once we get the results in the form of URLs, we will save the URLs to the internet_archive_urls table using sqlx.
Work on the NOTIFY method, and integrate the SQL query we will run on each notify.
Test and ensure batches of URLs are being sent through NOTIFY in the channel.

Week 6 (July 1 - July 7)

Work on the LISTEN logic using PgListener.
On each LISTEN, we will get URLs, which we want to archive, therefore start the work on the rate limiting queue implementation, that enqueues URLs, and makes API call with necessary delay.

Week 7 (July 8 - July 14)

Continue my work on the rate limited queue, and test locally if it’s working, and look for any optimizations that can be made.
Work on the retry/fallback logic, if the API response is not OK.

Week 8 (July 15 - July 21)

Work on analytics and monitoring of the app.
This will include integrating analytics tools, like sentry for issues tracking, and Prometheus for monitoring.
Add their crates to the project, and write their configs, and ensure the metrics are being collected alright.

Week 9 (July 22 - July 28)

This week is kept as a buffer.
Any notorious issue/shortcoming will be fixed in this week, and will make sure if the app is working correctly.

Week 10 (July 29 - August 4)

This week will involve deploying the container to the appropriate registry, and automating the process of deploying the container app.
Write workflows for CI CD.
Fix any post-deployment bugs.

Week 11/12 (August 5 - August 18)

Monitor the app in production, and fix production errors.
Publishing the rust binding crate to crates.io .
Start working on other IE integration aspects.

Stretch goals (August 19 onwards )

If we don’t run into unexpected error, my task in hand will be to work on 2 more IA integrations:
1. Re-archiving links which can be changed later by artists, like Spotify, iTunes.
2. MusicBrainz DB Dumps automation.

Detailed Information About Yourself

My name is Ashutosh Aswal. I am a senior pursuing Bachelors in CSE from Punjab Engineering College, Chandigarh, India. Having contributed to android projects before in MetaBrainz, now I want to explore new domains of tech and gain hands-on experience.

Tell us about the computer(s) you have available for working on your SoC project!

I have my HP Pavilion ec0101ax running Arch Linux, with Ryzen 5 3550H and 24 gigs of RAM.

When did you first start programming?

I began my programming journey with C++ in 11th grade.

What type of music do you listen to? (Please list a series of MBIDs as examples.)

I mostly listen to Indie music artists like The Local Train, Nikhil D’Souza, Anuj Jain etc.

What aspects of the project you’re applying for interest you the most?

I like exploring new tools and languages in my free time. In recent times, I started exploring rust, and wanted to build some stuff with it. Other than that, I like how this particular project is introducing different software engineering concepts which I have only read about, and would get a change to implement them.

Have you ever used MusicBrainz to tag your files?

Yes, I have.

Have you contributed to other Open Source projects? If so, which projects and can we see some of your code?

I have earlier contributed to the Metabrainz android projects, musicbrainz-android, and listenbrainz-android.
Here are my contributions to musicbrainz-android and listenbrainz-android.

What sorts of programming projects have you done on your own time?

I have prior experience in android development, and in my free time, I love to explore different tech stacks and frameworks.

i-remember: A CLI tool in rust to save shell commands, batch them, and reuse them. (WIP)
codes-practice : A React app to show my code solutions to various algorithmic problems, and competitive programming problems.
Apart from them, I have many undeployed, half-baked, good-for-nothing projects I cook in free time, that you can see on my GitHub or Portfolio.

How much time do you have available, and how would you plan to use it?

I will be working ~30 hours per week on the project.

Bitmap · March 28, 2024, 3:54pm

Hi yellowhatpro, thanks for your proposal! It’s definitely an exciting project. I have a few technical comments/questions.

But the library was not compatible with the musicbrainz schema, so it required some fixes.

I’d prioritize submitting your fixes upstream, because we don’t want to have to maintain an additional fork of this library for years to come. Is there a technical reason this can’t be done? I didn’t see it mentioned in the proposal.

The very first sentence on crates.io: Rust Package Registry is “PR’s and Issues welcome!”

The problem with sql-gen is, it does not very well map with Postgres types, which caused a lot of errors in generating the rust entities.

Their very first use case listed is: “Generate Rust structs and queries for PostgreSQL database tables or from migrations via a Postgres test container,” so it sounds like they’d definitely appreciate fixes for PostgreSQL types.

sql-gen also panicked on custom musicbrainz types, like edit_note_status, fluency, cover_art_presence, etc. As a fix, respective rust enums were added.

Perhaps the library should be modified to allow extending the supported types without having to hard-code them in the struct you linked?

For edits, there are multiple type of edits, but there are only few we want for our use case.

Since there are only a few, I think it’d make sense to mention them all in the proposal for completeness.

We can keep the track of the last processed row

Where will you store the last processed row outside of memory? That will be a problem when the container is restarted.

it makes sense to use a buffer storage, where we can keep the URLs. […] This can be a good schema for the table to start with.

What’s job_id used for?

The is_saved value can be helpful for cases when for any reason, the URL did not save, and we want to retry saving it.

Won’t you need a column indicating when the last attempt to save the URL was made, so that you don’t perform retries too frequently?

I assume this table won’t store saved URLs forever. When/how will they be cleaned up?

Now comes the interesting part, sending data to the URL-archiving code

I didn’t quite follow this section. Is the “URL-archiving code” a separate service? I didn’t see it mentioned previously. Why is a producer/consumer architecture needed here over just SELECT * FROM internet_archive_urls WHERE is_saved = FALSE ORDER BY created_at LIMIT x? Some additional information explaining the problem being solved would be helpful.

Timeline

Can you add dates to each of the weeks for clarity?

Week 2

Do we need a dedicated week to come up with all SQL queries used across the application? I’d expect that the queries would be very simple and can be written as needed.

Ashutosh_Aswal · March 29, 2024, 7:52am

Thanks for the valuable feedback Bitmap, it really helped me analyse the proposal.
I would like to address your questions here, and will discuss a couple on IRC, since they are more interesting to discuss.

My plan was to go with the forked one, but now I am motivated to fix the original one. Will have a discussion with jayy-lmao on this matter.

Sure, will do it soon.

I think we can keep the last processed rows in the table itself. We can keep 2 more columns, from_table and from_table_id . Currently we are dealing with edit_note and edit_data, but later we can have more such tables from where we want to save URLs. In any case, if the container restarts, we can get the last processed ids of edit_note and edit_data from the internet_archive_table, by checking the latest rows for columns from_table and from_table_id.

We get a job_id as a response for our URL saving API. Since saving a URL takes time, the response returns us a job_id, which we can later use to check the save status of the URL. My plan is to use a concurrent task to check if URL is saved or not.

Thanks for pointing this out, we can keep a column to check the status of retries. Either we can make a last_attempt timestamp, or retry_count to check number of retries, or both. We can discuss this on IRC.

Another concurrent task that run at x interval to clean the table. We can discuss about this on IRC , regarding the parameters that we may have to check before deleting the row.

Sorry for bad choice of words . By “URL-archiving code”, I mean the task that will be making API calls to save URLs.
Regarding producer/consumer architecture, tbh I did not think we could do this by just a SQL query. Earlier on discussion with yvanzo, I planned to use redis as a source for URLs. But we later discussed about postgres queues, and then I found notify/listen mechanism to do async communication. I thought this is it, so I made my proposal around it. Yeah, you are right, it can be just a SQL query as well. But on deep thought, I think we can use notify/listen mechanism for our use case. Remember we have a task that checks for unsaved URLs as well, that runs concurrent to out main task. This task will first check if the URL has been saved (taking in account factors like last retry) , if not, it can simply notify the channel and pass the unsaved URL as a payload. In the listen part, we have our URL-saving/archiving task running, which has nothing to do with the source of URL, and it just has to save the URL. So I think it kind of decouples the whole URL-saving logic from rest of the code.
I am not really sure if it is the best practice, or if it is an overkill, or I am just imagining things, but we can have a discussion regarding this on IRC to reach a conclusion.

Yessir, soon will.

Frankly, I suck at writing queries. The main reason of a week of SQL work is to properly write, test, and make sure each query involved works fine. You’re right, it might just take a day or 2 at max to come up with the queries. I am just taking in account any unexpected issue I might face, since there will be cases involved, like polling edit_data and edit_note tables and batching them, the notify task running on internet_archive_urls table , a cleanup and retry task, with different constraints and corner cases, and since it’s me doing the job, there are more chances that some random unexpected error/issue will come. I’m sure of it. Hence a week long SQL work .