[DMP 2025] Proposal: BrainzPlayer Internet Archive Integration #499

veldora · May 5, 2025, 12:18pm

Mohammad Shahnawaz
Metabrainz: veldora
discord : veldora

Summary:

BrainzPlayer is the custom react component in Listenbrainz which is used to search and play tracks from multiple sources eg.Youtube, Spotify and uses multiple data sources to facilitate track playback.
This project aims to allow users to play audio files from Internet Archive (IA) by integrating it to BrainzPlayer. This integration will help Listenbrainz(LB) users to discover and play songs that might not be available on other platforms.

My approach involves in creating an Internet Archive indexer. It will extract the metadata of audio files and URL from Internet Archive using the IA’s API and store these in database. The indexer will be synchronized so it will index metadata items from IA and moniter it for changes.
Then an API will be implemented to search indexed data and extract required media URL for track. This API will be used by InternetArchivePlayer to play the track.

For the frontend, I will create a InternetArchivePlayer similar to players(add link) in Brainzplayer . It will search the indexed data using an API and find the relevant audio file URL for playback and using html5 audio element in BrainzPlayer play the track

This project allows Listenbrainz to play songs that are not readily available with the functionality of backend indexing and lightweight audio player using HTML5.

Project Overview

This project involves developing a Internet Archive player and integrating it with BrainzPlayer. It will incorporate ListenBrainz with vast audio collection of Internet Archive allowing users more options to
Brainz

The key goals of this project are:

Developing an indexer to index music metadata items and media file urls from the Internet Archive.
Developing an API for BrainzPlayer to search this index for media files.
Integrate the API with BrainPlayer to play the relevant track.
The indexer is able to index metadata items in the Internet Archive and monitor it for changes.

Understanding of the Project

Phase 1: Implement Internet Archive Indexer

Part 1:
Currently ListenBrainz has indexers (e.g. spotify, apple, soundcloud) in listenbrainz/metadata_cache to extrack track metadata from external sources.
In this part we will implement an Indexer for Internet Archive data.
The Internet Archive(IA) has a vast repository of music and they offers APIs and python library to search and retrieve metadata and media URLs. So using the IA python library we will extract the URLs of track along with their metadata.

from internetarchive import search_items, get_item

search_results = search_items('collection:78rpm')
URLs = []
for result in search_results:
    identifier = result['identifier']
    item = get_item(identifier)
    for file in item.files:
        name = file.get('name')
        if name and name.endswith(('.mp3', '.ogg', '.flac', '.wav')):
            url = f"https://archive.org/download/{identifier}/{name}"
            URLs.append(url)

This URL will be formatted with track metadata. The identifier for the track will be create based on current identifiers such as spotify:track:<spotify_id>.

  "title": "Song Name",
  "artist": "Artist Name",
  "collection": "78rpm",
  "url": "https://archive.org/download/...",
  "identifier": "IntenetArchive:track:<IA_identifier>"

Part 2: Store the formatted data in database using Redis.
In this part we will use the existing function in Listenbrainz e.g brainzutils.cache.

from brainzutils import cache
def store_ia_recording(metadata: dict) -> None:
    identifier = metadata.get("identifier")
    redis_key = f"ia:recording:{identifier}"
    cache.set(redis_key, metadata)

Phase 2 : Develop Search API in Listenbrainz

In this we will create a new API endpoint that will be able to:

search the indexed data stored in Redis database.
search te indexed database for the item.
Return the result of query as json with URL of track.

Part 1: Create a new API route 1/internet_archive/search.
This API route will be implemented in the listenbrainz/webserver/views/internet_archive_api.py.

Phase 3: Integrate indexer with BrainzPlayer

In this part we will extend the current BrainzPlayer funtionality to support new InternetArchivePlayer

Part 1 :
Create a Internet Archive player similar to current players in frontend/js/src/common/brainzplayer/InternetArchivePlayer.tsx
This react component will query the new internet_archive_api to get track metadata and URL.

Part 2: Create a custom audio player for InternetArchivePlayer
In this part we will create a custom audio player in frontend/js/src/common/brainzplayer/AudioPlayer.tsx to play the audio file using HTML5 audio element, this creates a simple audio player without using wrappers like existing players eg. Spotify, Apple, Soundcloud.

interface Props {
  audioUrl: string;
  onEnded: () => void;
  onPause: () => void;
  onPlay: () => void;
  onTimeUpdate: (currentTime: number, duration: number) => void;
  onError?: () => void;
}

export const AudioPlayer = React.forwardRef<HTMLAudioElement, Props>(
  ({ audioUrl, onEnded, onPause, onPlay, onTimeUpdate, onError }, ref) => {
    return (
      <audio
        ref={ref}
        src={audioUrl}
        controls
        onEnded={onEnded}
        onPause={onPause}
        onPlay={onPlay}
        onError={onError}
        onTimeUpdate={(e) => {
          onTimeUpdate(e.currentTarget.currentTime, e.currentTarget.duration);
        }}
      />
    );
  }
);

Part 3 : Integrate InternetArchivePlayer in Brainzplayer
In this part we will extend the existing BrainzPlayer to handle the new InternetArchivePlayer

We will make changes in frontend/js/src/common/brainzplayer/BrainzPlayer.tsx

if (datasource instanceof InternetArchivePlayer) {
    return AppleMusicPlayer.isListenFromThisService(listen);

 switch (key) {
        case "InternetArchive":
          dataSources.push(InternetArchivePlayerRef);
          break;

Phase 4: Testing and Error handling

In this part we will do a comprehensive testing and error handling of indexer and API. Also I will check the functionality of player .
I will also discuss any potential changes with mentor.

Macro-Implementation Details with timelines:

Project details compile and Community Bonding. (Week 1)

Milestone 1: Develop an Indexer for Internet Archive (Weeks 1-5)

Deliverables:
An indexer similar to current ones in listenbrainz/metadata_cache with these functions:

Search Internet Archive collection search_ia_items(page: int) — (Week 1-2)
Extract URLs from search result extract_urls(identifier: str). — (Week 3)
Format track metadata to store it in database format_track(item: Dict). — (Week 4)
Store metadata of tracks in database using Redis store_track_db(entry: Dict). — (Week 5)

Milestone 2: Develop API for ListenBrainz (Week 6-8)

Deliverables:

New API endpoint in ListenBrainz 1/internet_archive/search. — (Week 6)
Accept query parameter eg. artists, title. — (Week 7)
Query indexed database and search for matching items. — (Week 8)
Return JSON result of query. — (Week 8)

"results": [
    {
      "title": "Song Name",
      "artist": "Artist Name",
      "collection": "78rpm",
      "url": "https://archive.org/download/...",
      "identifier": "IntenetArchive:track:<IA_identifier>"
    }
]

The Indexer is able to monitor changes in IA database sync_IA()

Milestone 3: Integrate Internet archive tracks with BrainzPlayer (Week 8-11)

In this we will modify the current BrainzPlayer in frontend/js/src/common/brainzplayer.
It will search the search IA endpoint 1/internet_archive/search).
Update BrainzPlayer to add HTML5 element to play audio.

Milestone 4: Testing and Errors handling ()

During this period I will do comprehensive testing and get feedback from my mentor for any code changes and error handling.
Deliverable:
Brainzplayer will be able to play recording from Internet Archive without any errors.

Issue: Redis may not be designed for large-scale fuzzy searches or full-text queries across thousands of keys. This can become a bottleneck as the number of indexed recordings grows.
Support Needed:
- Guidance from the ListenBrainz team on the expected scale and whether it’s acceptable to query Redis with keys() or switch to an alternate lightweight search solution (e.g., RediSearch or SQLite).
Possible Solutions:
- Use limited namespaces (ia:recording:) and apply pagination or caching on results.
- If scale becomes a concern, switch to a batched index export and query from a local in-memory structure.

Potential Issues

1. Redis Search limitations and Scalablity:

Issue: Redis may not be designed for large-scale fuzzy searches or full-text queries across thousands of keys. This can become a bottleneck as the number of indexed recordings grows.
Support Needed: Guidance from the ListenBrainz team on the expected scale and whether it’s acceptable to query Redis with keys() or switch to an alternate lightweight search solution (e.g., RediSearch or SQLite).
Possible Solutions: Use limited namespaces (ia:recording:) and apply pagination or caching on results.

2. Metadata Inconsistencies from Internet Archive

Issue: Internet Archive metadata can vary in format (missing fields, inconsistent naming, etc.).
Support Needed:
- Help with setting up fallback or canonical mapping rules—possibly guidance on how similar issues were handled in other metadata indexers in ListenBrainz.
Possible Solutions:
- Define a normalization layer in the indexer to ensure consistent keys (title, artist, identifier, audio_url, etc.).
- Maintain a validation schema and log skipped or malformed entries.

3. Changes in Internet Archive APIs or File Structure

Issue: The structure of files or the API response from Internet Archive might change or contain edge cases that break the indexer.
Support Needed:
- Input on how tightly to couple the indexer with IA APIs—should the IA CLI tool (internetarchive Python package) be preferred for future-proofing?
Possible Solutions:
- Use the official IA Python CLI/toolkit (internetarchive) to abstract low-level API changes.
- Set up unit tests on metadata extraction to catch schema drift early.

4. Playback Compatibility in BrainzPlayer

Issue: Some recordings may be in formats not supported by HTML5 audio (.flac, .ogg), or some links may require redirection or CORS headers.
Support Needed:
- Help from the frontend team to test multiple IA formats and ensure graceful fallback.
Possible Solutions:
- Prefer .mp3 links when available.
- Implement a client-side MIME type check in BrainzPlayer and log/play only compatible formats.

5. Monitoring for Updates to Internet Archive Collection

Issue: The IA collection is dynamic. New recordings are added and sometimes updated.
Support Needed:
- Advice on whether a full reindex strategy is acceptable or if ListenBrainz has a preferred method for incremental updates.
Possible Solutions:
- Add a lightweight scheduler (e.g., cron) to re-fetch recently modified items.
- Use the IA API’s metadata change logs to limit reprocessing.

6. Security or Abuse of the API Endpoint

Issue: Search endpoint may be spammed or queried heavily, impacting Redis performance.
Support Needed:
- Direction on API rate limiting—whether existing ListenBrainz middleware supports this.
Possible Solutions:
- Add simple caching or throttling logic to 1/api/internet_archive/search.
- Implement max_limit for API call.

BrainzPlayer is a custom React component in ListenBrainz used to search and play tracks from multiple sources such as YouTube and Spotify. It utilizes various data sources to facilitate track playback.

This project aims to extend BrainzPlayer by integrating support for audio files from the Internet Archive (IA). This integration will enable ListenBrainz (LB) users to discover and play songs that might not be available on other platforms.

sound.and.vision · May 5, 2025, 8:22pm

I’m going to stick a bit of a cat amongst the pigeons here; do we want to actually do this considering the hot water that IA are in for illegally providing music?

I’ve written about this a bit on Reddit, and as I stated there - please do not think I’m a big hater of the Internet Archive, I think it serves a very true and real purpose; but they’ve ended up getting the backs up of the RIAA recently for allowing copyrighted music to be distributed (either via stream or download) without suitable license.

I definitely know and understand (like I hope everyone else here does) that Internet Archive is not a piracy first website, however plenty of people often use it in that manner; uploading things they well and truly know are in copyright and could potentially cause a lot of legal headaches for the organization and its employees.

To that effect the current court case that is on-going is actually from material that the organization itself published to the website, namely part of its 78’s archiving project - this is much like the same situation that they recently had with book publishers from a misguided approach during COVID-19 allowing extended and seemingly unlimited “lending” of digital renditions of various printed books (again well within copyright).

Just my two cents.

sound.and.vision · May 5, 2025, 8:29pm

Adding to this, assuming that the case is resolved (which it probably will do in an out-of-court undisclosed settlement) it is likely that only “snippets” of recordings may be made available; certainly of those provided by the organization itself - this is what they used to provide with their CD audio archive (which is still valuable for high resolution artwork scans at least).

Are those snippets even useful for the BrainzPlayer needs?