Importing listening history files: GSoC-2025 application

veldora · March 16, 2025, 12:22am

GSoC 2025 application: Importing listening history files

Proposed Mentor: Lucifer/Mayhem

About me

Name: Mohammad Shahnawaz

Nickname: veldora (IRC, Discord)

Github : Mshahnawaz1,

Email: shahnawaz919956@gmail.com

Time zone: UTC+5:30

My PRs : Check out

Project Proposal

Problem Statement:

ListenBrainz has a feature to import listening history from platforms like last.fm and Spotify, but it doesn’t allow users to import listening history from sets of files like JSON (Spotify), CSV (Apple Music, Last.fm, Spintron), JSONL (LB), and ZIP, which is inconvenient for many users.

Solution Overview

I will develop an API endpoint, which will validate the files and process the data from the files and import the lists of listens to the ListenBrainz account. The processing of data will be done in the background, so users don’t have to keep their browser open for the duration of the import.

Implementation

The basic working of API is as follows:

Overview of workflow

I will create an API endpoint to get input from the user.

After that, I will create a new background task for imports, which will process these files in the background.

After that, I will implement the functions to process data from formats like JSON, CSV, JSONL, and ZIP into LB format.

Then I will add these lists to the queue to store in the database.

In the end, to integrate this with the user, we will create a UI to get user inputs.

Part 1: Create an API Endpoint

In this part, we will create an API endpoint at setting/import-listens-files

#path: listenbrainz/webserver/views/settings.py
@settings_bp.post("/import-listens-files/")
@api_login_required
def import_listens_file():
    if 'file' not in request.files:
        return jsonify({"error": "invalid file"}), 400
    file = request.files['file']
    start_date = request.form.get("start_date")
    end_date = request.form.get("end_date")

    validate(file) # add validation for the size and format of file
    temp_file = save_file_temp(file)     #save file temporarily
    result = create_import_task() # create a background task and add data into user_data_import 

    return result

The validate function will validate for file size (<100MB) and file formats (JSONL, JSON, CSV, ZIP).
The save_file_temp function will save the file temporarily.

filename = secure_filename(file.filename)
temp_path = os.path.join(TEMP_DIR, f"upload_{uuid.uuid4().hex}.tmp")
file.save(temp_path)

Then I create_import_task will be defined to manage the process of importing data. This function will retrieve the saved temporary file, validate the contents, and initiate the import process while ensuring that any errors encountered are logged for later review. on similar to create_export_task.

In this, we will insert data into user_import_data the table.
The schema for it user_import_data will be

CREATE TABLE user_data_import (
    id                  
    user_id         
    service            
    status              
    progress        
    filename 
    created_at
)

The create_import_task function will also insert importing status into.background_task
- Her task will be “import_listens_data_files”.
- And metadata will have import_id.

INSERT INTO background_tasks (user_id, task, metadata) 
VALUES (:user_id, :task, :metadata) 
ON CONFLICT DO NOTHING 
RETURNING id

Phase 3: Processing the data files in the background.

In this, we will implement file validations and conversions that will run in the background, similar to export_data.

After the lists are converted to LB schema and augmented. It will be sent to the queue to the RabbitMQ connection.

Implement a function to check the file type and schema in the file to select which importer we will use.
This function will also check for ZIP files and extract the files containing the lists. These files will then be processed by their importers.

Importers ---- [spofifyImporter, appleMusicImporter, lastFmImporter, LBImporter].

These importers will validate the file schema and convert it into the ListenBrainz schema.
The converted listens will then be added to the queue in batches of 1000 listens to import into the database using the _send_listens_to_queue function.
Then we will update the progress status in our tableuser_data_imports.

Part 4: Implement the importers to convert listens to LB schema.

In this part, we will make importers for each music service file.

class ListenImporter():
    def init(self, filename):
        self.filename = filename
        self.listens = []

    @abstractmethod
    def validate(self, filename) -> list(dict):
        """validate the file and schema and output listens"""
        pass

    @abstractmethod
    def tranform(self, listens) -> list(dict):
        """transform the listens"""
        pass

    def filter_date(self, start_data, end_date) -> list(dict):
        #Implement filters for start and end date of imports
        pass

class Spotify_history_importer(ListenImporter):
    def validate(self, filename):
        pass

    def transform(self, listens):
        pass

Create a function for LB exports to be converted into a suitable format for imports.

In this function, we will read the JSONL file, extract the lists, and then clean the lists by removing the inserted_at.

# Sample cleaned listen to import. (LB import format)
{'listened_at': 1741758725,
 'track_metadata': {'track_name': 'Pehli Nazar Mein',
  'artist_name': 'Atif Aslam',
  'mbid_mapping': {'caa_id': 14432024376,
   'artists': [{'artist_mbid': '2c26fddb-3926-4004-ae27-22a3896a4f26',
     'join_phrase': '',
     'artist_credit_name': 'Atif Aslam'}],
   'artist_mbids': ['2c26fddb-3926-4004-ae27-22a3896a4f26'],
   'release_mbid': '00942d0e-73ec-48a3-8389-0139feb502e8',
   'recording_mbid': '44b4ed20-a0e6-464e-96de-c95b134c0c60',
   'recording_name': 'Pehli Nazar Mein',
   'caa_release_mbid': '00942d0e-73ec-48a3-8389-0139feb502e8'},
  'release_name': 'Race',
  'recording_msid': '12105304-a169-4538-9200-5a459be823fa',
  'additional_info': {'origin_url': 'https://www.youtube.com/watch?v=fs7-8M1VbZU',
   'duration_ms': 216021,
   'media_player': 'BrainzPlayer',
   'music_service': 'youtube.com',
   'submission_client': 'BrainzPlayer',
   'music_service_name': 'youtube'},
  'brainzplayer_metadata': {'track_name': 'Arijit Singh: Agar Tum Sath Ho | Alka Yagnik, A.R. Rehman, Irshad Kamil'}}}

Create a SpotifyImporter class with functions to validate and transform Spotify listens into LB format.

I will create a validation for valid Spotify listens; these will include listens that are listened to for half of the track or 4 minutes.

Then I will create a function to transform these lists into LB format.

{'ts': '2020-12-23T14:51:53Z',
 'platform': 'Android OS 7.1.1 API 25 (samsung, SM-G550FY)',
 'ms_played': 163171,
 'conn_country': 'IN',
 'ip_addr': IP_address,
 'master_metadata_track_name': 'Demons',
 'master_metadata_album_artist_name': 'Alec Benjamin',
 'master_metadata_album_album_name': 'These Two Windows',
 'spotify_track_uri': 'spotify:track:57zRWXTQCFRV3zwg0NR8Ck',
 'episode_name': None,
 'episode_show_name': None,
 'spotify_episode_uri': None,
 'audiobook_title': None,
 'audiobook_uri': None,
 'audiobook_chapter_uri': None,
 'audiobook_chapter_title': None,
 'reason_start': 'trackdone',
 'reason_end': 'trackdone',
 'shuffle': False,
 'skipped': False,
 'offline': False,
 'offline_timestamp': None,
 'incognito_mode': False}

Conversion will be as follows:

ts → listened_at (UNIX)  
master_metadata_track_name → track_metadata.track_name  
master_metadata_album_artist_name → track_metadata.artist_name  
master_metadata_album_album_name → track_metadata.release_name  
spotify_track_uri → track_metadata.mbid_mapping.recording_mbid (if available)  
track_metadata.additional_info.media_player = 'Spotify'
track_metadata.additional_info.music_service = 'spotify.com'

I have taken reference from kellnerd/elbisaur.

Create an importer for the last.fm export file.

The last.fm exports taken from benjaminbenben have listen data in order.

[artist, album, title , datetime]. in CSV format.

So it can be directly mapped to LB listens.

Apple Music Importer

Title, Artist, Album, Duration (ms), Play Date, Play Count, Is Complete, Source, Device, Genre, Explicit, Track ID, Album ID, Artist ID
"Shape of You","Ed Sheeran","÷ (Divide)",233713,"2024-03-19T10:15:45Z",3,TRUE,"Apple Music Streaming","iPhone 13","Pop",FALSE,"1234567890","1234567800","111222333"
"Uptown Funk","Mark Ronson","Uptown Special",269000,"2024-03-19T11:45:30Z",1,TRUE,"Apple Music Streaming","iPad Air","Funk",FALSE,"0987654321","0987654300","444555666"

In this importer, I will directly map the following into LB format.

Title, Artists, Album, Duration(ms), Playdate(UNIX format)

Then add the additional information of listens.

track_metadata.additional_info.music_service = 'Apple Music Player'
track_metadata.additional_info.device = Device

Part 5: Add frontend component to allow user to submit listens file and start/end date.

Part 6: Handle errors, code cleanup, and add documentation.

In this part, I will test the API for any errors and then clean up the codes. Furthermore, I will add the documentation for the feature added.

Timeline

This is the detailed week-by-week timeline for the 350 hours of the GSoc coding period that I intend to follow.

Pre-community Bonding Period (April-May):

During this time I will be working on a deeper understanding of the codebase and workflow and related technical skills.

Community Bonding Period (May 8 - June 1):

In this period I will discuss the project with mentors and plan the roadmap of the project and get a clear sight on the plan of action for the project.

Week 1:

First I will create a file handler for LB exports. This will be used for testing of the API.

Week 2-3:

Then I will create the API endpoint and add simple file validations. And error handling

Week 4-5:

Create a task for the background processing of import data.

Week 6:

Create the tables to store the import progress.

Week 7:

Add functions to process the files in the background.

Week 8:

Create a Spotify importer to convert the export file from Spotify into LB. Listen.

Week 9:

Create a function to convert Apple Music listens to LB listen format.

Create a function to convert last.fm exports to LB listen format.

Week 10:

Make a frontend section to get file input from the user.

Week 11:

Check for any errors and troubleshoot for invalid behaviors.

Week 12:

I will clean up my code and add documentation. Discuss any changes with your mentor before final submission.

Stretch Goals:

Add the functionality to import data from YouTube Music.
Add functionality to import data from Spinitron exports.

Detailed Information about Myself

I am a 3rd-year college student trying to improve my skills and gain experience from working on projects used by many people. This opportunity will help me in gaining valuable experience in the open-source world. I am really fascinated by the idea of open-source projects, which provide access to good, if not the best, quality software at no cost.
When I am not working, I love watching anime and reading novels.

Tell us about the computer(s) you have available for working on your SoC project!

I will be using my HP 14s (8GB RAM with 512GB SSD) with Windows 11 for this summer.

When did you first start programming?

I have started programming since my high school, but seriously started coding

first year of my college.

What type of music do you listen to?

I am mostly listening to contemporary/pop like these. Let Me Down Slowly, Agar Tum Saath Ho , Aasan Nahin Yahan (Recording “Aasan Nahin Yahan” by Arijit Singh - MusicBrainz)

What aspects of the project you’re applying for (e.g., MusicBrainz, AcousticBrainz, etc.) interest you the most?

ListenBrainz provides a way to store listening history from multiple platforms and show stats and recommendations. It helps me in better understanding my listening habits.

Have you ever used MusicBrainz to tag your files?

Yes, I have.

Have you contributed to other open-source projects?

ListenBrainz is the first open-source project I am contributing to.

If you have not contributed to open source projects, do you have other code we can look at?

I have worked on irrigation systems, that uses machine learning and a Flask-based API to decide if irrgation is required or not.

What sorts of programming projects have you done on your own time?

In my free time I have been working on Python and JavaScript projects like a content-based movie recommendation system and an automated irrigation system for a hackathon. Also I have worked on grocery, to manage user inventory.

How much time do you have available, and how would you plan to use it?

I will be available full-time for the duration of Summer of Code, which will be about 30-40 hours weekly. I will also be able to add more hours if required.

veldora · March 16, 2025, 9:23am

Please give your feedback and suggestions.

RustyNova · March 17, 2025, 7:38am

Not metabrainz staff… But

Your post should be in the Metabrainz GSOC community instead

veldora · March 17, 2025, 11:07am

Thanks! Changed category

lucifer · March 27, 2025, 6:55am

Hi!

Thanks for the proposal. Comments below.

This is inefficient, the submit-listens endpoint enqueues the listens to rabbitmq after validation and the parser/transfromer in your case should do the same.

Do you intend to validate and process the file in the same request itself? This would require the user to keep the webpage open and ensure that they don’t get disconnected during the import. Otherwise, the import would fail.

Limited validation if possible can be performed during upload but the bulk of the processing should take place in background. For instance, we have the listen history deletion and listen export tasks in listenbrainz-server/listenbrainz/background at master · metabrainz/listenbrainz-server · GitHub, listenbrainz-server/listenbrainz/webserver/views/export.py at 77381518ca5ae8f24f13c02b085f838e3456eed4 · metabrainz/listenbrainz-server · GitHub. The file import should work similarly.

We should also display the progress of the import to the user, for exports we currently record the progress information in a table like this listenbrainz-server/listenbrainz/background/export.py at 77381518ca5ae8f24f13c02b085f838e3456eed4 · metabrainz/listenbrainz-server · GitHub. We can make this progress table generic, add a column to two, and make the progress available to the frontend for the user to see.

I think you should add some mockups or pictures of how the UI will look like.

I think the proposal will change significantly to accomodate the feedback so let’s discuss the timeline after that.

veldora · March 31, 2025, 8:53pm

Hi! @lucifer thanks for your valuable feedback.
These are the some of the solutions I thought for the issues.

For this we will should directly enqueue the processed listens to Rabbitmq.

exchange = rabbitmq_connection.INCOMING_EXCHANGE
for chunk in chunked(submit, MAX_LISTENS_PER_RMQ_MESSAGE):
    publish_data_to_queue(chunk, exchange)

here submit is the processed listens that need to be imported

The file will only be validated for size and type (json,jsonl,zip,csv) in the frontend and then will be temporarily stored on the server for further processing.

filename = secure_filename(file.filename)
    temp_import_path = os.path.join(temp_dir, f"upload_{uuid.uuid4().hex}.tmp")
    file.save(temp_import_path)

Can we update the existing export table user_data_export and add column operation_type which will be import/export. And another column of filename(str).

lucifer · April 1, 2025, 11:42am

@veldora I think you should add some more technical implementation details to your proposal also a mockup of the frontend components. Also, I don’t see how you intend to handle the imports in the backend.

veldora · April 5, 2025, 6:02am

Thanks @lucifer I have made the changes, can you review it now!