GSoC 2016: AcousticBrainz Dataset Creation Toolkit

Tags: #<Tag:0x00007fcb4fc97418> #<Tag:0x00007fcb4fc971c0> #<Tag:0x00007fcb4fc96fe0> #<Tag:0x00007fcb4fc96e28> #<Tag:0x00007fcb4fc96bd0>


Personal information

Complete Name: Daniele Scarano
Nickname: hellska
IRC nick: hellska
Skype: hellska75

Project Proposal

title: AcousticBrainz Dataset Creation Toolkit (AB DaCT)


The goal of this project is to import in AcousticBrainz the public availables datasets used in research. Usually 20/30 seconds song excerpt are used in those datasets and no MusicBrainz ID can be associated, even if the metadata contained in those datasets is lossy, they contains useful annotation related to the research problem they are used for (e.g. BPM or Mood). Importing those datasets into AcousticBrainz can improve their quality in terms of metadata and increase the value of AcousticBrainz in supporting the research.


Define a set of minimum metadata to accept a submission is a mandatory task to achieve this project goal. This is the basic set of features that are needed to accept a submission

  • Artist
  • Track Title
  • Length of the audio excerpt
  • Temporary AcousticBrainz ID
  • Dataset Name
  • Dataset Annotations


The AcousticBrainz Dataset Creation Toolkit is permits to the community users to create research datasets from the AB database and support the format/packaging and upload into AB of public access external datasets. The toolkit has two main components: the Dataset Upload GUI that formats the dataset content, adds the missing metadata, analyses the audio with the essentia_streaming_extractor_music and finally uploads the dataset into the AcousticBrainz DB; the Dataset Creation Tool that is a web app installed in the AcousticBrainz server that gives the user the possibility to create a dataset with specific characteristics and download it from the server.

The side effect of this toolkit is to propose a standard for the creation of a dataset for research.

The Dataset Upload GUI

This utility is a standalone software that AB community users can download and install in their own computer. The software can also be developed as a Picard Plugin so it can be integrated in the current “contribution toolkit”. The basic functionality of this tool is to scan a directory tree for audio files and provide the user with an interface to edit the metadata. Once this process is finished the software connects to the AB server and asks for temporary IDs for the audio files. The dataset format is complete and the user can upload it to the AB server.
Once the dataset has been imported into AB a ‘Dataset ID’ is generated by the system and is linked to every uploaded item. Every file is scanned to find the correspondent MusicBrainz Recording ID, the files that are not encountered are marked as ‘unknown’.
All this content should be accessible in a format similar to the MusicBrainz interface that shows the detailed dataset information and the entire track list. It can be also interesting to add the possibility to edit the ‘unknown’ tracks manually in the web page as long as the other dataset information.

Dataset Import Process Steps

Identify the dataset to import and download locally
Load the data into the Dataset Upload GUI (standalone/Picard plugin)
Review the metadata and complete the missing fields
Upload the dataset into AcousticBrainz
Obtain an AcousticBrainz temporal ID
Review the uploaded data in the web interface

Dataset Creation Tool Improvements

In this part of the project we will improve the code for dataset creation, this is already integrated into the AB website and permit to the users to create a dataset. The user must provide the name of the classes and add the desired recordings to each class using the MBID.
The goal of this development is to give an utility that creates automatically the dataset and offer to the user the possibility to edit it before submitting in the AB server. Another feature is the download of the dataset in json or other formats. This is the automatic dataset creation in a simple mock-up created with Google Forms (link to GF)

In the interface designed to create dataset for classification tasks the user provides the basic information needed, those information are then used to create the query that retrieve the data to compile the dataset. Some relevant fields are missing in this form (e.g. “The Project” in which the dataset will be used) but it is designed as simple as I imagine this tool.
The number of instances and the number of classes define the size of the dataset, the class category is selected from a list of possible values like in this examples list:

  • Genre
  • Style
  • Mood
  • Scale
  • …others
    Once the user selects one item from the list the tool generates the list of classes that can fit the user needs (e.g. enough number of items). The create button perform the query and returns a summary of the dataset that the user can accept or modify. When the user concludes the operation the dataset is compiled and downloaded in the local computer. The functionality to access the datasets will be developed in the AB web API.

The Dataset Format

The main use for this kind of datasets is to train machine learning algorithm for classification tasks, so it is important to make some considerations on the data format. The json format that integrates perfectly with AB can be the default, but it will be nice to add the possibility to download the data in other formats like csv and arff.

Time Schedule

Community Bonding: Set up all the necessary tools to start the development and participate in the community. Talk to the other students and check if some tasks are in common. Evaluate the project requirements against the organization requirements and refine the proposal.
Week 1: Design the Dataset Upload GUI (standalone/Picard plugin)
Week 2: ID3 tags annotator
Week 3: File browser
Week 4: GUI and Upload
Week 5: Test and Data Validation
Midterm Evaluation: Review the work, compile the documentation.
Week 6: Develop a script to create the dataset
Week 7: Automatic dataset creation Form design
Week 8: Web API improvements (dataset retrieve function)
Students Submit Code and Evaluation: Final tests, refactoring and submit

Detailed information about myself

I’m a AIX/Unix system administrator since many years and I found some time to study in the Master in Sound and music Computing at the Pompeu Fabra University. I’m devoting all my time to combine my passion for music with my IT skills. I really love to learn new things and to solve problems. As a sysAdmin I learnt to love well designed tools and I learnt also to develop ad-hoc solutions.

  • I use a MacBook Pro late 2012 (256GB SSD, 1TB HD, 16GB ram) for development. I have another laptop that I use as webserver, DNS, DHCP and other network services for development. I also have an iMac that I use to test deployment and installation tasks.

  • I started programming with the C programming language in 2001 with the book Operating Systems design and implementation by Tanenbaum. Because of my job daily tasks I started developing tools using bash, Perl and Visual Basic.

  • I listen almost every music genre and style, here there’s a list with some of my favourites:
    Sonic Youth, Bad Moon Rising MBID:d5418d35-5e61-348c-a3f7-5d8fdcd1ea30
    Tom Waits, Rain Dogs MBID:81f978e7-6f49-3d51-9243-a131a4edc259
    Melvins, Stoner Witch MBID:450a2f27-bd33-439c-ac3b-1e6861076399
    Claude Debussy, Nuages MBID:9758a96e-4617-4375-a8db-b9d58c1d06cf

  • I love music and I love to create tools, under the MetaBrainz Umbrella there are a lot of interesting project that are very connected to my interests, the AcousticBrainz with the research approach, the focus on music content and the exploration of low-level and high-level descriptors is one of my favourite. But also BooksBrainz because I love reading!

  • I’m using Picard since a while to tag my music collection and retrieve the metadata. I contributed with a few files to the Acousticbrainz. I’m running a local AB server in a vagrant VM to understand better how it works.

  • I write software that I release as Open Source, it is published in my github profile, but I’ve never contributed in a big and structured project like this.
    Most of my projects are related to music, a CSound code generator written in ruby was my thesis for the Bachelor degree. SuperCollider to create the performance setup that I use when i play live. Processing/java for interactive installation. Python and Flask for web development. Some bits of Arduino.

  • I will work both on my master thesis and on this project and I plan to devote 20/40 hours per week to the GSoC depending on the task.


@alastairp or @Gentlecat, any feedback?

@hellska, just a heads up: remember that the deadline for uploading your final PDF application is tomorrow evening (UTC).


Thanx Freso,
I was waiting for feedback to make some changes to the proposal if necessary, but I saw that I can upload the pdf and ‘reupload’ another version until tomorrow at 20:00, so I just uploaded the current version.

Anyway I will be waiting for feedback and ready to make changes to the proposal.



I’ll try and come up with some comments first thing in the morning.


Just want to note that we had a project related to datasets last year. Some parts of it might be useful here. I also had plans to work on an API for dataset creation tools that we have already, but nothing concrete yet.


@Gentlecat So the second part of the project already exists, at least some parts, so the goal can be switched to develop an API to access the dataset creation tool. You refer to the code contained in the dataset_eval folder in the AcousticBrainz server, is there any other relevant part to check?

‘The Dataset Creation Tool’ paragraph is quite general, but I edited it here in the forum to mention the web API approach.


dataset_eval package contains tools for evaluating datasets. Creation is done in the webserver. Some relevant parts:


@Gentlecat As I can see all the functionality to create a dataset are already present in the server. Recordings must be added manually, while I think this part of the creation can be done automatically.
Also some rules to automatically create the datasets should be included in the project (e.g. one song per artist in each class)

I changed a little bit the schedule, the week 6 is devoted to the creation of a script that create a complete dataset instead of a dataset viewer that is already present.


I made some small changes to the ‘Dataset Creation Tool Improvements’ paragraph, and to the ‘Time schedule’. Modified the Gdoc draft, the post here and uploaded a new pdf in the GSoC website.


Also, just curious: did you add yourself to MusicBrainz yet? :wink:


@Freso Not yet! But I’ll try to do it soon … when I find a good definition for the genre :stuck_out_tongue:


But we don’t even support genres yet! :slight_smile:


Yes I see, but the genre is usually in the tags, isn’t it?!?!
I’m using this information to select songs(recordings) for dataset creation in AcousticBrainz … am I doing something inconsistent here?
Well in the meantime I will miss this info in my songs :smiley:


I created a GIST on github that contains all the information of this coding period. I used the pull requests as a reference for my work without pointing out every commit.
Any feedback on the report before submitting it to the GSoC website is really appreciated!