Complete Name: Daniele Scarano
IRC nick: hellska
title: AcousticBrainz Dataset Creation Toolkit (AB DaCT)
The goal of this project is to import in AcousticBrainz the public availables datasets used in research. Usually 20/30 seconds song excerpt are used in those datasets and no MusicBrainz ID can be associated, even if the metadata contained in those datasets is lossy, they contains useful annotation related to the research problem they are used for (e.g. BPM or Mood). Importing those datasets into AcousticBrainz can improve their quality in terms of metadata and increase the value of AcousticBrainz in supporting the research.
Define a set of minimum metadata to accept a submission is a mandatory task to achieve this project goal. This is the basic set of features that are needed to accept a submission
* Track Title
* Length of the audio excerpt
* Temporary AcousticBrainz ID
* Dataset Name
* Dataset Annotations
The AcousticBrainz Dataset Creation Toolkit is permits to the community users to create research datasets from the AB database and support the format/packaging and upload into AB of public access external datasets. The toolkit has two main components: the Dataset Upload GUI that formats the dataset content, adds the missing metadata, analyses the audio with the essentia_streaming_extractor_music and finally uploads the dataset into the AcousticBrainz DB; the Dataset Creation Tool that is a web app installed in the AcousticBrainz server that gives the user the possibility to create a dataset with specific characteristics and download it from the server.
The side effect of this toolkit is to propose a standard for the creation of a dataset for research.
The Dataset Upload GUI
This utility is a standalone software that AB community users can download and install in their own computer. The software can also be developed as a Picard Plugin so it can be integrated in the current “contribution toolkit”. The basic functionality of this tool is to scan a directory tree for audio files and provide the user with an interface to edit the metadata. Once this process is finished the software connects to the AB server and asks for temporary IDs for the audio files. The dataset format is complete and the user can upload it to the AB server.
Once the dataset has been imported into AB a ‘Dataset ID’ is generated by the system and is linked to every uploaded item. Every file is scanned to find the correspondent MusicBrainz Recording ID, the files that are not encountered are marked as ‘unknown’.
All this content should be accessible in a format similar to the MusicBrainz interface that shows the detailed dataset information and the entire track list. It can be also interesting to add the possibility to edit the ‘unknown’ tracks manually in the web page as long as the other dataset information.
Dataset Import Process Steps
Identify the dataset to import and download locally
Load the data into the Dataset Upload GUI (standalone/Picard plugin)
Review the metadata and complete the missing fields
Upload the dataset into AcousticBrainz
Obtain an AcousticBrainz temporal ID
Review the uploaded data in the web interface
Dataset Creation Tool Improvements
In this part of the project we will improve the code for dataset creation, this is already integrated into the AB website and permit to the users to create a dataset. The user must provide the name of the classes and add the desired recordings to each class using the MBID.
The goal of this development is to give an utility that creates automatically the dataset and offer to the user the possibility to edit it before submitting in the AB server. Another feature is the download of the dataset in json or other formats. This is the automatic dataset creation in a simple mock-up created with Google Forms (link to GF)
In the interface designed to create dataset for classification tasks the user provides the basic information needed, those information are then used to create the query that retrieve the data to compile the dataset. Some relevant fields are missing in this form (e.g. “The Project” in which the dataset will be used) but it is designed as simple as I imagine this tool.
The number of instances and the number of classes define the size of the dataset, the class category is selected from a list of possible values like in this examples list:
Once the user selects one item from the list the tool generates the list of classes that can fit the user needs (e.g. enough number of items). The create button perform the query and returns a summary of the dataset that the user can accept or modify. When the user concludes the operation the dataset is compiled and downloaded in the local computer. The functionality to access the datasets will be developed in the AB web API.
The Dataset Format
The main use for this kind of datasets is to train machine learning algorithm for classification tasks, so it is important to make some considerations on the data format. The json format that integrates perfectly with AB can be the default, but it will be nice to add the possibility to download the data in other formats like csv and arff.
Community Bonding: Set up all the necessary tools to start the development and participate in the community. Talk to the other students and check if some tasks are in common. Evaluate the project requirements against the organization requirements and refine the proposal.
Week 1: Design the Dataset Upload GUI (standalone/Picard plugin)
Week 2: ID3 tags annotator
Week 3: File browser
Week 4: GUI and Upload
Week 5: Test and Data Validation
Midterm Evaluation: Review the work, compile the documentation.
Week 6: Develop a script to create the dataset
Week 7: Automatic dataset creation Form design
Week 8: Web API improvements (dataset retrieve function)
Students Submit Code and Evaluation: Final tests, refactoring and submit
Detailed information about myself
I’m a AIX/Unix system administrator since many years and I found some time to study in the Master in Sound and music Computing at the Pompeu Fabra University. I’m devoting all my time to combine my passion for music with my IT skills. I really love to learn new things and to solve problems. As a sysAdmin I learnt to love well designed tools and I learnt also to develop ad-hoc solutions.
I use a MacBook Pro late 2012 (256GB SSD, 1TB HD, 16GB ram) for development. I have another laptop that I use as webserver, DNS, DHCP and other network services for development. I also have an iMac that I use to test deployment and installation tasks.
I started programming with the C programming language in 2001 with the book Operating Systems design and implementation by Tanenbaum. Because of my job daily tasks I started developing tools using bash, Perl and Visual Basic.
I listen almost every music genre and style, here there’s a list with some of my favourites:
Sonic Youth, Bad Moon Rising MBID:d5418d35-5e61-348c-a3f7-5d8fdcd1ea30
Tom Waits, Rain Dogs MBID:81f978e7-6f49-3d51-9243-a131a4edc259
Melvins, Stoner Witch MBID:450a2f27-bd33-439c-ac3b-1e6861076399
Claude Debussy, Nuages MBID:9758a96e-4617-4375-a8db-b9d58c1d06cf
I love music and I love to create tools, under the MetaBrainz Umbrella there are a lot of interesting project that are very connected to my interests, the AcousticBrainz with the research approach, the focus on music content and the exploration of low-level and high-level descriptors is one of my favourite. But also BooksBrainz because I love reading!
I’m using Picard since a while to tag my music collection and retrieve the metadata. I contributed with a few files to the Acousticbrainz. I’m running a local AB server in a vagrant VM to understand better how it works.
I write software that I release as Open Source, it is published in my github profile, but I’ve never contributed in a big and structured project like this.
Most of my projects are related to music, a CSound code generator written in ruby was my thesis for the Bachelor degree. SuperCollider to create the performance setup that I use when i play live. Processing/java for interactive installation. Python and Flask for web development. Some bits of Arduino.
I will work both on my master thesis and on this project and I plan to devote 20/40 hours per week to the GSoC depending on the task.