GSoC 2020: AcousticBrainz - New machine learning infrastructure
Name: Pantelis Tzamalis
IRC nick: jmp_music
Time Zone: UTC +3
This project proposes the development of a module that uses Machine Learning to classify correctly the audio-based relative data which is extracted from music tracks. Python’s related Machine Learning (ML) library, scikit-learn, is used for the training of the data and the predictions of the classification results.
AcousticBrainz is a project which aims to crowdsource acoustic information for all music in the world. This music information is available to the public and describes the acoustic characteristics of music and includes low-level spectral information and information for genres, moods, keys, scales and much more.
During that period of time, the platform uses the Essentia toolkit which is an Open-source library written in C++ that provides tools for audio and music analysis, description and synthesis. The audio-based music information (features) is retrieved by using this library and a custom library called gaia is created for the train, test, and evaluation of the Machine Learning model which handles the libsvm ML library. The ML process uses the SVM (Support Vector Machines) algorithm to classify each track instance. Despite its excellent performance, this custom implementation lacks availability for the extension with different ML algorithms and Deep Deep Learning.
Contribution. This work replaces/reproduces the existing ML classification problem procedure that follows the gaia library with a new ML model infrastructure that uses the Python library, scikit-learn. This new infrastructure is built in high-level modeling, it reproduces classification process with the SVM model, and can be easily extended with other ML algorithms that can be then compared between each other or even combined (e.g. using a Voting classifier, Bagging Classifier).
Requirements and Specifications
A necessary process that comes with when building systems, modules, or tools is the definition of the requirements and specifications for the successful development of the proposed work. For the development of the specific sub-tool, several tools (external or internal) should be used.
To start with, the Python 3.7 programming language will be used for the development of the module which comes with the Anaconda Distribution. Many libraries (software packages) have been written until now in Python. These libraries are used in the development of various applications, as well as libraries that allow proper data processing, analysis, and visualization that need to be installed. Such libraries are the NumPy, SciPy, Pandas, for data manipulation, handling, and processing, and the Seaborn and Matplotlib for the raw data visualization. However, the core library that will be used for the ML modeling is the scikit-learn. The current version is 0.22.2 and includes many algorithmic models capable to solve many ML tasks both for classification and regression. It is a powerful ML library that can come up against multiclass classification problems too.
Moreover, the Jupyter Notebook will be used for the initial development steps, but the final module will be implemented in the scripting way that is compliant with the Organization’s premises, and of course in Object-Oriented Programming.
No data analysis module can be used and tested without the appropriate data. For the purpose of the project, the related data can be found via the AcousticBrainz’s publicly available Data API and can be accessed as pointed in AcousticBrainz Server Documentation. AcousticBrainz stores two levels of data, the low-level and the high-level, in JSON format. Python’s json library will be used to load and access the data for further manipulation. The low-level data includes acoustic descriptors which act as features inputs to the classifier. It is stable compared to the high-level data, which means that rarely change. The high-level data includes information about moods, genres, vocals and music types for each instance and is automatically inferred from low-level data by the pre-trained classifiers. In addition, the high-level data includes the MusicBrainz Identifiers (MBIDs) which provide a reliable and unambiguous form of music identification and play an important role when managing a digital music collection. They are commonly used to identify:
- the recording itself
- the release
- the label
- the track artist
- the release artist
Finally, the majority of Python’s libraries can be installed with pip which is a package management system used in the terminal to install and manage software packages written in Python. Many packages can be found in the default source for packages and their dependencies (Python Package Index). A Git repository will also be created for the needs of the project, for better handling of implementation progress versions, solving possible introduced issues, and helping project’s Google Mentors to have holistic supervision of the development procedure.
Data Flow and Architecture
First of all, an extensive study and comprehension of how the gaia module works for the data injection into the ML tool, the data processing, training and prediction steps of the model.
After that, the data flow chart and data handling procedure follow for the development/refactoring of the module, which is presented in the following Figure and is divided into these steps:
- Feature Engineering: Fortunately, one of the biggest challenges and time-consuming processes in ML which is feature engineering and extraction (in our case in the audio files), is now implemented by the Essentia toolkit that is embedded into the AccousticBrainz platform.
- Data preparation: The data flow procedure starts by applying the suitable data transformations and enforcing the desired, machine-readable, formats for the insertion of the instances’ features to the training phase and the subsequent analysis.
- Data post-processing: This step includes the optional fixing or removal of the outliers or filling with various statistical metrics (mean, median, zero) the missing values if they exist, or even drop them.
- Feature scaling: Here, the values of the features are standardized or normalized.
- Train/Test split: The data is split into train and test sets for training and evaluation of the ML models.
- Training the Model: The task is a multiclass classification problem. Thus, the appropriate algorithms are trained and tested, by starting with the predefined SVM, which is necessary to be implemented by the needs of the project. The data is inserted into this component and the training process is applied. Other models that can be applied are the Multinomial Naive Bayes, Random Forests, etc.
- Model evaluation: After the training process, the model is applied to the instances of the test set and the evaluations of the applied models are presented related to the predictions by the classification reports and the confusion matrices. The performance can be also measured by using the N-fold cross-validation method and computing the mean and standard deviation of the performance measure on the N-folds.
- Fine-tuning: In order to improve the performance of the model, tweaking the hyperparameters automatically and finding the best for the model will be part of this step. Several techniques such as GridSearch and RandomSearch are applied to find the best ones.
The above processes will be automated as much as possible in order for someone to can easily handle/modify them. Additionally, during the post-processing step, to improve the time of execution of the training of the model or to discover the most important features (feature selection), dimensionality reduction can be applied, such as the algorithm of Principal Component Analysis (PCA). This technique is already included in the scikit-learn library. Thus, the attributes that provide no useful information for the task are dropped. This process is not always necessary to be applied to Deep Learning too, as the Neural Networks themselves adjust the weights of each feature, indicating that way its importance to the dataset. Finally, the embedded to the scikit-learn, Bagging and Voting methods can be applied in order to combine different models and choose the best predictions.
Note: The Essentia toolkit is now available in Python bindings too.
This section presents indicative deliverables and milestones that will be followed for the successful delivery of the project and has been defined due to Google’s “Summer of Code” call. It includes the development planning of the module. June 01, 2020, is the starting period of module development.
Phase 01 (June 01, 2020)
It is the first phase of the project, where the community bonding takes place. Here, any further specifications and definitions for the project development life will be determined. Extensive research about the gaia module with the integration of the libsvm library, how it works, both in the data flow and processing procedures and to the theoretical parts will be also done.
Furthermore, the compatibility and the dependencies of the libraries that will be used for the development of the module will be checked and tested. Moreover, the Data API and the Web API with their corresponding documentation will be studied too.
Finally, the current ML modeling will be studied:
- the parameters it uses.
- how it handles the data
- the features that are chosen from the instances
- the type of distribution (Gaussian, uniform, etc.)
Phase 02 (July 03, 2020)
After the study of the ML module that is used and the determination of the requirements and specifications, the implementation period starts. Here, a data exploration will take place and a record of this process will run in Jupyter Notebook. Also, there will be studied:
- Each attribute and its characteristics (e.g. noisiness, percentage of missing values, type of distribution, etc.),
- the target attribute, and the number of the unique labels
- The data will be visualized and the correlations between the attributes will be studied.
Based on the data flow schema that is presented in the relevant section above, the initial implementation steps are followed in this phase and the related milestones are declared:
- Data preparation, transformation, and post-processing
- Implementation of the basic SVM model, and other relative models for comparison.
- Classification reports and confusion matrices outcomes
Phase 03 (July 31, 2020)
This is the period of building the main core of the module where the integration of the built module to the AcousticBrainz toolkit takes place. It includes only two basic milestones that are equal with the predefined by Google’s deadline:
- Development of the whole ML module and its integration into the general toolkit.
- Testing on how the model performs in the real world scenario.
Phase 04 (August 31, 2020)
It is the final phase of the development steps where an evaluation of the performance and minor debugging takes place and it is comprised of two milestones:
- Benchmark performance, Hyperparameter Tuning
- Initialization of the data transformation for a Deep Learning scenario with Tensorflow 2.x and train a DL model, e.g. a Sequential Model with few Dense layers, softmax activation function)
Detailed information about yourself
Tell us about the computer(s) you have available for working on your SoC project!
At home I have a MacBook Pro (Retina, 15-inch, Mid 2015):
- Processor: 2,2 GHz Quad-Core Intel Core i7
- Memory: 16 GB 1600 MHz DDR3
- macOS Catalina 10.15.4
At the Lab of the University, I work on a Desktop which is comprised of:
- Processor: 2.66 GHz Quad-Core Intel Core i7
- Memory: 16GB
- OS: Ubuntu 18.04 LTS
At the Lab of the University, I have also access to a Deep Learning workstation that includes:
- GPU: Tesla V100-PCIE-32GB with 640 Tensor Cores
When did you first start programming?
I started programming experimentally at the end of the High School but I learned the basics and the concepts of Programming in different programming languages from lectures and which I used in projects of my Bachelor studies.
What type of music do you listen to? (Please list a series of MBIDs as examples.)
I listen to various kinds of electronic music like House, Techno, Ambient, Dub, Trip-Hop, Electronica. Two of my favorite music groups are Depeche Mode and Thievery Corporation, while some of my favorite DJs/Producers are John Digweed, Ricardo Villalobos, Luciano, and Sven Vath. I have also performed to Sven Vath’s legendary club Cocoon in Frankfurt back in 2010.
Depeche Mode: 8538e728-ca0b-4321-b7e5-cff6565dd4c0
Depeche Mode - Strange Love: 8dc90556-ab6b-3030-b742-0f715fbebafd
Thievery Corporation: a505bb48-ad65-4af4-ae47-29149715bff9
Thievery Corporation - The Richest Man in Babylon: b2a820cc-c0ad-4aa3-a2a7-ed42ead88017
John Digweed: 68cac857-3147-43d0-879e-c63dc9c82014
Luciano - Rise of an Angel: e8cfc933-f4e0-493d-9350-15be02823c0d
Sven Vath: 14008378-5900-4370-831e-3d17f9749caf
Sven Vath - Mind Games: 1f5b83ba-4edf-4dbb-9a68-8d60eab8b430
What aspects of the project you’re applying for (e.g., MusicBrainz, AcousticBrainz, etc.) interest you the most?
As a passionate with the music, I like to work with the AcousticBrainz’s Open Source logic (Data API, and ML processes), and I would love to see how my ML skills work with that field of Analytics. I got impressed with AcousticBrainz’s classification labels (how many different categories it takes into account) and its accuracies too! This information can be found here:
Have you ever used MusicBrainz to tag your files?
I have used the Picard tool to tag my music collection in the Lab office.
Have you contributed to other Open Source projects? If so, which projects and can we see some of your code?
Yes, I have contributed and accomplished successfully to last year’s GSoC in collaboration with JBoss (Red Hat). Here is the link of the submitted project:
I have also made some tutorials about Machine Learning and Data Science with relevant small projects, which can be found here:
What sorts of programming projects have you done on your own time?
Following are some of the projects I made in my free time:
My tutorials for the Data Science and Machine Learning were actually done in my free time. The link can be found in the above question.
My MSc thesis also was done in my free time, which is an allergy map that presents the monitoring, detection, and exacerbation of various allergens and allergic diseases in the broad area of Greece by collecting, processing and analyzing hybrid inputs by the users (subjective inputs - crowdsensing/crowdsourcing), and sensor data (objective inputs). It is also a personalized monitoring system for each patient to monitor each allergy and treatment. The related publication can be found here:
In my free time, I extended this idea for the USA, but now the system had another input from collected public Twitter posts (raw text) and using various NLP (tokenization, lemmatization, sentiment analysis, etc.) and ML processes and methods, an allergy monitoring map and surveillance system was created by this data. The publication can be found here:
I have also used ML to detect fake news from the raw text of articles by training a model with a corpus that includes sentences of fake news and applied this model later on to real articles. Unfortunately, this tool is not yet uploaded to a GitHub repo but will be in the near future.
How much time do you have available, and how would you plan to use it?
Nowadays, I work part-time (with flexible working hours) in a project funded by the Greek state and which is related to my Ph.D. The rest of the day I ’ll be glad to work on that GSoC project as I did in the last year. I have also completed the auxiliary work that is needed for my Ph.D., thus, I have enough free time.
I would like to be part of this GSoC project because of my passion for the music (as you will see in my CV I’m a music producer too), data analytics, and Machine Learning in general. It is also a new field that I would like to apply and combine my personal interests, hobby and knowledge.
Do you plan to have a job or study during the summer in conjunction with Summer of Code?
Only a part-time job that is related to my Ph.D. studies.
Note: Because I am a new user, I was forced by the platform to be included only up to 10 links to my topic. Thus, I removed the links from the artists’ MBIDs.