Clarification in the scope of GSOC idea : New machine learning infrastructure

The project idea is about converting existing gaia based training process with that of scikit learn library.

Gaia is used basically for 3 purposes -
1.Creating low level info using acousticbrainz-client ie with essentia_streaming_extractor
2.Creating high level info by calling ie with essentia_streaming_extractor_music_svm
3.Evaluating datasets to generate confusion matrices.

So does the project scope covers all of the above or we will be changing only dataset evaluation models ?

Thanks for the question.
We have two pieces of software used in AcousticBrainz - Essentia is used to compute the low level information (your point 1.), and Gaia is used for the machine learning component.

The tools are used in each component like this:

  1. Essentia - create low-level data
  2. Gaia - train models using a dataset of low-level data and the SVM algorithm
  3. Essentia [essentia_streaming_extractor_music_svm program] + gaia [SVM library] - use the computed models from 2. and the low-level data from 1. to create high-level data
  4. Confusion matrices don’t use any special libraries - we just have some python code in the gaia package which uses the ground-truth from a dataset, and the freshly computed high-level data and compares them.

My expectation is that the scope of this project would cover 2-4. The idea is to no longer have to install gaia. Instead we will have another package (perhaps part of acousticbrainz-server, but probably an external dependency) written in python which does these steps.
We already made a start at this here:
Feel free to take a look at the code and ask if you have any questions.

We have two other people actively working on this task as well. This means that during the proposal period we will have to work closely to work out which tasks potential students can do, and which tasks will be done by others. Please feel free to read the tickets on the repository which I posted above, and start asking questions about specific implementation details. From there we can work out what a good split of tasks is.

I have read the code at and tried running it locally.

I have decided to start working with Issue#9 Implement variable length descriptor trimming

Other issues on which i would like to contribute are -

  1. NaN values after flattening variable-length lists #1 : (I think this would probably be resolved once we are clear with issue 9 )
  2. Implement normalization #7:
  3. Perform an estimate of an unknown input #5 :

And afterwards on issues below:
1.Save complete training state #2:
2.One-class classification #3: