Automatic clustering using low level data



I am interested in automatic music clustering using the low level information available from AcouticBrainz.
The goal is to group music that sounds similar.

I set a Kohonen network in order to perform the task. The network itself seems to perform well, but I have a lot of trouble to select the relevant low level data to be used as the features of my input vectors.

It looks as if the low level data are too “high” level in order to perform such a task. Maybe a simple FFT on some parts of the tracks would be more efficient?

What do you think ?


Hi, Thanks for your note. It’s great to see you trying this.
When you have results it’d be great if you should share them for us to look at.

Unfortunately it’s not feasible for us to provide FFT transforms for audio, for 2 main reasons: The size of an FFT for every frame of an entire song is pretty big, and to some extent it is reversible back to audio which is not a mess that we want to get involved in.

We’re starting to make plans for computing and providing more detailed features (probably not FFT per frame, but a lot more detailed than the current summaries), and will hopefully finalise these plans in the next few months.



Thanks for your support.

Actually I managed to get quite interesting results using only spectral information (mean, min, max and var for nearly all the spectral data)

The problem is that I am definitely not an audio expert. Which low level information should I select in order to better cluster the audio tracks?


To begin with, you can try some classic basic k-means clustering using MFCCs descriptors. They are widely used for timbre characterization of music. Obviously they are not enough for good clustering, as they capture only one of the facets of music, but it is a good start to improve upon.

Currently, there is no definite answer which features will be useful. You can also try larger features sets available in AB ignoring min/max and dvar* descriptors and focus on *mean and *var, including BPM and onset rate to capture rhythm, and HPCPs to capture tonality.