Search engine system for AcousticBrainz

acousticbrainz
gsoc-2016
Tags: #<Tag:0x00007fe3db155c40> #<Tag:0x00007fe3db155b00>

#1

A search system that lets user search by metadata or by extracted features and also supports content based retrieval. Given a piece of music it should return list of similar musics.

some features:

  • search for bpm of songs (number)
  • search for all the songs by a particular artist, etc.

Some Initial Ideas:

  • A basic Vector space model can be implemented where songs will be represented in an n-dimensional space and given a query relevant songs can be find by calculation similarity score between query vector and song vector. More the similarity more relevant it would be.

  • Probabilistic models could also be used as relevance function. It would be interesting to see how probabilistic models like Okapi BM25 with some tweaks behaves in this scenario which is state of art in text based search engines. Learning features would be important.

  • Songs could also be clustered using state of art clustering algorithms and would be interesting to see what kind of different groups we get. Songs can also be categorized based on genre, tempo etc.

  • Deep learning has solved many hard problems. It would be interesting to see how it could benefit this system in this scenario. As we know deep networks allow us to learn abstract representation at each computation layer, It would be nice to get features from deep networks and then performing classification on the new learned features.


#2

Hi,
Thanks for the proposal. I already wrote a little bit about our ideas for search here: Search API for acousticbrainz
Please take a look at it and see if you have any additional comments.

There is a difference between the items you listed in your features, and the ideas that you listed. The features are quite easy to do with any text search system. As I mentioned in the previous post we have already done this with elasticsearch.

My feeling is that your ideas on relevance would be better suited to content-based search rather than text search. This is quite a different project, but really interesting. There are some related projects listed on the AcousticBrainz ideas page, including Content-similarity, Data description, and Spot the odd song out.

We have a sound similarity search system in Freesound, which uses Gaia, the same technology we use in AcousticBrainz to perform our machine learning processes. For example, see a similar sounds page. Do you think this kind of similarity could also be done for songs, not just short sounds?

Finding clusters of songs is a cool idea. It would be interesting to see if these clusters could be used to discover if two submissions in AcousticBrainz have different MBIDs but are in fact the same song.
What about clustering at a broader level? If we tried to cluster on mood, song “colour”, instrumentation, style, etc, how would you go about gathering ground truth descriptors to understand what these clusters mean?


#3

@alastairp,

Hi,

True, given the features, search based on meta data can be quite easy to do with any text search system.

My idea to extend capabilities of the basic search engine already developed by elastic-search. It should also support content based retrieval which would help in identifying and recommending similar music and also in duplicate detection with some tweaks.

Lets start with answering your questions asked on other post (search api).

What specific list of descriptors do you think would be a good idea to include in a search index?

Some of the descriptors could be :

  • bpm - could be used for finding songs with specific bpm. Also, can be used in making groups.
  • artist - user can search for a song/music based on artist name
  • tempo - used for searching. some user likes songs with hight tempo while other with low.
  • instrumentation - searching songs based on instrument type. Say ‘guitar’
  • mood - searching songs based on mood. Get songs which have ‘sad’ or happy ‘mood’
  • release - get songs based on release info, date etc.
  • track name - get music based on name. say ‘scuse me’
  • key - get songs or music based on key. Can be useful for some musician users.
  • gender - get songs sung by male/female singer
  • speech/non speech - can be useful for grouping.
  • type of file - search by file type. get mp3 format of some track. Helpful in serving field queries
  • length - search by total length or duration of a music. Can be useful in identifying duplicates (will be discussed later)
  • language - get all ‘english’ songs
  • genre - search songs based on genre. ‘rock’, ‘classic’ etc.

if we have these features then it would simplify many searching and grouping tasks. It would also help in data exploration task.

It would be cool to be able to search for a track by its name, or even
search for a release or artist to add all matching recordings to a
class. search could also match recordings which match certain tags. For
example, we have tags on recordings in MusicBrainz which look like
genres. Would we be able to say “find all recordings tagged with 'rock’
and add them to the rock class in our genre dataset”?

If above mentioned features can be extracted then this task can be done easily. Just have to query for a ‘tag’ and then search engine will return all matches then for each match call the API for adding it to a class.

Do you think this kind of similarity could also be done for songs, not just short sounds?

Yes, I think it can be done for songs too. It can be done through content based searching.Applications like shazam has been doing it. I have put down some suggestions in my proposal for this.

For duplicate detection, could be done through fingerprint matching and duration(with some epsilon; since two same songs can be of slightly different length.) matching. Also, representing songs in new feature space learned from deep learning approach helps. Songs with similarity notions are near to each other. Duplicates will be very close by and it can be detected by finding a threshold radius. Mentioned in my proposal.

Basic version of draft proposal: https://wiki.mozilla.org/Abhishek/metabrainz_GSoC2016Proposal


#4

Keep in mind that if you want to enable searching for artist/release data, that you will want to use data from MusicBrainz for that, and not just whatever information has been submitted with AB submissions.

Also note that gender is defined both as part of the data Essentia pulls out, but it is also possible to get a gender from the MusicBrainz Artist data (do note that there are relationships for vocal performances that should likely be preferred to the artist credit, but that may be something for a later stage - relationships are tricky), so you will likely want to make it possible to determine which is meant.

This sounds like a much better way to detect possible duplicates. The submitted (AcoustID) fingerprints can not be double checked by AB (since AB doesn’t have the actual audio data), and it is not unheard of for someone to apply the wrong AcoustID to a recording…

Please make and maintain your draft proposal here instead:


#5

Thanks for the more detailed proposal. It’s good to see some concrete ideas about this search project.

In response to some of the items that you listed on your draft proposal:

Text based AIR
Query can be any text say ‘rock’ or ‘beethoven’ etc and the system will search through the text(tags, artist name, description) associated with the audios and will return list of relevant audios corresponding to matched text. It’s simple to implement but doesn’t help much in audio retrieval as most of the times the audio doesn’t contains enough annotations.

In AcousticBrainz we do have a lot of annotation data, including all of the information in MusicBrainz, which we can access via the MusicBrainz IDs which are used in both projects, and also the generated annotation data, which you listen in your post. For me me this would be an ideal first part of the project, combined with some sort of visualisation of the data.

Vector Space Model(VSM)
This is a basic yet effective model. The idea is to represent every audio entity in vector form in the feature space.

This is a good idea, and is something that we can already do in Gaia. Was your idea to implement this from scratch, or use an existing system? If you were to do this, we should also do a comparison between the system you want to use and Gaia to see what one works best for our requirements.

Spectrum analysis(Fingerprint) model
This relies on fingerprinting music based on spectrogram.

The data in acousticbrainz isn’t detailed enough to build a fingerprint model. Additionally, MusicBrainz also has a fingerprinting system which it uses (https://acoustid.org/). Building and maintaining an additional fingerprinting system is probably too much work for a Summer of Code project, so I don’t think we would want to go with this idea.

Deep Neural Network Model (Deep learning Approach)

This is something that we’ve just started to experiment with at the Music Technology Group. We see this fitting more under our “Dataset building and model creation” section of AcousticBrainz, instead of search, but it is definitely something that we’re interested in. If you want to investigate more in this direction, you should add some more detailed information about how you think we could go about this project.


#6

@alastairp,

In AcousticBrainz we do have a lot of annotation data, including all of
the information in MusicBrainz, which we can access via the MusicBrainz
IDs which are used in both projects, and also the generated annotation
data, which you listen in your post. For me me this would be an ideal
first part of the project, combined with some sort of visualisation of
the data.

Great! What sort of visualization are you expecting? Are you looking for visualization of clusters/groups etc.? One could be just based on annotations we could represent audio data on text feature space.

This is a good idea, and is something that we can already do in Gaia.
Was your idea to implement this from scratch, or use an existing
system? If you were to do this, we should also do a comparison between
the system you want to use and Gaia to see what one works best for our
requirements.

If there is already a system then it should be preferred. Again, I am not familiar with Gaia. Could you point me to some link or docs to Gaia? things like what’s the underlying concept behind it. We could also use elasticsearch.

The data in acousticbrainz isn’t detailed enough to build a fingerprint
model.

Data isn’t detailed enough… could you elaborate it a bit? I was expecting to use audio signals of songs for it.

This is something that we’ve just started to experiment with at the Music Technology Group. We see this fitting more under our “Dataset building and model creation” section of AcousticBrainz, instead of search, but it is definitely something that we’re interested in.

Can you give some use case of “Dataset building and model creation” I didn’t see this section in AcousticBrainz ideas

If you want to investigate more in this direction, you should add some more detailed information about how you think we could go about this project.

I will think more about it and let you know.

PS: I will post the draft on community forum for GSoC applications. My exams are from starting next week, will do after exams :slight_smile:


#7

Sure, I will keep in mind. Thanks for the info :relaxed:

This sounds interesting, tricky as well. Anyway, we will see. Nice thing to keep in mind.

Thanks a lot for your feedback.


#8

I was thinking of visualisations similar to the ones in the blog post that I linked to - pie charts or similar of the breakdown of particular attributes of the content in AB.

Gaia source code is here: https://github.com/MTG/gaia

Here is a brief description of how we do similarity in Freesound. It’s just taken from an email that one of our developers sent a few months ago. If you understand it we can go into more detail

To build similarity we build a feature space with all lowlevel features extracted from the Essentia’s Freesound extractor (list of descriptors here: http://www.freesound.org/docs/api/analysis_docs.html#lowlevel-descriptors).
We extract these features frame by frame and we compute 6 statistics measures at the end (mean, var and 2 derivatives of each).
This yields a high dimensional feature space of 6 statistics * 40 features = 240 “numbers” for each sound, some of them are multidimensional which increases the dimensionality of the feature space to ~500 or more.

What we do is that we normalize the feature space, and we perform dimensionality reduction (using PCA) to get a new feature space of 100 dimensions. This is the feature space that we use for the similarity search. So in summary: we use 100 first principal components of a very high dimensional feature space composed by all lowlevel features extracted by essentia. This is somehow defined in the code here: https://github.com/MTG/freesound/blob/master/similarity/similarity_settings.example.py (see PCA_DIMENSIONS and PCA_DESCRIPTORS).

I was hoping you could understand the data that we have available in AcousticBrainz by investigating the website and looking at the data that we have available to download there

This is not described well anywhere yet. It was a project that we did last SoC. You can read a little about it in a paper that we published: http://mtg.upf.edu/node/3320
You can try it yourself if you create a user in AcousticBrainz and go to your profile page.


#9

@alastairp, aplogy for the late reply had exams and then was out of station for a work.

I had looked at the low level descriptors generated by free sound api(s). I understood the process used for similarity search in freesound. Excerpts from the mail was really useful. Thanks for sharing :slight_smile:
PCA or LSI(SVD) could be used for dimensionality reductions; use of LSI for projecting representations in low dimensional space which encodes some kind of similarity notion could be useful in our case.

Yes sure. I would love to hear more details. Let me know if somethings in your mind.

I went through the paper. Nice read indeed thanks again for sharing. It helped in understanding the whole picture with clarity. Now, I understand you were talking about building models from low level features submitted by the users to AB server in order to generate hight level descriptors automatically by model (using different machine learning approach). Together these low level and high level descriptor would give better representations. Also these could be used for learning new things about the collection.

I went through the free sound apis, and also saw some videos from the course ‘Audio Signal Processing for Music applications’ by UPF. Covers good stuffs about essentia api, freesound api.

Let me know your suggestions. I will start adding more details and timeline to the proposal.