GSoC 2019: Recording similarity for AcousticBrainz

Tags: #<Tag:0x00007f8d650eeeb8> #<Tag:0x00007f8d650eebe8>

Personal information

Name: Taras Lehinevych
IRC nick: letaiv
MB username: letaiv
Timezone: EEST – Eastern European Summer Time (UTC +3)


Google Doc version


  • Evaluate three libraries (Annoy[1], NMSLIB[2], Faiss[3]) for similarity search based on lookup speed, memory limitation and scalability
  • Build a search index with the chosen library, implement index update and an API for similarity search
  • Create an offline matrix of similarity and index update mechanism and provide pre-computed similarity matrix as a data dump


Evaluation of ANN libraries

The evaluation process will be implemented based on ANN-benchmark repository[4] where recall-queries per second (1/s) tradeoff is measured. We have 12 different metrics[5], it is planned to consider a combination of metrics into one as well as separate search per metric. Moreover, in order to determine how sensitive index is to a growing amount of recording, we will consider three datasets with 10 000, 100 000 and 1 000 000 recordings. Besides, there are more factors that have an influence on lookup speed, such as a number of queried vectors (k most similar ones), different index structures and algorithm, memory consumption and additionally should be analyzed other advantages of each algorithm and library. More detailed list of evaluation criteria are in Tasks section

The expected result is code for evaluation result reproduction and report of received results with a summary.

NOTE! For below I assume that we will use NMSLIB. The distance and similarity terms are interchangeable thus it is the matter of how we will defind space (l2, cosine, ip)

Search index and API endpoint

Based on the chosen algorithm and library we are going to implement the index creation and index update mechanism as well as API endpoint for similarity search based on Flask. The main work here to integrate the chosen library and index generation flow to acousticbranz-server repository. All the approaches support straightforward interface for index creation, the open question is about update mechanism thus some of the algorithms require index regeneration. Also, the error handling will be provided e.g. when MBID is not in index.

The index generation is straightforward when we select the require vector for MBIDs (most likely it will the combination of metrics defined in Tovstogan master thesis). Below is the snippet of code for creating index. The add_items could be used multiple times and based on this function we can update our index. I think we can implement index build and update similar to high level feature extraction from Iterate over new recording (initially over all recordings) and for given recording generate metrics and then add it to index.

import hnswlib
import numpy as np

dim = 128
num_elements = 10000

# Generating sample data
data = np.float32(np.random.random((num_elements, dim)))
data_labels = np.arange(num_elements)

# Declaring index
p = hnswlib.Index(space = 'l2', dim = dim) # possible options are l2, cosine or ip
# Initing index
p.init_index(max_elements = num_elements, ef_construction = 200, M = 16)

# Element insertion (can be called several times):
p.add_items(data, data_labels)

# Query dataset, k - number of closest elements (returns 2 numpy arrays)
labels, distances = p.knn_query(data, k = 1)

API Endpoint

GET request /api/v1/(uuid:mbid)/similarity_search/

Query Parameters:
k: number of nearest neighbours to be returned (default k=10)

Response Headers:
Content-Type – application/json

Response Example:

      "mbid": "mnid-1-ca5dfcc3-83fb-3eee-9061-c27296b77b2c",
      "distance": 0.01
    . . .

GET request /api/v1/similarity_search/

Query Parameters:
recording_id: Required. A list of recording MBIDs
k: number of nearest neighbours to be returned (default k=10)

Response Headers:
Content-Type – application/json
Response Example:

    "mbid-1-ca5dfcc3-83fb-3eee-9061-4d4de4fb7b17": [
               "mbid": "mbid-10-ca5dfcc3-83fb-3eee-9061-c27296b77b2c";
               "distance": 0.01
               "mbid": "mbid-11-fdad0a6b-3d67-3bd5-a31f-4d4de4fb7b17";
               "distance": 0.015
     . . .

NMSLIB index supports batch search with multithreads. This endpoint could be utilized for better performance and decreasing the number of requests.

# get all nearest neighbours for all the datapoint
# using a pool of 4 threads to compute
neighbours = index.knnQueryBatch(data, k=10, num_threads=4)

Also, we could add /api/v1/datasets/(uuid:dataset_id)/similarity_search/ to get all similarity for specified dataset (basically it’s the same as for batch case).

The UI for exploring the similar recordings

The most similar MBIDs in table are clickable, what help to explore the similar recording
By instead query and displaying MBIDs I would propose to use musicbrainzngs in order to get correspondent entity and display the name of recording or simply querying musicbrainz from UI for MBIDs.

The expected result is an API endpoint covered by tests as well as index creation and index updated mechanism code (possible UI for exploring the similarity, I think there will be no time to update the AcousticBrainz website to include navigation by similarity, however if some time left, I will work on it)

Offline similarity matrix

I see three possible solution here:

  1. Based on index itself. The index could be simple save to file and simple shared with other projects. All they need to do is to add nmslib dependency and load the index file. Then querying could be done via calling knnQuery or knnQueryBatch.
    Advantages: simple index file sharing, flexibility to choose number of nearest neighbours, simple update mechanism (just use add_items function for new recordings and save the new index file)
    Disadvantages: we won’t achieve constant time guarantee for getting similar recordings, have to install nmslib dependency and not clear about memory consumption (more memory - better recall)

  2. Another approach is to use redis. Be generate all the nearest neighbours (for example k=1000) for all the datapoint and save then into redis.
    Advantages: guarantee constant time for getting similar recordings
    Disadvantages: memory consumption, for update we have to get all nearest neighbours for all the datapoint everytime

  3. Third approach is to save index into HDF5[6] file format that could be for search the most similar record. The benefit of this format that it search the required row by index, can read separate parts of file independently and compress data (binary format). This file could be shared and used in other projects. However, the main concern here is look up speed and that support of parallel I/O won’t provide constant query time (this is my hypothesis and should be experimentally verified).
    Advantages: smaller size and memory footprint, almost constant time for querying data
    Disadvantage: for update we have to get all nearest neighbours for all the data point every time, definitely slower than redis approach

For offline index generation we simply iterate over MBIDs and call index.knnQueryBatch(MBIDs, k=10), the result is saved to redis or HDF5 file as pairs MBID: list of similar MBIDs. We could also use index.knn_query(data, k=1), where data contains all out MBIDs, but we could get out of memory error.

For the second and third approaches we have to get all nearest neighbours for all the datapoint every time because for old recording the new one could appear to be quite similar. That’s why only generate the list of nearest neighbours for new recording is not enough. However for optimization purposes we can assume that we must regenerate only that old recording that appear in list for new one. For example we have list of nearest neighbours for 10 recordings (ids range 1-10). Then we generate a list for a new one and there appear ids 5 and 7, then for these two recordings we also have to generate new list of nearest neighbours.

Example for first approach:
import hnswlib
import numpy as np
dim = 16
num_elements = 10000
# Generating sample data
data = np.float32(np.random.random((num_elements, dim)))
# We split the data in two batches:
data1 = data[:num_elements // 2]
data2 = data[num_elements // 2:]
# Declaring index
p = hnswlib.Index(space='l2', dim=dim)  # possible options are l2, cosine or ip
# Initing index
# ef_construction - controls index search speed/build speed tradeoff
# M - is tightly connected with internal dimensionality of the data. Strongly affects the memory consumption (~M) Higher M leads to higher accuracy/run_time at fixed ef/efConstruction
p.init_index(max_elements=num_elements//2, ef_construction=100, M=16)
# Controlling the recall by setting ef higher ef leads to better accuracy, but slower search
# Adding first batch 
# Query the elements for themselves
labels, distances = p.knn_query(data1, k=1)
# Serializing and deleting the index:
del p
# Reiniting, loading the index
p = hnswlib.Index(space='l2', dim=dim)  # the space can be changed - keeps the data, alters the distance function. 
# Loading index from 'first_half.bin'  
# Adding the second batch
# Query the elements for themselves and measure recall:
labels, distances = p.knn_query(data, k=1)


  1. Evaluation of ANN libraries
  • Data set collection. Based on research and code[5] create three collections (10 000, 100 000 and 1 000 000).
    • Get all metric specified in the master thesis for data
    • Combine metrics into one vector
    • Pregenerated datasets save in HDF5 format [6]
  • Based on ANN-benchmark code[4] implement approach for empirical comparison of proposed libraries in an objective way.
    • Consider a varied number of queried vectors (e.g. 100, 300, 1000)
    • Consider a memory limitation, when the index doesn’t fit into memory
    • Consider the increasing number of recording (three data sets)
    • Consider multiple indexes (e.g. SWGraph, HNSW and BallTree)
    • Consider other advantages of different indexes (for example some of the indexes support incremental construction)
  • Create a report with visualization and summary.
  1. Build a search index with update mechanism and developed an API for similarity search
  • Integrate chosen similarity search library and other required dependency with acousticbrainz-server
    • Adding dependency
    • Modifying Dockerfiles
  • Build an index and add update mechanism. It dependents on chosen algorithm and library
    • Add separate service to all docker-compose.yml files
    • Implement index building functionality
    • Implement index update functionality
  • Implement API endpoint for similarity search
    • Flask API endpoint
  • Add tests and write documentation
    • Tests for endpoint
    • Tests for similarity search index
    • Add documentation
  1. Create offline matrix of similarity, index update mechanism and provide similarity matrix as a data dump
  • Create offline matrix similarity based on index from task 2
  • The matrix should be saved in HDF5 file format
  • Load the matrix into redis
  • Run performance testing for both approaches
  • Create offline matrix dump


Duration Task
April 9 Deadline for submitting Project Proposal
May 6 - May 27 Learn more about acousticbrainz-server architecture. Read Documentation and learn more data flow and use cases. Work on simple bugs and issue to get familiar with the development process and tools. Get Involved with MetaBrainz community. Know more about mentors such as their timezone, preferred medium of communication etc.
May 27 - June 28 Working on Task 1
June 24 - June 28 First Evaluations
June 28 - July 22 Working on Task 2
July 22 - July 26 Second Evaluations
July 26 - August 19 Working on Task 3
August 19 - August 26 Buffer period. Bug fixes, additional testing and completing documentation. Code and evaluations submission.

Detailed information about yourself

I am Taras Lehinevych, a 2nd year PhD student at the Institute of Software Systems of NAS of Ukraine and National University of “Kyiv-Mohyla Academy” (joint program). I have experience in programming with multiple languages such as Python, C/C++, Java. I passionate about machine learning, natural language processing, and neural networks. My PhD thesis is on a topic “Similarity metric learning for objects from heterogeneous environment” and working on the recording similarity would help to more deeply explore the existing library for approximate nearest neighbor libraries in Python. On the other hand, I support the idea of open data and research and want to help the community.

Tell us about the computer(s) you have available for working on your SoC project!
Dell Inspiron 15R, Intel i7-4500U CPU @ 1.80GHz × 4 and 8 GB RAM, running Ubuntu.
PC: Inter i7 3.2 GHz x 8 and 16 GB RAM, running CentOS

When did you first start programming?
At school, Pascal and Delphi, later on at university C/C++ and Python.
What type of music do you listen to? (Please list a series of MBIDs as examples.)

  • ca5dfcc3-83fb-3eee-9061-c27296b77b2c
  • fdad0a6b-3d67-3bd5-a31f-4d4de4fb7b17

What aspects of the project you’re applying for (e.g., MusicBrainz, AcousticBrainz, etc.) interest you the most?
Have you ever used MusicBrainz to tag your files?
I’ve recently started ti use it
Have you contributed to other Open Source projects? If so, which projects and can we see some of your code?
I was contributing to ManageIQ integration tests repository quite a long time ago.
If you have not contributed to open source projects, do you have other code we can look at?
You can take a look at MediaWikiAPI

What sorts of programming projects have you done on your own time?
Mostly on reproducing some machine learning papers
How much time do you have available, and how would you plan to use it?
Around 30 hours per week
Do you plan to have a job or study during the summer in conjunction with Summer of Code?
I have two conferences and research tasks related to my PhD thesis, but I planned to dedicate most of time to GSoC.


The upper limit of testing should be all unique tracks so about 4,000,000 recordings.

I like that you are suggesting an API endpoint – very good. However, I think for our immediate needs we would like just the data-set as a result. So, if more of this proposal can focus on picking the right algorithm and focus on running it on the 4M tracks before summer is done that would be a killer outcome. Considering the API as stretch goal would be a great way of thinking about it in this context.

Finally, I think the offline similarity matrix could initially go into Postgres, via on the fly import or by importing resultant CSV files, that’s is good enough for this project. Once we have it in postgres we can start using it and then work to get it into our Apache Spark cluster where we will likely need the output in the long run.

I think you should deemphasize the final matrix points and creating the API endpoint in favor or more deployment testing – how many VMs do we need to utilize for how many days to calculate the whole 4M item index. Putting these tools into docker images so that we can spin them up on a number of VMs would be very very useful in the context of this project. Any scripts to help fire up VMs and get the job done are all very helpful.

Making the data more useful to the public and other projects can be dealt with a little later on. I quite like your proposal, good luck on making a round of edits.


Also, if you are unfamiliar with docker, we can help with that. You can fudge the details of how docker gets involved in this proposal and I will not hold it against you. :slight_smile:


Hi, thanks a lot for your comments and clarifying the priorities.
I’ll add require information to the proposal.


Also make sure to get your proposal submitted to the summer of code site in time – the deadline is approaching fast!


One more thing, and this doesn’t have to happen for tomorrow’s deadline, but we’d really like to see some reasonably substantial PR towards the AB codebase so we can get a feel for your coding style and how you interact with the team.



Thanks for advice, I’ll work on PR after proposal submission.