GSOC 2021: Identifying bad data submissions in AcousticBrainz

Tags: #<Tag:0x00007f510ed50cf8> #<Tag:0x00007f510ed50c08>

Personal Information

Name: Rohan Pillai
IRC Nick: Rohan_Pillai
Email: rohanpillai22@gmail.com
Time Zone: UTC +5:30

INTRODUCTION

The concept of using similarity to label incorrect submissions was first discussed at the 2018 MetaBrainz Summit. Since then, it has mainly been developed at #364 as part of Aidan’s 2019 GSOC project which is heavily derived from Philips thesis.

This project aims to build on previous work in this field by clustering recordings with the same MBID to detect bad submissions. These submissions can then be flagged, so they are not provided to users or used in other machine learning tasks.

The main objective of this project is:
Applying a clustering algorithm to flag submissions that are incorrectly labelled.

Detailed Implementation:

1. The current state of the Similarity Engine(and why it cannot be used to identify outliers)
According to the idea page “We have a nearest neighbour system which allows us to group data in an n-dimensional space. We want to use this system to cluster recordings which have the same MBID”.

It’s important to note here the difference between nearest neighbour and clustering. These two algorithms are frequently confused with one another, but they are slightly different in reality.

The current similarity engine (#364) is based on Spotify annoy ( an approximate k nearest neighbour system). Annoy is a supervised learning method and requires the data to be labelled i.e. data points should be marked as either incorrectly tagged or correctly tagged. When a new point is to be labelled, the point is classified by assigning the label which is the most frequent label among the k training samples approximately nearest to that query point. Since we do not have a dataset labelled with correct and incorrect submissions for each MBID, this project needs to be implemented using a clustering algorithm. Clustering is an unsupervised machine learning algorithm where data points are bundled together into several groups such that data points in the same groups are more similar to other data points in the same group and dissimilar to the data points in other groups.

2. Clustering Algorithm

To detect outliers in submissions, I intend to use the DBSCAN algorithm. DBSCAN is a density-based clustering algorithm ( DBSCAN stand for Density-Based Spatial Clustering of Applications with Noise), what this algorithm does is look for the area of high density and assign a cluster around it, whereas points in less dense regions are not included in the clusters and are labelled as anomalies ( in our case this would imply an incorrect submission).

There are two main parameters in this algorithm:

— eps: Maximum distance between two points to consider them as neighbours. If this distance is too large we might end up with all the points in one huge cluster, however, if it’s too small we might not even form a cluster.

— min_points: Minimum number of points to form a cluster. If we set a low value for these parameters we might end up with a lot of really small clusters, however, a large value can stop the algorithm from creating any cluster, ending up with a dataset form only by anomalies.

The way this algorithm creates the clusters is by looking at how many neighbours each point has. Neighbours being all the points closer than a certain distance (eps). If more than min_points are neighbours, then a cluster is created, and this cluster is expanded with all the neighbours of the neighbours.

Based on the above clustering algorithm, submissions can be labelled as ‘outliers’ or ‘standard’ submissions. ‘Poor’ points are submissions that have been classified as Noise by the algorithm (as indicated with blue in the above diagram) and correspond to submissions that were most probably tagged with the wrong MBID. “standard” submissions correspond to either core or boundary points (yellow and red in the diagram) and are to submissions that were tagged correctly.

3. Proof of Concept

The below code is a simple proof of concept to demo DBSCAN for identifying incorrect submissions for On your own (MBID: f79c672c-8f08-4a9e-9d5d-db0b76ba9290). I used the AcousticBrainz API to get the first 17 of the 43 documents and store them in a CSV for analysis.

For this proof of concept, I used the probabilities of genre_electronic, genre_rosameric, genre_dortmund as the features however a more thorough investigation into feature selection will be done during the summer.

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import normalize
from sklearn.decomposition import PCA
X = pd.read_csv('results.csv')
#X = X.drop('offset', axis = 1)
print(X.head())
#Scaling the data to bring all the attributes to a comparable level 
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) 
''' 
Normalizing the data so that the data approximately follows a Gaussian distribution
 '''  
X_normalized = normalize(X_scaled)
#Converting the numpy array into a pandas DataFrame 
X_normalized = pd.DataFrame(X_normalized) 
'''
Reducing the dimensionality of the data to between -1 and 1 to make it visualizable 
''' 
pca = PCA(n_components = 2)
X_principal = pca.fit_transform(X_normalized) 
X_principal = pd.DataFrame(X_principal)
X_principal.columns = ['P1', 'P2'] 
print(X_principal.head()) 
#Building the clustering model
db_default = DBSCAN(eps = 0.5, min_samples = 5).fit(X_principal) 
labels = db_default.labels_ 
#Building the label to colour mapping 
colours = {} 
colours[0] = 'r' 
colours[-1] = 'k' 
#Building the colour vector for each data point
 cvec = [colours[label] for label in labels] 
#For the construction of the legend of the plot
r = plt.scatter(X_principal['P1'], X_principal['P2'], color ='r'); 
k = plt.scatter(X_principal['P1'], X_principal['P2'], color ='k');
#Plotting P1 on the X-Axis and P2 on the Y-Axis 
#according to the colour vector defined
plt.figure(figsize =(150, 150))
plt.scatter(X_principal['P1'], X_principal['P2'], c = cvec)
#Building the legend
plt.legend((r, k), ('Label 0', 'Label -1'))
plt.show()

Output:

Note: Although it seems like there are fewer than 17 points in the cluster, some of the points in the cluster are below one another.

4. Feature Selection and Tuning Hyperparameters

With any machine learning algorithm, A large part of its implementation involves choosing the correct features and settings.

Setting eps:
The value for eps can then be chosen by using a k-distance graph [1]. This method is slightly tricky to understand but basically, we compute the k-nearest neighbour distances for every point. We then calculate the average of the distances of every point to its k nearest neighbours. Next, these k-distances are plotted in ascending order. The aim is to determine the “knee”, which corresponds to the optimal eps parameter. A knee corresponds to a threshold where a sharp change occurs along the k-distance curve.


In the diagram, the value of eps will correspond to approximately 0.15

Setting min_points:
When Ester [1] first presented DBSCAN in 1996, based on his research he recommended min_points have a default value of 4 (for two-dimensional data).
While Sander [2] suggested setting it to twice the dataset dimensionality, i.e. min_points = (2 . no_of_dimensions).

Selecting clustering features:
Performing cluster analysis on the entire high-level feature set will most probably lead to the Curse of Dimensionality. To reduce the number of features we can employ some standard techniques such as using Correlation among features and Variance Threshold.

If 2 features have a high correlation, this means we can predict one from the other. Therefore, the model only really needs one of them, as the second one does not add additional information. For example, the correlation between mood_happy(happy) and mood_sad(not_sad) should be high and we can remove one of these features.
(Note: Interestingly according to Liem [3] mood_happy(happy) and mood_sad(not_sad) seem to have almost no correlation).

Variance threshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e., features that have the same value in all samples. Here we assume that features with a higher variance may contain more useful information.

While techniques such as Correlation among features and Variance Threshold will be effective in reducing the number of features, I think the right approach to reducing the dimensions(features )is by standardising the features and then performing Principal Component Analysis. Principal Component Analysis (PCA) is a statistical technique used to reduce the dimensionality of the data (reduce the number of features in the dataset) by selecting the most important features that capture maximum information about the dataset.
The features are selected based on the variance that they cause in the output. Original features of the dataset are converted to the Principal Components which are the linear combinations of the existing features. The feature that causes the highest variance is the first Principal Component. The feature that is responsible for the second-highest variance is considered the second Principal Component, and so on.
Although Principal Components try to cover maximum variance among the features in a dataset, if we don’t select enough Principal Components we may miss some information as compared to the original list of features. Andrew Ng has an interesting video on choosing the Number of Principal Components here

5. High-level Implementation:

This can be divided into 2 parts: A once-off job to cluster all the records already in the database and extending to hl extractor to cluster incoming submissions.

5.1 pseudo-code for Once-off job:

for each MBID in the database:
get all offsets
if number of offsets < minimum number of records required to custer:
    return None
else:
    standardise the data.
    select the first n principal components.
    cluster the submissions.
    store the results of clustering as a JSON (where the offset acts as the key and the value is either 'standard'' or 'outlier' ) in the 'quality' field of the 'info' table.

The minimum number of records needed to perform clustering and the time complexity is dependent on the number of principal components used.

According to Robert [5], the minimum sample size to conduct cluster analysis is 2^m where m is the number of dimensions. However, there seems to be very little research on this and it is a parameter I would like to experiment with.
According to Nidhi (https://doi.org/10.1016/j.procs.2016.07.336) “computational complexity of the whole algorithm is O(n2), where n is the number of degrees (points) in the data set. If effective index structures are used and the dimension of degrees is low (d<=5), the computational complexity of DBSCAN can be reduced to O (n logn ).” Hence we’re looking at between O(n2) and O(n logn ) based on the number of principal components.

5.2 hl extractor:

The hl extractor is a multi-threaded tool used to obtain high-level features from low-level data.
In the main function of acousticbrainz-server/hl_extractor/job_calc.py as soon as the high-level data is obtained and added to the high-level table I plan on calling a function cluster_new_submission(mbid,jdata). This function is responsible for clustering all the new submissions and adding their quality to the info table.

In acousticbrainz-server/hl_extractor/job_calc.py:

try:
    jdata = json.loads(hl_data)
    except ValueError:
    current_app.logger.error("error %s: Cannot parse result document" % mbid)
    current_app.logger.error(hl_data)
    sys.stdout.flush()
    jdata = {}
    db.data.write_high_level(mbid, ll_id, jdata, build_sha1)
"""
parsing the MBID and the JSON containing the high-level data to the cluster function for analysis.
submission_filter is a folder in acousticbrainz-server directory containing the functions required for clustering
"""
    submission_filter.cluster_new_submission(mbid,jdata)
    current_app.logger.info("done %s" % id)
    sys.stdout.flush()
    num_processed += 1

Note 1: unfortunately the process of storing the position of each point in a high dimensional space and recalling it is complex, hence every time a new submission is added we will have to recluster all the data for that MBID.
Note 2: The basic steps in cluster_new_submission(mbid,jdata) would be similar to that of the once-off job.

5.3 Schema changes required:
To integrate this with AcousticBrainz, a new table called quality_info is created.

CREATE TABLE quality_info (
gid UUID NOT NULL,
quality VARCHAR,
recomended_offset INTEGER
);

The gid field acts as the primary key and stores the MBID of a submission.
The quality field is used to store the submission offset and its quality (either outlier or standard ) in the form of a JSON.
Note: For the time being the rocomended_offset column will be null, this will be used later to store the offset of our suggested submission. In the case where there is not enough data to cluster the value of recomended_offset will be -1.

6. Creating the /uuid/info API:

To display the information, I plan on creating at /uuid/info.
The following code will be added to acousticbrainz-server/webserver/views/api/v1/core.py:

@bp_core.route("/<uuid>/info", methods=["GET"])
@crossdomain()
@ratelimit()
def quality_info(mbid):
"""
This endpoint returns a list of all the offsets along if weather it is an outlier or not.
returns
{"mbid1": {"offset1": "standard",
"offset2": "standard",
"mbid2": {"offset1": "outlier",
"mbid_mapping": {"MBID1": "mbid1"}
}
"""
map_classes = _validate_map_classes(request.args.get("map_classes"))
recordings = _get_recording_ids_from_request()
mbid_mapping = _generate_normalised_mbid_mapping(recordings)
recordings = [(mbid, offset) for _, mbid, offset in recordings]

#load_quality_info(recordings, map_classes) featches the quality information from the quality_info table

recording_details = db.data.load_quality_info(recordings, map_classes)
recording_details['mbid_mapping'] = {}
if mbid_mapping:
recording_details['mbid_mapping'] = mbid_mapping
if(recording_details is not None):
return jsonify(recording_details)
return "the data was for this MBID was inconclusive"

7. Timeline:

week 1-2:
Experiment with different features to find the optimum feature and parameters for clustering. These features include min_points, eps(using a k-distance graph ), the number of principal components required (and its effect of runtime) and the minimum number of points required to cluster.
I also want to experiment between principal component analysis vs selecting a subset of features using correlation and variance threshold.

week 3-4:
Focus on implementing the changes required to the hl extractor to cluster new submissions and storing the results of clustering. Develop a mechanism to continuously update the quality_info table when new recordings are added.

week 5-6:
Building the once-off job to cluster records already in the database and testing out the results locally using the data dumps.

week 7-8:
If everything goes according to schedule at this point I want to run the once-off job on the production database.
This period has a lighter workload to account for the time the clustering job will take to run and give me time to fix any bugs and issues that might arise.

week 9-10:
Building /mbid/info API.

Detailed information about me:

Tell us about the computer(s) you have available for working on your SoC project!
I have a Dell Inspiron with an i5 processor and 12GB of ram running Linux.

When did you first start programming?
I began programming with java when I was 15 and I’ve known python for about 2 years now.

What type of music do you listen to?
I love old school rock and pop!
ABBA (d87e52c5-bb8d-4da8-b941-9f4928627dc8)
Queen (0383dadf-2a4e-4d10-a46a-e9e041da8eb3)
Adele (cc2c9c3c-b7bc-4b8b-84d8-4fbd8779e493)
Bryan Adams (4dbf5678-7a31-406a-abbe-232f8ac2cd63)
are some of my favourite artists.

What aspects of the project you’re applying for (e.g., MusicBrainz, AcousticBrainz, etc.) interest you the most?
My favourite aspect of AcousticBrainz is its role in building music recommendation systems. I think music recommendation systems have room for improvement (besides Spotify, most engines recommend artists I already listen to) and AcousticBrainz with its 20 million crowdsourced submissions can be a driving force in building new and better recommendation systems.

Have you ever used MusicBrainz to tag your files?
Yes! I’ve used Picard to tag them.

Have you contributed to other Open Source projects? If so, which projects and can we see some of your code?
Metabrainz is the first open-source organisation I have contributed to with most of my contributions being to ListenBrainz

What sorts of programming projects have you done on your own time?
I love hackathons and the concept of spending 24 hours straight building something amazing!
As part of a hackathon, I built a system to screen patients for Depression using acoustic features of a person’s speech such as pitch, loudness, and speaking rate. ( and we ended up winning the hack!)
I worked on a system that allows patients to store and update their medical records on the cloud and access them using face recognition.
Tried building a system to detect car accidents using the accelerometer in a person’s phone.

How much time do you have available, and how would you plan to use it?
I can dedicate 30 hours a week to GSOC

Do you plan to have a job or study during the summer in conjunction with Summer of Code?
No

Citations

[1] Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD’96). AAAI Press, 226–231.

[2] Sander, J., Ester, M., Kriegel, HP. et al. Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications. Data Mining and Knowledge Discovery 2, 169–194 (1998). https://doi.org/10.1023/A:1009745219419

[3] Liem, C. C. S., & Mostert, C. (2020). Can’t trust the feeling? How open data reveals unexpected behavior of high-level music descriptors. In Proceedings of the 21st International Society for Music Information Retrieval Conference (pp. 240-247)

[4] Nidhi and Km Archana Patel. An Efficient and Scalable Density-based Clustering Algorithm for Normalize Data. 2nd International Conference on Intelligent Computing, Communication & Convergence

[5] McClelland, Robert. (2015). Re: What is the minimum sample size to conduct a cluster analysis?. Retrieved from:https://www.researchgate.net/post/What-is-the-minimum-sample-size-to-conduct-a-cluster-analysis/55ff1a97614325339b8b45d6/citation/download.

4 Likes

Hi @alastairp
This is the first draft of my GSOC proposal that I discussed with you. I’d love to hear your opinion on it.

Hi @r0han22, thanks for your proposal!

This proposal to use DBSCAN looks interesting. I hadn’t heard of the tool before but it looks like it’s useful for this specific task. Great work on the proof of concept, it makes the results very easy to see and understand. It seems that a large part of this project may involve choosing the correct features and settings. There are over 5 million unique mbids in the database and lots of different features that we could use.

Did you perform PCA to 2 dimensions only for the purposes of displaying the graph? Can we do DBSCAN with 10 or 100 dimensions? Do you know if this will improve the results? Will it have an effect on the runtime?

When we last talked, I gave a few suggestions, and I think we might have mixed up some of the ideas a bit. I think that we can take a bit of care and split these tasks into separate items.

identification of outliers

In this part we’re not interested in selecting a single good representation, instead we want to find all bad submissions in the database. We will have to do this in two places:

  1. on all of the existing data in the database. To do this we need some tool that we can run as a once-off job
  2. when new submissions come in to the database we need to re-compute outliers on clusters for new submissions

You mentioned that you wanted to perform this clustering on demand instead of on existing items and as new items come in. However, in our experience this type of on-demand computation often uses many more resources than you might expect. The correct way to do this task is to pre-compute all outliers ahead of time. We have a lot of computational resources available for us to do this.
We have some example tools that process incoming data, such as the hl extractor. You should add an overview about how a new tool would work. This doesn’t have to include any code examples, but some psuedo-code would be good.

The results of this tool can be stored in the database. I’m not sure if it’s best to put it directly on the lowlevel table, or to make a new table. This is something that we can test during the project. This data would include for every lowlevel row a boolean flag saying if the row is an outlier or not.

Is it possible that there could be more than 1 cluster for an mbid with hundreds of submissions? In this case do you have a suggestion about how we should present the data? Should we try and make only one cluster?

You made a comment in your timeline: “Create the helper function and edge cases, such as returning a message to the user if there are not enough records for an MBID to accurately cluster”.
This is really interesting, please add more information to your proposal about it. Why might this happen? What do you think that the behaviour should be for the clustering and the API if there is not enough data?

api for information about an mbid

You have given an example for a good-representation endpoint, but I think that this isn’t the right endpoint that we should provide yet. Instead, I think that it would be better if we had an endpoint called something like /<uuid>/info. This would include:

  1. the number of submissions that we’ve had for this mbid
  2. for each offset, a flag to say if it is an outlier or not.

A user of the API can retrieve this endpoint to get a list of all duplicate submissions and which ones are outliers. Then they can choose any non-outlier item to retrieve the actual contents. When we implement the good representation task, we can also add another field to the output of this info endpoint which indicates our suggested duplicate offset as the good submission.

identification of good representation

I see this as an extension of the previous task, and I’m not sure if we’ll have time to implement it, however we can definitely add it to the proposal as a new part of work and implement it if we have time.
In your proposal you described how we will identifier outliers, however in this part of the proposal I don’t see any discussion about how you will identify a single good submission if there are hundreds of possible items. You should dedicate a section to your proposal explaining some ideas about how this might work. This will definitely require some explicit time in your timeline to perform this investigation.

This task probably splits into the same general structure as above - some investigation, implementation (how you will do it, how you will store it [you have this]), and API. Make sure you split these items out in your proposal and timeline.

Remember that we have two styles of requests in the AB API, one which gets one item at a time, and one which allows us to get bulk items (e.g. 25 items at once). We have decided to focus on these bulk endpoints in AB (because you can also get just one item from this API too). Your examples should be based off of this bulk API structure.


This is a great first draft of a proposal, but I think that it will be stronger if it includes some of the following suggestions:

  • describe in a bit more detail some of the possible research that we need to do to find the optimal settings for clustering
  • describe how we can pre-compute clusters ahead of time based on my earlier comments
  • starting at the section “Creating of new API”, give more detail about each of the sub-tasks that I suggested above. You can use some of my text and examples that I added here, but remember also that this is your proposal. Investigate some of my points further, and if you have any questions then please ask me - either here, in a new thread, or in irc.
  • When you want to explain a new feature or view, just do some very high-level psuedo-code. For example, I think that your examples of get_good_representation_low_level and load_good_representation are useful, but contain too much detail for a proposal of this level. See if you can rewrite it in just 4-5 lines of psuedocode
  • Don’t include lazy items in your timeline such as “add test cases” and “write documentation”. These tasks should never be done as separate items, instead they should be expected parts of the actual items in your timeline. If this means that you have to re-estimate your tasks because they might take longer then that’s fine.
  • Remember that because of the scale of data that we are working with, integrating a simple proof of concept tool that works with 100 items may take additional time when working with 10 million items. Keep this in mind in your timeline.
  • Don’t be too worried about specific implementation tasks in your timeline. For example the task “Modify the .sql files to alter and create the required tables” is very specific. What you’re actually doing here is “Compute clusters and store results in the database”. Talk about general ideas instead of specific tasks.
  • Your timeline could include some time in the middle for deploying this tool in the production website and computing clusters. Don’t put this at the end of the proposal, this way we can try and get your work released on the website as soon as possible! e.g. do something like: cluster settings investigation, tools for computing clusters, API for submission info, deploy, investigate choosing best submission, computation of best submission, api for best submission.
  • carefully revise the existing material that you cite at the beginning of the proposal, I believe you spelled Aidan’s name incorrectly.

thanks again!

2 Likes

Hey, thank you for the feedback. I appreciate how the time and effort you put into writing it.

Based on your feedback I’ve made some changes to my proposals structure, objectives and timeline. I’ve decided to focus on identifying bad submissions during the GSOC period and keep recommending good submissions as a stretch goal or something to work on post-GSoC.

I’ve restructured my proposal into

  1. A proof of concept for DBSCAN
  2. My initial research on feature selection and further work required.
  3. A high-level implementation (subdivided into a one-off job and the changes required to the hl extractor to cluster incoming submissions)
  4. Displaying the information to the user via an API

Addressing some of the questions you had:

Yes, using only 2 principal components was for visualizing results. DBSCAN can be performed with more than 2 features. While increasing the number of dimensions will let us extract more information from our data this will cause the runtime to increase. This seems to be a classic case of finding a balance between runtime and accuracy. To give you an idea of the runtime, according to [Nidhi] (https://doi.org/10.1016/j.procs.2016.07.336) [4] “computational complexity of the whole algorithm is O(n2), where n is the number of degrees (points) in the data set. If effective index structures are used and the dimension of degrees is low (d<=5), the
computational complexity of DBSCAN can be reduced to O (n logn ).” Hence we’re looking at between O(n2) and O(n logn ) based on the number of principal components. Andrew Ng has an interesting video on choosing the Number of Principal Components here

Yes, more than one cluster can exist. I can think of 2 reasons why this would happen.

  1. The formatting of the song and its compression will produce different results.
    Although the songs formatting will cause the metadata to vary, this variation will be minimum and can be taken care of by selecting a suitable value for eps.

  2. A malicious user may submit an incorrectly tagged record multiple times to create a new cluster.
    This case is harder to detect. While I initially thought we could select the largest of the clusters and have a mechanism to stop a user from making too many submissions to the same MBID(to stop them from making their cluster the largest) I think throttling the number of submissions a user can make would not be the right approach and goes against the community-driven nature of AB. Plus whatever fail-safes we do create, a motivated person will find workarounds to it.

For now, the best approach here would be to have our API return a message saying that the data quality for this particular MBID is unknown or incomprehension.

[4] Nidhi and Km Archana Patel. An Efficient and Scalable Density-based Clustering Algorithm for Normalize Data. 2nd International Conference on Intelligent Computing, Communication & Convergence

Hi @r0han22, thanks for taking the time to update your proposal.
You said

But I don’t see this change reflected in your original post yet. Are you going to add it here too, or just in the final proposal that you submit?

hey @alastairp just finished making the changes here.

A question I did not address in my previous post:

While I couldn’t find too much research on this, according to Roberts [5] answer to a similar question the minimum sample size to conduct cluster analysis is 2^m where m is the number of dimensions.
For this case, I think we can also return a message stating that the data was unknown/inconclusive.

[5] McClelland, Robert. (2015). Re: What is the minimum sample size to conduct a cluster analysis?. Retrieved from:https://www.researchgate.net/post/What-is-the-minimum-sample-size-to-conduct-a-cluster-analysis/55ff1a97614325339b8b45d6/citation/download.

Great, thanks for taking the time to update the proposal! Don’t forget to submit the final version.