Are AcousticBrainz Submissions getting cleaned yet?

I just pick a recent submission from the AcousticBrainz front page

https://acousticbrainz.org/dfff11b2-e667-4dd6-9488-0b4858c7d30e

This has nine submissions, the key for the first one is F Minor, the key for the remaining eight is C Major. So it seems likely that the key for the first one is incorrect, and I vaguely remember @alastairp saying something the key being incorrectly calculated. If this is the case why don’t we remove these old records when they have more than one submission ?

More generally there is always going to be some variation for all the metrics, but where one metric is totally out should it not be dropped. Its not really viable for end users of this data to do this, not least because too many api calls would have to be made to get each submission. The data would be more useful if AcousticBrainz was cleaned up a bit or if AcousticBrainz offered a combined dataset for a recordingId that represented the average value of each metric. (Exactly what is meant by average of course requires some thought which is why it would be better if this was done by AcousticBrainz team who hopefully have a better understanding of the data then I do)

1 Like

Hi,
You’re right that this isn’t being cleaned yet. However we have 2 things that we’ve been working with that help in this direction:

  • Rashi’s GSoC project allows us to combine submissions that have different actual MBIDs but are actually merged
  • I worked this year with a masters student who worked on a system to perform content based similarity of recordings. We found that we were able to discover clusters of recording submissions, including bad data. We want to integrate this work in the months remaining this year (He’s still working with us at MTG), https://github.com/philtgun/acousticbrainz-server

@iliekcomputers has been doing some great work picking up AcousticBrainz development and is in Barcelona for the MeB summit this weekend. We’re having a planning session for AcousticBrainz, keep an eye on this space

1 Like

Hi!
I have a linked question: I use the beets plugin to submit acousticbrainz data, but unfortunately it doesn’t store if a recording has already been submitted, he just submits everything again, so I will end up submitting the data several times. Do you store this kind of data? I can’t find it, but I have no experience in this. If someone submits the data several times, it could mess up seriously any kind of ‘means’.
By the way, if any of you is the creator of the beets plugin, it could be worth it to fix it, but that’s another topic and I already reported it on the beets github page.