And, it’s still written in Python 2!
It still works. The open issues do not seem to prevent the use and the CLI client code is pretty straightforward (core functionality is just a wrapper for the extractor and a sender).
What puzzles me is why it isn’t parallel and why they’ve used a SQLite database. Extracting the features directly to JSON files, filling the missing SHA info of the extractor and sending the JSON seems to work just fine (sorry if it crashed something on the server).
What AB really needs is some form of manual review of the data and maybe a BrainzCaptcha to help with machine learning training (not sure if it doesn’t already exists).
Made a few changes here.
Added a switch for offline processing (pull request 34).
Another switch to reprocess failed feature extractions (issue 47).
Parallel processing to use N-1 threads (issue 37).
Using argparse to provide a help interface (issue 36).
Check if the server already has the recording features, preventing resubmission (issue 5).
Removed SQLite and issues associated (13 and 14).
Not thoroughly tested nor did I check if the metadata is respecting the server format.
I’ve completely removed a profile thing used by the extractor and didn’t look into it beforehand.
I guess it also needs an API rate limiter.
I have a working UI, but it is really ugly. Reused basically everything from the CLI (even command line argument parsing). You can’t drop down or anything fancy yet, just open the settings menu and add folders/files to process with a folder/file picker.
Using mutagen to check for duplicates before processing anything would make a ton of sense to speed things up (matching trackid + extractor sha). Done.
Update: now with a progress bar + time to complete estimate.
Update 2: and with API rate limit.
Update 3: and PyInstaller specs, AB icon… I’d say it is feature complete (but still lacking testing around the profile thing).
Thanks for the questions. As @gabrielcarvfer pointed out, the biggest reason is that for the most part it works. I agree that there are a number of open tickets and unmerged pull requests that are waiting on our review. I’m unsure how many of these issues are actually preventing people from submitting. Especially as according to our statistics we still have constant submissions to AcousticBrainz. It’s possible that improving these tools more will increase our submissions.
One reason for not picking up these tickets and PRs is that we had planned on releasing a newer version of the AcousticBrainz extractor, and would have made updates to the client at that time. However, due to other tasks getting in the way and slower development on the extractor tool, this task has continually been pushed back and back. This is something that we need to improve our communication on, and hopefully in the coming year we can look at making a more concrete development plan that includes these items.
I agree that python 3 support is a key feature that we should add, and is something that we should do before the new version of the tools. I’ll take a look at this.
The main reason why isn’t not parallel is that we made a quick and easy tool to process and submit the content and there was no need at the time to add this functionality. We needed a good tradeoff between getting data submitted, and being able to easily develop and maintain the tool. At the time, adding multithreading was too much effort for what we would get out of it. Having said that, this is a good idea to add, and we should prioritise it for the next version of the client.
What are your concerns about the SQLite database? This exists purely to verify that files are not submitted multiple times if the tool is run over the same directory multiple times. We wanted to have a quick check to make sure that people didn’t waste time computing stuff twice (because if they tried to submit it we would have rejected it).
This is a much larger question and could be extracted out into another thread. We’re interested in automatic review of data, we think that with some tools that we have been working on that we can determine that if a recording has multiple submissions, which ones are bad.
Do you have other ideas about review? What data in specific do you mean? For the results of high-level data, we can definitely do a lot about improving the models that we have. Do you know about our dataset creation tools? We had some ideas about tools for giving feedback (e.g. saying that a prediction is incorrect), but haven’t proceeded on this. If you’re interested then I’d definitely be happy to talk more about it!
Thanks for making these changes and opening the pull request! There are a lot of changes here, so it might take me a while to get through all of the changes. I’ll send you more comments on the PR.
Not a problem on itself, but also not really necessary. I’ve replaced it by keeping the feature files with basically the same name of the input file (’_.json’ appended), and separating them into different folders for different processing states. Not as easy as moving a single file around, but multiple access is way easier to deal with and json file is kept for inspection for some kinds of errors.
I’ve included mutagen to check the mbid and extractor version already have a match in the server (could probably be more economical on the request, but seemed fine for now). Should also save processing time in both ends.
Not really, was just thinking that every large company seems to be using some form of captcha to gather reviews and would be nice to have something like that. Yes, I was thinking more about the high-level data.
There is a ticket for AB reporting that two male rappers were identified as a female singer with a probability (or whatever obscure metric ML uses) of 86%. Any human with a 5 sec snippet could say that is was completely wrong, and a captcha could be used to identify problematic samples such as this one to put it into the training dataset.
It hits the same issue of hosting licensed songs, so no idea how to workaround that. Maybe spotify/youtube player, or maybe some benevolent record label could help.