Thanks for the interesting discussion.
I may have put you on a slightly incorrect path with the investigation - it looks like google has 2 products, one called BigTable, and one called BigQuery, and I confused them. The one that we're interested in using is BigQuery. It looks like it can support CSVish and JSON data as an input format. The possibility of JSON fits in well with the type of structured data that we have. I'd also like to see what kind of queries it's possible to run with this system.
Some simple examples of what we'd like to do include basic searches - getting all items by an artist, or from an album, or with a particular genre/tag/etc.
As we will be storing frame-level data here as well, I'm not sure if it's useful for us to be able to make queries based on values in this data. My feeling is that the more useful types of queries for users will be on higher level features, which they can use to export the lowlevel data for other research.
I spoke with the Essentia maintainers and we think that we can probably have a new version of
streaming_extractor_music ready around the end of summer, which would fit in well with this project if we went ahead with it. As the frame level data is so large my recommendation would be that we compress it on the client side before sending it, which means implementing protocol buffers (or whatever serialization format we decide on) in the client as well. I think that this could also be a part of this project.
I agree that this project doesn't need the release of the extractor to be finished to work on it, but we should also look at adding the existing (non-frame-detail) data files to BigQuery
We have been thinking about other changes to the server software to handle the new extractor. Our idea is that we need to introduce some kind of versioning system so that we can distinguish between data submitted from our current extractor and the new one.
For example, if features between the two versions of extractors are compatible, a classifier which uses this feature can use both old and new submitted features, however if they are not, a user must be able to specifically ask only for features from a specific version.
If you're interested in going ahead with this, you should write up another proposal in the SoC category and fill in the template. We can iterate on ideas there.