AcousticBrainz BigTable storage

Right now AcousticBrainz is storing summary information to PostgreSQL as jsonb data. In order to run algorithms on audio we need to store frame-level data. This poses several challenges listed here:

The idea is to still generate aggregated summary data with Essentia streaming_extractor_music but also generate raw frame-level data and submit them together. PostgreSQL would still store JSON summary data but frame-level data would be stored in a faster and more space-efficient format (think protocol buffers). Those data would then be saved to Google’s BigTable storage so it can be queried later for analysis. It’s a key/value storage engine capable of supporting huge amounts of keys -> “These tables will grow at the rate of approximately 2 billion rows per day, which Cloud Bigtable can handle without difficulty.” Currently, the number of collected data is somewhere around 3.5 million songs.

Here is some info from their docs page:

  • Each table has only one index, the row key. There are no secondary indices.
  • Rows are sorted lexicographically by row key, from the lowest to the highest byte string
  • In general, keep all information for an entity in a single row. An entity that doesn’t need atomic updates and reads can be be split across multiple rows. Splitting across multiple rows is recommended if the entity data is large (hundreds of MB).

The lexicographically sorted keys and the ability to index a large number of rows makes it ideal to store audio data. A similar example is given for server metrics data:
We can use the guidelines for server metrics data to design the db schema for AcousticBrainz data.
Before we can accomplish this we need support for frame-level data in streaming_extractor_music. I’m under the impression that the tool needs more work before releasing a new version so in order for this to work by the end of summer we need coordination between Essentia and AcousticBrainz webservice developers. Designing the BigTable schema and setting it up doesn’t have to wait for the tool.

So the flow would go like this:

  1. Generate summary and frame-level JSON data with streaming_extractor_music
  2. Submit the data to the webservice
  3. Store summary data to postgres and frame-level data to the local storage in protobuf format and to BigTable
    (Not sure when data should actually be converted to protobuf, maybe with streaming_extractor_music, maybe after upload, needs to be brainstormed)
  4. Efficiently fetch audio data frames from BigTable
  5. Do machine-learning magic

Think of this as a working draft for a GSOC proposal in the making if everyone is on board with the idea.
Any input or help with expanding this idea further?

Storing data in one format vs another should not make a big difference but what I would suggest is just storing the raw data in external files on disk as you don’t need to and should not read them that often.
You should process them once and only need them when you need to rebuild the search index or doing your machine learning.

If you are suggesting that you need to read the 20tb of data for a query it is not going to work any time soon.
What you need to do is build a search index such as using a inverted index to allow you to query the data.

So with a normal database you have a table and to search for records where that word exists you would start at the top and look at each record to find what you need.
In a search engine you process the data ahead of time and build up a table of what words are used in what records.
You then can query this words table to find where this word is used without having to read the whole data set from top to bottom.
This can take time and resources to build this index and this index will take up several times more data than the original data.
You are spending the time up front but doing this makes the rest run faster as it does not need to do the processing for every query.

The work flow should be:

  1. generate the data at the client
  2. submit to the web service
  3. store raw data on disk
  4. process raw data and build search index

Building a search index manually doesn’t make sense, IMHO. It’s unlikely that users will ever do queries like “give me all the rows where column is a certain value” (as commented in the ticket). What’s more likely is for the users to request a sequential lookup based on the mbid.
So we’re talking about keys with a pattern like

The row keys are already indexed and sorted lexicographically and split across something called tablets so I imagine the lookups are similar in principle to hash lookups. Here I’m relying on Google to efficiently do what it advertises.

Thanks for the interesting discussion.

I may have put you on a slightly incorrect path with the investigation - it looks like google has 2 products, one called BigTable, and one called BigQuery, and I confused them. The one that we’re interested in using is BigQuery. It looks like it can support CSVish and JSON data as an input format. The possibility of JSON fits in well with the type of structured data that we have. I’d also like to see what kind of queries it’s possible to run with this system.
Some simple examples of what we’d like to do include basic searches - getting all items by an artist, or from an album, or with a particular genre/tag/etc.
As we will be storing frame-level data here as well, I’m not sure if it’s useful for us to be able to make queries based on values in this data. My feeling is that the more useful types of queries for users will be on higher level features, which they can use to export the lowlevel data for other research.

I spoke with the Essentia maintainers and we think that we can probably have a new version of streaming_extractor_music ready around the end of summer, which would fit in well with this project if we went ahead with it. As the frame level data is so large my recommendation would be that we compress it on the client side before sending it, which means implementing protocol buffers (or whatever serialization format we decide on) in the client as well. I think that this could also be a part of this project.

I agree that this project doesn’t need the release of the extractor to be finished to work on it, but we should also look at adding the existing (non-frame-detail) data files to BigQuery

We have been thinking about other changes to the server software to handle the new extractor. Our idea is that we need to introduce some kind of versioning system so that we can distinguish between data submitted from our current extractor and the new one.
For example, if features between the two versions of extractors are compatible, a classifier which uses this feature can use both old and new submitted features, however if they are not, a user must be able to specifically ask only for features from a specific version.

If you’re interested in going ahead with this, you should write up another proposal in the SoC category and fill in the template. We can iterate on ideas there.