BigQuery Data exploration

Personal information

Nickname: Cetko
IRC nick: cetko
Email: goran.cetusic@gmail.com
GitHub: gcetusic (Goran Cetušić) · GitHub
Twitter: https://twitter.com/gorancetusic

Proposal

Right now AcousticBrainz is storing summary information to PostgreSQL as jsonb data. In order to run algorithms on audio we need to store frame-level data. This poses several challenges listed here: [AB-101] frame-level ll data is too big to store in postgres - MetaBrainz JIRA

The idea is to still generate aggregated summary data with Essentia streaming_extractor_music while maintaining backwards compatibility on the server through versioning but also generate raw frame-level data, compress them as csv and submit all the generated data to the server. The data would then be saved to Google’s BigQuery storage so it can be queried for later analysis. Possible use cases would be getting all items by an artist, or from an album, or with a particular genre/tag/etc. Currently, the number of collected data is somewhere around 3.5 million songs.

Before we can accomplish this we need support for frame-level data in streaming_extractor_music. The tool needs more work before releasing a new version so in order for this to work by the end of summer we need coordination between Essentia and AcousticBrainz webservice developers. Designing the BigQuery schema and setting it up doesn’t have to wait for the tool.

Workflow:

  1. Generate summary and frame-level JSON data with
  2. Submit the data to the webservice
  3. Store summary and frame-level data to BigQuery
  4. Efficiently fetch audio data frames from BigQuery
  5. Do machine-learning magic

A BigQuery table is a standard, two-dimensional table with individual records organized in rows, and a data type assigned to each column (also called a field). Individual fields within a record may contain nested and repeated children fields. Its hybrid (row vs column) storage structure makes it possible to easily store frame-level data yet enables efficient querying over specific columns.

Timeline

  • Community Bonding period
    Getting familiar with the codebase
    Defining BigQuery schema
  • Week 1 (23rd of May - 29th of May)
    BigQuery setup
  • Week 2 (30th of May - 5th of June)
    Test data insertion to BigQuery
    Running representative queries
  • Week 3 (6th of June - 12th of June)
    BigQuery data schema improvements
  • Week 4 (13th of June - 19th of June)
    AcousticBrainz server webservice → BigQuery API
  • Week 5 (20th of June - 26th of June)
    Test data insertion to BigQuery via AcousticBrainz
    Stress testing of AcousticBrainz/BigQuery
  • Week 6 (27th of June - 3rd of July)
    Midterm evaluation
    Resolving any design problems
    Maintenance and codebase improvements
  • Weeks 7 and 8 (4th of July - 17th of July)
    AcousticBrainz server webservice → BigQuery automatic storage
    Polishing AcousticBrainz API
  • Week 9 (18th of July - 24th of July)
    Website UI for most commonly used data extractions
  • Week 10 (25th of July - 31st of July)
    streaming_extractor_music testing
    (Assumes finished next version of streaming_extractor_music)
  • Week 11 (1st of August - 7th of August)
    Community streaming_extractor_music frame-level data resubmissions
    (Assumes finished next version of streaming_extractor_music)
  • Week 12 (8th of August - 14th of August)
    Final syncing of GSOC project
    Fixing any leftover issues
  • Endterm Evaluation

Detailed information about applicant

I’m a an experienced software engineer taking time to attend my last year as a masters student of Telecommunications and Informatics at the University of Zagreb.

Tell us about the computer(s) you have available for working on your SoC project!
I have my trusty Thinkpad Edge E325 laptop since 2012. It’s still serving me well and I use it to stay mobile but I do heavy-load development on my desktop, with 16GB RAM and an i7 processor.

When did you first start programming?
My programming activities go back to playing with the turtle from LOGO programming language in grade school but I started doing more serious development during my first programming course in C back in 2006. Now I’m a Python software engineer with 5+ years of experience, mostly web development.

What type of music do you listen to? (Please list a series of MBIDs as examples.)
I’m sort of a mix between punk, hard rock and classical music.
Examples would be Dropkick Murphys (e8374874-4178-4869-b92e-fef6bf30dc04), AC/DC (66c662b6-6e2f-4930-8610-912e24c63ed1) and Chopin (09ff1fe8-d61c-4b98-bb82-18487c74d7b7).

What aspects of the project you’re applying for (e.g., MusicBrainz, AcousticBrainz, etc.) interest you the most?
I’m keen on working with AcousticBrainz because I believe it poses some interesting challenges for these

Have you ever used MusicBrainz to tag your files?
Yes, Croatia has a lot of obscure artists. I’ve mostly used MusicBrainz to tag local punk bands :slight_smile:

Have you contributed to other Open Source projects? If so, which projects and can we see some of your code?
I’ve provided my Github account where my contributions to open source projects are visible.
Besides smaller contributions to a couple of projects, like fixes, typos and issues, I’d like to turn your attention to a couple of projects which I contributed to extensively:

  • IMUNES - Linux/FreeBSD network emulator
  • GNS3 - Docker support
  • Stormpath Python SDK - Python binding for the Stormpath user identity API

What sorts of programming projects have you done on your own time?
I’ve mostly done web development with Flask and Django and lately I started to tinker with Bioinformatics projects. I’m now determined to go through Rosalind (ROSALIND | Problems | Locations) and solve all the assignments.

How much time do you have available, and how would you plan to use it?
The general plan is to work 8 hours a week on the project with the possibility of working weekdays to finish the project as soon as possible so I can react in time if any problems occur before the end of the summer.

Do you plan to have a job or study during the summer in conjunction with Summer of Code?
This depends on how well I do my exams by the end of May. I only have one exam this year and my Masters thesis is already finished so I intend do graduate in July. I do have a week of well deserved vacation planned in July after the GSOC midterm as a reward for graduating.

1 Like

Hi!

I quite like how complete (and pretty!) your application is! Overall I think you’re on the right track,but I think the first three weeks of the project may end up being a bit different, but still take up about three weeks.

I’m guessing the setup of BigQuery will be pretty fast, but the insertion of test data might take a bit longer. In the end the schedule may work out exactly as you planned, but a bit shifted around. Not a big deal really.

I think you should submit your application – I think it is close enough to complete that you should submit it now. Alastair, the chief behind this project is busy this week, so he can look at this in detail next week. You can still edit your proposal after the submission deadline, so as long as you get your proposal in before the deadline you should be fine.

1 Like

Thank you for the feedback! I’ll submit the application like this, for now.

2 Likes

Quoting sttaylor in gsoc (IRC) last night:

Remember a Final PDF must be submitted before the March 25th 19:00 UTC deadline to be considered for GSoC 2016
also if you submit your final pdf now you can upload a new one up until the deadline on March 25th
but I strongly encourage students to submit at least 6 hours before the final deadline - we do not extend the deadline under any circumstances
every year students miss the deadline because their wifi goes out or their computer dies and they have to wait until the next year to try again

So consider this a heads up and a reminder :wink: