Note: This idea is an extension of the idea here and discussion on IRC here.
ListenBrainz has recently undergone a big overhaul and is now getting ready to stream its data to Google BigQuery for analysis and statistics generation. Right now, the user does not get any information from her data other than a flat list of listens. My proposal involves adding statistics and visualizations to ListenBrainz that would be informative and helpful to the user.
Currently data from ListenBrainz comes from the user and takes the following path to get to Google BigQuery:
This project would involve addition of code that makes queries to BigQuery periodically to generate stats, stores these stats in PostgreSQL and shows them to the user. The flow of data would be something like this:
What stats should ListenBrainz calculate?
I have divided statistics into three categories. Each of these stats will be accompanied with visualizations, so I will refer to them as graphs from here on out. Also all these statistics will be calculated and stored for two time-frames: last month (30 days) and all-time.
User graphs: graphs which tell the user about her own listening history.
Entity graphs: graphs which show details about entities like artists, releases and recordings.
Miscellaneous graphs: graphs which don’t fit into the above categories, like a sitewide listen-count graph.
All graphs would be keyed by MessyBrainz ids right now, and when later MessyBrainz ids start getting matched to MBIDs, different MSIDs that point to the same entity will be merged into one.
I intend to either use d3.js or a charting library based off d3.js (like plotly.js) to create the visualizations, depending on feedback from the community and our needs.
Note: Please ignore the design aspects such as fonts and colors for now, as the demo graphs exist just to get my ideas of what information the graphs depict across.
Right now, ListenBrainz does not have any entity views at all. New views inspired by CritiqueBrainz (artist page, release group page) for artists and releases will be added, all uniquely identified by MessyBrainz ids. The url structure of the views would be similar to CritiqueBrainz:
This would involve creating two new blueprints. It will also involve adding functionality to messybrainz that allows us to query the messybrainz db with a msid and get basic information like the entity name out of the db.
Then we add the following stats to these pages:
- Total listen count of each entity
- Top tracks (artists and release groups): This would be a horizontal bar chart with number of listens sitewide on the x-axis for the top X tracks of the particular entity.
- Top releases (artists): This would be a graph on the artist page, showing her top releases in a bar chart.
- Add a new view, maybe
listenbrainz.org/stats which will be the place to show sitewide graphs.
- Add cumulative number of listens submitted sitewide over time to the stats page. This graph would look like this: https://codepen.io/iliekcomputers/full/LxqvZJ/
- Adding a sharable visualization here would be a great idea. So a graph that I’d add in this category is a stream graph showing which artists have been listened to most on ListenBrainz over a past period of time (say week by week). See this reddit post for an example of what I’m thinking of. Such a graph would make for good end-of-the-year posts about LB also as it shows us the community’s listening trends (who was popular when?) etc. The graph would be something like this.
Some pics of what the above graphs would look like (taken from the codepen links):
When and whose stats should ListenBrainz calculate?
Traffic to BigQuery is a limited resource and minimizing its usage while keeping our users happy is a big challenge. We should try to only calculate statistics that will be consumed by our users. In this vein, stats will be calculated in the following manner:
- We generate new stats every week (or every month), depending on various factors such as Google’s generosity, the money that we can afford to spend etc.
- We only generate new stats for users who have logged in at least once in the last month. We can modify the user table in PostgreSQL to add a column that keeps track of when a user logged in last. If a user comes after a month, we show them their old stats with a message saying something along the lines of “Hi, long time no see, we don’t have new data for you, here’s the old one while we’re working on new stats for you”.
- For entities, we only generate stats for the top X listened entities in the last cycle. This is okay in my opinion because for entities that aren’t listened to much, there won’t be a major change in their graphs over our cycle. Of course, we provide an option for users to request new stats for some particular entity on the respective entity view, which we’ll again track in redis.
- Sitewide stats are generated every cycle.
Where and how should Stats be stored?
Since statistics generation is a batch operation performed after some interval of time, we need to save these stats somewhere in the meanwhile. This will be done in the PostgreSQL database.
The data that we’re trying to store is not data that will be queried much. All that we’ll do is request a list of a user’s or an entity’s stats that has already been calculated. As a result, there is not much of a need to keep this data in a normalized manner. We can make use of PostgreSQL’s JSON data types to store JSON blobs of stats calculated by BigQuery and then we’ll get these out of the db when we need them.
The JSON that we’ll have to store will basically provide us the entire information required to create graphs. I envision a dictionary with entity msids as keys and listen counts as values for the top entities graphs etc. This would be a simple solution.
If we generate graphs that are not rendered client side but are images, then we can also use JSON blobs with keys that include a title, a description of the graph and a link to the image we wish to show.
In order to keep track of when stats were generated for a user or for a particular entity last, we’ll add columns that store these values to the postgres tables for these entities.
Listen Counts of users and artists etc. are fast enough to calculate in influx and can be cached in redis as done in this pull request by me. Similarly, no need to make a table for storing sitewide listen count as it can also be stored in redis (see this pull request).
Implementation and code organization
Create a new module (called
stats) which uses the python bindings to Google’s API to send queries to Google BigQuery. This module will contain the following stuff initially, and then later functions can be added to it as more stats start being shown.
Of course, tests and documentation will be written for the above code also.
Now write scripts that do the following:
- calculate which entities and which users need their stats generated this cycle.
- use the above module to generate stats.
- invalidate old stats and update the postgres db with new stats.
Then, configure these scripts to be run at regular intervals using apscheduler.
After this, the workflow to add new stats would be simple:
- Make modifications in the stats module for the stat you wish to add.
- Configure the scripts which generate stats to also generate your stat.
Stats API endpoints: API endpoints which allow applications to get the stats that we have calculated in json format would be pretty cool to have. However, this will require detailed discussion on what exactly these endpoints would be and what data they would provide. For now, I was thinking of endpoints which return user’s top artists from our database in json format. For example: A user’s top artists could be returned in the following format:
"payload" : [
"artist_name": "Kanye West",
"artist_msid": "msid here",
Stats keyed by time (instead of listen count): This is a pretty popular request in the community here. The basic idea is that instead of showing graphs based on listen counts of a particular artist or release, we show graphs on the time the user has listened to the entity. However, I think this would require either adding
track_length to our api json format or using musicbrainzngs to request this data from MusicBrainz for every listen (which is a bit too inefficient, in my opinion).
More graphs: The graphs I’ve listed in the proposal are pretty standard and only begin to scratch the surface of what can be done with LB data in BigQuery. Graphs like the stream graph referenced here would be cool to have also as they show the user what was popular with her at different points of time. After the basic architecture has been laid, addition of new graphs would be relatively easy.
Search for entities: With entity pages being added, allowing a user to search for artists or releases by name is something that should be added also.
A broad timeline of the work to be done is as follows:
Community Bonding (May 5 - May 29)
Spend this time trying to formalize what exactly I need to code and discuss design decisions with mentor to make sure that no bad decisions are made early in the process. A deliverable here would be an exact spec of what code needs to be written.
Phase 1 (May 30 - June 25)
I aim to get the user stats done in this phase. This will involve getting the schema for entities added to LB and get a basic version of the stats module and the scripts done. Also, the stats will be shown on the user page in this phase.
Phase 2 (June 26 - July 30)
This phase would involve addition of entity and sitewide stats to LB. This should be relatively easy to do, given that the basic foundation for the stats module and the scripts will have been done already. This will also involve creating views for entities that display the stats.
Phase 3 (August 1 - August 29)
This phase would involve clean up of the code written in the earlier phases, bug fixes, testing and making sure that the stuff added over the summer is usable. Also, start work on the optional ideas if there is time.
After Summer of Code
Continue working on ListenBrainz as I have been since January. Maybe start with matching listens to mbids automatically somehow, if it hasn’t been done as a part of SoC 2017.
Here is a more detailed week-by-week timeline of the 13 week GSoC coding period to keep me on track:
Week 1: Begin with setting up environment, bigquery credentials etc.
Week 2: Start with user stats, create new tables for postgres, begin work on the stats module
Week 3: Finish work on the user part of the stats module, write tests, begin the script that uses the stats module to get stats generated
Week 4: UI stuff, draw graphs from the stats generated so far. PHASE 1 evaluation here
Week 5: Taking mentor evaluation into account, fix stuff in code written so far and continue coding.
Week 6: Write the artist part of the stats module and start with the script that generates artist stats.
Week 7: Finish the script and begin work on the artist view, adding visualizations to the view.
Week 8: CUSHION WEEK: If behind on stuff, then catch up. If not, then continue with plan, with an intent to get some optional ideas started in the future. PHASE 2 evaluation here
Week 9: Work on the release stats and release stat generation script.
Week 10: Finish the work from week 9 and get to adding the release view with graphs.
Week 11: Start with sitewide stats. This is a relatively smaller task and should be done in a week.
Week 12: CUSHION WEEK: If behind on stuff, then catch up. If not, then work on optional ideas.
Week 13: Pencils down week. Work on final submission and make sure that everything is okay.
Detailed Information about yourself.
I am a junior CS undergrad at the National Institute of Technology, Hamirpur. I came across ListenBrainz when looking for alternatives to Last.fm last November and I’ve been helping out in development work since this January. Here is a list of pull requests I’ve worked on over time, most notable of which are the alpha_importer pull request, this pull request which adds listen counts to the user page of LB and this pull request which adds integration tests for the ListenBrainz API. I have a semi-active blog here but I intend to use it regularly over the Summer of Code period to blog about my progress with the project.
Question: Tell us about the computer(s) you have available for working on your SoC project!
Answer: I have an HP laptop with an Intel i5 processor and 4 GB RAM, running Xubuntu 16.04.2.
Question: When did you first start programming?
Answer: I have been programming since 10th grade, mostly small Java applications. I picked up Python in my senior year of high school, working on small text based games etc.
Question: What type of music do you listen to?
Answer: I listen mostly to hip hop, along with some indie rock etc. Favorites of mine (with links to MusicBrainz) include Radiohead, Kanye West, A Tribe Called Quest, Kendrick Lamar, Broken Bells and LCD Soundsystem. Here is a link to my last.fm: https://www.last.fm/user/singhparam
Question: What aspects of ListenBrainz interest you the most?
Answer: I have been using Last.fm to track my listening history since 2012 and the visualizations that they used to show in the playground were really interesting to see. The thing that interested me in ListenBrainz is that the data is open and this could allow for many more interesting usages of this data.
Question: Have you ever used MusicBrainz to tag your files?
Yes, I have been using Picard to tag my files ever since my music collection became a little too large to handle without well organized metadata.
Question: Have you contributed to other Open Source projects? If so, which projects and can we see some of your code?
I have made contributions to Sympy, working on stuff in their parser. Here’s a link to my pull requests.
Question: What sorts of programming projects have you done on your own time?
I have worked on a Gameboy emulator in C++ and SDL, an online judge for competitive programming that runs people’s codes in a docker container and issues judgements accordingly. I have also worked on a Checkers bot and a game environment for the bot in Pygame.
Question: How much time do you have available, and how would you plan to use it?
I have holidays during most of the coding period and can work full time (45-50 hrs per week) on the project.
Question: Do you plan to have a job or study during the summer in conjunction with Summer of Code?
None, if selected for GSoC.