GSoC 2017: Adding Statistics and Graphs to ListenBrainz

iliekcomputers · March 12, 2017, 5:55am

Note: This idea is an extension of the idea here and discussion on IRC here.

Personal Information

Name : Param Singh
IRC nick: iliekcomputers
E-mail: paramsingh258@gmail.com
Github: https://github.com/paramsingh
Blog: https://paramsingh.github.io
Time Zone: UTC+0530

Project Details

ListenBrainz has recently undergone a big overhaul and is now getting ready to stream its data to Google BigQuery for analysis and statistics generation. Right now, the user does not get any information from her data other than a flat list of listens. My proposal involves adding statistics and visualizations to ListenBrainz that would be informative and helpful to the user.

Currently data from ListenBrainz comes from the user and takes the following path to get to Google BigQuery:

This project would involve addition of code that makes queries to BigQuery periodically to generate stats, stores these stats in PostgreSQL and shows them to the user. The flow of data would be something like this:

What stats should ListenBrainz calculate?

I have divided statistics into three categories. Each of these stats will be accompanied with visualizations, so I will refer to them as graphs from here on out. Also all these statistics will be calculated and stored for two time-frames: last month (30 days) and all-time.

User graphs: graphs which tell the user about her own listening history.
Entity graphs: graphs which show details about entities like artists, releases and recordings.
Miscellaneous graphs: graphs which don’t fit into the above categories, like a sitewide listen-count graph.

All graphs would be keyed by MessyBrainz ids right now, and when later MessyBrainz ids start getting matched to MBIDs, different MSIDs that point to the same entity will be merged into one.

I intend to either use d3.js or a charting library based off d3.js (like plotly.js) to create the visualizations, depending on feedback from the community and our needs.

User graphs

Three graphs will be shown on the profile page.
Top X artists by listen count.
Top X releases by listen count.
Top X recordings by listen count.
Link to demo graphs: https://codepen.io/iliekcomputers/full/rjPYEw

Note: Please ignore the design aspects such as fonts and colors for now, as the demo graphs exist just to get my ideas of what information the graphs depict across.

Entity graphs

Right now, ListenBrainz does not have any entity views at all. New views inspired by CritiqueBrainz (artist page, release group page) for artists and releases will be added, all uniquely identified by MessyBrainz ids. The url structure of the views would be similar to CritiqueBrainz:

Artist: listenbrainz.org/artist/artist_msid
Release: listenbrainz.org/release/release_msid

This would involve creating two new blueprints. It will also involve adding functionality to messybrainz that allows us to query the messybrainz db with a msid and get basic information like the entity name out of the db.

Then we add the following stats to these pages:

Total listen count of each entity
Top tracks (artists and release groups): This would be a horizontal bar chart with number of listens sitewide on the x-axis for the top X tracks of the particular entity.
Top releases (artists): This would be a graph on the artist page, showing her top releases in a bar chart.

Miscellaneous graphs

Add a new view, maybe listenbrainz.org/stats which will be the place to show sitewide graphs.
Add cumulative number of listens submitted sitewide over time to the stats page. This graph would look like this: https://codepen.io/iliekcomputers/full/LxqvZJ/
Adding a sharable visualization here would be a great idea. So a graph that I’d add in this category is a stream graph showing which artists have been listened to most on ListenBrainz over a past period of time (say week by week). See this reddit post for an example of what I’m thinking of. Such a graph would make for good end-of-the-year posts about LB also as it shows us the community’s listening trends (who was popular when?) etc. The graph would be something like this.

Some pics of what the above graphs would look like (taken from the codepen links):

When and whose stats should ListenBrainz calculate?

Traffic to BigQuery is a limited resource and minimizing its usage while keeping our users happy is a big challenge. We should try to only calculate statistics that will be consumed by our users. In this vein, stats will be calculated in the following manner:

We generate new stats every week (or every month), depending on various factors such as Google’s generosity, the money that we can afford to spend etc.
We only generate new stats for users who have logged in at least once in the last month. We can modify the user table in PostgreSQL to add a column that keeps track of when a user logged in last. If a user comes after a month, we show them their old stats with a message saying something along the lines of “Hi, long time no see, we don’t have new data for you, here’s the old one while we’re working on new stats for you”.
For entities, we only generate stats for the top X listened entities in the last cycle. This is okay in my opinion because for entities that aren’t listened to much, there won’t be a major change in their graphs over our cycle. Of course, we provide an option for users to request new stats for some particular entity on the respective entity view, which we’ll again track in redis.
Sitewide stats are generated every cycle.

Where and how should Stats be stored?

Since statistics generation is a batch operation performed after some interval of time, we need to save these stats somewhere in the meanwhile. This will be done in the PostgreSQL database.

The data that we’re trying to store is not data that will be queried much. All that we’ll do is request a list of a user’s or an entity’s stats that has already been calculated. As a result, there is not much of a need to keep this data in a normalized manner. We can make use of PostgreSQL’s JSON data types to store JSON blobs of stats calculated by BigQuery and then we’ll get these out of the db when we need them.

The JSON that we’ll have to store will basically provide us the entire information required to create graphs. I envision a dictionary with entity msids as keys and listen counts as values for the top entities graphs etc. This would be a simple solution.

If we generate graphs that are not rendered client side but are images, then we can also use JSON blobs with keys that include a title, a description of the graph and a link to the image we wish to show.

In order to keep track of when stats were generated for a user or for a particular entity last, we’ll add columns that store these values to the postgres tables for these entities.

Listen Counts of users and artists etc. are fast enough to calculate in influx and can be cached in redis as done in this pull request by me. Similarly, no need to make a table for storing sitewide listen count as it can also be stored in redis (see this pull request).

Implementation and code organization

Create a new module (called bigquery or stats) which uses the python bindings to Google’s API to send queries to Google BigQuery. This module will contain the following stuff initially, and then later functions can be added to it as more stats start being shown.

stats.users
    - get_top_artists(user_name)
    - get_top_releases(user_name)
    - get_top_recordings(user_name)
stats.artists
    - get_top_recordings(artist_msid)
    - get_top_releases(artist_msid)
stats.releases
    - get_top_recordings(release_msid)
stats.sitewide
    - get_listen_count()

Of course, tests and documentation will be written for the above code also.

Now write scripts that do the following:

calculate which entities and which users need their stats generated this cycle.
use the above module to generate stats.
invalidate old stats and update the postgres db with new stats.

Then, configure these scripts to be run at regular intervals using apscheduler.

After this, the workflow to add new stats would be simple:

Make modifications in the stats module for the stat you wish to add.
Configure the scripts which generate stats to also generate your stat.

Optional Ideas

Stats API endpoints: API endpoints which allow applications to get the stats that we have calculated in json format would be pretty cool to have. However, this will require detailed discussion on what exactly these endpoints would be and what data they would provide. For now, I was thinking of endpoints which return user’s top artists from our database in json format. For example: A user’s top artists could be returned in the following format:
```
  {
   "user_id": "iliekcomputers",
   "artist_count": 10,
   "payload" : [
        {
            "artist_name": "Kanye West",
            "artist_msid": "msid here",
            "timeframe": "last_week",
            "listen_count": 80
        },
        ...
    ]
  }
```
Stats keyed by time (instead of listen count): This is a pretty popular request in the community here. The basic idea is that instead of showing graphs based on listen counts of a particular artist or release, we show graphs on the time the user has listened to the entity. However, I think this would require either adding track_length to our api json format or using musicbrainzngs to request this data from MusicBrainz for every listen (which is a bit too inefficient, in my opinion).
More graphs: The graphs I’ve listed in the proposal are pretty standard and only begin to scratch the surface of what can be done with LB data in BigQuery. Graphs like the stream graph referenced here would be cool to have also as they show the user what was popular with her at different points of time. After the basic architecture has been laid, addition of new graphs would be relatively easy.
Search for entities: With entity pages being added, allowing a user to search for artists or releases by name is something that should be added also.

Timeline

A broad timeline of the work to be done is as follows:

Community Bonding (May 5 - May 29)

Spend this time trying to formalize what exactly I need to code and discuss design decisions with mentor to make sure that no bad decisions are made early in the process. A deliverable here would be an exact spec of what code needs to be written.

Phase 1 (May 30 - June 25)

I aim to get the user stats done in this phase. This will involve getting the schema for entities added to LB and get a basic version of the stats module and the scripts done. Also, the stats will be shown on the user page in this phase.

Phase 2 (June 26 - July 30)

This phase would involve addition of entity and sitewide stats to LB. This should be relatively easy to do, given that the basic foundation for the stats module and the scripts will have been done already. This will also involve creating views for entities that display the stats.

Phase 3 (August 1 - August 29)

This phase would involve clean up of the code written in the earlier phases, bug fixes, testing and making sure that the stuff added over the summer is usable. Also, start work on the optional ideas if there is time.

After Summer of Code

Continue working on ListenBrainz as I have been since January. Maybe start with matching listens to mbids automatically somehow, if it hasn’t been done as a part of SoC 2017.

Here is a more detailed week-by-week timeline of the 13 week GSoC coding period to keep me on track:

Week 1: Begin with setting up environment, bigquery credentials etc.
Week 2: Start with user stats, create new tables for postgres, begin work on the stats module
Week 3: Finish work on the user part of the stats module, write tests, begin the script that uses the stats module to get stats generated
Week 4: UI stuff, draw graphs from the stats generated so far. PHASE 1 evaluation here
Week 5: Taking mentor evaluation into account, fix stuff in code written so far and continue coding.
Week 6: Write the artist part of the stats module and start with the script that generates artist stats.
Week 7: Finish the script and begin work on the artist view, adding visualizations to the view.
Week 8: CUSHION WEEK: If behind on stuff, then catch up. If not, then continue with plan, with an intent to get some optional ideas started in the future. PHASE 2 evaluation here
Week 9: Work on the release stats and release stat generation script.
Week 10: Finish the work from week 9 and get to adding the release view with graphs.
Week 11: Start with sitewide stats. This is a relatively smaller task and should be done in a week.
Week 12: CUSHION WEEK: If behind on stuff, then catch up. If not, then work on optional ideas.
Week 13: Pencils down week. Work on final submission and make sure that everything is okay.

Detailed Information about yourself.

I am a junior CS undergrad at the National Institute of Technology, Hamirpur. I came across ListenBrainz when looking for alternatives to Last.fm last November and I’ve been helping out in development work since this January. Here is a list of pull requests I’ve worked on over time, most notable of which are the alpha_importer pull request, this pull request which adds listen counts to the user page of LB and this pull request which adds integration tests for the ListenBrainz API. I have a semi-active blog here but I intend to use it regularly over the Summer of Code period to blog about my progress with the project.

Question: Tell us about the computer(s) you have available for working on your SoC project!

Answer: I have an HP laptop with an Intel i5 processor and 4 GB RAM, running Xubuntu 16.04.2.

Question: When did you first start programming?

Answer: I have been programming since 10th grade, mostly small Java applications. I picked up Python in my senior year of high school, working on small text based games etc.

Question: What type of music do you listen to?

Answer: I listen mostly to hip hop, along with some indie rock etc. Favorites of mine (with links to MusicBrainz) include Radiohead, Kanye West, A Tribe Called Quest, Kendrick Lamar, Broken Bells and LCD Soundsystem. Here is a link to my last.fm: https://www.last.fm/user/singhparam

Question: What aspects of ListenBrainz interest you the most?

Answer: I have been using Last.fm to track my listening history since 2012 and the visualizations that they used to show in the playground were really interesting to see. The thing that interested me in ListenBrainz is that the data is open and this could allow for many more interesting usages of this data.

Question: Have you ever used MusicBrainz to tag your files?

Yes, I have been using Picard to tag my files ever since my music collection became a little too large to handle without well organized metadata.

Question: Have you contributed to other Open Source projects? If so, which projects and can we see some of your code?

I have made contributions to Sympy, working on stuff in their parser. Here’s a link to my pull requests.

Question: What sorts of programming projects have you done on your own time?

I have worked on a Gameboy emulator in C++ and SDL, an online judge for competitive programming that runs people’s codes in a docker container and issues judgements accordingly. I have also worked on a Checkers bot and a game environment for the bot in Pygame.

Question: How much time do you have available, and how would you plan to use it?

I have holidays during most of the coding period and can work full time (45-50 hrs per week) on the project.

Question: Do you plan to have a job or study during the summer in conjunction with Summer of Code?

None, if selected for GSoC.

iliekcomputers · March 12, 2017, 5:58am

This is an initial draft of my application. Any reviews / feedback would be greatly appreciated!

Zas · March 12, 2017, 9:03am

Nice proposal

Stats API endpoints:
actually it would be a good idea to provide the stats through a json format usable by https://github.com/influxdata/telegraf/tree/master/plugins/inputs/httpjson
Storing values on regular basis in an influxdb database makes it very easy to represent data (we use grafana + influxdb + telegraf for nodes/services metrics already).

Stats keyed by time:
Of course, you could use iinfluxdb as a local storage and let people access to it through useful influx API.

iliekcomputers · March 12, 2017, 6:55pm

Thanks for the comments, @Zas.

Thank you!

I have been thinking of what format json the API endpoints would return and for the stats, that I have proposed, it would be something like this. Suppose we make a request to get a user’s top artists, the json would be:

{
     "user_id": "iliekcomputers",
     "artist_count": 10,
     "payload" : [
          {
              "artist_name": "Kanye West",
              "artist_msid": "msid here",
              "timeframe": "last_week",
              "listen_count": 80
          },
          ...
      ]
 }

Some discussion around the format exactly would be needed before we start with the API endpoints, but I completely agree with making the json format usable by telegraf.
I am not completely sure how to design it so that it can be used by telegraf. Some help here would be appreciated. I will add this format to the draft while more discussion on it takes place.

I’m not sure how saving the data in influx would help with making stats show listen time instead of listen count. Could you please elaborate on this? It would be really helpful. The problem I came across when thinking of how to implement this is how to get listen times from listen counts. This data is not sent by the user and making requests to the MusicBrainz database would be expensive. Also, I think I may not have explained this idea well enough. I will edit the proposal to fix this.

Quesito · March 13, 2017, 3:44pm

Nice Proposal!

I’d like to something like this work for more than just LB. Is this possible? I’d also like to make some Data Visualization Art (not graphs) for social media sharing–is this something we can combine efforts on (in some way)?

iliekcomputers · March 16, 2017, 3:51pm

Thank you!

In my opinion, this might get a little too out of scope for me, as I only have a passing understanding of the other MeB projects. I am happy to give your suggestions a look, though.

This is something that I would definitely love to work on! I can add more visualizations to my proposal if you give me some examples of stuff that you want.

Quesito · March 16, 2017, 4:21pm

I wrote up a little proposal, in the hopes to give my idea some clarity–and quick made some mock ups! Check it out and let’s keep discussing!

iliekcomputers · March 24, 2017, 7:06am

@Quesito,

Inspired by your post, I’ve added a more arty / sharable graph to the sitewide graphs category of the proposal . It is basically a stream graph showing which artists were listened to the most over the last month or so. This was recommended to me in the forums also, so I think it is something that people want. We could show this on the LB site and share it on our blog also. See this graph for an example.

Quesito · March 24, 2017, 11:05am

@iliekcomputers this is a really beautiful and shareable data visualization!!!

rob · March 27, 2017, 7:47am

Wow, what a proposal. Well done!

A couple of comments: Last login date is something that becomes important (since computing resources will be spent based on it) so maybe it should not live in Redis, but in Postgres.

Along the same lines, the stats may not need to be stored in a detailed postgres schema. In these contexts you should always ask yourself the question: Do we plan to query this data or do we just need to store it? Right now, it doesn’t seem like we get a lot of use of to stats stored in a detailed set of database tables. Instead lets just store the data as JSONB blobs in Postgres. This also removes a lot of work out of a very ambitious proposal.

More thoughts later.//

iliekcomputers · March 28, 2017, 4:00pm

Thank you!

Both of these make sense to me, I will edit the proposal soon after some thought. Last login can go into the user table already in LB.

Storing JSON blobs was something I didn’t consider seriously because denormalized data is something I was told to generally stay away from, but it definitely makes sense here as there is no reason for all the joins that we’ll have to do if we normalize the data. I will edit the proposal accordingly.