GSoC 2020: Adding Statistics and Graphs for ListenBrainz Users and Community

Tags: #<Tag:0x00007f756b98f340> #<Tag:0x00007f756b98f250> #<Tag:0x00007f756b98f188>

GSoC 2020: Adding Statistics and Graphs for ListenBrainz Users and Community

Personal Information

Project Overview

ListenBrainz now has a statistics infrastructure that collects and computes statistics from the listen data that has been stored in the database. Right now, the only information a user gets about their listening trends is a list of recent listens and top artists. This project aims to change this by displaying insightful graphs and statistics that would be more helpful to the user.

Graphs and Statistics which can be shown

We can classify the graphs and statistics to be shown in two different categories:

User Statistics

These graphs tell the user about their listening history and habits.

  • Listening Activity: The number of listens submitted to ListenBrainz in the last week/month/year
  • Top Artist: The top artists that the user has listened to
  • Top Releases: The top releases that the user has listened to
  • Top Recordings: The top recordings that the user has listened to
  • Daily Activity: This graph shows the user activity during the day
  • Top Genres: This graph shows the top 5 genres that user listens to
  • Artist Origins: A map showing the locations of artists to which the user listens to
  • Mood Analysis: Information such as Danceability, BPM and the general Tone of the songs that user listens to

Sitewide Statistics

These graphs tell about the sitewide trending artists, releases, and recordings in the ListenBrainz community. This data can also be used to calculate the popularity of the entities.

UI Mockups

This project will add three new views to serve the statistics that are being generated.

User statistics

UI Prototype

This view contains all the graphs and statistics that have been described in the User Statistics section.

User listen history

UI Prototype

This view shows a paginated list of the artists/recordings/releases that the user has listened to in a given time period.

Sitewide statistics

UI Prototype

This view shows the top 10 artists/recordings/releases that all ListenBrainz users are listening to. Moreover, the Listen Count shown on the homepage will be replaced by a graph showing the cumulative listens submitted to ListenBrainz over the last month.

Note: The mockups for the UI may change as per further discussions.

Implementation

Front End

ListenBrainz uses ReactJS for implementing UI components. I intend to use nivo - a React based charting library built using d3.js for rendering various visualizations. The reason to choose nivo as the charting library is -

  • Has thorough and in-depth documentation
  • Has a lot of customization options
  • Has typescript definitions, which will help in devlopment considering that ListenBrainz is going to use typescript for ReactJS code in future
  • Supports responsive components, which is essential in making the website mobile friendly


    The code used to build graphs for the mock UI can be found here.

Back End

Listen Activity

The listen activity shows the number of listens that a user has submitted over a period of time. It is a good measure of how active a user is and on which days is he most active.

Generating the data required is fairly easy. The psuedocode for generating the data for weekly listen activity is given below.

Psuedocode:

def get_listen_activity(week, user_name):
  message = {}
  df = get_listens(from_date=week.begin, to_date=week.end)
  df.createOrReplaceTempView("listens")
  for day in week:
    result = run_query("""SELECT * 
                          FROM listens 
                          WHERE user_name={user_name}
                          AND   timestamp>={utc(day.start)}
                          AND   timestamp<={utc(day.end)}
                       """)
    cnt = result.collect().count()
    message[day] = cnt
  return message

Top Artist/Recording/Release

The top artist/recording/release section shows the top 10 artist/recording/release that a user has listened to in the given period of time.
Generating the data required for this is fairly easy. We first have to generate a Dataframe for the specified time period, then convert the Dataframe to a table and run the following SQL query on it.

SQL Query:

SELECT  artist_name
      , artist_msid
      , artist_mbid
      , COUNT(artist_name) as cnt
FROM    {table_name}
WHERE   user_name={user_name}
GROUP   BY  artist_name
          , artist_msid
          , artist_mbid
ORDER   BY  cnt
LIMIT   10

Similar queries can be made to get Top Recordings/Releases.

Daily Activity

The daily activity graph tells when in the whole day is a user most active. It reveales interesting information about a persons work habits and daily routine.

Calculating the data for the graph is moderately easy. The psuedocode for calculating this data is given below. As this data doesn’t change much, it can be calculated once a week.

Pseudocode:

def get_daily_activity(week, user_name):
  df = get_listens(from_date=week.begin, to_date=week.end)
  df.createOrReplaceTempView("listens")
  data = [0 for i in range(1, 25)]
  for day in week:
    for hour in day:
      result = run_query("""SELECT * 
                            FROM listens 
                            WHERE user_name={user_name}
                            AND   timestamp>={utc(hour.start)}
                            AND   timestamp<={utc(hour.end)}
                         """)
      data[hour] += result.collect().count()
  data = [cnt/7 for cnt in data]

  return data

History

This section shows a paginated list of all the different entities that a user has listened to in a given time period.

A bar graph can be used to show this data. The SQL query to calculate this is similar to the Top Artist graph.

Sitewide Statistics

The sitewide statistics page shows the top artists/recordings/releases that the users are listening to in a week. These graphs reveal the current trending artists/releases/recordings.

A stream graph will be used to show which entity was most listened to in the past week. The SQL query to calculate the data required is similar to the Top Artist graph.

Artist Origin

The Artist Origin maps out all the artist’s that a user has listened to. It shows how diverse a user’s listening habits are.

This map is a bit difficult to implement as we have to query the MusicBrainz database to get the artist’s origin. This data will be calculated weekly/biweekly, depending upon how fast this process is. A local cache can be created that maps various artists to their origin. This will make subsequent queries to get a particular artist’s origin faster. The overall flow of the above process is shown in the figure -

Top Genres

The Top Genres chart shows the top 5 genres that user listened to.

A pie chart can be used to show this data. The data required to display this chart also has to be obtained from MusicBrainz database. Hence this data will be generated incrementally once per week. The overfall flow of the process is similar to the once shown in Artist Origin.

Mood Analysis

AcousticBrainz provides a lot of useful information such as Danceability, BPM and the general Tone of a recording. This can be used to provide insightful information about users’ listening habits.

As the raw data provided by AB is hard relate to, this data will be shown relative to other users. For example, Dancebility of an user’s songs is 20% more than average. As AB only supports mbid lookups for now, this project will calculate these statistics only for listens having a valid recording_mbid. Supporting all listens will be a stretch goal. These statistics will be calculated once in a week. Different statistics which can be shown are -

  • BPM
  • Dancebility
  • Happiness
  • Accousticness

Storing data in ListenBrainz Server

As the data required will be calculated in batches, it has to be stored in ListenBrainz Server so that it can be served to the user when required. ListenBrainz already has a table which stores this data. Additional columns will be added to the Schema for the new data that will be generated.

Storing data for Top Genres, Artist Origin and Mood Analysis

The data for Artist Origin, Top Genres and Mood Analysis will be calculated incrementally. That is the data will be calculated for a week and then merged with previous data. Hence we have to store this data in HDFS. A new table with the following schema will have to be created for this.

Column Type Nullable
user_name string not nullable
genre map(string, integer) nullable
artist_origin map(string, integer) nullable
mood map(string, integer) nullable
last_updated long integer not nullable

Timeline

Community Bonding Period

I will use this time to discuss implementation details with mentors and finalize the Mock UI. I will start configuring the ListenBrainz server to use Typescript and the Spark server to start generating statistics.

Phase 1

I would implement the basic and easy to implement graphs in this period. The graphs that would be completed in this phase would be Listen History, Daily Activity, Top Entity and User History.

Phase 2

I would complete the Sitewide Statistics. I will also work on integrating MB and LB and work on the Artist Origin map.

Phase 3

I would complete the Top Genres graph. I will also work on integrating AB and LB and work on Mood Analysis. If time permits I will work on the additional ideas mentioned above.

Here is a more detailed week-by-week timeline of the 13 weeks GSoC coding period to keep me on track

Week 1-2

Implement the frontend and backend for Listen History and Daily Activity.

Week 3

Implement the frontend and backend for Top Artists/Recordings/Releases.

Week 4 (Phase 1 evaluations here)

Implement the user history page.

Week 5

Buffer Period. Catch up on things if behind. If not continue working. Improve the code written before based upon feedback from mentors in evaluation.

Week 6

Implement sitewide statistics.

Week 7

Work on integrating LB and MB. This will involve writing an API library to communicate with MB database or setting up the MB database on Playground itself.

Week 8 (Phase 2 evaluations here)

Implement Artist Origin map.

Week 9

Buffer Period. Catch up on things if behind. If not continue working. Improve the code written before based upon feedback from mentors in evaluation.

Week 10

Implement Top Genres graph.

Week 11

Work on integrating AB and LB. This will involve writing an API library to communicate with AB database.

Week 12

Implement Mood Analysis.

Week 13 (Phase 3 evaluations here)

Buffer Period. Catch up on things if behind. If not work on additional ideas.

Post GSoC / Additional Ideas

I would like to continue working with ListenBrainz after Summer of Code. This project aims at setting up basic architecture for generating statistics with Apache Spark. The addition of more statistics will be relatively easy.

Mood Analysis for listens not having recording_mbid

As mentioned in the proposal the project aims to implement Mood Analysis for listens having recording_mbid. Support for all listens can be added later.

Entity Graphs

These graphs will show details about various entities like artists, recordings and releases, when did a user start listening to that entity.

Mainstream Meter

This measures how mainstream a users’ musical choice is. This can be done by taking the popularity of an entity and number of listens for that entity into account.

About Me

I am a freshman at IIIT-H (International Institute of Information Technology, Hyderabad). I have been working with ListenBrainz since January and have learned quite a few things along the way. You can find the list of PRs that I have made over here.

Question: Tell us about the computer(s) you have available for working on your SoC project!

I have a HP laptop with an Intel i5 processor, and 8 GB RAM running Arch Linux. I also have a desktop computer with an Intel i7 processor, GTX 960 Graphics card, and 8 GB RAM running Arch Linux.

Question: When did you first start programming?

I have been programming since 10th grade. I started with C/C++ but now mostly code in Python and JavaScript.

Question: What type of music do you listen to?

I am a big fan of Coldplay. In addition to that, I also like listening to songs by Maroon 5, Lenka, and The Local Train.

Question: What aspects of ListenBrainz interest you the most?

The data collected by ListenBrainz is openly available and can be used to improve music technologies. Also the Open Source nature of ListenBrainz allow me to add features which are currently unavailable in closed source competitors.

Question: Have you ever used MusicBrainz to tag your files?

I have used MusicBrainz Picard to tag my music collection.

Question: Have you contributed to other Open Source projects? If so, which projects, and can we see some of your code?

ListenBrainz is the first open source organization that I have contributed to. However, I have done some other projects that can be seen on my Github Page.

Question: What sorts of programming projects have you done on your own time?

I wrote a bot that solved the Eight Puzzle as the final project for CS50. Recently I also worked on the platform used for Botomania, an onsite contest held at IIITH.

Question: How much time do you have available, and how would you plan to use it?

I plan to work for 35-45 hours per week as I would be on vacation during most of the coding period.

Question: Do you plan to have a job or study during the summer in conjunction with Summer of Code?

No

2 Likes

Hey everyone, this is the first draft of my GSoC proposal. Any feedback would be appreciated.

What does daily activity mean in terms of stats? What data goes into that graph? The stats you pointed out were mostly ones on our radar already. What stats would you add that we’ve not discussed yet?

More details, please!

Noice!!

This one isn’t as sexy as the previous one. Can we do better? For the sitewide ones as well – bar charts are not really engaging. Lets find something better.

I’ve read up through the timeline so far. I would like to see a lot more detail on the types of graphs you propose to do. I would like to see more of them, including new ones that you came up with. Then I would like to see queries that you think can generate the data and then mockups of the graphs.

I really like the mock-up you did for page1 - stats. Lets do more of those for other stats. you can look in Jira for other tickets that describe more stats, if you want.

The daily activity graph tells when in the whole day is a user most active. It reveales interesting information about a persons work habits and daily routine.

Artist Origin, Top Genres and Mood Analysis are statistics which were not discussed yet.

Thanks!

I feel that we should keep the bar charts for the history part because this view shows all the entities that a user has listened to. I think we can make the page more attractive by adding Artist Images/Album Art.

Another option for showing trending entities is a Stream graph. I have updated the mock UI to reflect this change. Let me know your thoughts on this.

I have updated the Back End section and have added more details about all the graphs as well as the queries which could generate the data for some of the graphs. I have added the link to mockups of the graphs in the Front End section.

Please have a look through the updated proposal and let me know if anything else has to be changed. Thanks!