See original google doc here
As I am proposing a project that has been ongoing for several years now and is a continuation of my project from GSoC 2015, I will proceed with the understanding that the MusicBrainz community knows the basics of the project. Instead, I will use this as a platform to fill the community in on the state of the project and how I intend to bring the project to a conclusion.
To be absolutely clear though, finishing this project and replacing the old custom Lucene server is my top priority. Improvements can be made down the road.
MusicBrainz Simple Solr Search Server Schema (mbsssss)
Most of my work last summer involved working in the MusicBrainz Simple Solr Search Server Schema (mbsssss) found here: https://github.com/jeffweeksio/mbsssss
The directories in this repository (aside from _template, lib, and common) correspond to the searchable entities (Solr cores) indexed on the server. Each core directory has the following basic structure:
common (symlink to directory in root)
The schema.xml files contain the entity’s field names, its field types, and whether the entity’s data needs to be stored on the search server or just indexed. Some of these fields are unique to the search server. For example, I have indexed (not stored) data from core-critical fields into a “catch” field that serves as a “catch-all” index for the entity. This is the field that is searched by default in an “Indexed search with advanced query syntax” when the user doesn’t specify a search field. Another such, search-server specific field is the ngram field/fieldtype which indexes ngrams for core-critical fields for use in an edismax (“Indexed search”) query.
The solrconfig.xml files contain the code that tells Solr how queries should be handled (requestHandler) and responded to (queryResponseWriter). The request handlers tell Solr what fields to search and how/if results should be boosted. The query response writers call one of Mineo’s response writer’s, either MBXMLWriter or MBJSONWriter. These response writers generate mmd-schema compliant responses.
The work that remains to be done in this repo will fall under the “results optimization” portion of the project. I was able to optimize the artist entity fairly well last summer. This summer, I plan to go through each entity and tweak boosting factors, edismax parameters, “ngram” and “catch” fields, minimum match thresholds, etc.
This will involve actively engaging with the MusicBrainz community to receive and respond to feedback. I will monitor the project’s JIRA page closely.
One specific bug in this repo is SOLR-12. The fix for this will involve the filtering method used for Katakana characters. Currently using:
Search Index Rabbit (sir)
Much of my previous work was in:
Much of what needs completed is in:
convert.py defines the functions that convert an entity’s data to a an XML document compliant with the MMD schema version2. I believe that many if not all of the bugs in JIRA of the form “? is missing from ? results” are related to missing or unfinished compatibility converter functions in this file. If I am correct, completing these converter functions will resolve a large percentage of the current bugs in JIRA.
MusicBrainz Query Response Writer for Solr (mb-solrquerywriter)
If missing data in the results is not remedied via convert.py in sir, then the problem could be in the query response writer.
Defragmentation of MusicBrainz/Solr Codebase
There are several versions of the various repos floating around and I should probably sort things out and make sure everyone is on the same page. My inexperience with version control has been quite evident.
With all of the work that has gone into this project over the last few years, I want to be sure there is some thorough, unassuming, plain spoken documentation to go with it and to help further encourage hacking.
Community Bonding Period
Solicitation of community feedback.
Discuss plans with CJ_ regarding Solr and zas regarding sysadmin issues and the monitoring of the new service.
Clean up the codebase and control my version control
Study up on Python and SQAlchemy for sir
Getting my dev environment in order
Ensure the test VM is up and running.
Curate feedback into a final list of bugs.
Write blog post about currently known issues and announce a release immediately after fixing these issues. Short of any regressions, this will happen the week of July 4th.
Work on bugs in the form of “? is missing from ? results”. I figure I might be able to do 1 or 2 of these per day for the first few weeks. Maybe this goes more quickly (particularly since they all seem to be the same type of problem), but if there is time toward the end of the summer I could contribute to *Brainz elsewhere.
Work on other critical issues as determined by the community during the bonding period.
Deploy into production.
Work through each entity one by one to fine tune search its search configuration.
Solicitation of community feedback.
Bug fixes in response to feedback.
Inevitable bug fixes.
Computer: Lenovo Z580
Processor: Intel® Core™ i5-3210M CPU @ 2.50GHz × 4
Memory: 7.7 GiB
OS: Ubuntu GNOME 15.10 64-bit
Programming since: Autumn of 2012.
(You might notice I have a type.)
What about MusicBrainz interests me most?
I’m a musician, a music cataloger, and I’m trying to become a music hacker. MusicBrainz is the
perfect project for me!
Have you ever used MusicBrainz to tag your files?
No. I spent a TON of time doing this myself back in the day before I was aware of MusicBrainz.
Have you contributed to other open source projects?
No. Just this one
How much time do you have available?
25-35 hours per week
Do you plan to have a job or study during the summer?
Job: Yes, I will have to continue my position at Sibley Music Library (part-time).
Study: I have no plans to take any coursework over the summer officially, but I do need to
practice coding challenges on HackerRank and I’d like to work through Professor Serra’s
Coursera course on Audio Signal Processing for Music Applications.