See original google doc here
Personal information
Nick Name: Jeff
IRC nick: weeksio
Email: jeff@jeffweeks.io
GitHub: jeffweeksio
Twitter: jeffweeksio
FaceBook: jeffweeks.io
LinkedIn: jeffweeks
Proposal
As I am proposing a project that has been ongoing for several years now and is a continuation of my project from GSoC 2015, I will proceed with the understanding that the MusicBrainz community knows the basics of the project. Instead, I will use this as a platform to fill the community in on the state of the project and how I intend to bring the project to a conclusion.
To be absolutely clear though, finishing this project and replacing the old custom Lucene server is my top priority. Improvements can be made down the road.
MusicBrainz Simple Solr Search Server Schema (mbsssss)
Most of my work last summer involved working in the MusicBrainz Simple Solr Search Server Schema (mbsssss) found here: https://github.com/jeffweeksio/mbsssss
The directories in this repository (aside from _template, lib, and common) correspond to the searchable entities (Solr cores) indexed on the server. Each core directory has the following basic structure:
[entity-name]
conf
→schema.xml
→solrconfig.xml
common (symlink to directory in root)
data
index
The schema.xml files contain the entity’s field names, its field types, and whether the entity’s data needs to be stored on the search server or just indexed. Some of these fields are unique to the search server. For example, I have indexed (not stored) data from core-critical fields into a “catch” field that serves as a “catch-all” index for the entity. This is the field that is searched by default in an “Indexed search with advanced query syntax” when the user doesn’t specify a search field. Another such, search-server specific field is the ngram field/fieldtype which indexes ngrams for core-critical fields for use in an edismax (“Indexed search”) query.
The solrconfig.xml files contain the code that tells Solr how queries should be handled (requestHandler) and responded to (queryResponseWriter). The request handlers tell Solr what fields to search and how/if results should be boosted. The query response writers call one of Mineo’s response writer’s, either MBXMLWriter or MBJSONWriter. These response writers generate mmd-schema compliant responses.
The work that remains to be done in this repo will fall under the “results optimization” portion of the project. I was able to optimize the artist entity fairly well last summer. This summer, I plan to go through each entity and tweak boosting factors, edismax parameters, “ngram” and “catch” fields, minimum match thresholds, etc.
This will involve actively engaging with the MusicBrainz community to receive and respond to feedback. I will monitor the project’s JIRA page closely.
One specific bug in this repo is SOLR-12. The fix for this will involve the filtering method used for Katakana characters. Currently using:
Search Index Rabbit (sir)
Much of my previous work was in:
sir/sir/schema/init.py
sir/sir/schema/modelext.py
Much of what needs completed is in:
sir/sir/wscompat/convert.py
convert.py defines the functions that convert an entity’s data to a an XML document compliant with the MMD schema version2. I believe that many if not all of the bugs in JIRA of the form “? is missing from ? results” are related to missing or unfinished compatibility converter functions in this file. If I am correct, completing these converter functions will resolve a large percentage of the current bugs in JIRA.
MusicBrainz Query Response Writer for Solr (mb-solrquerywriter)
If missing data in the results is not remedied via convert.py in sir, then the problem could be in the query response writer.
Defragmentation of MusicBrainz/Solr Codebase
There are several versions of the various repos floating around and I should probably sort things out and make sure everyone is on the same page. My inexperience with version control has been quite evident.
Documentation
With all of the work that has gone into this project over the last few years, I want to be sure there is some thorough, unassuming, plain spoken documentation to go with it and to help further encourage hacking.
Schedule
Community Bonding Period
Solicitation of community feedback.
Discuss plans with CJ_ regarding Solr and zas regarding sysadmin issues and the monitoring of the new service.
Clean up the codebase and control my version control
Study up on Python and SQAlchemy for sir
Getting my dev environment in order
Ensure the test VM is up and running.
Curate feedback into a final list of bugs.
Write blog post about currently known issues and announce a release immediately after fixing these issues. Short of any regressions, this will happen the week of July 4th.
Weeks 1-5
Work on bugs in the form of “? is missing from ? results”. I figure I might be able to do 1 or 2 of these per day for the first few weeks. Maybe this goes more quickly (particularly since they all seem to be the same type of problem), but if there is time toward the end of the summer I could contribute to *Brainz elsewhere.
Work on other critical issues as determined by the community during the bonding period.
Week 6
Deploy into production.
Weeks 7-9
Work through each entity one by one to fine tune search its search configuration.
Solicitation of community feedback.
Bug fixes in response to feedback.
Weeks 10-12
Documentation
Loose Ends
Inevitable bug fixes.
About Me
Computer: Lenovo Z580
Processor: Intel® Core™ i5-3210M CPU @ 2.50GHz × 4
Memory: 7.7 GiB
OS: Ubuntu GNOME 15.10 64-bit
Programming since: Autumn of 2012.
Musical Interests:
Favorite composers:
519dd32e-8f30-4380-8826-7aa99169e1bb
9ddd7abc-9e1b-471d-8031-583bc6bc8be9
0e43fe9d-c472-4b62-be9e-55f971a023e1
c278de2c-9696-4fdf-a919-0781cd945e2c
(You might notice I have a type.)
Favorite bassists:
2dc7a305-7a8c-4fcc-add4-5e9def0a9a0d
901caa09-383e-4dbc-95f2-a75ec7863b6a
7e4d50c0-3652-4564-b0f1-c6552204b6ec
f28fc731-59db-4399-b243-43ed7b5e6e49
1d0ae4e0-384f-4854-a1b4-ebb4cc81414b
46a6fac0-2e14-4214-b08e-3bdb1cffa5aa
384d6827-3b17-43bc-acf6-92e618b8ec83
Albums:
40c45052-9418-4aee-83c7-8ba518afae3f
97b0a3f5-4f7e-3744-8780-0449dc734439
0cb0f83f-b8d2-4cd7-a5fb-32a4bc21d053
3bd5a388-774f-3b47-b3f8-5b3463cbcb13
a159e102-18bb-42ca-a59e-5f96a7eff241
9b191575-fa1f-3531-bff5-121782318972
ca5dfcc3-83fb-3eee-9061-c27296b77b2c
c7b245c9-8099-32ea-af95-893acedde2cf
cd76f76b-ff15-3784-a71d-4da3078a6851
4f8bb33f-9e2d-34dc-8852-5d591cc104d8
What about MusicBrainz interests me most?
I’m a musician, a music cataloger, and I’m trying to become a music hacker. MusicBrainz is the
perfect project for me!
Have you ever used MusicBrainz to tag your files?
No. I spent a TON of time doing this myself back in the day before I was aware of MusicBrainz.
Have you contributed to other open source projects?
No. Just this one
How much time do you have available?
25-35 hours per week
Do you plan to have a job or study during the summer?
Job: Yes, I will have to continue my position at Sibley Music Library (part-time).
Study: I have no plans to take any coursework over the summer officially, but I do need to
practice coding challenges on HackerRank and I’d like to work through Professor Serra’s
Coursera course on Audio Signal Processing for Music Applications.