MusicBrainz Solr Search Server Completion and Results Optimization

weeksio · March 25, 2016, 7:04am

See original google doc here

Personal information
Nick Name: Jeff
IRC nick: weeksio
Email: jeff@jeffweeks.io
GitHub: jeffweeksio
Twitter: jeffweeksio
FaceBook: jeffweeks.io
LinkedIn: jeffweeks

Proposal

As I am proposing a project that has been ongoing for several years now and is a continuation of my project from GSoC 2015, I will proceed with the understanding that the MusicBrainz community knows the basics of the project. Instead, I will use this as a platform to fill the community in on the state of the project and how I intend to bring the project to a conclusion.

To be absolutely clear though, finishing this project and replacing the old custom Lucene server is my top priority. Improvements can be made down the road.

MusicBrainz Simple Solr Search Server Schema (mbsssss)
Most of my work last summer involved working in the MusicBrainz Simple Solr Search Server Schema (mbsssss) found here: https://github.com/jeffweeksio/mbsssss

The directories in this repository (aside from _template, lib, and common) correspond to the searchable entities (Solr cores) indexed on the server. Each core directory has the following basic structure:
[entity-name]
conf
→schema.xml
→solrconfig.xml
common (symlink to directory in root)
data
index

The schema.xml files contain the entity’s field names, its field types, and whether the entity’s data needs to be stored on the search server or just indexed. Some of these fields are unique to the search server. For example, I have indexed (not stored) data from core-critical fields into a “catch” field that serves as a “catch-all” index for the entity. This is the field that is searched by default in an “Indexed search with advanced query syntax” when the user doesn’t specify a search field. Another such, search-server specific field is the ngram field/fieldtype which indexes ngrams for core-critical fields for use in an edismax (“Indexed search”) query.

The solrconfig.xml files contain the code that tells Solr how queries should be handled (requestHandler) and responded to (queryResponseWriter). The request handlers tell Solr what fields to search and how/if results should be boosted. The query response writers call one of Mineo’s response writer’s, either MBXMLWriter or MBJSONWriter. These response writers generate mmd-schema compliant responses.

The work that remains to be done in this repo will fall under the “results optimization” portion of the project. I was able to optimize the artist entity fairly well last summer. This summer, I plan to go through each entity and tweak boosting factors, edismax parameters, “ngram” and “catch” fields, minimum match thresholds, etc.

This will involve actively engaging with the MusicBrainz community to receive and respond to feedback. I will monitor the project’s JIRA page closely.

One specific bug in this repo is SOLR-12. The fix for this will involve the filtering method used for Katakana characters. Currently using:

Search Index Rabbit (sir)
Much of my previous work was in:
sir/sir/schema/init.py
sir/sir/schema/modelext.py
Much of what needs completed is in:
sir/sir/wscompat/convert.py

convert.py defines the functions that convert an entity’s data to a an XML document compliant with the MMD schema version2. I believe that many if not all of the bugs in JIRA of the form “? is missing from ? results” are related to missing or unfinished compatibility converter functions in this file. If I am correct, completing these converter functions will resolve a large percentage of the current bugs in JIRA.

MusicBrainz Query Response Writer for Solr (mb-solrquerywriter)
If missing data in the results is not remedied via convert.py in sir, then the problem could be in the query response writer.

Defragmentation of MusicBrainz/Solr Codebase
There are several versions of the various repos floating around and I should probably sort things out and make sure everyone is on the same page. My inexperience with version control has been quite evident.

Documentation
With all of the work that has gone into this project over the last few years, I want to be sure there is some thorough, unassuming, plain spoken documentation to go with it and to help further encourage hacking.

Schedule
Community Bonding Period
Solicitation of community feedback.
Discuss plans with CJ_ regarding Solr and zas regarding sysadmin issues and the monitoring of the new service.
Clean up the codebase and control my version control
Study up on Python and SQAlchemy for sir
Getting my dev environment in order
Ensure the test VM is up and running.
Curate feedback into a final list of bugs.
Write blog post about currently known issues and announce a release immediately after fixing these issues. Short of any regressions, this will happen the week of July 4th.

Weeks 1-5
Work on bugs in the form of “? is missing from ? results”. I figure I might be able to do 1 or 2 of these per day for the first few weeks. Maybe this goes more quickly (particularly since they all seem to be the same type of problem), but if there is time toward the end of the summer I could contribute to *Brainz elsewhere.
Work on other critical issues as determined by the community during the bonding period.

Week 6
Deploy into production.

Weeks 7-9
Work through each entity one by one to fine tune search its search configuration.
Solicitation of community feedback.
Bug fixes in response to feedback.

Weeks 10-12
Documentation
Loose Ends
Inevitable bug fixes.

About Me
Computer: Lenovo Z580
Processor: Intel® Core™ i5-3210M CPU @ 2.50GHz × 4
Memory: 7.7 GiB
OS: Ubuntu GNOME 15.10 64-bit
Programming since: Autumn of 2012.
Musical Interests:
Favorite composers:
519dd32e-8f30-4380-8826-7aa99169e1bb
9ddd7abc-9e1b-471d-8031-583bc6bc8be9
0e43fe9d-c472-4b62-be9e-55f971a023e1
c278de2c-9696-4fdf-a919-0781cd945e2c
(You might notice I have a type.)

Favorite bassists:
2dc7a305-7a8c-4fcc-add4-5e9def0a9a0d
901caa09-383e-4dbc-95f2-a75ec7863b6a
7e4d50c0-3652-4564-b0f1-c6552204b6ec
f28fc731-59db-4399-b243-43ed7b5e6e49
1d0ae4e0-384f-4854-a1b4-ebb4cc81414b
46a6fac0-2e14-4214-b08e-3bdb1cffa5aa
384d6827-3b17-43bc-acf6-92e618b8ec83

Albums:
40c45052-9418-4aee-83c7-8ba518afae3f
97b0a3f5-4f7e-3744-8780-0449dc734439
0cb0f83f-b8d2-4cd7-a5fb-32a4bc21d053
3bd5a388-774f-3b47-b3f8-5b3463cbcb13
a159e102-18bb-42ca-a59e-5f96a7eff241
9b191575-fa1f-3531-bff5-121782318972
ca5dfcc3-83fb-3eee-9061-c27296b77b2c
c7b245c9-8099-32ea-af95-893acedde2cf
cd76f76b-ff15-3784-a71d-4da3078a6851
4f8bb33f-9e2d-34dc-8852-5d591cc104d8

What about MusicBrainz interests me most?
I’m a musician, a music cataloger, and I’m trying to become a music hacker. MusicBrainz is the
perfect project for me!
Have you ever used MusicBrainz to tag your files?
No. I spent a TON of time doing this myself back in the day before I was aware of MusicBrainz.
Have you contributed to other open source projects?
No. Just this one
How much time do you have available?
25-35 hours per week
Do you plan to have a job or study during the summer?
Job: Yes, I will have to continue my position at Sibley Music Library (part-time).
Study: I have no plans to take any coursework over the summer officially, but I do need to
practice coding challenges on HackerRank and I’d like to work through Professor Serra’s
Coursera course on Audio Signal Processing for Music Applications.

reosarevok · March 25, 2016, 10:24am

You self-hating American!

More seriously, yes. Yes. Yes. We definitely need that, and very happy to see you made it on time (assuming you’ve submitted to the GSoC page too! If not, please do it very soon).

rob · March 25, 2016, 11:45am

OK, in general I like where you are going with this. I would like to see you make another pass adjusting things a bit more:

I’d like to see more community feedback during the bonding period and less during the actual coding phase.
I’d like to not see any new features or improvement at first.
I’d like to see the project get released and pushed into production about half way through the summer.
The latter half would be bug fixes and improvements as the community suggests.

They key thing here is: FINISH this project. That needs to be your #1 goal.

rob · March 25, 2016, 11:54am

And whatever you do, get a proposal in SOON.

weeksio · March 25, 2016, 3:24pm

I made some slight changes. Working on official GSoC application now.

rob · March 25, 2016, 3:28pm

Deploy into production needs to be a week. Then leave a week for bug hot-fixes.

Really, post release things are going to be nebulous – how well the release went will define what else needs to be done, so I’m fine with a vague schedule for the latter half.

weeksio · March 25, 2016, 3:35pm

rob · March 25, 2016, 3:36pm

/me hits ignore proposal

(just kidding)

rob · March 25, 2016, 3:37pm

The student has submitted their final proposal for review. You will be able to view it immediately after the deadline:

We’re not certain why this is in place. Would you do me a favor and mail me a copy?

weeksio · March 25, 2016, 3:39pm

The official proposal is this [Google doc] (https://docs.google.com/document/d/1eOMvBtNygeLmwjiRKy1n-RyQOooZxpPONM8tn2h4l7w/edit?usp=sharing) in pdf form.

Here is the abstract I posted.

It looked to me like we can submit a draft for review (which would be redundant with the MB community page we’re on) and then upload a final proposal. In my case, they are identical.

rob · March 25, 2016, 3:51pm

OK, getting there.

I’d to see less technical detail about tuning the schema - for once.

I’d like to see more information on what steps are required to get this into production. Less about turning, more about show stopped bugs. You should also talk about your plans to pick CJ_'s brain to finish debugging the remaining bugs while he doesn’t have time to do so.

I’d also like to see a stated goal of using the community bonding period to determine the list of critical issues that the community believes need to be fixed before going into production. I see the following steps for this:

Ensure the test VM is up and running.
Write blog post about currently known issues. Threaten to release immediately after fixing these issues, which will bring more people out of the woodwork.
Curate feedback into a final list of bugs. Announce final list of bugs and short of regressions appearing, that we plan to push this into production on date X.

This ought to generate enough interest (read: panic) in the community to get a clear feeling for what is needed for actually getting a first version of search out the door. The clear list of things that are required to be done for release should be clear at the beginning of the summer.

Finally, you should talk more about your plans to work closely with zas so that zas can work on the sysadmin related issues and monitoring this new service.

You have a lot of plans to document still.

weeksio · March 25, 2016, 4:01pm

Can I continue that here post deadline?

rob · March 25, 2016, 4:02pm

No.

I was under the impression that that was possible, but it isn’t.

weeksio · March 25, 2016, 4:06pm

Alright…I’ll do what I can in the next 3 hours (minus doctors appt., and nurishment). The purpose of including the schema technical details was to document where the project stands.

weeksio · March 25, 2016, 4:10pm

I admittedly got bogged down with and hyper focused on the schema details last year…but they’re fun. I get your point though. Just get it done.

rob · March 25, 2016, 4:16pm

Don’t get me wrong – tuning is important, but getting it out the door, more so.

weeksio · March 25, 2016, 6:58pm

I resubmitted with your suggested timeline edits (both here and the GSoC page).

weeksio · March 25, 2016, 6:59pm

weeksio · March 27, 2016, 4:51pm

Not to be presumptuous or anything, but would it be acceptable to go ahead and start a Community topic/conversation with Zas, CJ_, and Mineo (and anyone else of course) regarding the Solr project in general? …or would it be best to just wait?

Freso · March 27, 2016, 5:38pm

We’re allowed to say that no one else has made a proposal for working on the Solr search server/mbssssssssssss, so whether or not your proposal is going to pull through, any work you do on it will not conflict with another student’s. So I’d say you’d be most welcome to start talking with people about the Solr project, as long as you keep in mind that there’s a very real chance you’re not going to get picked.