Lucene scoring not accurate

Richard_Dunn · January 23, 2017, 1:55am

Hi guys, I’ve been having a look at the v2 API (it’s awesome!), but can’t help feel there’s an issue with the scoring. When running the following search to find Nirvana’s Nevermind:

http://musicbrainz.org/ws/2/release-group/?query=artist:nirvana%20AND%20release:"nevermind"

The API scores it joint second after “Nevermind Sessions” (100) and on par with “Nevermind, It’s an Interview” (91). Both of those are by Nirvana, but obviously not the most relevant.

At first glance it looks like a document length normalisation problem, as the longer documents are ranked higher, but the fact that Nevermind is ranked as 91 and not 100 makes it seem like there’s something else up with the algorithm. I’m not familiar with the Lucene scoring algorithm so I can’t really comment further, but is this not considered a fairly big issue?

From a programmatic point of view, it seems pretty beneficial to have this scoring as accurate as possible, otherwise the result set needs to filtered again on the client side to determine which result is actually most relevant.

Apologies if I’m ranting nonsense, I’ve only been poking at the API for a couple of nights.

Edit: It’s been brought to my attention that the following search gives a better result:

http://musicbrainz.org/ws/2/release-group/?query=artist:nirvana%20AND%20releasegroup:"nevermind"

But the above question still stands, just to a lesser extent I guess; Shouldn’t the query match the document more accurately?

chirlu · January 23, 2017, 3:20am

Here are some earlier topics regarding Lucene scoring:

Richard_Dunn · January 23, 2017, 11:20am

Thanks chirlu,

These are indeed all referring to the same issue, but nobody in those threads seems to be considering the root of the problem, just workarounds.

I might fire up a server and take a look, I’m either going to have to write client side code to process the results, or fix the results at the source.

Edit: I still haven’t nailed down the weight associated with document length normalisation to address this issue exactly, as the lucene scoring algorithm is new to me. However, it appears that the DefaultSimilarity class has been deprecated and that a more conventional BM25 implementation is now recommended/required for newer releases of lucene. Someone mentioned in the IRC channel that there may be some plans to move to Solr, I think DefaultSimilarity will have to be replaced in order to do so. Has anyone been looking into this issue?

P.S. BM25 also seems to be more performant:

dns_server · January 24, 2017, 9:54pm

I believe a student participated in the GSOC to implement a solr based search server.

I am not sure how far the implementation got and if it needs more work before it can be used.

Richard_Dunn · January 25, 2017, 2:24am

Thanks DNS, just to be clear though, I’m not currently interested in helping with the whole migration to solr (I simply don’t have the time for the next few months), I’m soley interested in improving the search results ordering. I just happened to read that the relevant classes were deprecated and not supported by solr, so I’m mentioning it as I’ve also heard that people are considering this migration in the future.

That said, the repo you provided could prove very useful for the task I’m looking into, so cheers for that!

Freso · January 25, 2017, 11:57am

It is not fully implemented. One of @Gentlecat’s current official tasks is to try and get it Actually Working™ so we can finally make the switch.