GSoC 2025 Application: BookBrainz - Use Solr Search Server

michael-w1 · March 20, 2025, 4:53am

High level overview of how search works:
User inputs text into Search Field → Text get filtered (converted to lowercase, ignoring accents in characters) → Get converted to tokens by splitting on whitespace and punctuation → Filtering out certain tokens like stop words etc. if needed

Look into finding the analysis setting used in elastic in Solr Search

The search is sent to elastic search index to find results. If the index is not generated, sent query to SQL database then create the index. To generate the index, it pulls chunks of 50k records from the database, and takes into account of relationships in the data such the author of an work. I will need to look to into doing the same thing in Solr Search.

Proposal Timeline

May 8 - June 1 - Community Bonding Period

Get to know the mentors
Read documentation on elastic search and Solr search server
Get familiar with build and testing process

12 Week Project Period

Week 1 - June 2-9 - Initial Research

Look at the source code with elastic search for BookBrainz
Note the current data schema for the books
Review how other MetaBrainz websites use Solr Search

Week 2 - June 9-16 - Solr Setup and Configuration

Setup Solr development environment
Create Solr development instance and configure it for search queries

Week 3 - June 16-23 - Design Solr Schema

Analyze data structure for all entities (authors, work, etc.)
Design Solr Schema that supports multi-entity search

Week 4 June 23-30 Develop Multi-Entity Search Model and Indexing

Implement a prototype Solr schema to support multi-entity search

Begin indexing data for different entity types

Test that the multi-entity search works in Solr with the sample data
Make sure that schema supports combined queries across multiple entities

Week 5 - June 30-July 7 Integrate to Website

Adapt search logic and integrate with website
Update backend logic to use Solr Search instead of Elastic search
Update website routes to work with Solr

Week 6 July 7-14 Data Migration and Indexing

Migrate data from Elastic search to Solr and ensure that it is indexed correctly
Test the indexing process for different entities

July 14 - Submit midterm evaluation

Week 7 July 14-21: Finalize Multi-Entity Search Functionality

Finalize the multi-entity search feature to ensure all entity types are handle in one query
Ensure that Solr can return results from authors, works, editions, etc.

Week 8 July 21-28: Frontend Integration and Testing

Integrate Solr-based search into the frontend
Make sure the UI is working with Solr’s multi-entity search

Week 9 July 28-August 4 - Performance Optimization and User Testing

Research and see if there are any performance optimizations can be made
Test that search is working as expected
Test if search can handle large datasets and heavy usage

Week 10 August 4-11 - Refinements, Deployment and Documentation

Finalize the search server integration and prepare for production deployment
Document the search system and changes made to the code
Deploy to production and monitor for any issues

2 Week buffer for any unexpected delays

Week 11 - August 11-18

Week 12 - August 18-25

Submit Project

Community Affinities
Some music that I listen to is: bd5bd7cf-cb0e-4848-87e2-df54084a3286bd5bd7cf-cb0e-4848-87e2-df54084a3286

What interests me the most about BookBrainz is that it is offers valuable data that is easily accessible to anymore. With the code being open source, the project is also community driven.

I’ve tested out BookBrainz search quite a few times

Programming Precedents

Started programming in 2018
I have not contributed to any open source project
Some of my projects include:
full-stack websites using React, Typescript and Express with REST API

Programming Language Interpreter using Golang

Practical Requirements

I will be using a MacBook Pro (M1 Pro, 16 GB RAM) for the project

I am able to spent 40 hours per week on this project

mr_monkey · March 31, 2025, 2:35pm

Hi @michael-w1, considering there aren’t any technical details in your proposal it is hard for me to give you any feedback, other than it is not a complete proposal.

Fro reference, the items from community bonding period you should already be part of the way through, ideally having already got to know the codebase by fixing some bugs or looking at some issues, and have your local development setup ready.
Similarly, week 1 seems like the sort of preparatory work I would expect during the community bonding period.

michael-w1 · April 4, 2025, 5:18am

Hi @mr_monkey,

Thanks for the feedback. I’ve added some more technical details about the process:
The current code for indexMappings for elastic search:

const indexMappings = {
	mappings: {
		_default_: {
			properties: {
				aliases: {
					properties: {
						name: {
							fields: {
								autocomplete: {
									analyzer: 'edge',
									type: 'text'
								},
								search: {
									analyzer: 'trigrams',
									type: 'text'
								}
							},
							type: 'text'
						}
					}
				},
				authors: {
					analyzer: 'trigrams',
					type: 'text'
				},
				disambiguation: {
					analyzer: 'trigrams',
					type: 'text'
				}
			}
		}
	},
	settings: {
		analysis: {
			analyzer: {
				edge: {
					filter: [
						'asciifolding',
						'lowercase'
					],
					tokenizer: 'edge_ngram_tokenizer',
					type: 'custom'
				},
				trigrams: {
					filter: [
						'asciifolding',
						'lowercase'
					],
					tokenizer: 'trigrams',
					type: 'custom'
				}
			},
			tokenizer: {
				edge_ngram_tokenizer: {
					max_gram: 10,
					min_gram: 2,
					token_chars: [
						'letter',
						'digit'
					],
					type: 'edge_ngram'
				},
				trigrams: {
					max_gram: 3,
					min_gram: 1,
					type: 'ngram'
				}
			}
		},
		'index.mapping.ignore_malformed': true
	}
};

All the documents are indexed using the default mappings. This indexes information from author, edition, edition group, series, work, and publisher.

Currently in search, we have defined special cases for:

Aliases
- Special case where we allow autocomplete
Authors
Disambiguation

Otherwise, the code uses default indexing in Elastic Search. We can copy this with the use of a dynamic field in solr.

<dynamicField name="*" type="text_general" indexed="true" stored="true" multiValued="false"/>

If a field in the document does not match our static fields, it will use this default to index it.

With ‘index.mapping.ignore_malformed’: true, we are skipping fields in documents where there are mismatch of types. There is not equivalent to this in solr search, but since we are pulling data from the SQL server (with defined types), this should not be an issue. Also all the types are strings.

A draft of the field types in Solr search might look something like this:

<schema name = "example" version="some_version_of_solr">
<uniqueKey>id</uniqueKey>
	<fields>
      <field name="id" type="string" indexed="true" stored="true" required="true" />
	  <field name="aliases" type="text_general" indexed="true" stored="true" multiValued="true"/>
	  <field name="aliases.name" type="text_general" indexed="true" stored="true"/>
	  <field name="aliases.name.autocomplete" type="text_autocomplete" indexed="true" stored="false"/>
	  <field name="aliases.name.search" type="text_trigrams" indexed="true" stored="false"/>
	  <field name="authors" type="text_trigrams" indexed="true" stored="true" multiValued="true"/>
	  <field name="disambiguation" type="text_trigrams" indexed="true" stored="true"/>
	</fields>
	<dynamicField name="*" type="text_general" indexed="true" stored="true" multiValued="false"/>
	
	<fieldType name="text_autocomplete" class="solr.TextField" positionIncrementGap="100">
	  <analyzer>
	    <tokenizer class="solr.EdgeNGramTokenizerFactory" minGramSize="2" maxGramSize="10" tokenChars="letter,digit"/>
	    <filter class="solr.LowerCaseFilterFactory"/>
	    <filter class="solr.ASCIIFoldingFilterFactory"/>
	  </analyzer>
	</fieldType>
	
	<fieldType name="text_trigrams" class="solr.TextField" positionIncrementGap="100">
	  <analyzer>
	    <tokenizer class="solr.NGramTokenizerFactory" minGramSize="1" maxGramSize="3"/>
	    <filter class="solr.LowerCaseFilterFactory"/>
	    <filter class="solr.ASCIIFoldingFilterFactory"/>
	  </analyzer>
	</fieldType>
	
	<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
	  <analyzer>
	    <tokenizer class="solr.StandardTokenizerFactory"/>
	    <filter class="solr.LowerCaseFilterFactory"/>
	    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
	    <filter class="solr.ASCIIFoldingFilterFactory"/>
	  </analyzer>
	</fieldType>


</schema>

Added positionIncrementGap = 100 since this is the default in elastic search if not specified.

With the general schema defined. We will need to query the database and create an JSON object to add documents to the Solr Server.

To convert the relational data from SQL into a Solr document, it has to be denormalized into a flat structure. The code currently does this for elastic search. For example, finding the author of every work and adding it to a Solr document of type work, so all this info is self-contained in one place for indexing.

For solr search, we can add documents formatted as a flat JSON object. Each document will have bbid if it exists for the entity type, otherwise it will use its entity id. This id will also be used to help us delete a document from the index if needed. Then this object be sent as an HTTP post request.

To send a request to the Solr server, we can write a HTTP request.
The main usage case of the bookbrainz search:

user → writes search term in search bar → searches all entities by default
There is a dropdown menu that allows searching for specific types

Once the solr server is up, we can sent http requests like so:
curl URL_TO_SOLR_SERVER/select?q=name:search_term&wt=json&rows=20

The parameters are:
q = {field} : {What to search}
fq = {field}: {field type} (Allow users to search specific entity types)
wt = json (what we are expecting the output to be)
rows = 20 (current setup with elastic search is set to pull 20 rows per page)

I will need to restructure the handling of the response from the solr server since the response format between solr and elastic search differs.

The output will look something like:
Where status 0 means success and Qtime is query time in milliseconds. We would just parse the hits from the docs and sent it to the frontend.

{
  "responseHeader": {
    "status": 0,
    "QTime": 1
  },
  "response": {
    "numFound": 1,
    "start": 0,
    "docs": [
      {
        "id": "1",
        "title": "book title",
        "score": 1.234
      },

mr_monkey · April 7, 2025, 2:57pm

Thanks for updating your proposal @michael-w1

Don’t forget to submit the proposal on the GSOC website before tomorrow!

michael-w1 · April 8, 2025, 2:38am

Thank you for the feedback @mr_monkey, I’ve just submitted my proposal on GSOC!