GSoC 2026: Migration of ElasticSearch to Solr Search for BookBrainz

eulerbutcooler · March 22, 2026, 4:26pm

Contact Information

Name: Amaan Pathan

Email: eulerbutcooler@gmail.com

Github: eulerbutcooler

Timezone: UTC+5:30

Project Overview

Title: Migration of ElasticSearch to Solr Search for BookBrainz

Proposed Mentor: Monkey, Lucifer

Project Length: 350 hours

BookBrainz currently operates on ElasticSearch 5.6 which is both outdated as well as creates some overhead in maintaining two different search architectures. This project streamlines the BookBrainz search architecture with what is being used by other MetaBrainz services by migrating from ElasticSearch to Solr all while keeping the existing features the same and maintaining multi-entity search.

All search and indexing logic lives in search.ts. The route layer search.tsx calls the exported functions from the prior mentioned file and requires no changes apart from response object parsing. The database schema, frontend and ORM layers are completely untouched. The migration is a contained rewrite of search.ts that will replace the ElasticSearch client calls with fetch() calls to Solr. Apart from this the schema configuration and docker infrastructure will need required changes.

My Contributions

Merged PRs:

PR#1235 - Replaced the problematic getong/elasticsearch-action@v1.2 with a native service container.
PR#21 - Fixed a name mismatch in the BookBrainz installation docs to align them with the bookbrainz-site repo.
PR#1254 - Fixed a typo in index.js file - requireJS instead of requiresJS.

Open PRs:

PR#1257 - Added native healthchecks in docker compose, removing the external dependency on waisbrot/wait image.

Experiments

After digging through numerous stackoverflow discussions and solr docs, I found out there were 3 ways of serving multi-entity search for a Solr that was designed for single-type queries.

I downloaded the latest BookBrainz SQL dump and wrote a simple streaming parser that follows the actual join chain in the BB schema:

entity → {type}_header → {type}_revision → {type}_data → alias_set__alias → alias

After that, I implemented and tested all 3 of these strategies using a Node.js script that creates 3 Solr collections, indexes 400 documents (200 authors + 200 works + filler) and runs 7 queries through all three strategies side by side.

Option A - Single collection: All entity types in one collection with a type field. IDF computed across the whole pool. One request per search.

Option B - Cross-collection query: Per-type collections, queried together in one request via Solr’s comma syntax. IDF computed independently per collection.

Option C - Fan-out + Normalization: Per-type collections, queried in parallel, scores divided by each collection’s max score to produce a 0-1 scale, merged in Node.js.

The two differentiating queries were:

“tolkien”:

Option B ranked a biography about Tolkien above Tolkien himself. The reason was IDF skew: The word “tolkien” appears in 4 of 200 works-collection documents (IDF ≈ 3.91) but only 2 of 200 authors-collection documents (IDF ≈ 4.61). The biography where “tolkien” dominates the title gets an inflated score from the works IDF. Option A fixes it by computing IDF once across all 400 documents.

“dune”:

Option C normalizes both collections to a 0-1 scale but destroys a different signal: it collapses the work “Dune” and the author “Frank Herbert” to identical 1.0 scores erasing the meaningful information that the work is a much stronger match.

Conclusion:

Option B fails due to IDF skew when comparing raw scores from different collections. When “tolkien” is searched, it is rarer in the authors collection than the works collection and hence we get the absurd ranking of biography above the author.

Option C attempts to fix the IDF skew by normalizing the scores before merging them but this destroys the relative magnitude of the match and hence for the query “dune”, it normalizes both the author and work to a same score - 1.0.

Option A solves both of these problems by placing all entities into a single, unified Solr core. This fixes the IDF skew issue since all entities are now in the same collection and also we don’t need to normalize the scores.

Architecture

Single Solr Collection

All entity types are indexed into a single BookBrainz Solr collection. A type field distinguishes them. Type-specific searches use Solr’s filter query (fq=type:Author) which is applied post-scoring so it doesn’t affect relevance and cached separately.

Changes Required

src/common/helpers/search.ts - Full rewrite of internals - fetch() instead of ES Client.
docker-compose.yml - Solr container instead of ES containers.
solr/schema.xml - Field types, fields, copyField directives.
solr/solrconfig.xml - Request handlers, cache config.
src/server/routes/search.tsx - Response shape updated.

Solr Schema

Field Type Mapping

ElasticSearch	Solr	Migration Notes
`aliases.name` (nested text)	`aliases_name` (multi-valued text_general)	Flattened nested JSON objects into array
`aliases.name.autocomplete` (edge analyzer sub-field)	`aliases_autocomplete` via `copyField`	Edge NGram 2-10, query-side keyword tokenizer
`aliases.name.search` (trigram sub-field)	`aliases_search` via `copyField`	NGram 1-3 for partial matching
authors (trigrams)	authors + authors_search via copyField	Stored as `text_general`, ngrams separately
`disambiguation` (trigrams)	`disambiguation` + `disambiguation_search` via `copyField`	Stored as text_general, ngrams separately
`identifiers[].value` (nested array)	`identifiers_value` (multi-valued string)	Flattened in `getDocumentToIndex()`
`asciifolding` + `lowercase` filters	`ICUFoldingFilterFactory`	Single filter, broader Unicode coverage
`StandardTokenizerFactory`	`ICUTokenizerFactory`	Proper CJK word segmentation
ES `_type` routing	`type` field + `fq=type:{type}`	Explicit field, filter at query time

Indexed Document Shape (Example)

{
  "bbid": "ef212883-aba1-4ba8-9181-3d15d2aa0394",
  "type": "Author",
  "aliases_name": ["J.R.R. Tolkien", "Tolkien", "John Ronald Reuel Tolkien"],
  "disambiguation": "Author of The Lord of the Rings",
  "identifiers_value": ["0000000121463862"]
}

Implementation:

In search.ts :

Function	What changes
`init()`	Remove `ElasticSearch.Client`. Store Solr URL as `_solrUrl`. Ping via `fetch(${_solrUrl}/admin/ping)`. Check collection via Collections API.
`getDocumentToIndex()`	Flatten: `aliases into aliases_name: [names]`, `identifiers into identifiers_value: [values]`
`_fetchEntityModelsForESResults()`	Rename to `_fetchEntityModelsForSolrResults()`. Read `response.docs[]` (flat) instead of `hits.hits[]._source`. PostgreSQL fetch logic remains unchanged.
`_searchForEntities()`	Replace `_client.search(dslQuery)` with `fetch(${_solrUrl}/select?${params})`. Read `data.response.numFound` for total.
`searchByName()`	Replace ES `multi_match` DSL with eDisMax params: `qf=aliases_name^3 aliases_search disambiguation_search identifiers_value`, `mm=80%`. Type filter becomes `fq=type:{type}`.
`autocomplete()`	Replace `match: { aliases.name.autocomplete }` with eDisMax `qf=aliases_autocomplete`. BBID shortcut becomes `q=bbid:"<uuid>"`.
`indexEntity()`	Replace `_client.index({body, id, index, type})` with `fetch(${_solrUrl}/update/json/docs?commit=true, {body: [doc]})`
`deleteEntity()`	Replace `_client.delete({id})` with `fetch(/update?commit=true, {body: {delete: {id}}})`
`refreshIndex()`	Replace `_client.indices.refresh()` with `POST /update?commit=true`
`generateIndex()`	All Postgres fetch logic unchanged. Replace index create/delete with Collections API. Replace bulk POST with `/update/json/docs`.
`_bulkIndexEntities()`	Replace alternating ES bulk format with plain JSON array to `/update/json/docs`. Keep 429 retry logic.
`sanitizeEntityType()`	`allEntities` → `null` (no `fq` filter). All other types → `fq=type:{type}`.
`checkIfExists()`	No changes.

Timeline

Period	Work
Community Bonding (May 1–24)	Finalize schema with mentor feedback. Confirm Solr version and standalone vs SolrCloud preference for dev.
Week 1–2	Write `schema.xml` + `solrconfig.xml`. Add Solr to Docker. Verify core starts, index sample data in Admin UI.
Week 3–4	Rewrite `init()`, `getDocumentToIndex()`, `_bulkIndexEntities()`, `generateIndex()`. Run full re-index, verify document counts.
Week 5–6	Rewrite `searchByName()`, `autocomplete()`, `_fetchEntityModelsForSolrResults()`. Manual search testing across all entity types.
Week 7	Rewrite `indexEntity()`, `deleteEntity()`, `refreshIndex()`. Verify real-time updates work end-to-end.
Week 8	Update `search.tsx` response parsing. Remove `@elastic/elasticsearch` from `package.json`.
Midterm	All search functionality and single-entity indexing works end-to-end on a standalone Solr container. Code is clean and passes basic tests
Week 9–10	Test suite update and fixes. Integration tests against real Solr. Manual QA of all search features.
Week 11	SolrCloud setup with ZooKeeper in Docker. Verify application works without code changes. Document setup.
Week 12	Buffer: edge cases, error handling, performance sanity check (re-index time comparison ES vs Solr). PR polish.
Final submission	Code cleanup, final documentation for dev setup, and merging the Solr infrastructure into production-ready state.

Stretch goals:

Faceted search (entity type counts in results)
Spellcheck via Solr’s /spell component
Benchmark report comparing ES 5 and Solr 9.

Detailed Information About Yourself

My name is Amaan Pathan. I’m in the final year of my bachelors in Computer Science and Engineering at University of Lucknow. I have worked on backend all the way from simple CRUD apps to queues and RPC.

Community Affinities

What type of books do you read? (Please list a series of BBIDs as examples)

I read a variety of genres, especially thrillers and classics. My latest read was The Nose by Gogol and before that I had read Kafka On The Shore by Murakami and The Idiot by Dostoyevsky. My favorite author is Agatha Christie and my favorite work of hers is ABC Murders.

What type of music do you listen to? (Please list a series of MBIDs as examples)

My music ranges all the way from Sufi to Rock. Some of my favorites songs are - Gulon Mein Rang Bhare by Mehdi Hassan, Jesus of Suburbia by Green Day, Heaven or Las Vegas by Cocteau Twins. I love discovering new music as well so I’m always open to recommendations.

What aspects of BookBrainz interest you the most?

I find it really cool that BookBrainz has kept all this information open source and is always open for contributions. I even revised/added one title I couldn’t find at the site - White Nights by Dostoyevsky

Have you ever used MusicBrainz Picard to tag your files or used any of our projects in the past?

I hadn’t before joining this project but I am excited to explore it both as a developer and a user

Programming precedents

When did you first start programming?

I started in 4th grade by making simple apps in QBasic. I even tried my hand at C but my brain wasn’t developed enough at that time. Then in high school, I started again with python. More recently I’ve been working with Javascript and Go.

Have you contributed to other Open Source projects? If so, which projects and can we see some of your code?

I’ve contributed to bluewave-labs back in my 2nd year in college but they seem to have deleted or made the guidefox repo private since it is a graduated project now. Here are a couple of my PRs that got merged - PR346, PR497. Since the PR’s are unavailable here’s a short summary of what I added -

Added schema-based input validations using Zod on both frontend and backend
Introduced Data Transfer Objects (DTOs) for clean separation of request logic and validation
Refactored conditional logic in React components with Zod-powered form schema validation
Collaborated on open-source PRs as part of a modular codebase under industry-standard practice

What sorts of programming projects have you done on your own time?

I recently built a zapier-like automation engine - hermes. I’ve added discord, slack, http request and email integrations into this engine. I made it in Go to better understand goroutines and building scalable backend systems. I’ve also added workflow chaining and payload passing. This is my final year project for my bachelors.
I’ve also built an HTTP server from scratch in golang - HTTP from scratch. I incorporated a request body parsing with RFC compliance in this server, and chunked transfer encoding, tcp server with concurrent connection handling.
I also made a fun personal project - wikisillygoose. It returns a noteworthy or popular landmark closest to the user clicked point on the earth map and also outputs the wikipedia page associated with it. I built this using Three.JS and Nextjs.

Practical requirements

What computer(s) do you have available for working on your SoC project?

I have a Xiaomi notebook with 16GB of RAM, Intel i5 (11th gen) and intel integrated graphics. I’m using Ubuntu 24.04 LTS.

How much time do you have available per week, and how would you plan to use it?

I plan on working 35-40 hours a week as I graduate before the summer. I can code/work 5-6 hours a day.