GSoC 2026: Use Solr Search Server - Migration from ES to Solr

Contact Information

Name: Amaan Pathan

Email: eulerbutcooler@gmail.com

Github: eulerbutcooler

Timezone: UTC+5:30


Project Overview

Title: Use Solr search server

Proposed Mentor: Monkey, Lucifer

Project Length: 350 hours

BookBrainz currently operates on Elasticsearch 5.6 which is both outdated as well as creates some overhead in maintaining two different search architectures. This project streamlines the BookBrainz search architecture with what is being used by other MetaBrainz services by migrating from Elasticsearch to Solr all while keeping the existing features same and maintaining multi-entity search.

All search and indexing logic lives in search.ts. The route layer search.tsx calls the exported functions from the prior mentioned file and requires no changes apart from response object parsing. The database schema, frontend and ORM layers are completely untouched. The migration is a contained rewrite of search.ts that will replace the Elasticsearch client calls with fetch() calls to Solr. Apart from this the schema configuration and docker infrastructure will need required changes.

My Contributions

Merged PRs:

  1. PR#1235 - Replaced the problematic getong/elasticsearch-action@v1.2 with a native service container.
  2. PR#21 - Fixed a name mismatch in the BookBrainz installation docs to align them with the bookbrainz-site repo.
  3. PR#1254 - Fixed a typo in index.js file - requireJS instead of requiresJS.

Open PRs:

  1. PR#1257 - Added native healthchecks in docker compose, removing the external dependency on waisbrot/wait image.

Experiments

After digging through numerous stackoverflow discussions and solr docs, I found out there were 3 ways of serving multi-entity search for a Solr that was designed for single-type queries.

Methodology

To ensure the experiment reflected real world conditions I recreated a stripped-down BookBrainz indexing pipeline locally.

First I spun up a Solr 9 Docker container. Since TF-IDF (Term Frequency–Inverse Document Frequency) scoring is susceptible to data distribution, I built a controlled dataset of 400 documents 200 authors and 200 works using real entity names and titles (Tolkien, Shakespeare, Dune, Good Omens, etc.) with procedural filler to generate realistic IDF metrics. I indexed these into three Solr collections - bb_exp_single, bb_exp_authors and bb_exp_works and ran identical eDisMax (Extended DisMax) search queries against all three simultaneously.

eDisMax is Solr’s recommended query parser that supports multi-field searching with per-field boosting, flexible minimum-match rules, and phrase proximity scoring.

TF-IDF is the default relevance scoring model used by both Elasticsearch and Solr. TF (Term Frequency) measures how often a search term appears in a document, more occurrences = higher score. IDF (Inverse Document Frequency) measures how rare a term is across the entire collection, rarer terms = higher score. The product of these two values determines a document’s relevance score for a given query.

const EDISMAX = {
  defType: "edismax",
  qf: "aliases_name^3 disambiguation^1.5 author_names", // query fields, ^N sets boost weight
  mm: "75%", // minimum-match
  rows: "10", // results to return
  fl: "id,type,aliases_name,disambiguation,author_names,score", // fields included in response
};

Option A - Single collection: All entity types in one collection with a type field. IDF computed across the whole pool. One request per search.

Option B - Cross-collection query: Per-type collections, queried together in one request via Solr’s comma syntax. IDF computed independently per collection.

Option C - Fan-out + Normalization: Per-type collections, queried in parallel, scores divided by each collection’s max score to produce a 0-1 scale, merged in Node.js.

The 7 test queries I tried were:

  1. ā€œtolkienā€ - Since there is an author as well as a book containing this name.
  2. ā€œshakespeareā€ - Same as tolkien, exists in Author as well as Works collections.
  3. ā€œmartinā€ - To test ambiguous partial matches (G.R.R. Martin vs other authors named Martin)
  4. ā€œharry potterā€ - To test multi-word matches where the phrase only exists in the Works collection.
  5. ā€œlord of the ringsā€ - To test how well eDisMax handles long and multi-word titles and if it successfully drops Authors down the list.
  6. ā€œgood omensā€ - To test a book with multiple authors to ensure the author_names array field boosts correctly.
  7. ā€œduneā€ - To test a single-word Work title that has no overlap with Author’s name (Frank Herbert).

The two differentiating queries were:

  • ā€œtolkienā€:

Option B ranked a biography about Tolkien above Tolkien himself. The reason was IDF skew: The word ā€œtolkienā€ appears in 4 of 200 works-collection documents (IDF ā‰ˆ 3.91, meaning it’s moderately common in that collection) but only 2 of 200 authors-collection documents (IDF ā‰ˆ 4.61, meaning it’s rarer there and thus gets a higher weight). The biography where ā€œtolkienā€ dominates the title gets an inflated score from the works IDF. Option A fixes it by computing IDF once across all 400 documents.

  • ā€œduneā€:

Option C normalizes both collections to a 0-1 scale but destroys a different signal: it collapses the work ā€œDuneā€ and the author ā€œFrank Herbertā€ to identical 1.0 scores erasing the meaningful information that the work is a much stronger match.

Conclusion:

Option B fails due to IDF skew when comparing raw scores from different collections. When ā€œtolkienā€ is searched, it is rarer in the authors collection than the works collection and hence we get the absurd ranking of biography above the author.

Option C attempts to fix the IDF skew by normalizing the scores before merging them but this destroys the relative magnitude of the match and hence for the query ā€œduneā€, it normalizes both the author and work to the same score - 1.0.

Option A solves both of these problems by placing all entities into a single, unified Solr core. This fixes the IDF skew issue since all entities are now in the same collection and also we don’t need to normalize the scores. Apart from this, BookBrainz has 200k entities which isn’t a large enough dataset for multiple Solr collections, hence a single Solr collection would suffice.

Here’s a link to reproduce the experiment - BookBrainz Solr Search Strategies Experiment

Architecture

Single Solr Collection

All entity types are indexed into a single BookBrainz Solr collection. A type field distinguishes them. Type-specific searches use Solr’s filter query (fq) parameter which is a secondary query that restricts results without affecting relevance scores. It is applied post-scoring and cached separately by Solr for performance.
If in future BookBrainz grows to hundreds of millions of entities, we can easily shift to multiple Solr collections since the proposed architecture doesn’t lock BookBrainz into a single Solr collection.

Changes Required

  • src/common/helpers/search.ts - Full rewrite of internals - fetch() instead of ES Client.
  • docker-compose.yml - Solr container instead of ES containers.
  • solr/schema.xml - Field types, fields, copyField directives.
  • solr/solrconfig.xml - Request handlers, cache config.
  • src/server/routes/search.tsx - Response shape updated.

Solr Schema

Field Type Mapping

ElasticSearch Solr Migration Notes
aliases.name (nested text) aliases_name (multi-valued text_general) Flattened nested JSON objects into array
aliases.name.autocomplete (edge analyzer sub-field) aliases_autocomplete via copyField Edge NGram 2-10, query-side keyword tokenizer
aliases.name.search (trigram sub-field) aliases_search via copyField NGram 1-3 for partial matching
authors (trigrams) authors + authors_search via copyField Stored as text_general, ngrams separately
disambiguation (trigrams) disambiguation + disambiguation_search via copyField Stored as text_general, ngrams separately
identifiers[].value (nested array) identifiers_value (multi-valued string) Flattened in getDocumentToIndex()
asciifolding + lowercase filters ICUFoldingFilterFactory Single filter, broader Unicode coverage
StandardTokenizerFactory ICUTokenizerFactory Proper CJK word segmentation
ES _type routing type field + fq=type:{type} Explicit field, filter at query time

Solr’s copyField directive automatically copies data from a source field into a destination field at index time, allowing the same content to be analyzed differently (ex - once for full-text search, once for autocomplete) without duplicating it in the source document.

Schema:

<?xml version="1.0" encoding="UTF-8" ?>
<schema name="bookbrainz" version="1.6">

<!--
Replaces ES's default StandardAnalyzer + asciifolding + lowercase
ICUTokenizer handles CJK (Chinese, Japanese, Korean) word
segmentation properly
ICUFoldingFilter is a superset of asciifolding as it has a broader
unicode coverage
-->

<fieldType name="text_general" class="solr.TextField"
             positionIncrementGap="100">
    <analyzer type="index">
      <tokenizer class="solr.ICUTokenizerFactory"/>
      <filter class="solr.ICUFoldingFilterFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.ICUTokenizerFactory"/>
      <filter class="solr.ICUFoldingFilterFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
</fieldType>

<!--
text_edge_ngram replaces ES's edge analyzer (used for autocomplete)
Index-time: EdgeNGram(2,10) generates prefix tokens. ("Tolkien" ->
["To", "Tol", "Tolk"..]
Query-time: KeywordTokenizer keeps user input as one token. ("Tol"
as "Tol", hence matches the "Tol" ngram)
-->

<fieldType name="text_edge_ngram" class="solr.TextField"
             positionIncrementGap="100">
    <analyzer type="index">
      <tokenizer class="solr.EdgeNGramTokenizerFactory"
                 minGramSize="2" maxGramSize="10"/>
      <filter class="solr.ICUFoldingFilterFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.KeywordTokenizerFactory"/>
      <filter class="solr.ICUFoldingFilterFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
</fieldType>

<!-- 
text_ngram replaces ES's trigram analyzer (used for fuzzy search)
Index-time: NGram(1,3) generates all substrings of length 1-3
("Tolkien -> ["T", "To", "Tol", "o", "ol", "olk", ...]
Query-time: ICUTokenizer splits normally so we don't generate
ngrams of the query itself, we match against indexed ngrams. 
-->

<fieldType name="text_ngram" class="solr.TextField"
             positionIncrementGap="100">
    <analyzer type="index">
      <tokenizer class="solr.NGramTokenizerFactory"
                 minGramSize="1" maxGramSize="3"/>
      <filter class="solr.ICUFoldingFilterFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.ICUTokenizerFactory"/>
      <filter class="solr.ICUFoldingFilterFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
</fieldType>

<!-- Fields -->

 <field name="bbid" type="string" indexed="true" stored="true"
         required="true"/>
 <field name="type" type="string" indexed="true" stored="true"/>
 <field name="aliases_name" type="text_general" indexed="true"
         stored="true" multiValued="true"/>
 <field name="aliases_autocomplete" type="text_edge_ngram" indexed="true"
         stored="false" multiValued="true"/>
 <field name="aliases_search" type="text_ngram" indexed="true"
         stored="false" multiValued="true"/>
 <field name="disambiguation" type="text_general" indexed="true"
         stored="true"/>
 <field name="disambiguation_search" type="text_ngram" indexed="true"
         stored="false"/>
 <field name="authors" type="text_general" indexed="true" stored="true"
         multiValued="true"/>
 <field name="authors_search" type="text_ngram" indexed="true"
         stored="false" multiValued="true"/>
 <field name="identifiers_value" type="string" indexed="true"
         stored="true" multiValued="true"/>
 <field name="_version_" type="plong" indexed="false" stored="false"/>


<!-- Copy Fields -->

<copyField source="aliases_name" dest="aliases_autocomplete"/>
<copyField source="aliases_name" dest="aliases_search"/>
<copyField source="disambiguation" dest="disambiguation_search"/>
<copyField source="authors" dest="authors_search"/>
<uniqueKey>bbid</uniqueKey>
</schema>

Since Solr schema lives in a separate file it can be version-controlled, reviewed and deployed independently of the application.

Implementation:

In search.ts :

Function What changes
init() Remove Elasticsearch.Client. Store Solr URL as _solrUrl. Ping via fetch(${_solrUrl}/admin/ping). Check collection via Collections API.
getDocumentToIndex() Flatten: aliases into aliases_name: [names], identifiers into identifiers_value: [values]
_fetchEntityModelsForESResults() Rename to _fetchEntityModelsForSolrResults(). Read response.docs[] (flat) instead of hits.hits[]._source. PostgreSQL fetch logic remains unchanged.
_searchForEntities() Replace _client.search(dslQuery) with fetch(${_solrUrl}/select?${params}). Read data.response.numFound for total.
searchByName() Replace ES multi_match DSL with eDisMax params: qf=aliases_name^3 aliases_search disambiguation_search identifiers_value, mm=80%. Type filter becomes fq=type:{type}.
autocomplete() Replace match: { aliases.name.autocomplete } with eDisMax qf=aliases_autocomplete. BBID shortcut becomes q=bbid:"<uuid>".
indexEntity() Replace _client.index({body, id, index, type}) with fetch(${_solrUrl}/update/json/docs?commit=true, {body: [doc]})
deleteEntity() Replace _client.delete({id}) with fetch(/update?commit=true, {body: {delete: {id}}})
refreshIndex() Replace _client.indices.refresh() with POST /update?commit=true.
generateIndex() All Postgres fetch logic unchanged. Replace index create/delete with Collections API. Replace bulk POST with /update/json/docs.
_bulkIndexEntities() Replace alternating ES bulk format with plain JSON array to /update/json/docs. Keep 429 retry logic.
sanitizeEntityType() allEntities → null (no fq filter). All other types → fq=type:{type}.
checkIfExists() No changes.

search.ts - Solr equivalent:

  • getDocumentToIndex()
// For ES this function converts the ORM model into a minimal JSON doc for indexing
// For Solr we flatten the nested structures into multi-valued fields

interface SolrDocument {
bbid: string;
type: string;
aliases_name: string[];
disambiguation: string;
identifiers_value: string[];
authors?: string[];
}

export function getDocumentToIndex(entity: any): SolrDocument {
    const entityType: IndexableEntities = entity.get('type');

    // Extract alias names as a flat string array
    let aliasNames: string[] = [];
    const aliasSet = entity.related('aliasSet')?.related('aliases');
    if (aliasSet) {
        aliasNames = aliasSet.map((alias) => alias.get('name')).filter(Boolean);
    } else {
    // Editors, Collections and Areas don't have the standard aliasSet structure hence we fallback to the 'name' attribute
        const name = entity.get('name');
        if (name) aliasNames = [name];
    }

    // Extract identifier values as a flat string array
    let identifierValues: string[] = [];
    const identifierSet = entity.related('identifierSet')?.related('identifiers');
    if (identifierSet) {
        identifierValues = identifierSet.map(id => id.get('value')).filter(Boolean);
    }

    const doc: SolrDocument = {
        bbid: entity.get('bbid') || entity.get('id'),
        type: entityType,
        aliases_name: aliasNames,          // multi-valued field
        disambiguation: entity.get('disambiguation') || '',
        identifiers_value: identifierValues // multi-valued field
    };

    // Only Works have the authors field
    if (entityType === 'Work') {
        doc.authors = entity.get('authors') || [];
    }

    return doc;
}

Output:

{
  "bbid": "ef212883-aba1-4ba8-9181-3d15d2aa0394",
  "type": "Author",
  "aliases_name": ["J.R.R. Tolkien", "Tolkien", "John Ronald Reuel Tolkien"],
  "disambiguation": "Author of The Lord of the Rings",
  "identifiers_value": ["0000000121463862"]
}
  • autocomplete()
export async function autocomplete(
orm: ORM,
query: string,
type: IndexableEntitiesOrAll,
size = 42){
    const params = new URLSearchParams();

    if (commonUtils.isValidBBID(query)) {
        // Direct BBID lookup
        params.set('q', `bbid:"${query}"`);
    } else {
        // Autocomplete via edge ngram field
        params.set('q', query);
        params.set('defType', 'edismax');
        params.set('qf', 'aliases_autocomplete');
        params.set('mm', '80%');
    }

    params.set('rows', String(size));
    params.set('wt', 'json');

    const sanitizedType = sanitizeEntityType(type);
    if (sanitizedType) {
        params.set('fq', `type:${sanitizedType}`);
    }

    const searchResponse = await _searchForEntities(orm, params);
    return searchResponse.results;
}
  • searchByName()
// aliases.name.search -> aliases_search (multi-field -> copyField)
// disambiguation -> disambiguation_search (NGram analyzed copy for fuzzy matching)
// identifiers.value -> identifiers_value (Flattened from nested array)
// authors -> authors_search (NGram analyzed copy)

export function searchByName(
orm: ORM,
name: string,
type: IndexableEntitiesOrAll,
size?: number,
from?: number
) {
    const sanitizedType = sanitizeEntityType(type);

    const params = new URLSearchParams();
    params.set('q', name);
    params.set('defType', 'edismax');
    params.set('mm', '80%');
    params.set('start', String(from || 0));
    params.set('rows', String(size || 10));
    params.set('wt', 'json');
    params.set('fl', 'bbid,type,aliases_name,disambiguation,score');

    let qf = 'aliases_name^3 aliases_search disambiguation_search identifiers_value';

    const includesWork = !sanitizedType || sanitizedType === 'Work';
    if (includesWork) {
        qf += ' authors_search';
    }
    params.set('qf', qf);

    if (sanitizedType) {
        params.set('fq', `type:${sanitizedType}`);
    }

    return _searchForEntities(orm, params);
}

Timeline

Period Work
Community Bonding (May 1–24) Finalize schema with mentor feedback. Confirm Solr version and standalone vs SolrCloud preference for dev.
Week 1–2 Write schema.xml + solrconfig.xml. Add Solr to Docker. Verify core starts, index sample data in Admin UI.
Week 3–4 Rewrite init(), getDocumentToIndex(), _bulkIndexEntities(), generateIndex(). Run full re-index, verify document counts.
Week 5–6 Rewrite searchByName(), autocomplete(), _fetchEntityModelsForSolrResults(). Manual search testing across all entity types.
Week 7 Rewrite indexEntity(), deleteEntity(), refreshIndex(). Verify real-time updates work end-to-end.
Week 8 Update search.tsx response parsing. Remove @elastic/elasticsearch from package.json.
Midterm All search functionality and single-entity indexing works end-to-end on a standalone Solr container. Code is clean and passes basic tests
Week 9–10 Test suite update and fixes. Integration tests against real Solr. Manual QA of all search features.
Week 11 SolrCloud setup with ZooKeeper in Docker. Verify application works without code changes. Document setup.
Week 12 Buffer: edge cases, error handling, performance sanity check (re-index time comparison ES vs Solr). PR polish.
Final submission Code cleanup, final documentation for dev setup, and merging the Solr infrastructure into production-ready state.

Stretch goals:

  • Faceted search (entity type counts in results)
  • Spellcheck via Solr’s /spell component
  • Benchmark report comparing ES 5 and Solr 9.

Detailed Information About Yourself

My name is Amaan Pathan. I’m in the final year of my bachelors in Computer Science and Engineering at University of Lucknow. I have worked on backend all the way from simple CRUD apps to queues and RPC.

Community Affinities

What type of books do you read? (Please list a series of BBIDs as examples)

I read a variety of genres, especially thrillers and classics. My latest read was The Nose by Gogol and before that I had read Kafka On The Shore by Murakami and The Idiot by Dostoyevsky. My favorite author is Agatha Christie and my favorite work of hers is ABC Murders.

What type of music do you listen to? (Please list a series of MBIDs as examples)

My music ranges all the way from Sufi to Rock. Some of my favorites songs are - Gulon Mein Rang Bhare by Mehdi Hassan, Jesus of Suburbia by Green Day, Heaven or Las Vegas by Cocteau Twins. I love discovering new music as well so I’m always open to recommendations.

What aspects of BookBrainz interest you the most?

I find it really cool that BookBrainz has kept all this information open source and is always open for contributions. I even revised/added one title I couldn’t find at the site - White Nights by Dostoyevsky

Have you ever used MusicBrainz Picard to tag your files or used any of our projects in the past?

I haven’t yet but I plan on using it since I will be shifting to offline media instead of Spotify.

Programming precedents

When did you first start programming?

I started in 4th grade by making simple apps in QBasic. I even tried my hand at C but my brain wasn’t developed enough at that time. Then in high school, I started again with python. More recently I’ve been working with Javascript and Go.

Have you contributed to other Open Source projects? If so, which projects and can we see some of your code?

I’ve contributed to bluewave-labs back in my 2nd year in college but they seem to have deleted or made the guidefox repo private since it is a graduated project now. Here are a couple of my PRs that got merged - PR346, PR497. Since the PR’s are unavailable here’s a short summary of what I added -

  • Added schema-based input validations using Zod on both frontend and backend
  • Introduced Data Transfer Objects (DTOs) for clean separation of request logic and validation
  • Refactored conditional logic in React components with Zod-powered form schema validation
  • Collaborated on open-source PRs as part of a modular codebase under industry-standard practice

What sorts of programming projects have you done on your own time?

  • I recently built a Zapier-like automation engine - repo. I’ve added Discord, Slack, HTTP request and email integrations into this engine. I made it in Go to better understand goroutines and building scalable backend systems. I’ve also added workflow chaining and payload passing. This is my final year project for my bachelors.
  • I’ve also built an HTTP server from scratch in Go - repo. I incorporated a request body parsing with RFC compliance in this server, and chunked transfer encoding, tcp server with concurrent connection handling.
  • I also made a fun personal project - site. It returns a noteworthy or popular landmark closest to the user clicked point on the earth map and also outputs the wikipedia page associated with it. I built this using Three.JS and Nextjs.

Practical requirements

What computer(s) do you have available for working on your SoC project?

I have a Xiaomi notebook with 16GB of RAM, Intel i5 (11th gen) and intel integrated graphics. I’m using Ubuntu 24.04 LTS.

How much time do you have available per week, and how would you plan to use it?

I plan on working 35-40 hours a week as I graduate before the summer. I can code/work 5-6 hours a day.

3 Likes

@mr_monkey @lucifer would love some feedback. thanks!

Hi, former GSoC student and occasional contributor to BB code here. Overall, I think this proposal shows that you’ve thought carefully about the problem, understand the options available, and have a clear path to implementing a solution. I appreciate that you’ve presented your proposal in a concise manner while still providing sufficient information to understand your approach and the challenges involved, even though I don’t have direct knowledge of Solr.

Though it’s not directly part of the proposal, it would be useful to provide the the script you used for testing the different options so that the experiment could be recreated by someone else. I would also make sure that terms you use are explained (I’m not clear on what IDF stands for or what it signifies, for example) and that if you include code snippets, they’re of the highest possible quality (for example, the schema could do with formatting changes and uses // comments, which aren’t valid XML). That said, there’s sufficient technical detail here for me to feel confident that you understand the project and would be capable of completing it. The timeline is detailed, seems reasonable, and builds in buffer for polish, which I appreciate. There isn’t much feedback here, but hopefully it’s helpful, and good luck.

2 Likes

Thanks for the feedback @Leftmost_Cat ! I’ve added the gist for reproducing the experiment and have explained the terms used. I’ve also fixed the syntax in code.
Is there anything else I can improve?

1 Like