GSOC-2026 : Use Solr search server

Project Overview

Title: Use Solr search server
Proposed Mentor: Monkey, Lucifer
Project Length: 350 hours

Expected outcomes: A functional multi-entity search server with the same features as the existing search functionality

Other MetaBrainz projects use Solr search server, while BookBrainz was created using ElasticSearch and has not evolved since. This creates some extra overhead by running two separate search infrastructures and prevents us from optimizing resources.

For this project, you would entirely replace the search server infrastructure and adapt the existing search to work with Solr. This makes for a project relatively isolated from the rest of the website, the only surface of contact being this file handling most of the indexing and ElasticSearch-specific logic, and this file which adds the website routes that allows users and the website to interact with the search server.

One relevant point of detail is that we want to maintain multi-entity search (search for authors, works, edition, etc all in one go) compared to the MusicBrainz search for example which requires selecting an entity type before performing a search. This would need to be investigated.

BookBrainz uses Elasticsearch 5.6 for search. MusicBrainz and other MetaBrainz projects already use Solr. Running two separate search backends means extra infrastructure and maintenance for the team. This project switches BookBrainz from ES to Solr.

The good news is that all ES interaction lives in one file → src/common/helpers/search.ts. Nothing else in the codebase touches Elasticsearch. So the migration comes down to: rewrite the internals of that file to talk to Solr HTTP instead of the @elastic/elasticsearch SDK, design a Solr schema that replicates the current indexing behavior, and swap the Docker container. Routes, frontend, ORM, database layer ! none of that changes.

This is also a good opportunity to improve multilingual search. The current ES setup only uses asciifolding + lowercase, which does not handle CJK, Cyrillic, or Arabic text well. Solr’s ICU analysis which MusicBrainz already uses gives us proper Unicode-aware tokenization and cross-script normalization out of the box .

I have already built a working demo that runs on the actual BookBrainz website → tested with Japanese, Cyrillic, and Latin queries also work-author relationship queries . The demo is up as a draft PR with a video.

Understanding the Current ES Architecture

Before explaining how I will migrate to Solr, it is worth walking through exactly how the current Elasticsearch setup works, because the whole migration is about replicating this behavior.

The whole ES layer lives in one file: src/common/helpers/search.ts (729 lines). It is the only file that imports @elastic/elasticsearch. It creates a client on startup and exports functions that the rest of the app uses. Nothing else knows about ES.

The key functions:

  • init() – runs at server startup, connects to ES, and auto-indexes everything if the index doesn’t exist yet.

  • searchByName() – the main search. Builds a multi_match query across aliases.name^3, aliases.name.search (trigram-analyzed), disambiguation, and identifiers.value. Uses cross_fields with minimum_should_match: '80%'. For work queries, it adds authors to the field list so searching an author name also finds their works.

  • autocomplete() – used by entity editor dropdowns. Queries aliases.name.autocomplete (EdgeNGram 2-10 chars) for prefix matching.

  • getDocumentToIndex() – flattens a Bookshelf.js model into a document with bbid, name, type, disambiguation, aliases, identifiers, and for works, authors.

  • indexEntity() / deleteEntity() – single-document CRUD, called when entities are created or edited.

  • _bulkIndexEntities() – bulk indexing with retry logic for ES’s 429 Too Many Requests.

  • generateIndex() – full reindex. Fetches entities from PostgreSQL in 50k chunks, attaches author names to works via relationship type 8 (“author wrote work”), and bulk-indexes. Also handles areas, editors, and collections.

  • checkIfExists() – the “does this exist?” check in entity editors. Doesn’t use ES at all – queries PostgreSQL directly.

The schema is defined as a JS object inside search.ts (not a separate file). Two custom analyzers:

  • edge – EdgeNGram (2-10 chars) + asciifolding + lowercase, for autocomplete

  • trigrams – NGram (1-3 chars) + asciifolding + lowercase, for partial matching

The route layer (src/server/routes/search.tsx) just calls exported functions from search.ts and returns results. It does not know what search backend is behind them – which is exactly why the migration is safe.

Important pattern: after getting results from ES, the code fetches full entities from PostgreSQL by bbid. The search index is just a lookup layer, not the source of truth. Same pattern applies with Solr.

What changes in the migration we must focus on :

  • ES client calls → fetch() to Solr HTTP

  • indexMappings JS object → schema.xml + solrconfig.xml

  • ES multi_match / cross_fields → Solr eDisMax with qf, mm, fq

  • ES response format (hits.hits._source) → Solr format (response.docs)

  • ES bulk format (alternating metadata/doc) → Solr JSON array

What stays the same: function signatures, route layer, ORM fetching, frontend, everything else.

Migration Approach & Key Design Decisions

Single core, not multi-core

MusicBrainz uses separate Solr cores per entity type – artist, release, recording, etc. – because their search UI makes you pick an entity type first, they have millions of entities per type, and each type has different search fields.

BookBrainz is different. Multi-entity search is a core feature – users type a query and get authors, works, editions, publishers, series all in one list. The dataset is also way smaller (~200k entities) its a guess :)i am assuming . A single core handles that easily.

If we went multi-core, every search would mean one HTTP request per core, then merging results and dealing with relevance scores that aren’t comparable across cores. Pagination would be painful. A single core avoids all this – one query, one response, Solr handles ranking. For type filtering, we just add fq=type:author as a filter query, which Solr caches separately from the main query.

The current ES setup already uses a single index. Matching that pattern keeps the migration straightforward.

Solr version: 9.7.0

Same version MusicBrainz runs in production (per mb-solr). If we hit anything version-specific we can ask the MB team directly.

No SDK, just fetch()

ES uses the @elastic/elasticsearch SDK. For Solr, I’m not adding a client library. Solr’s API is plain HTTP + JSON – GET /select for queries, POST /update for indexing. Node’s built-in fetch() is enough. The @elastic/elasticsearch dependency gets removed entirely.

Keep trigram fields out of the default search path:

This was the biggest thing I learned from building the demo.

My schema defines name_trigram and alias_trigram fields (NGram 1-3 chars) for partial matching. Early on I tried adding these to eDisMax’s qf and pf parameters to get fuzzy matching on every search. The result: every query matched every document. Searching “tolkien” returned all 17 documents with roughly equal scores.

The reason: when trigram fields are in qf, Solr breaks the query into 1, 2, and 3-character grams – t, to, tol, o, ol, olk, … – and each tiny gram matches against grams from every indexed document. With a small dataset it’s hard to spot, but the relevance is garbage and at scale it would match everything.

The fix: don’t set qf/pf for the default search path. Instead, rely on df=_text_ configured in the /select handler in solrconfig.xml. The _text_ field is a text_general catch-all populated by copyField rules from name, alias, disambiguation, author, and identifier. This gives proper word-level matching. The trigram and autocomplete fields still exist in the schema for specific use cases (the /autocomplete handler uses name_autocomplete and alias_autocomplete via its own qf), but they’re deliberately kept out of the main search path.

This is consistent with how MusicBrainz does it in mbsssss – they define multiple field types but keep the default search on standard-analyzed fields.

Schema Design

The current ES schema lives as a JS object inside search.ts.

For Solr, this moves into a proper schema.xml file.

Here’s how each piece maps across!

Current ES index mapping (the thing we’re replacing)

This is the full ES index config defined in search.ts:


const indexMappings = {
	mappings: {
		_default_: {
			properties: {
				aliases: {
					properties: {
						name: {
							fields: {
								autocomplete: {
									analyzer: 'edge',
									type: 'text'
								},
								search: {
									analyzer: 'trigrams',
									type: 'text'
								}
							},
							type: 'text'
						}
					}
				},
				authors: {
					analyzer: 'trigrams',
					type: 'text'
				},
				disambiguation: {
					analyzer: 'trigrams',
					type: 'text'
				}
			}
		}
	},
	settings: {
		analysis: {
			analyzer: {
				edge: {
					filter: [
						'asciifolding',
						'lowercase'
					],
					tokenizer: 'edge_ngram_tokenizer',
					type: 'custom'
				},
				trigrams: {
					filter: [
						'asciifolding',
						'lowercase'
					],
					tokenizer: 'trigrams',
					type: 'custom'
				}
			},
			tokenizer: {
				edge_ngram_tokenizer: {
					max_gram: 10,
					min_gram: 2,
					token_chars: [
						'letter',
						'digit'
					],
					type: 'edge_ngram'
				},
				trigrams: {
					max_gram: 3,
					min_gram: 1,
					type: 'ngram'
				}
			}
		},
		'index.mapping.ignore_malformed': true
	}
};

Two custom analyzers (edge and trigrams), both using asciifolding + lowercase.

That’s the whole text analysis story in the current setup.

Solr field types (replacing those analyzers) :

Each ES analyzer maps to a Solr fieldType. But we don’t just replicate them 1:1 – we improve them by adding ICU support and char normalization:

  1. ES edge analyzer → Solr text_autocomplete:
<fieldType name="text_autocomplete" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-chars.txt"/>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.ASCIIFoldingFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="10"/>
  </analyzer>
  <analyzer type="query">
    <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-chars.txt"/>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.ASCIIFoldingFilterFactory"/>
  </analyzer>
</fieldType>

Same EdgeNGram(2-10) idea as ES, but the query analyzer skips n-gramming so the user’s input stays intact. ES doesn’t make this index-vs-query distinction as cleanly.

  1. ES trigrams analyzer → Solr text_trigram:
<fieldType name="text_trigram" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-chars.txt"/>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="false"/>
    <filter class="solr.ICUFoldingFilterFactory"/>
    <filter class="solr.NGramFilterFactory" minGramSize="1" maxGramSize="3"/>
  </analyzer>
</fieldType>

This is the workhorse – used for name, alias, and the _text_ catch-all field. ES doesn’t have a dedicated equivalent; it just uses the default analyzer.

3 ) (this is not in ES at all) → Solr text_multilang:

<fieldType name="text_multilang" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-chars.txt"/>
    <tokenizer class="solr.ICUTokenizerFactory"/>
    <filter class="solr.ICUTransformFilterFactory" id="Katakana-Hiragana"/>
    <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>
    <filter class="solr.ICUFoldingFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

Full multilingual support – covered in detail in the next section.

Every text field type starts with MappingCharFilterFactory using mapping-chars.txt. This normalizes punctuation before tokenization – fullwidth CJK spaces become regular spaces, smart quotes become straight quotes, various hyphens become a standard hyphen, ligatures get expanded. This file is based on MusicBrainz’s mapping-MBCharEquivToChar.txt:

# spaces
"\u3000" => " "
"\u00A0" => " "

# hyphens
"\u2010" => "-"
"\u2014" => "-"
"\u2212" => "-"

# ligatures
"\uFB01" => "fi"
"\uFB02" => "fl"
"\u00C6" => "AE"
"\u00E6" => "ae"

One improvement over the current ES setup:

ES uses asciifolding + lowercase as separate filters, which only covers basic Latin diacritics. The Solr schema adds ICUFoldingFilterFactory on top, which handles broader Unicode normalization – diacritics, case folding, and character equivalences across scripts. This is what lets José match jose and Björk match bjork without needing explicit per-language rules.

Fields

ES supports nested objects natively (aliases: [{name: "..."}]).

Solr doesn’t , it uses flat multi-valued fields. So aliases becomes alias (multi-valued), and identifiers becomes identifier (multi-valued exact-match strings).

Here’s the Solr field definitions that replace the ES mapping:

<!-- core fields -->
<field name="bbid" type="string" indexed="true" stored="true" required="true"/>
<field name="type" type="string" indexed="true" stored="true" required="true"/>
<field name="name" type="text_general" indexed="true" stored="true"/>
<field name="alias" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="disambiguation" type="text_trigram" indexed="true" stored="true"/>
<field name="identifier" type="string" indexed="true" stored="true" multiValued="true"/>

<!-- denormalized work-author names -->
<field name="author" type="text_trigram" indexed="true" stored="true" multiValued="true"/>
<field name="author_exact" type="string" indexed="true" stored="false" multiValued="true"/>

<!-- analyzed copies for autocomplete + trigram search -->
<field name="name_autocomplete" type="text_autocomplete" indexed="true" stored="false"/>
<field name="name_trigram" type="text_trigram" indexed="true" stored="false"/>
<field name="alias_autocomplete" type="text_autocomplete" indexed="true" stored="false" multiValued="true"/>
<field name="alias_trigram" type="text_trigram" indexed="true" stored="false" multiValued="true"/>

<!-- catch-all: the default search target (df=_text_) -->
<field name="_text_" type="text_general" indexed="true" stored="false" multiValued="true"/>

<!-- stores full entity JSON for demo; in real migration, removed -->
<field name="_store" type="string" indexed="false" stored="true"/>

<uniqueKey>bbid</uniqueKey>

What each part replaces from ES:

ES Solr What changed
aliases.name (nested property) alias (flat multi-valued field) Solr can’t nest, so we flatten
aliases.name.autocomplete (sub-field with edge analyzer) name_autocomplete / alias_autocomplete via copyField ES uses sub-fields, Solr uses copyField to separate analyzed copies
aliases.name.search (sub-field with trigrams analyzer) name_trigram / alias_trigram via copyField Same idea, different mechanism
authors with trigrams analyzer author (text_trigram) + author_exact (string) Added an exact-match copy for field-specific queries like author:lovecraft
disambiguation with trigrams analyzer disambiguation (text_trigram) Direct mapping
identifiers.value (nested) identifier (string, multiValued) Flattened to exact-match strings for ISBN / Wikidata ID lookups
ES _type param for entity routing type string field + fq=type:author ES has built-in type routing; Solr uses a regular field we filter on
asciifolding + lowercase filters ICUFoldingFilterFactory Single filter that covers broader Unicode normalization, not just Latin

copyField rules

Instead of manually indexing the same content into multiple fields, Solr’s copyField handles it at index time:

<copyField source="name" dest="name_autocomplete"/>
<copyField source="name" dest="name_trigram"/>
<copyField source="name" dest="_text_"/>

<copyField source="alias" dest="alias_autocomplete"/>
<copyField source="alias" dest="alias_trigram"/>
<copyField source="alias" dest="_text_"/>

<copyField source="disambiguation" dest="_text_"/>
<copyField source="author" dest="_text_"/>
<copyField source="author" dest="author_exact"/>
<copyField source="identifier" dest="_text_"/>

When a document is indexed with just name, alias, author, etc., all the analyzed variants get populated automatically. The _text_ catch-all ends up containing everything searchable, which is what df=_text_ on the /select handler uses for the default search.

Document flattening

The ES version of getDocumentToIndex() returns nested objects:

// current ES format
return {
  ...entity.toJSON({ ignorePivot: true, visible: commonProperties.concat(additionalProperties) }),
  aliases,                    // [{name: "H. P. Lovecraft"}, {name: "Лавкрафт"}]
  identifiers: identifiers    // [{value: "Q169566"}]
};

For Solr, we flatten these into multi-valued fields:

// solr format
const doc = {
  bbid: entity.bbid,
  type: snakeCase(entity.type),
  name: entity.name,
  alias: entity.aliases?.map(a => a.name) || [],          // ["H. P. Lovecraft", "Лавкрафт"]
  identifier: entity.identifiers?.map(i => i.value) || [], // ["Q169566"]
  disambiguation: entity.disambiguation || '',
  ...(entity.type === 'work' && entity.authors ? { author: entity.authors } : {}),
  _store: JSON.stringify(entity)
};

For works, the authors array (author names attached at index time via relationship type 8) maps to the author field. This is the same denormalization pattern the current ES code uses – author names get baked into work documents so searching “lovecraft” also finds “The Call of Cthulhu”.

Request handlers (solrconfig.xml)

Instead of building the full query config in code every time, Solr lets you define request handlers with sensible defaults:

<!-- main search -- conservative, uses df=_text_ catch-all -->
<requestHandler name="/select" class="solr.SearchHandler">
  <lst name="defaults">
    <str name="df">_text_</str>
    <int name="rows">10</int>
  </lst>
</requestHandler>

<!-- dedicated autocomplete endpoint -->
<requestHandler name="/autocomplete" class="solr.SearchHandler">
  <lst name="defaults">
    <str name="defType">edismax</str>
    <str name="qf">name_autocomplete^3 alias_autocomplete^3</str>
    <str name="mm">80%</str>
  </lst>
</requestHandler>

<!-- indexing, wired to dedup chain keyed on bbid -->
<requestHandler name="/update" class="solr.UpdateRequestHandler">
  <lst name="defaults">
    <str name="update.chain">dedupe</str>
  </lst>
</requestHandler>

The deduplication chain uses SignatureUpdateProcessorFactory keyed on bbid, so re-indexing an entity overwrites the old document instead of creating duplicates:

<updateRequestProcessorChain name="dedupe">
  <processor class="solr.processor.SignatureUpdateProcessorFactory">
    <str name="signatureField">bbid</str>
    <bool name="overwriteDupes">true</bool>
    <str name="fields">bbid</str>
  </processor>
  <processor class="solr.LogUpdateProcessorFactory"/>
  <processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>

The config also sets up autoCommit (hard commit every 15s, no new searcher opened) and autoSoftCommit (soft commit every 1s for near-real-time visibility), plus CaffeineCache for filter queries, query results, and document lookups.

1 Like

Multi-language Support

This is where the Solr migration adds real value beyond just swapping backends.

The problem with the current ES setup

The ES 5.6 config uses two filters for text normalization:

// from the ES indexMappings in search.ts
analyzer: {
  edge: {
    filter: ['asciifolding', 'lowercase'],
    tokenizer: 'edge_ngram_tokenizer',
    type: 'custom'
  },
  trigrams: {
    filter: ['asciifolding', 'lowercase'],
    tokenizer: 'trigrams',
    type: 'custom'
  }
}

asciifolding + lowercase handles Joséjose and Björkbjork, but that’s about it. BookBrainz already has data in Japanese (katakana, hiragana, kanji), Chinese, Cyrillic, Arabic, and more.

The current ES setup handles these poorly:

  • StandardTokenizer splits CJK text character-by-character instead of by word boundaries

  • asciifolding does nothing for non-Latin scripts – Cyrillic Лавкрафт stays as-is, no normalization happens

  • there’s no script transformation, so searching in katakana won’t match hiragana

What Solr gives us ?

Note : Demo vs this section: In the draft PR demo, name, alias, and _text_ uses text_general.

pipeline it follows is this : (mapping-chars.txt → Standard tokenizer → lowercase → asciifolding → ICUFolding).

The block below is text_multilang, already defined in schema.xml for the full migration for stricter CJK (ICU tokenizer + script transforms).

I’m describing that planned pipeline here so reviewers see where we’re headed, not claiming the demo already runs every step below on name/alias.

The schema defines a text_multilang field type built specifically for this:

<fieldType name="text_multilang" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-chars.txt"/>
    <tokenizer class="solr.ICUTokenizerFactory"/>
    <filter class="solr.ICUTransformFilterFactory" id="Katakana-Hiragana"/>
    <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>
    <filter class="solr.ICUFoldingFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

What each piece does:

  • ICUTokenizerFactory – uses Unicode word-break rules instead of just whitespace. Japanese, Chinese, Thai, Khmer all get segmented properly. It’s the same ICU tooling Solr ships behind analysis-extras (the same broad approach MetaBrainz uses in other projects)

  • ICUTransformFilterFactory with Katakana-Hiragana – normalizes Japanese scripts so ラヴクラフト (katakana) and らヴくらふと (hiragana) match each other.

  • ICUTransformFilterFactory with Traditional-Simplified – normalizes Chinese character variants so traditional and simplified forms match.

  • ICUFoldingFilterFactory -help us in diacritics, case folding, and broader Unicode normalization than plain asciifolding alone; in this fieldType we still run LowerCaseFilterFactory after it (see schema order).

Compare the analysis chain side by side:

ES (search.ts indexMappings): edge / trigram tokenizers + asciifolding + lowercase on those analyzers.

Solr — demo path on name/alias/_text_: mapping-chars → Standard tokenizer → lowercase → asciifolding → ICUFolding.

Solr — planned text_multilang path (below): mapping-chars → ICUTokenizer → Katakana–Hiragana → Traditional–Simplified → ICUFolding → lowercase.

The character mapping file

Before any tokenization happens, MappingCharFilterFactory runs through mapping-chars.txt. This file is based on MusicBrainz’s mapping-MBCharEquivToChar.txt and normalizes things like:

# fullwidth CJK space → regular space
"\u3000" => " "
"\u00A0" => " "

# smart quotes → straight quotes
"\u201C" => "\""
"\u201D" => "\""

# various unicode hyphens → standard hyphen
"\u2010" => "-"
"\u2014" => "-"
"\u2212" => "-"

# ligatures → expanded form
"\uFB01" => "fi"
"\uFB02" => "fl"
"\u00C6" => "AE"
"\u00E6" => "ae"

# hebrew punctuation
"\u05F3" => "'"
"\u05BE" => "-"

# arabic punctuation
"\u06D4" => "."
"\u060C" => ","

This matters because real user data has all kinds of Unicode variants, and we want "H.P. Lovecraft" to match regardless of which apostrophe or space character was used.

Enabling ICU in Docker

All the ICU classes need analysis-extras enabled. In the docker setup:

solr:
  image: solr:9.7.0
  environment:
    - SOLR_MODULES=analysis-extras
  command:
    - solr-precreate
    - bookbrainz
    - /opt/solr/server/solr/configsets/bookbrainz

Without SOLR_MODULES=analysis-extras, Solr silently fails to create the core. This tripped me up during setup – it’s not prominently documented. You just get a “core not found” error with no mention of missing ICU classes.

What I tested

In the demo, I tested these multilingual searches through the actual BookBrainz website search page .

Query Script Expected Result
ラヴクラフト Japanese katakana H. P. Lovecraft + The Call of Cthulhu found both
トールキン Japanese katakana J. R. R. Tolkien + The Lord of the Rings found both
村上 Japanese kanji 村上 春樹 + Kafka on the Shore found both
Остин Cyrillic Jane Austen found
指輪物語 Japanese kanji The Lord of the Rings (work) found

These cases work because the test data includes those scripts in indexed fields (mostly aliases), and index + query use the same analyzer.

In the demo that’s the text_general chain: mapping-chars → Standard tokenizer → lowercase → asciifolding → ICUFolding.

For example, the test data for Lovecraft includes:

{
  name: 'H. P. Lovecraft',
  alias: ['Howard Phillips Lovecraft', 'Лавкрафт', 'ラヴクラフト'],
  type: 'author'
}

When ラヴクラフト is indexed, it goes through the demo’s text_general chain → mapping-chars → tokenizer → lowercase → asciifolding → ICUFolding — and ends up in _text_ (via copyField from alias, same as the other searchable text).

When the user searches ラヴクラフト, Solr runs that same chain on the query, so the tokens match the stored katakana alias. The Call of Cthulhu shows up as well because the work has the author name in author, and author is also copied into `text, so one query can hit both the author and the work.

What needs to happen in the full migration

Right now in the demo, most fields use text_general (which has ICUFolding but uses StandardTokenizer). For the full migration, the plan is to evaluate switching name and alias fields to text_multilang (which uses ICUTokenizer + the transform filters) for proper CJK word segmentation. This needs careful testing with real BookBrainz data to make sure it doesn’t break Latin-script searches, but the field type is already defined in the schema and ready to use.

Work-Author Relationship

How the current ES code handles it ?

During indexing, the generateIndex() function in search.ts does something clever. After fetching all Author and Work entities from PostgreSQL, it looks at each Work’s relationships, finds the ones where the relationship type is 8 (meaning “author wrote work”), and attaches the author names directly onto the Work document before indexing it:

// from search.ts generateIndex(), lines 517-544
const authorWroteWorkRels = relationshipSet
    .related('relationships')
    ?.filter(
        (relationshipModel) =>
            relationshipModel.get('typeId') === 8
    );
const authorNames = [];
authorWroteWorkRels.forEach((relationshipModel) => {
    const sourceBBID = relationshipModel.get('sourceBbid');
    const source = authorCollection.get(sourceBBID);
    const name = source?.related('defaultAlias')?.get('name');
    if (name) {
        authorNames.push(name);
    }
});
workEntity.set('authors', authorNames);

Then in searchByName(), when the search type includes works, it adds authors to the query fields:

// from search.ts searchByName(), lines 684-690
if (
    sanitizedEntityType === 'work' ||
    (Array.isArray(sanitizedEntityType) && sanitizedEntityType.includes('work'))
) {
    dslQuery.body.query.multi_match.fields.push('authors');
}

So the approach is: denormalize the author names into work documents at index time, then search across them. This means you don’t need a join or a second query at search time – everything is already in one place.

How we replicate this in Solr

The same denormalization pattern works in Solr. In the schema, we define an author field on every document:

<!-- denormalized work-author names -->
<field name="author" type="text_trigram" indexed="true" stored="true" multiValued="true"/>

<!-- copy into the catch-all so default search picks it up -->
<copyField source="author" dest="_text_"/>

When we index a work document in solr , we can include the author names directly:

// from search-solr.ts from my demo , the document preparation in _bulkIndexEntities
const doc = {
    bbid: entity.bbid,
    type: 'work',
    name: 'The Call of Cthulhu',
    alias: ['The Call of Cthulhu', 'Call of Cthulhu'],
    author: ['H. P. Lovecraft'],   // <-- denormalized here
    // ...
};

The copyField rule copies author into _text_, so when someone searches “lovecraft” using the default df=_text_ path,

Solr matches both:

  • the Author document (because “Lovecraft” is in its name and alias, which are also copied to _text_)

  • the Work document “The Call of Cthulhu” (because “H. P. Lovecraft” is in its author field, which is copied to _text_)

For type filtering, we just add fq=type:work and Solr narrows it down:

# search "lovecraft", no filter -- returns author + their works
/select?q=lovecraft&df=_text_&wt=json

# search "lovecraft", filter to works only -- returns just the works
/select?q=lovecraft&df=_text_&fq=type:work&wt=json

In the code used in demo , searchByName() function handles this automatically:

// from search-solr.ts searchByName()
if (sanitizedType && sanitizedType !== 'allEntities') {
    const typeFilter = Array.isArray(sanitizedType)
        ? `type:(${sanitizedType.join(' OR ')})`
        : `type:${sanitizedType}`;

    if (Array.isArray(params.fq)) {
        params.fq.push(typeFilter);
    }
    else {
        params.fq = typeFilter;
    }
}

What I tested

In the demo, these are the work-author searches I ran through the actual website:

Query Filter What came back
lovecraft none H. P. Lovecraft (author) + The Call of Cthulhu (work)
lovecraft Type: Work only The Call of Cthulhu
tolkien none J. R. R. Tolkien (author) + The Lord of the Rings (work)
tolkien Type: Work only The Lord of the Rings
asimov none Isaac Asimov (author) + Foundation (work) + Foundation Series (series)
asimov Type: Work only Foundation

This all works because searching by author name is really just a text search – the author’s name appears on both the author document (as name) and the work document (as author), and both get copied into _text_. No joins, no second query, no cross-core lookup.

What stays the same in the full migration

The denormalization logic in generateIndex() (the relationship type 8 lookup) doesn’t change at all. That code reads from PostgreSQL and attaches author names onto Work entities before indexing. The only difference is the output format – instead of { authors: ["H. P. Lovecraft"] } as a nested field for ES, it becomes { author: ["H. P. Lovecraft"] } as a flat multi-valued field for Solr. The pattern is identical.

Search Logic: Function-by-Function Migration

Every function in search.ts that currently talks to the @elastic/elasticsearch SDK gets rewritten to use direct HTTP calls via fetch(). Here’s each one.

init() – connection setup

Current ES code:

// search.ts init(), lines 703-728
_client = new ElasticSearch.Client(options);
await _client.ping();

const mainIndexExists = await _client.indices.exists({ index: _index });
if (!mainIndexExists) {
    generateIndex(orm).catch(log.error);
}

Creates an ES client object, pings the cluster, checks if the index exists, and auto-indexes if it doesn’t.

In solr we can have:

// search-solr.ts init()
_solrHost = `http://${host}:${port}`;
_client = { baseUrl: `${_solrHost}/solr/${_solrCore}` };

await solrRequest('/admin/ping', { wt: 'json' });

const checkResponse = await solrRequest('/select', { q: '*:*', rows: 0, wt: 'json' });
const docCount = checkResponse.response?.numFound || 0;

if (docCount === 0) {
    await generateIndex(orm);
}

No SDK. Just a base URL and fetch(). Instead of checking if the index exists (which is an ES-specific concept – Solr cores are created by Docker), we check if the core has any documents. If empty, auto-index.

searchByName() – the main search

This is the function every search page calls. It takes a search term and queries across multiple fields.

Current ES code:

// search.ts searchByName(), lines 661-693
const dslQuery = {
    body: {
        from,
        query: {
            multi_match: {
                fields: [
                    'aliases.name^3',
                    'aliases.name.search',
                    'disambiguation',
                    'identifiers.value'
                ],
                minimum_should_match: '80%',
                query: name,
                type: 'cross_fields'
            }
        },
        size
    },
    index: _index,
    type: sanitizedEntityType
};
if (sanitizedEntityType === 'work' ||
    (Array.isArray(sanitizedEntityType) && sanitizedEntityType.includes('work'))) {
    dslQuery.body.query.multi_match.fields.push('authors');
}
return _searchForEntities(orm, dslQuery);

ES uses multi_match with cross_fields type, which searches across all listed fields as if they were one combined field. Entity type filtering is done through the _type parameter built into ES.

In solr we can have :

// search-solr.ts searchByName()
const params = {
    defType: 'edismax',
    q: query,
    wt: 'json',
    start: from,
    rows: size,
    tie: 0.1,
    fl: 'bbid,id,name,type,disambiguation,author,_store'
};

// default path: let df=_text_ in solrconfig handle it.
// _text_ is a catch-all populated by copyField from name, alias, author, disambiguation, identifier.
// important: do NOT put trigram fields in qf here or everything matches everything.

// type filtering via fq (ES uses _type param, Solr uses filter query)
if (sanitizedType && sanitizedType !== 'allEntities') {
    params.fq = Array.isArray(sanitizedType)
        ? `type:(${sanitizedType.join(' OR ')})`
        : `type:${sanitizedType}`;
}

const response = await solrRequest('/select', params);

Key differences:

  • ES multi_match with explicit field list → Solr df=_text_ catch-all (populated by copyField rules in schema)

  • ES _type param for entity routing → Solr fq=type:author filter query

  • ES cross_fields → Solr eDisMax with tie=0.1 (gives a small boost when multiple fields match)

  • No qf set on the default search path – this is deliberate. The _text_ catch-all already contains everything (name, alias, author, disambiguation, identifier via copyField rules), so we let df=_text_ handle it

The code also handles two special cases that ES doesn’t have explicit paths for:

// ISBN lookup -- exact match on identifier field
if (isISBN) {
    params.fq = [`identifier:"${query}"`];
    params.q = '*:*';
}

// field-specific query like "author:lovecraft"
if (isFieldSpecificQuery) {
    params.defType = 'lucene';
    params.q = query.replace(/author:\S+/, `author_exact:*${searchTerm}*`);
}

autocomplete() – typeahead suggestions

Current ES code:

// search.ts autocomplete(), lines 338-372
queryBody = {
    match: {
        'aliases.name.autocomplete': {
            minimum_should_match: '80%',
            query
        }
    }
};
const dslQuery = {
    body: { query: queryBody, size },
    index: _index,
    type: sanitizeEntityType(type)
};

ES queries the aliases.name.autocomplete sub-field which uses the edge (EdgeNGram) analyzer.

In solr we can have :

// search-solr.ts autocomplete()
const params = {
    q: lowerQuery,
    defType: 'edismax',
    qf: 'name_autocomplete^10 alias_autocomplete^10',
    fl: 'bbid,id,name,type,disambiguation,_store',
    rows: size,
    wt: 'json',
    tie: 0.1
};

// type filter
if (sanitizedType && sanitizedType !== 'allEntities') {
    params.fq = Array.isArray(sanitizedType)
        ? `type:(${sanitizedType.join(' OR ')})`
        : `type:${sanitizedType}`;
}

const response = await solrRequest('/autocomplete', params);

Instead of querying a sub-field, we hit the dedicated /autocomplete handler in solrconfig.xml which already has qf=name_autocomplete^3 alias_autocomplete^3 as defaults. The code overrides with ^10 for stronger boosting. Both name_autocomplete and alias_autocomplete use the text_autocomplete field type (EdgeNGram 2-10), so typing “tolk” matches “Tolkien”.

getDocumentToIndex() / indexEntity() – preparing and indexing a single entity

Current ES code:

// search.ts getDocumentToIndex(), lines 134-166
return {
    ...entity.toJSON({
        ignorePivot: true,
        visible: commonProperties.concat(additionalProperties)
    }),
    aliases,                    // [{name: "H. P. Lovecraft"}, {name: "Лавкрафт"}]
    identifiers: identifiers    // [{value: "Q169566"}]
};

// search.ts indexEntity(), lines 375-390
return _client.index({
    body: document,
    id: entity.get('bbid') || entity.get('id'),
    index: _index,
    type: snakeCase(entityType)
});

ES gets a nested object and uses _client.index() with the SDK.

Solr equivalent:

// search-solr.ts indexEntity()
const doc = {
    bbid: entity.bbid || entity.id,
    type: snakeCase(entity.type),
    name: entity.name,
    alias: entity.aliases?.map(a => a.name) || [],          // flattened
    identifier: entity.identifiers?.map(i => i.value) || [], // flattened
    disambiguation: entity.disambiguation || '',
    ...(entity.type === 'work' && entity.authors ? { author: entity.authors } : {}),
    _store: JSON.stringify(entity)
};

await solrPost('/update/json/docs', doc);
await solrPost('/update', { commit: {} });

Two differences:

(1) nested objects get flattened into multi-valued arrays,

(2) _client.index() becomes a POST to /update/json/docs.

The _store field holds the full entity JSON for the demo; in the real migration this goes away and we fetch from PostgreSQL by bbid instead.

_bulkIndexEntities() – batch indexing

Current ES code:

// search.ts _bulkIndexEntities(), lines 252-319
const bulkOperations = entitiesToIndex.reduce((accumulator, entity) => {
    accumulator.push({
        index: {
            _id: entity.bbid ?? entity.id,
            _index,
            _type: snakeCase(entity.type)
        }
    });
    accumulator.push(entity);
    return accumulator;
}, []);

const { body: bulkResponse } = await _client.bulk({ body: bulkOperations });

ES bulk indexing uses an alternating format – metadata object, then document, metadata, document, etc. It also has retry logic for HTTP 429 (queue overrun) with jitter.

In solr we can have :

// search-solr.ts _bulkIndexEntities()
const docs = entities.map(entity => ({
    bbid: entity.bbid || entity.id,
    type: snakeCase(entity.type),
    name: entity.name,
    alias: entity.aliases?.map(a => a.name) || entity.alias || [],
    identifier: entity.identifiers?.map(i => i.value) || entity.identifier || [],
    disambiguation: entity.disambiguation || '',
    ...(entity.type === 'work' && entity.authors ? { author: entity.authors } : {}),
    _store: JSON.stringify(entity)
}));

await solrPost('/update/json/docs?commit=false', docs);

Much simpler. Solr natively accepts a JSON array of documents – no alternating metadata/doc format. We skip the commit during bulk operations (commit=false) and do one hard commit at the end via refreshIndex().

deleteEntity()

Current ES:

// search.ts deleteEntity(), lines 392-402
return _client.delete({
    id: entity.bbid ?? entity.id,
    index: _index,
    type: snakeCase(entity.type)
});

In solr :

// search-solr.ts deleteEntity()
await solrPost('/update', {
    delete: { id: entity.bbid || entity.id }
});
await solrPost('/update', { commit: {} });

refreshIndex()

Current ES:

// search.ts refreshIndex(), lines 404-408
return _client.indices.refresh({ index: _index });

In solr :

// search-solr.ts refreshIndex()
await solrPost('/update', { commit: {} });

ES “refresh” makes recent changes visible to search. Solr’s equivalent is a hard commit. The solrconfig.xml also has autoSoftCommit (every 1s) for near-real-time visibility during normal operation, but we do an explicit commit here so the demo feels instant.

checkIfExists() – “warn if entity already exists” flow

Current ES code:

// search.ts checkIfExists(), lines 621-659
const bbids = await orm.func.alias.getBBIDsWithMatchingAlias(
    transacting, snakeCase(type), name
);
// then fetches full entities from PostgreSQL by bbid

// then fetches full entities from PostgreSQL by bbid

Interestingly, the current ES version doesn’t actually query Elasticsearch here – it goes straight to PostgreSQL.

For the Solr version, we query Solr instead since it already has the indexed data:

// search-solr.ts checkIfExists()
const params = {
    q: `name:"${escapeQuery(name)}"`,
    fq: `type:${snakeCase(type)}`,
    fl: 'bbid,id,name,type,_store',
    rows: 20,
    wt: 'json'
};
const response = await solrRequest('/select', params);

This is actually a slight improvement – we get fuzzy-ish matching via the analyzer instead of just exact alias matching from PostgreSQL. In the full migration, we’ll evaluate whether to keep the PostgreSQL path or switch to Solr for this.

Response parsing – _fetchEntityModelsForESResults → _fetchEntityModelsForSolrResults

ES response structure:

{
  "body": {
    "hits": {
      "total": 42,
      "hits": [
        { "_source": { "bbid": "...", "type": "Author", "name": "..." } }
      ]
    }
  }
}

Solr response structure:

{
  "response": {
    "numFound": 42,
    "docs": [
      { "bbid": "...", "type": "author", "name": "..." }
    ]
  }
}

In the ES version, the function reads hit._source.bbid and then fetches the full entity from PostgreSQL using the ORM. In the Solr demo, we shortcut this by storing the full entity JSON in _store and parsing it back. In the real migration, _store goes away and we do the same PostgreSQL fetch the ES version does – just reading from response.docs instead of hits.hits.

1 Like

Docker & Infrastructure

Current ES setup

The existing docker-compose.yml runs ES 5.6.8:

elasticsearch:
    container_name: elasticsearch
    restart: unless-stopped
    image: docker.elastic.co/elasticsearch/elasticsearch:5.6.8
    environment:
      - transport.host=127.0.0.1
      - discovery.zen.minimum_master_nodes=1
      - xpack.security.enabled=false
    ports:
      - "127.0.0.1:9200:9200"
    volumes:
      - elasticsearch-data:/usr/share/elasticsearch/data

This gets replaced entirely.

Solr for development (standalone mode)

For local development, Solr runs in standalone mode – just one container, no ZooKeeper, no cluster:

solr:
    image: solr:9.7.0
    container_name: bookbrainz-solr
    ports:
      - "8983:8983"
    volumes:
      - ./solr-config/conf:/opt/solr/server/solr/configsets/bookbrainz/conf
      - solr-data:/var/solr
    environment:
      - SOLR_HEAP=512m
      - SOLR_MODULES=analysis-extras
    command:
      - solr-precreate
      - bookbrainz
      - /opt/solr/server/solr/configsets/bookbrainz

Key points:

  • solr:9.7.0 – same version MusicBrainz runs in production (per mb-solr). If we hit anything version-specific, we can ask the MB team directly.

  • SOLR_MODULES=analysis-extras – this is the line that enables all ICU analysis classes (ICUTokenizerFactory, ICUFoldingFilterFactory, ICUTransformFilterFactory). Without it, Solr silently fails to create the core. This is the single most important environment variable in the whole setup and it’s easy to miss.

  • solr-precreate – creates the bookbrainz core on first startup using our configset (schema.xml + solrconfig.xml + mapping-chars.txt). If the core already exists, it’s a no-op.

  • Config mounting./solr-config/conf gets mounted into the container. During development, you can edit schema.xml locally, recreate the core, and test changes without rebuilding the image.

The solr-config/conf directory contains three files:

solr-config/
  conf/
    schema.xml          # field types, analyzers, fields, copyField rules
    solrconfig.xml      # request handlers, caching, dedup chain
    mapping-chars.txt   # pre-token character normalization

To run the demo locally:

# start the solr container
docker compose -f docker-compose.solr.yml up -d

# start the website with solr enabled
USE_SOLR=true ./develop.sh

When the website starts, init() in search-solr.ts pings the Solr core, sees it’s empty, and auto-indexes the demo dataset (17 documents). No manual indexing step needed.

SolrCloud for production

For production, we’d move to SolrCloud, which adds ZooKeeper for cluster management:

zookeeper:
    image: zookeeper:3.9
    ports:
      - "2181:2181"

solr:
    image: solr:9.7.0
    depends_on:
      - zookeeper
    environment:
      - ZK_HOST=zookeeper:2181
      - SOLR_MODULES=analysis-extras
    ports:
      - "8983:8983"
    volumes:
      - solr-data:/var/solr

The differences from standalone:

  • ZooKeeper manages the schema and cluster config instead of local files

  • Schema gets uploaded through ZooKeeper’s API (solr zk upconfig) instead of being mounted as a volume

  • “Collections” replace single cores (a collection can be sharded across nodes)

  • The search code in search-solr.ts doesn’t change at all – Solr’s HTTP API is the same regardless of standalone vs SolrCloud

This is the same setup MusicBrainz uses. BookBrainz’s dataset is much smaller (~200k entities vs millions), so we might not need sharding initially, but SolrCloud gives us the option if the data grows. It also means the MetaBrainz infrastructure team can manage BookBrainz Solr alongside MusicBrainz Solr with the same tools and processes.

What gets removed

  • The elasticsearch service from docker-compose.yml

  • The elasticsearch-data volume

  • The @elastic/elasticsearch npm dependency from package.json

  • All ES-specific environment variables (transport.host, discovery.zen.minimum_master_nodes, xpack.security.enabled)

What gets added

  • The solr service (and optionally zookeeper for prod)

  • solr-data volume

  • solr-config/conf/ directory with schema, config, and mapping files

  • SOLR_MODULES=analysis-extras environment variable

The rest of the Docker setup (PostgreSQL, Redis, the website container itself) stays exactly the same.

Lessons from Building the Demo

building the demo taught me things that reading documentation alone wouldn’t have. these are the real gotchas i hit and how they affect the full migration.

don’t put trigram fields in global qf

this was the biggest one. my schema defines name_trigram and alias_trigram fields (NGram 1-3 chars) for partial matching. early on i tried adding these to eDisMax’s qf parameter to get fuzzy matching on every search. the result: every query matched every document. searching “tolkien” returned all 17 documents with roughly equal scores.

the reason: when trigram fields are in qf, solr breaks the query into 1, 2, and 3-character grams – t, to, tol, o, ol, olk… – and each tiny gram matches against grams from every indexed document. with a small dataset it’s hard to spot, but the relevance is garbage and at scale it would match everything.

the fix: don’t set qf at all for the default search path. rely on df=_text_ (a text_general catch-all field populated by copyField rules) as the default field. this gives standard word-level tokenized matching – proper search behavior. the trigram fields are available for explicit partial matching as a separate strategy, not mixed into the default path.

in the full migration this means being careful about which fields go into which request handler. the /select handler should stay conservative (just df=_text_), and any partial/fuzzy matching should use a separate handler or explicit field targeting.

SOLR_MODULES=analysis-extras is easy to miss

all the ICU classes (ICUTokenizerFactory, ICUFoldingFilterFactory, ICUTransformFilterFactory) need the analysis-extras module enabled. in docker, that’s one environment variable:

SOLR_MODULES=analysis-extras

without it, solr doesn’t throw an error at startup. it just silently fails to create the core, and you get a “core not found” error when you try to query. i spent time debugging this before figuring out the core never got created in the first place. the solr docs mention it but not prominently.

for the full migration, this needs to be clearly documented and part of the docker setup from day one.

flat documents are actually simpler

i was initially worried about losing ES’s nested object support. in ES, aliases are [{name: "Lovecraft"}, {name: "Лавкрафт"}] and identifiers are [{value: "Q169566"}]. solr needs flat multi-valued fields: alias: ["Lovecraft", "Лавкрафт"] and identifier: ["Q169566"].

turns out this is actually cleaner. the nested structure in ES doesn’t really buy us anything – we never query on alias properties other than name, and identifiers only have value. flattening removes a layer of complexity and makes the documents easier to debug in solr’s admin UI.

the full migration just needs to update getDocumentToIndex() to flatten these arrays. the rest of the pipeline stays the same.

the deduplication chain saves headaches

during development i kept re-running the indexing script and ending up with duplicate documents. solr’s SignatureUpdateProcessorFactory keyed on bbid fixes this – if you index a document with the same bbid as an existing one, it overwrites instead of creating a duplicate.

<updateRequestProcessorChain name="dedupe">
  <processor class="solr.processor.SignatureUpdateProcessorFactory">
    <str name="signatureField">bbid</str>
    <bool name="overwriteDupes">true</bool>
    <str name="fields">bbid</str>
  </processor>
</updateRequestProcessorChain>

this is important for the full migration because indexEntity() gets called whenever an entity is edited on the website. without dedup, you’d slowly accumulate stale copies.

no SDK is actually fine

ES uses the @elastic/elasticsearch npm package. for solr, i didn’t add any client library – just fetch(). solr’s API is plain HTTP + JSON: GET /select for queries, POST /update for indexing. node’s built-in fetch() handles everything.

this means one less dependency to maintain, no version compatibility issues between the client library and the server, and the code is easier to debug because you can copy-paste the exact URL into a browser or curl and see what solr returns.

Proof of Concept

i built a working demo that runs solr through the actual bookbrainz website , the demo indexes 17 hardcoded documents (5 authors, 5 works, 3 editions, 1 edition group, 2 publishers, 1 series) covering multiple languages and entity types.

draft PR: [link to PR] walkthrough video: [link to video]

here are some of the key results:

1. multilingual search – japanese katakana

searching ラヴクラフト (lovecraft in katakana) returns H. P. Lovecraft (author) and The Call of Cthulhu (work). this works because the alias ラヴクラフト is stored on the author document, gets analyzed through ICUFolding at index time, and the query goes through the same pipeline. the work shows up because “H. P. Lovecraft” is denormalized into its author field.

solr query built by search-solr.ts:

/select?q=ラヴクラフト&defType=edismax

no qf set – relies on df=_text_ from solrconfig. the _text_ catch-all contains name, alias, author, disambiguation, and identifier via copyField rules, all analyzed through text_general (which includes ICUFolding).

2. multilingual search – cyrillic :

searching Остин (austen in cyrillic) returns Jane Austen (author) and Pride and Prejudice (work). the current ES setup with asciifolding + lowercase can not do this – asciifolding only handles latin diacritics.

ICUFolding handles cyrillic normalization properly.

solr query built by search-solr.ts:

/select?q=остин&defType=edismax

3. work-author relationship

searching lovecraft with no type filter returns both the author and their work in one result list. this is the multi-entity search behavior specifically we want to be preserved. it works because “H. P. Lovecraft” exists in both places: as the name on the author document, and as the author field on the work document “The Call of Cthulhu”. both get copied into _text_ via copyField.

solr query built by search-solr.ts:

/select?q=lovecraft&defType=edismax

4. type filtering

same search lovecraft, but with the type dropdown set to “Work”. now only The Call of Cthulhu shows up. the code adds fq=type:work to the query.

solr caches filter queries separately from the main query, so repeated type switches are fast.

solr query built by search-solr.ts:

/select?q=lovecraft&defType=edismax&fq=type:work

5. identifier search – ISBN lookup

searching 978-0-486-27204-8 returns only the 1928 edition of The Call of Cthulhu. the code detects the ISBN pattern via regex, switches from a text search to an exact filter query on the identifier field. no text analysis, no tokenization – pure exact match.

solr query built by search-solr.ts:

/select?q=*:*&fq=identifier:"978-0-486-27204-8"

notice q=*:* – the actual filtering is done entirely through fq. this is a common solr pattern for exact lookups.

6. field-specific query – author:name filtered to works

typing author:H. P. Lovecraft with the type filter set to “Work” returns only works written by Lovecraft. the code detects the author: prefix, switches to lucene query parser, and rewrites the query to target the author_exact field.

solr query built by search-solr.ts:

/select?q=author_exact:*h. p. lovecraft*&defType=lucene&fq=type:work

what’s not fully working yet

  • autocomplete suggestions while typing are not integrated into the website search UI (the /autocomplete solr endpoint works, but the frontend doesn’t call it on the main search page)

  • fuzzy/typo tolerance needs more tuning to be reliable without over-matching

  • the demo dataset is hardcoded (17 documents); the real migration needs the full generateIndex() flow reading from PostgreSQL

Timeline

pre-community bonding (april)

  • i already have a working demo with schema.xml, solrconfig.xml, mapping-chars.txt, search-solr.ts, and docker-compose.solr.yml – all tested with 17 hardcoded documents

  • i’ll keep contributing to open bookbrainz tickets during april or refactor my approach towards schema’s and all !

  • i’ll go through the real production bookbrainz database to understand the edge cases my demo doesn’t cover – entities with many aliases, works with multiple authors, long identifier lists and finally improve my demo schemas and approach for it so it can help in real production work !

community bonding (may 1 - may 24)

  • i’ll walk through the demo with mentors and get feedback on schema choices – the main open question is whether name/alias should stay as text_general or move to text_multilang for proper CJK segmentation ?

  • i’ll test the schema against a sample of real bookbrainz postgres data to see if the copyField rules and _text_ catch-all hold up at scale

  • i’ll finalize the checkIfExists() approach with mentors – keep the current postgres-direct path or switch to solr

  • i’ll confirm SOLR_MODULES=analysis-extras works on the production docker setup or not !

deliverable: schema reviewed and locked up with mentors. no design surprises going into week 1.

week 1-2: docker infrastructure swap + schema & config finalization

  • i’ll replace the elasticsearch:5.6.8 service in docker-compose.yml with solr:9.7.0, remove ES-specific env vars, and add SOLR_MODULES=analysis-extras with the solr-precreate command

  • i’ll mount solr-config/conf/ into the container and finalize all three config files based on community bonding feedback:

    • schema.xml: finalize field types (text_autocomplete, text_trigram, text_general, text_multilang), field definitions (bbid, type, name, alias, disambiguation, identifier, author, author_exact, the _text_ catch-all), and all copyField rules

    • solrconfig.xml: finalize request handlers (/select with df=_text_, /autocomplete with its own qf, /update with the dedup chain), autoCommit/autoSoftCommit settings, and CaffeineCache config

    • mapping-chars.txt: finalize character normalization rules (CJK spaces, smart quotes, hyphens, ligatures, hebrew/arabic punctuation)

  • i’ll remove @elastic/elasticsearch from package.json

  • will verify the core starts, accepts documents, and responds to /admin/ping

deliverable: solr container running in docker-compose, core created with finalized schema + config + mapping file, responding to queries. ES dependency gone.

week 3-4: rewriting the search and autocomplete functions

  • i’ll rewrite init() to connect to solr via fetch(), ping the core, and auto-index if empty

  • i’ll rewrite searchByName() – replacing the ES multi_match / cross_fields approach with eDisMax + df=_text_. type filtering moves from ES’s _type param to fq=type:....

  • i’ll add ISBN detection and field-specific author: prefix handling

  • i’ll rewrite autocomplete() to hit the dedicated /autocomplete handler with name_autocomplete and alias_autocomplete fields

  • i’ll rewrite the response parser to read from solr’s response.docs / response.numFound instead of ES’s hits.hits._source

  • i’ll verify that the work-author relationship works end-to-end: searching an author name should return both the author and their works (via the denormalized author field + copyField to _text_)

  • i’ll preserve the existing BBID lookup path in autocomplete(): if the query is a valid BBID (commonUtils.isValidBBID), i’ll skip the edge-ngram text match and query solr directly by bbid (instead of ES’s ids query)

deliverable: website search page returns correct results through solr. search, autocomplete, and work-author lookups all working.

week 5-6: indexing pipeline with real postgres data

  • i’ll rewrite getDocumentToIndex() to flatten nested aliases and identifiers into multi-valued fields for solr

  • i’ll rewrite indexEntity() and deleteEntity() to use solr’s /update API instead of the ES SDK

  • i’ll rewrite _bulkIndexEntities() – replacing ES’s alternating metadata/doc format with a simple JSON array POST

  • i’ll rewrite generateIndex() to output flat solr documents – the postgres fetching logic and relationship type 8 author-work enrichment stays the same, just the output format changes

  • i’ll keep the existing generateIndex(orm, entityType) behavior so we can reindex a single entity type (e.g. only Work) without always wiping and rebuilding the whole index - the solr side becomes “delete/replace docs for that type” or targeted bulk updates, not indices.create like ES

  • i’ll run a full reindex from real postgres data and compare results with the current ES output

deliverable: all functions in search.ts rewritten. full reindex from real data works. search results match or improve on current ES behavior.

midterm evaluation (~week 6-7)

at this point all core functions talking to solr. website search returns correct results from real postgres data.

week 7-8: testing + response parsing

  • i’ll remove the _store field and update _fetchEntityModelsForSolrResults() to fetch full entities from postgres by bbid using the ORM – making solr a pure lookup layer, not a data store

  • i’ll handle the special cases in response parsing the same way the current ES code does: Area entities fetched via Area.forge({gid}), Editor entities via Editor.forge({id}), Collection entities via UserCollection.forge({id})

  • for regular entities i’ll mirror _fetchEntityModelsForESResults(): load relationships and resolve source/target via commonUtils.getEntity() where needed, so the JSON shape matches what the UI expects (not just bbid + name stubs)

  • i’ll make sure the response format matches what the route layer (search.tsx) expects – {results, total} from searchByName(), plain array from autocomplete()

  • i’ll run the existing test suite against solr and fix any failures

  • i’ll manually test all user-facing features: search by name, type filter, pagination, autocomplete in entity editors, the “does this entity exist?” check

  • i’ll specifically test work-author searches with real data and edge cases: entities with empty aliases, special characters in identifiers, works with multiple authors

deliverable: _store removed, ORM-based response parsing working. test suite passes. all features manually verified.

week 9: multilingual refinement + fuzzy search

  • i’ll test text_multilang (ICUTokenizer + script transforms + ICUFolding) against real CJK data from bookbrainz

  • if latin-script searches don’t regress, i’ll switch name and alias fields to text_multilang

  • i’ll explore controlled fuzzy/typo tolerance , possibly a separate request handler or using solr’s ~ operator, without polluting the default search path

deliverable: multilingual search validated with real data. fuzzy strategy decided and implemented if time allows.

week 10: solrcloud setup for production

  • i’ll add zookeeper to docker-compose, upload the schema via solr zk upconfig, and create a collection to replace the standalone core

  • i’ll verify all search functions work without any code changes – solr’s HTTP API is the same in standalone and solrcloud

  • i’ll document both setup paths for the team

deliverable: solrcloud running, all features working without code changes. setup docs written.

week 11: cleanup + hardening

  • i’ll remove all remaining ES artifacts from the codebase: docker service, volumes, env vars, any leftover code paths

  • i’ll do a final pass on error handling in solrRequest() and solrPost() – timeouts, retries, connection failures

  • i’ll fix any edge cases found during weeks 7-10

deliverable: codebase fully clean of ES. error handling production-ready.

week 12: documentation + final review

  • i’ll document the key design decisions: why single core, why df=_text_ catch-all, why ICU over asciifolding, why no SDK

  • i’ll write a developer setup guide

  • final PR review with mentors

deliverable: documentation complete. final PR ready for merge.

stretch goals (if time permits)

  • search result highlighting: solr can return highlighted snippets showing why a result matched. right now if you search “lovecraft” and get “The Call of Cthulhu”, there’s no indication why that work appeared. with solr’s highlighting component, the result would show “author: H. P. Lovecraft” – making it obvious. this is especially useful for multi-entity search where results come from different fields (name, alias, author, identifier).

  • result grouping by entity type: currently search results are ranked purely by relevance, which means one entity type can dominate the first page. solr’s result grouping (group=true&group.field=type) can return balanced results – top 3 authors, top 3 works, top 3 editions – giving users a better overview without having to manually filter by type.

  • zero-result query logging: track searches that return no results and surface them to editors. if users keep searching for “brandon sanderson” and getting nothing, that’s a signal that the entity is missing or has poor alias coverage. this is a simple solr-side log + a small admin dashboard, but it directly helps bookbrainz’s data quality by connecting search behavior to editorial gaps.

Architecture - How Search Works After the Migration :

  1. when a user searches on BookBrainz, the search page sends GET /search?q=... to the server. the autocomplete searchbar sends GET /autocomplete?q=.... both requests land on search.tsx routes.

  2. the routes call searchByName() or autocomplete() in search.ts — the single file that handles all search logic. searchByName() builds an eDisMax query with df=_text_ and sends it to Solr’s /select endpoint via fetch(). autocomplete() hits the dedicated /autocomplete handler targeting EdgeNGram fields (name_autocomplete, alias_autocomplete).

  3. Solr (single core: bookbrainz) receives the query, runs it through the query parser using the field types and analyzers defined in schema.xml, with character normalization from mapping-chars.txt.

  4. Solr returns response.docs , a list of matching documents containing bbid and type.

  5. the code reads bbid and type from the response, and fetches the complete entity from PostgreSQL using the Bookshelf.js ORM. PostgreSQL is the source of truth , Solr is just a lookup layer.

  6. the full entity JSON is sent back to the search page for display.

  7. when an entity is created or edited, the entity editor sends a POST through entity routes, which call indexEntity(). this sends a flat document to Solr’s /update endpoint. the dedupe chain (keyed on bbid) overwrites instead of duplicating.

  8. on server startup, init() pings Solr to check if it’s reachable. if the Solr index is empty, init() calls generateIndex().

  9. generateIndex() fetches all entities from PostgreSQL in 50k chunks using the ORM, attaches author names to work documents via relationship type 8 (“author wrote work”), converts them into flat Solr documents, and bulk-POSTs them to /update/json/docs.

  10. the Elasticsearch connection is removed entirely (red dashed line in the diagram) — no SDK, no ES Docker service.

  11. Docker Compose runs Solr in standalone mode for development (solr-precreate + volume mount of schema.xml, solrconfig.xml, mapping-chars.txt). for production, this moves to SolrCloud + ZooKeeper , the search code stays the same since Solr’s HTTP API is identical in both modes.

Alternative: Multi-Core Approach

MusicBrainz uses a separate Solr core for each entity type → one for artists, one for releases, one for recordings, and so on. each core has its own schema tailored to that entity’s fields. this works for MB because their search UI requires you to pick an entity type before searching, they have millions of entities per type, and each type has genuinely different search fields.

BookBrainz is different. multi-entity search is the default. users type a query and get authors, works, editions, publishers, series, and edition groups all in one result list. a pure multi-core approach would mean every unfiltered search like “lovecraft” with no type selected fires 6 separate HTTP requests, one per core, instead of 1. the results from each core come back with independent relevance scores that aren’t directly comparable, so it would need custom merge-and-rank logic in node.js. pagination across cores is also painful because you can not just ask each core for results 11-20, we would need to over-fetch from all of them and merge. and all BookBrainz entity types share the same base fields (name, alias, disambiguation, identifier), so there’s no strong schema reason to split them in the first place.

that said, there is a real question worth testing: what if we use both?

the idea is a hybrid approach. a unified single core handles the default multi-entity search where you want everything ranked together in one response. but when a user explicitly filters by entity type, say “show me only Authors”, the query could route to a dedicated author core with author-specific field weights and a schema optimized for that entity type. this is closer to how MusicBrainz works. they benefit from per-type boosting because artist searches need different field weights than release searches.

in the single-core design, type filtering already works with fq=type:author and Solr caches filter queries separately, so it’s fast. but a per-type core could offer better relevance tuning for filtered searches, especially as the dataset grows. the tradeoff is added complexity. maintaining multiple schemas, syncing data across cores, and the routing logic to decide which path to take.

this is something i want to test and benchmark during the project. my plan is to start with the single unified core, which already works in my demo and matches the current ES single-index setup, then during the testing phase i’ll try setting up a few entity-specific cores and compare result quality and performance side by side. if per-type cores give noticeably better results for filtered searches without too much operational overhead, we go hybrid. if not, single core stays. my mentor also suggested we discuss and test both approaches, unified for unified search and separate for entity-specific search, and i’ll prioritize that comparison early in the project so we’re not locked into a decision we haven’t validated.

references I used for this proposal and my demo building !

I also referred src/common/helpers/search.ts and src/server/routes/search.tsx for understanding the current behaviour !

About Me

i’m Rayyan Seliya, a computer science student from India. i started coding in my first year of college with C, moved to C++ for competitive programming, and then got into web development (React, Node.js, TypeScript).

my journey with MetaBrainz:

i started contributing to MetaBrainz in early 2025 with this PR on ListenBrainz adding an “Add another” checkbox to the submit listens modal so users can add multiple listens without reopening the modal each time. it was my first real open source contribution, and MonkeyDo walked me through multiple rounds of review and that PR got merged and deployed.

after that, i worked on a 3-month internship with MetaBrainz through C4GT 2025 see the blog on metabrainz : integrating Internet Archive into BrainzPlayer. i worked with lucifer and MonkeyDo on that project, which gave me solid experience with the MetaBrainz codebase and development workflow.

once the internship ended,i continued my journey and explored BookBrainz and picked up a stale feature request adding OpenLibrary cover images to edition pages which was mentioned in the metabrainz summit youtube video by monkey :). MonkeyDo reviewed and merged it.

other open source:

i also interned at Knative (part of the Linux Foundation / CNCF ecosystem). you can see my CNCF contribution card here.

github: RayyanSeliya

Community Affinities

what interests me about BookBrainz:

i genuinely use BookBrainz from the day i got to know ! i like the idea of having a structured, community-maintained catalog for books the same way MusicBrainz does for music.

books i read : harry potter , pride and prejudice etc

ListenBrainz:

i use ListenBrainz as my main listening tracker. you can see my 2025 Year in Music here. i listen to a mix of pop, rock, and recitations of quran !

Practical Requirements

device: HP Laptop 15s-du3xxx — 11th Gen Intel i5-1135G7 @ 2.40GHz, 20GB RAM, 477GB storage, Windows 11

availability: i can easily work 40 hours per week during summer as i don’t have any other commitments !

1 Like

(FWIW, I said the same about Atharv’s proposal, both of your proposals are great.)

The proposal is very well thought out and I don’t have much improvements to offer. Since the dataset is small, I agree that a single core solution would work well and be simplest. As BB grows, we may need to switch to multiple cores for different use cases but we can cross that bridge later. I will think more about this implementation in the coming days and share more feedback if any.

2 Likes

Thx @lucifer for the feedback !