Contact Information
Name: Amaan Pathan
Email: eulerbutcooler@gmail.com
Github: eulerbutcooler
Timezone: UTC+5:30
Project Overview
Title: Use Solr search server
Proposed Mentor: Monkey, Lucifer
Project Length: 350 hours
BookBrainz currently operates on Elasticsearch 5.6 which is both outdated as well as creates some overhead in maintaining two different search architectures. This project streamlines the BookBrainz search architecture with what is being used by other MetaBrainz services by migrating from Elasticsearch to Solr all while keeping the existing features same and maintaining multi-entity search.
All search and indexing logic lives in search.ts. The route layer search.tsx calls the exported functions from the prior mentioned file and requires no changes apart from response object parsing. The database schema, frontend and ORM layers are completely untouched. The migration is a contained rewrite of search.ts that will replace the Elasticsearch client calls with fetch() calls to Solr. Apart from this the schema configuration and docker infrastructure will need required changes.
My Contributions
Merged PRs:
- PR#1235 - Replaced the problematic getong/elasticsearch-action@v1.2 with a native service container.
- PR#21 - Fixed a name mismatch in the BookBrainz installation docs to align them with the bookbrainz-site repo.
- PR#1254 - Fixed a typo in index.js file - requireJS instead of requiresJS.
Open PRs:
- PR#1257 - Added native healthchecks in docker compose, removing the external dependency on waisbrot/wait image.
Experiments
After digging through numerous stackoverflow discussions and solr docs, I found out there were 3 ways of serving multi-entity search for a Solr that was designed for single-type queries.
Methodology
To ensure the experiment reflected real world conditions I recreated a stripped-down BookBrainz indexing pipeline locally.
First I spun up a Solr 9 Docker container. Since TF-IDF (Term FrequencyāInverse Document Frequency) scoring is susceptible to data distribution, I built a controlled dataset of 400 documents 200 authors and 200 works using real entity names and titles (Tolkien, Shakespeare, Dune, Good Omens, etc.) with procedural filler to generate realistic IDF metrics. I indexed these into three Solr collections - bb_exp_single, bb_exp_authors and bb_exp_works and ran identical eDisMax (Extended DisMax) search queries against all three simultaneously.
eDisMax is Solrās recommended query parser that supports multi-field searching with per-field boosting, flexible minimum-match rules, and phrase proximity scoring.
TF-IDF is the default relevance scoring model used by both Elasticsearch and Solr. TF (Term Frequency) measures how often a search term appears in a document, more occurrences = higher score. IDF (Inverse Document Frequency) measures how rare a term is across the entire collection, rarer terms = higher score. The product of these two values determines a documentās relevance score for a given query.
const EDISMAX = {
defType: "edismax",
qf: "aliases_name^3 disambiguation^1.5 author_names", // query fields, ^N sets boost weight
mm: "75%", // minimum-match
rows: "10", // results to return
fl: "id,type,aliases_name,disambiguation,author_names,score", // fields included in response
};
Option A - Single collection: All entity types in one collection with a type field. IDF computed across the whole pool. One request per search.
Option B - Cross-collection query: Per-type collections, queried together in one request via Solrās comma syntax. IDF computed independently per collection.
Option C - Fan-out + Normalization: Per-type collections, queried in parallel, scores divided by each collectionās max score to produce a 0-1 scale, merged in Node.js.
The 7 test queries I tried were:
- ātolkienā - Since there is an author as well as a book containing this name.
- āshakespeareā - Same as tolkien, exists in Author as well as Works collections.
- āmartinā - To test ambiguous partial matches (G.R.R. Martin vs other authors named Martin)
- āharry potterā - To test multi-word matches where the phrase only exists in the Works collection.
- ālord of the ringsā - To test how well eDisMax handles long and multi-word titles and if it successfully drops Authors down the list.
- āgood omensā - To test a book with multiple authors to ensure the
author_namesarray field boosts correctly. - āduneā - To test a single-word Work title that has no overlap with Authorās name (Frank Herbert).
The two differentiating queries were:
- ātolkienā:
Option B ranked a biography about Tolkien above Tolkien himself. The reason was IDF skew: The word ātolkienā appears in 4 of 200 works-collection documents (IDF ā 3.91, meaning itās moderately common in that collection) but only 2 of 200 authors-collection documents (IDF ā 4.61, meaning itās rarer there and thus gets a higher weight). The biography where ātolkienā dominates the title gets an inflated score from the works IDF. Option A fixes it by computing IDF once across all 400 documents.
- āduneā:
Option C normalizes both collections to a 0-1 scale but destroys a different signal: it collapses the work āDuneā and the author āFrank Herbertā to identical 1.0 scores erasing the meaningful information that the work is a much stronger match.
Conclusion:
Option B fails due to IDF skew when comparing raw scores from different collections. When ātolkienā is searched, it is rarer in the authors collection than the works collection and hence we get the absurd ranking of biography above the author.
Option C attempts to fix the IDF skew by normalizing the scores before merging them but this destroys the relative magnitude of the match and hence for the query āduneā, it normalizes both the author and work to the same score - 1.0.
Option A solves both of these problems by placing all entities into a single, unified Solr core. This fixes the IDF skew issue since all entities are now in the same collection and also we donāt need to normalize the scores. Apart from this, BookBrainz has 200k entities which isnāt a large enough dataset for multiple Solr collections, hence a single Solr collection would suffice.
Hereās a link to reproduce the experiment - BookBrainz Solr Search Strategies Experiment
Architecture
Single Solr Collection
All entity types are indexed into a single BookBrainz Solr collection. A type field distinguishes them. Type-specific searches use Solrās filter query (fq) parameter which is a secondary query that restricts results without affecting relevance scores. It is applied post-scoring and cached separately by Solr for performance.
If in future BookBrainz grows to hundreds of millions of entities, we can easily shift to multiple Solr collections since the proposed architecture doesnāt lock BookBrainz into a single Solr collection.
Changes Required
src/common/helpers/search.ts- Full rewrite of internals -fetch()instead of ES Client.docker-compose.yml- Solr container instead of ES containers.solr/schema.xml- Field types, fields, copyField directives.solr/solrconfig.xml- Request handlers, cache config.src/server/routes/search.tsx- Response shape updated.
Solr Schema
Field Type Mapping
| ElasticSearch | Solr | Migration Notes |
|---|---|---|
aliases.name (nested text) |
aliases_name (multi-valued text_general) |
Flattened nested JSON objects into array |
aliases.name.autocomplete (edge analyzer sub-field) |
aliases_autocomplete via copyField |
Edge NGram 2-10, query-side keyword tokenizer |
aliases.name.search (trigram sub-field) |
aliases_search via copyField |
NGram 1-3 for partial matching |
| authors (trigrams) | authors + authors_search via copyField | Stored as text_general, ngrams separately |
disambiguation (trigrams) |
disambiguation + disambiguation_search via copyField |
Stored as text_general, ngrams separately |
identifiers[].value (nested array) |
identifiers_value (multi-valued string) |
Flattened in getDocumentToIndex() |
asciifolding + lowercase filters |
ICUFoldingFilterFactory |
Single filter, broader Unicode coverage |
StandardTokenizerFactory |
ICUTokenizerFactory |
Proper CJK word segmentation |
ES _type routing |
type field + fq=type:{type} |
Explicit field, filter at query time |
Solrās
copyFielddirective automatically copies data from a source field into a destination field at index time, allowing the same content to be analyzed differently (ex - once for full-text search, once for autocomplete) without duplicating it in the source document.
Schema:
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="bookbrainz" version="1.6">
<!--
Replaces ES's default StandardAnalyzer + asciifolding + lowercase
ICUTokenizer handles CJK (Chinese, Japanese, Korean) word
segmentation properly
ICUFoldingFilter is a superset of asciifolding as it has a broader
unicode coverage
-->
<fieldType name="text_general" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<!--
text_edge_ngram replaces ES's edge analyzer (used for autocomplete)
Index-time: EdgeNGram(2,10) generates prefix tokens. ("Tolkien" ->
["To", "Tol", "Tolk"..]
Query-time: KeywordTokenizer keeps user input as one token. ("Tol"
as "Tol", hence matches the "Tol" ngram)
-->
<fieldType name="text_edge_ngram" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.EdgeNGramTokenizerFactory"
minGramSize="2" maxGramSize="10"/>
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<!--
text_ngram replaces ES's trigram analyzer (used for fuzzy search)
Index-time: NGram(1,3) generates all substrings of length 1-3
("Tolkien -> ["T", "To", "Tol", "o", "ol", "olk", ...]
Query-time: ICUTokenizer splits normally so we don't generate
ngrams of the query itself, we match against indexed ngrams.
-->
<fieldType name="text_ngram" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.NGramTokenizerFactory"
minGramSize="1" maxGramSize="3"/>
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<!-- Fields -->
<field name="bbid" type="string" indexed="true" stored="true"
required="true"/>
<field name="type" type="string" indexed="true" stored="true"/>
<field name="aliases_name" type="text_general" indexed="true"
stored="true" multiValued="true"/>
<field name="aliases_autocomplete" type="text_edge_ngram" indexed="true"
stored="false" multiValued="true"/>
<field name="aliases_search" type="text_ngram" indexed="true"
stored="false" multiValued="true"/>
<field name="disambiguation" type="text_general" indexed="true"
stored="true"/>
<field name="disambiguation_search" type="text_ngram" indexed="true"
stored="false"/>
<field name="authors" type="text_general" indexed="true" stored="true"
multiValued="true"/>
<field name="authors_search" type="text_ngram" indexed="true"
stored="false" multiValued="true"/>
<field name="identifiers_value" type="string" indexed="true"
stored="true" multiValued="true"/>
<field name="_version_" type="plong" indexed="false" stored="false"/>
<!-- Copy Fields -->
<copyField source="aliases_name" dest="aliases_autocomplete"/>
<copyField source="aliases_name" dest="aliases_search"/>
<copyField source="disambiguation" dest="disambiguation_search"/>
<copyField source="authors" dest="authors_search"/>
<uniqueKey>bbid</uniqueKey>
</schema>
Since Solr schema lives in a separate file it can be version-controlled, reviewed and deployed independently of the application.
Implementation:
In search.ts :
| Function | What changes |
|---|---|
init() |
Remove Elasticsearch.Client. Store Solr URL as _solrUrl. Ping via fetch(${_solrUrl}/admin/ping). Check collection via Collections API. |
getDocumentToIndex() |
Flatten: aliases into aliases_name: [names], identifiers into identifiers_value: [values] |
_fetchEntityModelsForESResults() |
Rename to _fetchEntityModelsForSolrResults(). Read response.docs[] (flat) instead of hits.hits[]._source. PostgreSQL fetch logic remains unchanged. |
_searchForEntities() |
Replace _client.search(dslQuery) with fetch(${_solrUrl}/select?${params}). Read data.response.numFound for total. |
searchByName() |
Replace ES multi_match DSL with eDisMax params: qf=aliases_name^3 aliases_search disambiguation_search identifiers_value, mm=80%. Type filter becomes fq=type:{type}. |
autocomplete() |
Replace match: { aliases.name.autocomplete } with eDisMax qf=aliases_autocomplete. BBID shortcut becomes q=bbid:"<uuid>". |
indexEntity() |
Replace _client.index({body, id, index, type}) with fetch(${_solrUrl}/update/json/docs?commit=true, {body: [doc]}) |
deleteEntity() |
Replace _client.delete({id}) with fetch(/update?commit=true, {body: {delete: {id}}}) |
refreshIndex() |
Replace _client.indices.refresh() with POST /update?commit=true. |
generateIndex() |
All Postgres fetch logic unchanged. Replace index create/delete with Collections API. Replace bulk POST with /update/json/docs. |
_bulkIndexEntities() |
Replace alternating ES bulk format with plain JSON array to /update/json/docs. Keep 429 retry logic. |
sanitizeEntityType() |
allEntities ā null (no fq filter). All other types ā fq=type:{type}. |
checkIfExists() |
No changes. |
search.ts - Solr equivalent:
- getDocumentToIndex()
// For ES this function converts the ORM model into a minimal JSON doc for indexing
// For Solr we flatten the nested structures into multi-valued fields
interface SolrDocument {
bbid: string;
type: string;
aliases_name: string[];
disambiguation: string;
identifiers_value: string[];
authors?: string[];
}
export function getDocumentToIndex(entity: any): SolrDocument {
const entityType: IndexableEntities = entity.get('type');
// Extract alias names as a flat string array
let aliasNames: string[] = [];
const aliasSet = entity.related('aliasSet')?.related('aliases');
if (aliasSet) {
aliasNames = aliasSet.map((alias) => alias.get('name')).filter(Boolean);
} else {
// Editors, Collections and Areas don't have the standard aliasSet structure hence we fallback to the 'name' attribute
const name = entity.get('name');
if (name) aliasNames = [name];
}
// Extract identifier values as a flat string array
let identifierValues: string[] = [];
const identifierSet = entity.related('identifierSet')?.related('identifiers');
if (identifierSet) {
identifierValues = identifierSet.map(id => id.get('value')).filter(Boolean);
}
const doc: SolrDocument = {
bbid: entity.get('bbid') || entity.get('id'),
type: entityType,
aliases_name: aliasNames, // multi-valued field
disambiguation: entity.get('disambiguation') || '',
identifiers_value: identifierValues // multi-valued field
};
// Only Works have the authors field
if (entityType === 'Work') {
doc.authors = entity.get('authors') || [];
}
return doc;
}
Output:
{
"bbid": "ef212883-aba1-4ba8-9181-3d15d2aa0394",
"type": "Author",
"aliases_name": ["J.R.R. Tolkien", "Tolkien", "John Ronald Reuel Tolkien"],
"disambiguation": "Author of The Lord of the Rings",
"identifiers_value": ["0000000121463862"]
}
- autocomplete()
export async function autocomplete(
orm: ORM,
query: string,
type: IndexableEntitiesOrAll,
size = 42){
const params = new URLSearchParams();
if (commonUtils.isValidBBID(query)) {
// Direct BBID lookup
params.set('q', `bbid:"${query}"`);
} else {
// Autocomplete via edge ngram field
params.set('q', query);
params.set('defType', 'edismax');
params.set('qf', 'aliases_autocomplete');
params.set('mm', '80%');
}
params.set('rows', String(size));
params.set('wt', 'json');
const sanitizedType = sanitizeEntityType(type);
if (sanitizedType) {
params.set('fq', `type:${sanitizedType}`);
}
const searchResponse = await _searchForEntities(orm, params);
return searchResponse.results;
}
- searchByName()
// aliases.name.search -> aliases_search (multi-field -> copyField)
// disambiguation -> disambiguation_search (NGram analyzed copy for fuzzy matching)
// identifiers.value -> identifiers_value (Flattened from nested array)
// authors -> authors_search (NGram analyzed copy)
export function searchByName(
orm: ORM,
name: string,
type: IndexableEntitiesOrAll,
size?: number,
from?: number
) {
const sanitizedType = sanitizeEntityType(type);
const params = new URLSearchParams();
params.set('q', name);
params.set('defType', 'edismax');
params.set('mm', '80%');
params.set('start', String(from || 0));
params.set('rows', String(size || 10));
params.set('wt', 'json');
params.set('fl', 'bbid,type,aliases_name,disambiguation,score');
let qf = 'aliases_name^3 aliases_search disambiguation_search identifiers_value';
const includesWork = !sanitizedType || sanitizedType === 'Work';
if (includesWork) {
qf += ' authors_search';
}
params.set('qf', qf);
if (sanitizedType) {
params.set('fq', `type:${sanitizedType}`);
}
return _searchForEntities(orm, params);
}
Timeline
| Period | Work |
|---|---|
| Community Bonding (May 1ā24) | Finalize schema with mentor feedback. Confirm Solr version and standalone vs SolrCloud preference for dev. |
| Week 1ā2 | Write schema.xml + solrconfig.xml. Add Solr to Docker. Verify core starts, index sample data in Admin UI. |
| Week 3ā4 | Rewrite init(), getDocumentToIndex(), _bulkIndexEntities(), generateIndex(). Run full re-index, verify document counts. |
| Week 5ā6 | Rewrite searchByName(), autocomplete(), _fetchEntityModelsForSolrResults(). Manual search testing across all entity types. |
| Week 7 | Rewrite indexEntity(), deleteEntity(), refreshIndex(). Verify real-time updates work end-to-end. |
| Week 8 | Update search.tsx response parsing. Remove @elastic/elasticsearch from package.json. |
| Midterm | All search functionality and single-entity indexing works end-to-end on a standalone Solr container. Code is clean and passes basic tests |
| Week 9ā10 | Test suite update and fixes. Integration tests against real Solr. Manual QA of all search features. |
| Week 11 | SolrCloud setup with ZooKeeper in Docker. Verify application works without code changes. Document setup. |
| Week 12 | Buffer: edge cases, error handling, performance sanity check (re-index time comparison ES vs Solr). PR polish. |
| Final submission | Code cleanup, final documentation for dev setup, and merging the Solr infrastructure into production-ready state. |
Stretch goals:
- Faceted search (entity type counts in results)
- Spellcheck via Solrās
/spellcomponent - Benchmark report comparing ES 5 and Solr 9.
Detailed Information About Yourself
My name is Amaan Pathan. Iām in the final year of my bachelors in Computer Science and Engineering at University of Lucknow. I have worked on backend all the way from simple CRUD apps to queues and RPC.
Community Affinities
What type of books do you read? (Please list a series of BBIDs as examples)
I read a variety of genres, especially thrillers and classics. My latest read was The Nose by Gogol and before that I had read Kafka On The Shore by Murakami and The Idiot by Dostoyevsky. My favorite author is Agatha Christie and my favorite work of hers is ABC Murders.
What type of music do you listen to? (Please list a series of MBIDs as examples)
My music ranges all the way from Sufi to Rock. Some of my favorites songs are - Gulon Mein Rang Bhare by Mehdi Hassan, Jesus of Suburbia by Green Day, Heaven or Las Vegas by Cocteau Twins. I love discovering new music as well so Iām always open to recommendations.
What aspects of BookBrainz interest you the most?
I find it really cool that BookBrainz has kept all this information open source and is always open for contributions. I even revised/added one title I couldnāt find at the site - White Nights by Dostoyevsky
Have you ever used MusicBrainz Picard to tag your files or used any of our projects in the past?
I havenāt yet but I plan on using it since I will be shifting to offline media instead of Spotify.
Programming precedents
When did you first start programming?
I started in 4th grade by making simple apps in QBasic. I even tried my hand at C but my brain wasnāt developed enough at that time. Then in high school, I started again with python. More recently Iāve been working with Javascript and Go.
Have you contributed to other Open Source projects? If so, which projects and can we see some of your code?
Iāve contributed to bluewave-labs back in my 2nd year in college but they seem to have deleted or made the guidefox repo private since it is a graduated project now. Here are a couple of my PRs that got merged - PR346, PR497. Since the PRās are unavailable hereās a short summary of what I added -
- Added schema-based input validations using Zod on both frontend and backend
- Introduced Data Transfer Objects (DTOs) for clean separation of request logic and validation
- Refactored conditional logic in React components with Zod-powered form schema validation
- Collaborated on open-source PRs as part of a modular codebase under industry-standard practice
What sorts of programming projects have you done on your own time?
- I recently built a Zapier-like automation engine - repo. Iāve added Discord, Slack, HTTP request and email integrations into this engine. I made it in Go to better understand goroutines and building scalable backend systems. Iāve also added workflow chaining and payload passing. This is my final year project for my bachelors.
- Iāve also built an HTTP server from scratch in Go - repo. I incorporated a request body parsing with RFC compliance in this server, and chunked transfer encoding, tcp server with concurrent connection handling.
- I also made a fun personal project - site. It returns a noteworthy or popular landmark closest to the user clicked point on the earth map and also outputs the wikipedia page associated with it. I built this using Three.JS and Nextjs.
Practical requirements
What computer(s) do you have available for working on your SoC project?
I have a Xiaomi notebook with 16GB of RAM, Intel i5 (11th gen) and intel integrated graphics. Iām using Ubuntu 24.04 LTS.
How much time do you have available per week, and how would you plan to use it?
I plan on working 35-40 hours a week as I graduate before the summer. I can code/work 5-6 hours a day.