GSoC 2024 - BookBrainz: Use Solr search server

PROPOSAL

Contact Information

Project Overview

Main Goal:- A functional multi-entity search server with the same features as the existing search functionality

Other MetaBrainz projects use Solr search server, while BookBrainz was created using ElasticSearch and has not evolved since. This creates some extra overhead by running two separate search infrastructures and prevents us from optimizing resources.

For this project, you would entirely replace the search server infrastructure and adapt the existing search to work with Solr. This makes for a project relatively isolated from the rest of the website, the only surface of contact being this file handling most of the indexing and ElasticSearch-specific logic, and this file which adds the website routes that allows users and the website to interact with the search server.

One relevant point of detail is that we want to maintain multi-entity search (search for authors, works, edition, etc all in one go) compared to the MusicBrainz search for example which requires selecting an entity type before performing a search. This would need to be investigated.

Project Duration:- 300 Hours
Proposed Mentors:- monkey, lucifer
Languages/skills:- Javascript (Typescript), Solr

Terminology

Let’s start with the common terminologies which we come across when studying about search infrastructures. Most of these are common in Elasticsearch and Solr with differences talked about in coming sections:-

  • Index- An index is a collection of documents that often have a similar structure and is used to store and read documents from it. It’s the equivalent of a database in RDBMS (relational database management system)
  • Document- A document is a basic unit of information. Documents can be stored and indexed.
  • Field- The field stores the data in a document holding a key-value pair, where key states the field name and value the actual field data. Each document is a collection of fields.
  • Shard- Shards allow you to split and store your index into one or more pieces.

Existing Infrastructure

In the current scenario, Bookbrainz uses ElasticSearch as its search engine. ElasticSearch is a distributed, RESTful search and analytics engine built on the top of Apache Lucene.
We use nodejs client @elastic/elasticsearch to integrate Elasticsearch functionality.

Search Fundamentals and the way they are done in BB

Index generation:- Mapping is the process of defining how a document, and the fields it contains, are stored and indexed. When mapping your data, you create a mapping definition, which contains a list of fields relevant for our index.
We use explicit mapping to define data (as name suggests mappings are explicitly written)
Mappings: aliases, authors, disambiguation

When a person enters data on the BookBrainz website, it is first saved into the PostgreSQL database using BookshelfJS transactions. when this transaction is completed server initiates the indexing of this data into elasticsearch:

Text analysis:- is the process of converting unstructured text, like the body of an email or a product description, into a structured format that’s optimized for search. It is performed by an analyzer, a set of rules that govern the entire process. It is package which contains three lower-level building blocks: character filters, tokenizers, and token filters.

  • A character filter receives the original text as a stream of characters and can transform the stream by adding, removing, or changing characters.
  • A tokenizer receives a stream of characters, breaks it up into individual tokens (usually individual words), and outputs a stream of tokens.
  • A token filter receives the token stream and may add, remove, or change tokens.

In BookBrainz,
Tokenizers used are:-
→ The edge_ngram tokenizer first breaks text down into words whenever it encounters one of a list of specified characters, then it emits N-grams of each word where the start of the N-gram is anchored to the beginning of the word. **min_gram**Minimum length of characters in a gram (1) **max_gram**Maximum length (2) Character classes that should be included in a token. Elasticsearch will split on characters that don’t belong to the classes specified. Defaults to [] (keep all characters).
→ The ngram tokenizer first breaks text down into words whenever it encounters one of a list of specified characters, then it emits N-grams of each word of the specified length.

Token filters used are:-
→ asciifolding- Converts alphabetic, numeric, and symbolic characters that are not in the Basic Latin Unicode block (first 127 ASCII characters) to their ASCII equivalent, if one exists.
→ lowercase- Changes token text to lowercase
→ edge_ngram- Forms an n-gram of a specified length from the beginning of a token.
→ ngram- Forms n-grams of specified lengths from a token.

The ignore_malformed parameter has been set to true which allows the exception to be ignored which is thrown when we try to index the wrong data type into a field. The malformed field is not indexed, but other fields in the document are processed normally.

Searching:- A search query, or query, is a request for information about data in Elasticsearch data streams or indices. A search consists of one or more queries that are combined and sent to Elasticsearch. Documents that match a search’s queries are returned in the hits, or search results, of the response. Elasticsearch provides a full Query DSL (Domain Specific Language) based on JSON to define queries.
It sorts matching search results by relevance score, which measures how well each document matches a query. The relevance score is a positive floating point number, returned in the _score metadata field of the search API. While each query type can calculate relevance scores differently, score calculation also depends on whether the query clause is run in a query or filter context.
Query context- Besides deciding whether or not the document matches, the query clause also calculates a relevance score in the _score metadata field.

In BB, full text queries are done. The full text queries enable you to search analysed text fields such as the body of an email. The query string is processed using the same analyzer that was applied to the field during indexing.
Match query- Returns documents that match a provided text, number, date or boolean value. The provided text is analyzed before matching.
Multi-Match query- builds on the [match query](Match query | Elasticsearch Guide [8.12] | Elastic) to allow multi-field queries.
The way the multi_match query is executed internally depends on the type parameter. In BookBrainz cross-fields type is used. It is a term-centric approach which first analyzes the query string into individual terms, then looks for each term in any of the fields, as though they were one big field.

In the Elasticsearch query DSL, individual fields can be boosted using the caret (^) notation. The boost factor is a floating-point number that determines the significance of the field in the relevance scoring. When Elasticsearch calculates the relevance score for a document, it considers the boost factor assigned to each field. Documents containing matches in fields with higher boost factors will be scored more favorably and are likely to appear higher in the search results.

Queries have a minimum_should_match parameter which indicates that this percent of the total number of optional clauses are necessary. The number computed from the percentage is rounded down and used as the minimum.

Complete Process

Majority of the code for searching is same for the website and BookBrainz api:

  • Initialization:- On app startup, the init function is called. This function initializes the Elasticsearch client with provided options and checks if the main index exists. If the index doesn’t exist, it generates the index using the generateIndex function
  • Index Generation:- The generateIndex function fetches data from the ORM of the database (initialised during app setup from bookbrainz-data (const app = express();
    app.locals.orm = BookBrainzData(config.database); ) It processes this data. For entities, it prepares documents for indexing, and bulk indexes them into Elasticsearch. It also handles special cases like Areas, Editors, and UserCollections, separately due to differences in their data structures. It then refreshes the index to make the changes visible.
  • Bulk Indexing and Retry Logic:- It indexes 10000 entities at a time. The bulk API makes it possible to perform many index/delete operations in a single API call. The code handles the scenario where indexing may fail due to too many requests, and it retries failed indexing operations after a small delay to ensure that all entities are eventually indexed.
  • Indexing single entity:- Function indexEntity is responsible for indexing a single entity into Elasticsearch. It prepares document for the required entity and then indexes it.
  • Searching entities:- Search operations are handled by functions like autocomplete and searchByName. These functions construct Elasticsearch DSL queries based on the search parameters provided (e.g., query string, entity type) and execute the queries using the Elasticsearch client. The search results are then processed and returned in a format suitable for the application.

Project Goals:

  • Set up and configure Solr Search Server within BookBrainz infrastructure.
  • Implement efficient data indexing mechanisms in Solr to ensure timely updates and synchronization with the BookBrainz database (Search Index Rebuilder)
  • Configuring Solr to optimize query performance with the support of Solr’s buillt-in features like caching, query-parsing and warming
  • Multi-entity search support (explained in implementation part)
  • Configure Solr to give optimal responses for search queries, implementing features like caching, query warming and parsing (Search Server Schema)
  • Ensure that multi-entity search functionality is working fine , enabling users to search for authors, works, editions, and other entities.
  • Implement custom tokenizers, and filters in Solr to accommodate the specific requirements of BookBrainz search functionality, including language-specific analysis and stemming. (As is done in MusicBrainz Query Parser) This again needs to be discussed well about particular needs, current working etc qnd a comparison with standard ones provided by lucene
  • Integrate search server with the existing BookBrainz codebase, ensure things are working properly with the new setup. This would need to change files
  • Conduct rigorous testing of Solr implementation, including unit tests, integration tests, and end-to-end tests, to validate functionality and performance
  • Setup and configuration of Solr Zookeeper (”Although Solr comes bundled with Apache ZooKeeper, you are strongly encouraged to use an external ZooKeeper setup in production”)
  • Configure SolrCloud and extend all the working functions from standalone Solr instance to SolrCloud. It includes indexing, querying and sending results. (Major task)
  • Manage Solr Cloud distribute routes for multi-entity search
  • Understanding monitoring Solr using metrics API
  • Using Prometheus and Grafana for metrics storage and data visualization
  • Dockerize the whole search infrastructure

My goal throughout would be to ensure that all the goals are completed but the priority would be to setup a complete Solr server which would be comparable to present search infrastructure and I am confident that with the support of community and the guidance of my mentors I would be able to complete the project well in time with the desired results.


Implementation:

Solr

Apache Solr (stands for Searching On Lucene w/ Replication) is a free, open-source search engine based on the Apache Lucene library. Lucene is a scalable and high-performance Java-based library used to index and search virtually any kind of text. Lucene library provides the core operations which are required by any search application. In comparison to Elasticsearch setup the process of setup is similar yet different (oxymorons <3)

Configuration and setting up

Solr has several configuration files, majorly written in XML. Overall structure:-

  • solr.xml:- specifies configuration options for your Solr server instance.

    <solr>
    	<solrcloud>
    		<!--explained later-->
    	</solrcloud>
    	<str name="sharedLib">lib</str>
    	<metrics>
        <!--explained later-->
      </metrics>
    </solr>
    

    It sets lib folder as sharedLib which would be used to store and configure Query Response writer (if written separately).
    Cores to be created would be ‘Author’, ‘Edition’, ‘EditionGroup’, ‘Publisher’, ‘Series’, ‘Work’, ‘Editor’, ‘Collection’ and 'Area’ from which the first six are BB entities.

  • core.properties- defines specific properties for each core such as its name, the collection the core belongs to, the location of the schema, and other parameters.
    It is a simple Java Properties file where each line is just a key-value pair, e.g., name=core1.
    The minimal core.properties file is an empty file, in which case all of the properties are defaulted appropriately like the folder name is defaulted as the core name and so on.
    To create a new core, simply add a core.properties file in the directory.

  • solrconfig.xml- controls high-level behavior of solr core it is present in.

    <luceneMatchVersion>9.0.0</luceneMatchVersion>
    <directoryFactory name="DirectoryFactory" class="${solr.directoryFactory:solr.NRTCachingDirectoryFactory}" />
    <dataDir>${solr.home}/data/${solr.core.name:}</dataDir>
    <codecFactory name="CodecFactory" class="solr.SchemaCodecFactory" />
    <schemaFactory class="ClassicIndexSchemaFactory" />
    <updateHandler class="solr.DirectUpdateHandler2">
        <!-- to be discussed -->
    </updateHandler>
    <query>
      <queryResultCache class="solr.CaffeineCache" size="32768" initialSize="8192" autowarmCount="512" maxRamMB="200" />
      <documentCache class="solr.CaffeineCache" size="32768" initialSize="8192" autowarmCount="4096" />
    </query>
    

    dataDir and directoryFactory indicate the storage location of index and other data files. The [solr.NRTCachingDirectoryFactory] is filesystem based, and tries to pick the best implementation for the current JVM and platform.
    schemaFactory tells how the schema file would be configured. It can either be a managed one or manual. For the later part of project, we’d need to shift from classic to managed.

  • schema.xml- stores details about types and fields which are essential for the given core. uniqueKey specifies field element which is unique for the document. Including it is necessary for updating the documents. UUIDUpdateProcessorFactory can be directly used for this.
    Similarity is a Lucene class used to score a document in searching. SchemaSimilarityFactory allows individual field types to be configured with a “per-type” specific Similarity and implicitly uses BM25Similarity for any field type which does not have an explicit Similarity.
    FieldType defines the analysis that will occur on a field when documents are indexed or queries are sent to the index. Fields are then defined based on the defined FieldTypes.

There are certain Solr concepts which need to be discussed prior to setup:-

Commits:- It is configured in updateHandler settings, which affect how updates are done internally.
After data has been fetched, it must be committed first before search would work on these. These commits can be either sent manually or we have an autocommit option too, which we would consider for the project as it offers much more control over commit strategy.
There are 2 types- hard commit and soft commit. Determining the best autoCommit settings is a tradeoff between performance and accuracy. Settings that cause frequent updates will improve the accuracy of searches because new content will be searchable more quickly, but performance may suffer because of the frequent updates. Less frequent updates may improve performance but it will take longer for updates to show up in queries. The realtime get feature allows retrieval (by unique-key) of the latest version of any documents without the associated cost of reopening a searcher. These requests can be handled using an implicit /get request handler:-

<requestHandler name="/get" class="solr.RealTimeGetHandler">
  <lst name="defaults">
    <str name="omitHeader">true</str>
  </lst>
</requestHandler>

updateHandler settings:-

<updateHandler class="solr.DirectUpdateHandler2">
    <updateLog>
      <str name="dir">${solr.solr.home}/data/${solr.core.name:}</str>
    </updateLog>
    <autoCommit>
	    <maxTime>${solr.autoCommit.maxTime:60000}</maxTime>
		</autoCommit>
		<autoSoftCommit>
		  <maxTime>${solr.autoSoftCommit.maxTime:10000}</maxTime>
		</autoSoftCommit>
</updateHandler>

Caches:- Cache management is critical to a successful Solr implementation. By default cached Solr objects do not expire after a time interval; instead, they remain valid for the lifetime of the Index Searcher (this can be changed using maxIdleTime).
Solr comes with a default SolrCache implementation that is used for different types of caches

  • The queryResultCache holds the results of previous searches: ordered lists of document IDs (DocList) based on a query, a sort, and the range of documents requested.
  • The documentCache holds Lucene Document objects (the stored fields for each document). This cache is not autowarmed.

Request Handlers:- A request handler processes requests coming to Solr. These might be query requests, index update requests or specialized interactions. If a request handler is not expected to be used very often, it can be marked with startup="lazy" to avoid loading until needed.
Defaults:- The parameters there are used unless they are overridden by any other method.
Appends:- you can define parameters that are added those already defined elsewhere
Invariants:- you can define parameters that cannot be overridden by a client

initParams:- allows you to define request handler parameters outside of the handler configuration. This helps you keep only a single definition of few properties that are used across multiple handlers. The properties and configuration mirror those of a request handler. It can include sections for defaults, appends, and invariants, the same as any request handler. We can write these in a file to be stored in each core (like request-params.xml in MusicBrainz, which would be discussed later, along with search handlers while discussing about searching).

Request dispatcher:- controls the way the Solr HTTP RequestDispatcher implementation responds to requests. The requestParsers sub-element controls values related to parsing requests.

Update Request Processors:- All the update requests sent to Solr are passed through a series of plug-ins known as URPs. It consists of several Update Processor Factories (UPF I call them), which can be configured based on our use case. There are 3 default UPF which should be remembered not to remove in case we define our own URP:-

LogUpdateProcessorFactory - Tracks the commands processed during this request and logs them
DistributedUpdateProcessorFactory - Responsible for distributing update requests to the right node e.g., routing requests to the leader of the right shard and distributing updates from the leader to each replica. This processor is activated only in SolrCloud mode.
RunUpdateProcessorFactory - Executes the update using internal Solr APIs.

ScriptUpdateProcessorFactory:- allows Java scripting engines to be used during Solr document update processing. It is implemented as an UpdateProcessor to be placed in an UpdateChain.

<updateRequestProcessorChain name="ignore-commit-from-client" default="true">
  <processor class="solr.IgnoreCommitOptimizeUpdateProcessorFactory">
    <int name="statusCode">200</int>
  </processor>
  <processor class="solr.LogUpdateProcessorFactory" />
  <processor class="solr.DistributedUpdateProcessorFactory" />
  <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>;

Document Analysis:- The process is quite similar to how it happens in ElasticSearch.
An analyzer examines the text of fields and generates a token stream. Tokenizers break field data into lexical units, or tokens. Filters examine a stream of tokens and keep them, transform or discard them, or create new ones.

Analyzers:- child of the <fieldType> element in schema.xml file. In normal usage, only fields of type solr.TextField or solr.SortableTextField will specify an analyzer.
Analysis takes place at two instances: at index time, when a field is being created, the token stream that results from analysis is added to an index and defines the set of terms (including positions, sizes, and so on) for the field. At query time, the values being searched for are analyzed and the terms that result are matched against those that are stored in the field’s index. There could be the same analysis rules in both the cases as well as different. The first case is desirable when you want to query for exact string matches, possibly with case-insensitivity, for example. In other cases, you may want to apply slightly different analysis steps during indexing than those used at query time.
For our use case, it would be better to have separate analyzers.

Tokenizers- read from a character stream (a Reader) and produce a sequence of token objects (a TokenStream). Filter looks at each token in the stream sequentially and decides whether to pass it along, replace it, or discard it. The order in which the filters are specified is significant. Typically, the most general filtering is done first, and later filtering stages are more specialized. Configured with a element in the schema file as a child of , following the element.

CharFilter- is a component that pre-processes input characters. They can be chained like Token Filters and placed in front of a Tokenizer.

Stemming- set of mapping rules that maps the various forms of a word back to the base, or stem, word from which they derive. For example, in English the words “hugs”, “hugging” and “hugged” are all forms of the stem word “hug”. The stemmer will replace all of these terms with “hug”, which is what will be indexed. This means that a query for “hug” will match the term “hugged”, but not "huge”. Snowball Porter Stemmer Filter. KeywordRepeatFilterFactory
Emits each token twice, one with the KEYWORD attribute and once without. If placed before a
stemmer, the result will be that you will get the unstemmed token preserved on the same position as the stemmed one. Queries matching the original exact term will get a better score while still maintaining the recall benefit of stemming. Another advantage of keeping the original token is that wildcard truncation will work as expected. To configure, add the KeywordRepeatFilterFactory early in the analysis chain. It is recommended to also include RemoveDuplicatesTokenFilterFactory to avoid duplicates when tokens are not stemmed.

Phonetic Matching- algorithms may be used to encode tokens so that two different spellings that are pronounced similarly will match. There are a few predefined filters for phonetic matching. This would need to be discussed well and test on results before implying.

Language Analysis:- Currently, BookBrainz is currently only in english language. Thus there won’t be any need to setup ICU filter. But since it’s a planned project for this summer, I would be more than happy to update the files accordingly in future.

We can modify the rules and re-build our own tokenizer with JFlex. But based on our use case and the current Text-analysis strategy implemented in ElasticSearch (which works quite well), I feel we can simply map some of the characters before tokenization with a CharFilter.This again needs to be discussed and then implemented. The code I’ve written here uses the standard ones only and we would need to improve upon this

<types>
<!-- general types used almost everywhere -->
  <fieldtype name="string" class="solr.StrField" sortMissingLast="false" />
  <fieldType name="long" class="solr.LongPointField" positionIncrementGap="0" />
  <fieldType name="bbid" class="solr.UUIDField" omitNorms="true" />
  <fieldType name="storefield" class="solr.StrField" />
  <fieldType name="bool" class="solr.BoolField" />
  <fieldType name="date" class="solr.DateRangeField" sortMissingLast="false" />
  <fieldType name="int" class="solr.IntPointField" sortMissingLast="false" />
  <fieldType name="float" class="solr.FloatPointField" />

<!-- sample fieldType for an edge N-gram analyser -->
  <fieldType name="autocomplete" class="solr.EdgeNGramTokenizerFactory" />
  <analyser type="index">
    <tokenizer class="solr.EdgeNGramTokenizerFactory" />
    <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="false" />
    <filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="1"
      generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0"
      splitOnCaseChange="0" preserveOriginal="0" splitOnNumerics="0" stemEnglishPossessive="0"/>
    <filter class="solr.LowerCaseFilterFactory" />
  </analyser>
  <analyser type="query">
    <tokenizer class="solr.EdgeNGramTokenizerFactory" />
    <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="false" />
    <filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="1"
      generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0"
      splitOnCaseChange="0" preserveOriginal="0" splitOnNumerics="0" stemEnglishPossessive="0"/>
    <filter class="solr.LowerCaseFilterFactory" />
    <filter class="solr.NGramFilterFactory" minGramSize="2" maxGramSize="10" />
  </analyser>
</types>

Indexing:-

After having setup the solr server, the next step would be to setup a system for indexing the data.

Solr uses Lucene to create an “inverted index”. it inverts a page-centric data structure (documents ⇒ words) to a keyword-centric structure (word ⇒ documents). We can visualise this as the index present at the end of the book where the occurence page number of certain important words is listed. Solr, therefore, achieves faster responses because it searches for keywords in the index instead of scanning the text directly. For other features that we now commonly associate with search, such as sorting, faceting, and highlighting, this approach is not very efficient. To cater to this, DocValue fields are now column-oriented fields with a document-to-value mapping built at index time.

Before indexing the data, it would need to be imported and organized in the way we want to so that searches could be performed in the way we want and on fields we want. The current code present in src/common/helpers/search.ts can be used for this specific purpose unless we want to change the fields to search on, which could be done in course if required.

const baseRelations = [
		'annotation',
		'defaultAlias',
		'aliasSet.aliases',
		'identifierSet.identifiers'
	];

	const entityBehaviors = [
		{
			model: Author,
			relations: [
				'gender',
				'beginArea',
				'endArea'
			]
		},
		{
			model: Edition,
			relations: [
				'editionGroup',
				'editionFormat',
				'editionStatus'
			]
		},
		{model: EditionGroup, relations: []},
		{model: Publisher, relations: ['area']},
		{model: Series, relations: ['seriesOrderingType']},
		{model: Work, relations: ['relationshipSet.relationships.type']}
	];

	
	const behaviorPromise = entityBehaviors.map(
		(behavior) => behavior.model.forge()
			.query((qb) => {
				qb.where('master', true);
				qb.whereNotNull('data_id');
			})
			.fetchAll({
				withRelated: baseRelations.concat(behavior.relations)
			})
	);
	const entityLists = await Promise.all(behaviorPromise);
	/* eslint-disable @typescript-eslint/no-unused-vars */
	const entityFetchOrder:EntityTypeString[] = ['Author', 'Edition', 'EditionGroup', 'Publisher', 'Series', 'Work'];
	const [authorsCollection,
		editionCollection,
		editionGroupCollection,
		publisherCollection,
		seriesCollection,
		workCollection] = entityLists;
	/* eslint-enable @typescript-eslint/no-unused-vars */
	const listIndexes = [];

	workCollection.forEach(workEntity => {
		const relationshipSet = workEntity.related('relationshipSet');
		if (relationshipSet) {
			const authorWroteWorkRels = relationshipSet.related('relationships')?.filter(relationshipModel => relationshipModel.get('typeId') === 8);
			const authorNames = [];
			authorWroteWorkRels.forEach(relationshipModel => {
				const source = authorsCollection.get(relationshipModel.get('sourceBbid'));
				const name = source?.related('defaultAlias')?.get('name');
				if (name) {
					authorNames.push(name);
				}
			});
			workEntity.set('authors', authorNames);
		}
	});
	
	// except for these 6 types, we also want to perform search on 'area', 'editor' and
	// 'collection' which have their schema slightly differently arranged, hence we'd need to
	// organise this data in a similar way prior to indexing
	const areaCollection = await Area.forge()
		.fetchAll();

	const areas = areaCollection.toJSON({omitPivot: true}
	const processedAreas = areas.map((area) => ({
		aliases: [
			{name: area.name}
		],
		id: area.gid,
		type: 'Area'
	}));

	const editorCollection = await Editor.forge()
		.where('type_id', 1)
		.fetchAll();
	const editors = editorCollection.toJSON({omitPivot: true});
	const processedEditors = editors.map((editor) => ({
		aliases: [
			{name: editor.name}
		],
		id: editor.id,
		type: 'Editor'
	}));

	const userCollections = await UserCollection.forge().where({public: true})
		.fetchAll();
	const userCollectionsJSON = userCollections.toJSON({omitPivot: true});
	const processedCollections = userCollectionsJSON.map((collection) => ({
		aliases: [
			{name: collection.name}
		],
		id: collection.id,
		type: 'Collection'
	}));

Next, a solr connection needs to be set up from our application for sending the index and search requests.

Since the BookBrainz project is mainly written in Javascript/ Typescript, it’d make sense to use the same for search server setup. Using Solr from JavaScript clients is so straightforward that it has been specially mentioned on the solr officialdocumentation. In fact, it is so straightforward that there is no client API. You don’t need to install any packages or configure anything.
HTTP requests can be sent to Solr using the standard XMLHttpRequest mechanism.

Also there are some client libraries for this purpose, of which solr-node-client is the most famous and I wanted to use it, but its code hasn’t been upgraded for Solr v9.x yet, hence there might be certain problems in using it. Thus I propose using code style similar to solr-node-client (GitHub - lbdremy/solr-node-client: A solr client for node.js.) which would be consistent with latest version of solr for the purpose of this project.

Since the latest update of Nodejs supports fetch request by default, I ‘d be using the same rather than setting up a separate Undici connection as setup in solr-node-client.
Relevant code snippets:-

export function createClient(options: FullSolrClientParams = {}) {
  return new Client(options);
}

Different Solr Requests would be defined using a Solr class with default options (which defines port, protocol, core etc) and methods of adding documents, updating them and writing commit messages. This can be extended to include other methods if needed:-

private async doRequest<T = JsonResponseData>(
    path: string,
    method: 'GET' | 'POST',
    body: string | null,
    bodyContentType: string | null,
    acceptContentType: string
  ): Promise<T> {
    const protocol = this.options.protocol;
    const url = `${protocol}://${this.options.host}:${this.options.port}${path}`;
    const requestOptions: RequestOptions = {
      headers: {
        'accept': acceptContentType,
        ...(method === 'POST' && bodyContentType && { 'content-type': bodyContentType }),
        ...(body && { 'content-length': String(Buffer.byteLength(body)) }),
      },
      body: method === 'POST' ? body : undefined,
    };

    const response: Response = await fetch(url, {
      method,
      ...requestOptions,
    });

    const text: string = await response.text();

    if (!response.ok) {
      throw new Error(`Request HTTP error ${response.status}: ${text}`);
    }

    return JSON.parse(text);
  }

  update<T>(
    data: Record<string, any>,
    queryParameters?: Record<string, any>
  ): Promise<T> {
    const path = this.getFullHandlerPath(this.UPDATE_JSON_HANDLER);
    const queryString = querystring.stringify({
      ...queryParameters,
      wt: 'json',
    });

    return this.doRequest<T>(
      `${path}?${queryString}`,
      'POST',
      JSON.stringify(data),
      'application/json',
      'application/json; charset=utf-8'
    );
  }

  add(
    docs: Record<string, any> | Record<string, any>[],
    queryParameters?: Record<string, any>
  ): Promise<AddResponse> {
    // format `Date` object into string understood by Solr as a date.
    docs = format.dateISOify(docs);
    docs = Array.isArray(docs) ? docs : [docs];
    return this.update<AddResponse>(docs, queryParameters);
  }

  commit(options?: Record<string, any>): Promise<JsonResponseData> {
    return this.update({
      commit: options || {},
    });
  }

Searching

After having indexed the documents, the last step is correctly configure the settings of ranking the fields in solr. This is handled by a query parser:

A query parser converts a user’s search terms into a Lucene query to find appropriately matching documents. For usecases like ours, it makrs sense to use Extended DisMax Query Parser (eDisMax).
DisMax stands for Maximum Disjunction. It is designed to process simple phrases (without complex syntax) entered by users and to search for individual terms across several fields using different weighting (boosts) based on the significance of each field. Extended DisMax (eDisMax) query parser is an improved version of it.

//solrConfig.xml
<requestHandler name="/search" class="solr.SearchHandler" default="true">
  <lst name="invariants">
    <str name="defType">edismax</str>
    <str name="timeAllowed">3000</str>
  </lst>
</requestHandler>

On the basis of field types defined in schema.xml, we define a scoring logic in order to score documents for searching later. There are certain parameters which are defined for this based on the query parser used-

<lst name="defaults">
  <str name="echoParams">explicit</str>
  <str name="fl">score,_store</str>
  <str name="qf"> </str>
  <str name="pf">  </str>
  <str name="bf"> </str>
</lst>;

The echoParams parameter controls what information about request parameters is included in the response header. The echoParams parameter accepts the following values:
explicit: Only parameters included in the actual request will be added to the params section of the response header.

fl (Field List) parameter limits the information included in a query response to a specified list of fields. The fields must be either stored=“true” or docValues=“true”. can be specified as a space-separated or comma-separated list of field names. The string “score” can be used to indicate that the score of each document for the particular query should be returned as a field.
The qf (Query Field) parameter introduces a list of fields, each of which is assigned a boost factor to increase or decrease that particular field’s importance in the query.
The pf (phrase Fields) parameter can be used to “boost” the score of documents in cases where all of the terms in the q parameter appear in close proximity. The format is the same as that used by the qf parameter: a list of fields and “boosts” to associate with each of them when making phrase queries out of the entire q parameter.
bq (boost query) parameter specifies an additional, optional, query clause that will be added to the user’s main query as optional clauses that will influence the score
bf (boost functions) parameter specifies functions (with optional query boost) that will be used to construct FunctionQueries which will be added to the user’s main query as optional clauses that will influence the score.

In the requests sent, the q parameter defines the main “query” constituting the essence of the search. The parameter supports raw input strings provided by users with no special escaping. The + and - characters are treated as “mandatory” and “prohibited” modifiers for termsBy default, all words or phrases specified in the q parameter are treated as “optional” clauses unless they are preceded by a “+” or a “-”. When dealing with these “optional” clauses, the mm (Minimum should match) parameter makes it possible to say that a certain minimum number of those clauses must match.


Solr Cluster

A Solr cluster is a group of servers (nodes) that each run Solr.
Shards- In cluster mode, a single logical index can be split across nodes as shards. Each shard contains a subset of overall index. It also determines the amount of parallelization possible for an individual search request.
Replica- A replica has the same configuration as the shard and any other replicas for the same index. The number of replicas determines the level of fault tolerance the entire cluster has in the event of a node failure
Leader- One of the replicas is made leader. It can be said as a source-of-truth for each replica. The replicas which are not leaders are followers.

SolrCloud Mode

This mode uses Apache ZooKeeper to provide the centralized cluster management. Zookeper is used track each node of the cluster and the state of each core on each node.
It enables making collections. A collection is the entire group of cores that represent an index: the logical shards and the physical replicas for each shard. Using this, operations can be performed on the entire collection at one time. This feature is primarily why the idea of integrating solrCloud came to me as this can be a way to enable global search across all the cores.

While searching for ways to query the entire index, I came across the feature of “Distributed Search”, which was earlier supported in Solr, when SolrCloud wasn’t a feature. It allowed one query to be executed across multiple shards, so the query was executed against the entire Solr index and no documents would be missed from the search results. But there were limitations to that method-

  • Shard splitting was a manual process
  • It was needed to explicitly send the data to each particular shard
  • There was no way of load balancing or failover

SolrCloud addresses all these limitations.

In Solr, every shard consists of at least one physical replica, exactly one of which is a leader. Zookeeper does the leader-election. If a leader goes down, one of the other replicas is automatically elected as the new leader.

When a document is sent to a Solr node for indexing, the system first determines which Shard that document belongs to, and then which node is currently hosting the leader for that shard. The document is then forwarded to the current leader for indexing, and the leader forwards the update to all of the other replicas.

Replicas also have types within them:-
NRT (Near Real Time)- This type of replica maintains a transaction log and writes new documents to its indexes locally. Any replica of this type is eligible to become a leader.
TLOG- maintains a transaction log but does not index document changes locally. This type helps speed up indexing since no commits need to occur in the replicas.
PULL- replica does not maintain a transaction log nor index document changes locally.

Since we require NRT feature, for this project, favourable combination of replicas is all NRT Replicas.
In most cases, when running in SolrCloud mode, indexing client applications should not send explicit commit requests. In order to enforce this, Solr provides the IgnoreCommitOptimizeUpdateProcessorFactory, which allows you to ignore explicit commits and/or optimize requests from client applications without having refactor your client application code. (included in updateProcessor discussed while explaining solrcongfig.xml.

Distributed Request

When a Solr node receives a search request, the request is automatically routed to a replica of a shard that is part of the collection being searched.

The chosen replica acts as an aggregator: it creates internal requests to randomly chosen replicas of every shard in the collection, coordinates the responses, issues any subsequent internal requests as needed (for example, to refine facets values, or request additional stored fields), and constructs the final response for the client. The more replicas there are of every shard, the more likely that the Solr cluster will be able to handle search results in the event of node failures.
Care should be taken to ensure that the max number of threads serving HTTP requests is greater than the possible number of requests from both top-level clients and other shards. If this is not the case, the configuration may result in a distributed deadlock.

Because SolrCloud automatically load balances queries, a query across all shards for a collection is simply a query that does not define a shards parameter

http://localhost:8983/solr/gettingstarted/select?q=*:*

To limit the query to just one shard, use the shards parameter to specify the shard by its logical ID

http://localhost:8983/solr/gettingstarted/select?q=*:*&shards=shard1

Thus when we have a user initially searching from BookBrainz website(https://bookbrainz.org/) or searching in All entities, we would send the query using the first method, whereas when we are querying a particular type like

https://bookbrainz.org/search?q=harry&type=publisher&size=20&from=0

the second method would be used.

Collections API

It is provided to allow you to control your cluster, including the collections, shards, replicas, backups, leader election, and other operations needs.

Since we require NRT feature, for this project, favourable combination of replicas is all NRT Replicas.In most cases, when running in SolrCloud mode, indexing client applications should not send explicit commit requests. In order to enforce this, Solr provides the IgnoreCommitOptimizeUpdateProcessorFactory which allows you to ignore explicit commits and/or optimize requests from client applications without having refactor your client application code. (included in updateProcessor discussed while explaining solrcongfig.xml.

Distributed Request

When a Solr node receives a search request, the request is automatically routed to a replica of a shard that is part of the collection being searched.
The chosen replica acts as an aggregator: it creates internal requests to randomly chosen replicas of every shard in the collection, coordinates the responses, issues any subsequent internal requests as needed (for example, to refine facets values, or request additional stored fields), and constructs the final response for the client. The more replicas there are of every shard, the more likely that the Solr cluster will be able to handle search results in the event of node failures.Care should be taken to ensure that the max number of threads serving HTTP requests is greater than the possible number of requests from both top-level clients and other shards. If this is not the case, the configuration may result in a distributed deadlock.

Because SolrCloud automatically load balances queries, a query across all shards for a collection is simply a query that does not define a shards parameter


http://localhost:8983/solr/gettingstarted/select?q=*:*

To limit the query to just one shard, use the shards parameter to specify the shard by its logical ID


http://localhost:8983/solr/gettingstarted/select?q=*:*&shards=shard1

Thus when we have a user initially searching from BookBrainz website(https://bookbrainz.org/) or searching in All entities, we would send the query using the first method, whereas when we are querying a particular type like


https://bookbrainz.org/search?q=harry&type=publisher&size=20&from=0

the second method would be used.
Collections API
It is provided to allow you to control your cluster, including the collections, shards, replicas, backups, leader election, and other operations needs.

Zookeeper configuration

Solr does come with a built-in Zookeeper, but using this in production is not recommended. The major issue being: if the Solr instance that hosts ZooKeeper shuts down, ZooKeeper is also shut down. Any shards or Solr instances that rely on it will not be able to communicate with it or each other. Hence, we would have an external ZooKeeper ensemble setup.

Solr currently uses Apache ZooKeeper v3.9.1.

We need to setup a <ZOOKEEPER_HOME>/conf/zoo.cfg file.

tickTime=2000
dataDir=/var/lib/zookeeper
clientPort=21814l
w.commands.whitelist=mntr,conf,ruok

Minimum session timeout is defined as two “ticks”. The tickTime parameter specifies in milliseconds how long each tick should be. dataDir is the directory in which ZooKeeper will store data about the cluster. clientPort is port on which Solr will access ZooKeeper.

We also need to set additional parameters so each node knows who it is in the ensemble and where every other node is. But all this would require discussions with mentors about scalability and also performance etc. Moreover I feel I need study about this in more detail before providing final views, which I plan to be doing in the community bonding period.

Next, solr must be aware of the zookeeper configuration. First we would change ZK_CREATE_CHROOT environment variable to true, to enable creation of z-node automatically on startup. Now, pointing Solr at the ZooKeeper ensemble is a simple matter of using the -z parameter when using the bin/solr script or we can update Solr include files.

Also there are a few things to do:-

  • Uploading configuration files on SolrCloud
  • Preparing zookeeper before cluster starts

Zookeeper also has a command line interface, which provides a lot of commands for help.

Monitoring with Prometheus and Grafana

Solr supports Prometheus and Grafana as monitoring softwares. Solr distribution by default comes with a Prometheus exporter (solr-exporter) to collect metrics fetched from Metrics API and other data.
The configuration for the solr-exporter (solr-exporter-config.xml) defines the data to get from Solr. This includes the metrics, but can also include queries to the PingRequestHandler, the Collections API, and a query to any query request handler. It defines the elements to request, how to scrape them, and where to place the extracted data in the JSON template.
It has two container elements and , and in between of these, the data the solr-exporter should request is defined.

There are several possible types of requests to make:

Scrape the response to a Ping request.
Scrape the response to a Metrics API request.
Scrape the response to a Collections API request.
Scrape the response to a query request.

All the required data would be listed here and we would be able to get the metrics using our Prometheus server.
In order to configure Prometheus server to get all the metrics from solr, the listen address must be added to the Prometheus server’s prometheus.yml configuration file:-

scrape_configs:
  - job_name: 'solr'
    static_configs:
      - targets: ['localhost:9854']

The use of Grafana is basically for visualisation of the metrics gathere by prometheus.
Both the Grafana and Prometheus server have to be set up and deployed separately.
If BookBrainz already has these set up (the way MusicBrainz has, yvanzo helped me get this one), we can simply add the solr required configurations there, but if this is not the cas, I would be more than happy to dive deeper into these technologies, learn the necessary functions and then set them up for our organization.

That was all from my side in terms of implementations and uses, I tried my best to look at things from all angles and propose to get things done in the best way possible. But I am sure, I would have definitely made mistakes at places or have chosen a less-optimal way of doing things. But I am confident that with the support of our community and respected mentors, I would be able to complete things and take this project to success :slightly_smiling_face::partying_face: (emojis too <3)


Based on all these implementations and ideas, I have made the following timeline for the project:-

Timeline

  • May 1- May 26 (Community Bonding period)-
    → Continue reading about SolrCloud features
    → Familiarize with solr basic functions
    → Have a closer look at MusicBrainz search infrastructure
    → Have discussions about final indexing and searching logic for entities
  • May 27- June 2 (Week 1)-
    → Implement basic data indexing mechanisms in Solr for synchronization with the BookBrainz database.
    → Configure Solr indexing settings, including document fields, analyzers, and update strategies, to optimize performance and accuracy.
    → Integrate custom tokenizers and filters into the Solr indexing pipeline, ensuring they are applied consistently during document indexing and query processing [based on discussion with mentors and community]
    → Basically setting up BookBrainz Search Server Schema
  • June 3- June 9 (Week 2)-
    → Set up functions for batch processing and indexing data into solr server
    → Work on Solr’s caching mechanisms and query parsing options to improve indexing and retrieval performance.
    → Completing work of Search Server Schema and start working on Search Index Rebuilder
  • June 10- June 16 (Week 3)-
    → Work on handling the query responses from Solr search
    → Ensuring all the results are in accordance with current results
    → Writing Query response writer (if required)
  • June 17- June 23 (Week 4)-
    → Integrate Solr with the existing BookBrainz codebase, including backend services and frontend components,
    → Modify relevant code files ( API endpoints, indexing logic, search and autocomplete functions etc) to incorporate Solr-based search functionality and replace any existing ElasticSearch-specific logic
  • June 24- June 30 (Week 5)-
    → Ensure proper communication and error handling between the BookBrainz frontend and Solr backend- Develop comprehensive tests for the Solr integration components
    → Ensure all the search functions
  • July 1- July 12 (Week 6,7)-
    → Buffer week to complete any pending works and Mid-Term evaluations from mentors
    → Ensure all the work has been documented properly for further use
    → Continue reading the documentations of SolrCloud
  • July 13- July 21 (Week 7,8)-
    → Begin the setup and configuration of Solr Zookeeper for managing Solr instances in a distributed environment
    → Setup SolrCloud service, ensuring proper shift, like change in schema config files, use of collection api
    → Extend existing Solr functionalities to work seamlessly with SolrCloud, ensuring compatibility and interoperability with the BookBrainz application.
  • July 22- July 28 (Week 9)-
    → Implement and manage Solr Cloud distribute routes for multi-entity search, defining routing rules and strategies for efficient query distribution.
    → Configure routing policies and load balancing mechanisms to evenly distribute search traffic across SolrCloud nodes and clusters.
    → Ensuring Parallelization and distributed search functions are working as required
  • July 29- August 4 (Week 10)-
    → Explore Solr’s metrics API for monitoring and performance analysis, including key performance indicators (KPIs) and resource utilization metrics
    → Configure alerts and notifications for critical performance thresholds or anomalies, ensuring timely detection and response to potential issues.
  • August 5- August 11 (Week 11)-
    →Install and configure Prometheus for metrics storage and data collection from Solr instances and clusters.
    → Read documentations of Prometheus, Grafana and Metrics API
  • August 12- August 18 (Week 12)-
    → Integrate Solr metrics with Prometheus and Grafana to populate dashboards and generate reports on search traffic, query performance, and system health
    → Test the integration between Solr, Prometheus, and Grafana to ensure seamless data flow and accurate representation of search metrics and monitoring data
  • August 19- August 25 (Week 13)-
    → Conduct final testing of the Solr implementation, testing on all use cases, ensuring all the current functions are implemented and are working properly
  • August 26- September 1
    Buffer week and preparing for final submission
    Complete documentations of all the functions, set up configuration etc

About Me

I am Bhumika Bachchan. I am a sophomore at the Indian Institution of Technology, Roorkee, India. I have been intersted in programming from whenever I can remember. While in school, we were taught HTML and basic Python which used to be a solace from the academic curriculum which didn’t appeal me as much as these subjects did and I used to spend hours in my elder brother’s laptop, who used to teach me things slightly out of curriculum. Things continued this way until I came to college, which is basically when I first got a laptop of my own. Since then, we together (my laptop and me) have shared many amazing experiences together, from reading long documentation to trying to understand huge codebases and it has been amazing so far. My areas of experience is development using Typescript (Javascript) but I am interested in learning new programming languages and skills based on the need. I also have knowledge of C++ which was taught to us as a part of college curriculum and I have tried learning parts and pieces of Java for understanding Lucene for this project.

My journey with open source has been quite new and in this very short duration I have met pretty amazing people, helping each other without expecting anything in return. I’ve always believed in collaboration over competition and as soon as I came to know about the idea of open source, I started finding some project to start from, on recommendation of a senior. It was then only that I came across the interview of Robert Kaye (aka Mayhem) with Prathamesh Ghatole (aka Pratha-Fish) and I was quite intrigued with their discussions about open source and community building, which made me curious enough to head to Metabrainz page and then finding Bookbrainz. And this is how I ended up here. Irrespective of what the results would be, I had a great time here, understanding the code base, trying to solve tickets, asking all of my doubts in IRC channel without any fear of being judged or scolded and finally brush up my skills and apply for this project. I now feel more comfortable in a development environment, the ongoing chats on IRC make more sense to me and I feel less inhibited to put my thoughts out there. I initially worked on a few tickets and made some pull requests but from the past few days I wasn’t getting time to do this as I was committing my time in learning more and more about solr. I would be back at working for tickets now that major work has been completed here. And I would continue contributing here as much as I can, there’s a lot to learn.

What type of music do you listen to? (please list a series of MBIDs as examples)
→ Listening music (and singing at times, though my voice is terrible) is one of my favourite recreational activities and I have long playlists with tons of song of different tastes: from soothing to pop.
I’m a big Eminem (b95ce3ff-3d05-4e87-9e01-c97b66af13d4) fan: his Lose Yourself (88df7110-f219-3821-83b8-56443eaa34c3) being one of favourites. I’m also mostly listening to Ed Sheeran (b8a7c51f-362c-4dcb-a259-bc6e0095f0a6) and must to mention his collab with Eminem and 50 Cent- Remeber the Name (b8a7c51f-362c-4dcb-a259-bc6e0095f0a6).
Also, I am a big fan of Hindi and Punjabi music (tho I’m not from Punjab :relieved:). Nimrat Khaira (42d8064a-2eb1-42ec-8336-b113347c361e), PropheC (878391e2-c5e7-4b51-bbc1-236326fa810c) and Amrinder Gill (cd74f170-639f-4ff9-b01f-29b8e82912ae) are the singers I listen the most to.

What aspects of MusicBrainz/ListenBrainz/BookBrainz/Picard interest you the most?

→ The thing that interests me the most is definitely the community that Metabrainz has as a whole. In the past few monthys, by lurking in the IRC channels, moving to chats done almost 2 years ago for finding stuff, I have developed a great respect and love for this community and I feel grateful for having this chance to be a part of it and learn and grow.

What sorts of programming projects have you done on your own time?

→ Being a software enthusiast, I have tried my hands on a lot of different projects, majorly in Javascript (Typescript). Some of it are open-source and can be seen from my github profile (insane-22 (Bhumika Bachchan) · GitHub).

What computer(s) do you have available for working on your SoC project

→ I have an ASUS TUF Gaming A15, 15.6-inch 144Hz, AMD Ryzen 7 4800 H, GeForce RTX 3050 Gaming Laptop, which is my personal laptop and I’d be using this for the purpose of coding.

How much time do you have available per week, and how would you plan to use it?

→ I can dedicate more than 45 hours a week. The college’s summer vacations are
scheduled from mid-May to July. Therefore, there would be no classes for the first two
months of the program’s coding period and no exams throughout the coding period. My
college will reopen on July 16th. During this time, I could work more than 25 hours a
week.
I generally work from 4 PM to 5 AM IST (6:30 AM to 7:30 PM Eastern Time), although I
find it flexible to adjust my working hours if required.
Other than this project, I have no commitments/vacations planned for the summer. I
shall keep my status posted to all the community members through Monday meetings every week and ensure that all the deadlines are adhered to.

Involvement after GSOC:-
I have already learned a lot contributing to the BookBrainz project. Even after the
GSoC period ends, I plan on contributing to this organization by adding to my past
projects and working on open issues because of my familiarity with the technical stack
and the new challenges that I am continually offered in the process.
Having picked up many development skills, my primary focus would be to help the project and
the community grow and also to explore other projects under MetaBrainz, MusicBrainz, of which I have read lots of code while learning about Solr infrastrucre, and other projects as well. I would like to learn about CAA and if possible help in implementing it in BB.

References

The 3 messages together make up the full proposal, it was too long to be sent in a single message :sweat_smile:, which I never realised while writing this all along. I have tried my best to cater to all the possible dimensions of the project but there could be errors here and there, which I would love to discuss upon and resolve.
Would be grateful to receive feedback and suggestions!
Really enjoyed all along learning about new technologies and new dimensions to already known things, Thank you for all your patience and help!

Thanks for your proposal @insane_22 !

I have one main general comment: although I can see that you’ve done a good amount of in-depth research, most of the content of the proposal is directly quoted from SOLR docs, or code copied from the solr-node-client library you mention.
I don’t see a lot of original content, and it makes it look like you’re trying to pass this as your own work which is a bad look.

While it is definitely good to do this in-depth research, the goal of these proposals is not to quote documentation but rather to show that you have integrated the knowledge and understand in advance where you will be heading with the project and where the potential pitfalls are.
All I can see above is that you know how to find what you need in the Solr docs, which while useful in itself makes it hard for me to evaluate how ready you are for this big project.
Having a hard time myself every time I need to wrap my head around search indexing, tokenizing etc. I know you can’t just plow through and hope it all just works.

What do the schemas look like?
How do we do more advanced indexing such as a work’s author via its relationships?
Language-wise, contrarily to what you suggest we do have content in many languages and not just english, including a variety of alphabets and scripts. How are those managed?

1 Like

thanks a lot @mr_monkey for a review this quick!
The reason it is mostly quoted directly from documentation the most was because that’s the place from where I had learnt the stuff, hence, thought it would make more sense to quote it in the proposal. My idea of an ideal GSOC proposal was one which would be understandable for someone who had no prior knowledge to the theme being talked about and that’s why I have at places quoted the exact lines from documentation, including explanations at places exactly where I needed them.
I am really sorry if it portrayed me as “trying to pass this as your own work”, which was never my intention. I never wanted to look ‘smarter’ than what I know or get credits for things I don’t deserve. I am really sorry for putting things this way, never viewed things from this perspective that it may have this portrayal of things, which is definitely critical for a project which is this big and needs deep study, and also not tolerable for programmes as reputed as GSOC is.

Moving to the questions,

What do the schemas look like?

There are 9 cores which would be defined , ‘Author’, ‘Edition’, ‘EditionGroup’, ‘Publisher’, ‘Series’, ‘Work’, ‘Editor’, ‘User Collection’ and 'Area’. All of these would have their own schema.xml, which would have their own fields, all the fieldtypes would be stored in a common fieldTypes.xml

<?xml version="1.0"?>
<schema name="author" version="1.0" >
  <!-- link to fieldType definitions for all cores-->
  <xi:include href="common/fieldtypes.xml" />
  <similarity class="solr.SchemaSimilarityFactory" />
  <field name="bbid" type="bbid" indexed="true" stored="true" required="true" />
  <field name="annotation" type="text_general" indexed="true" stored="true"/>
  <field name="default_alias" type="text_general" indexed="true" stored="true"/>
  <field name="alias_set_aliases" type="text_general" indexed="true" stored="true"/>
  <field name="identifier_set_identifiers" type="text_general" indexed="true" stored="true"/>
  <field name="author_gender" type="string" indexed="true" stored="true"/>
  <field name="author_beginArea" type="string" indexed="true" stored="true"/>
  <field name="author_endArea" type="string" indexed="true" stored="true"/>
  <field name="disambiguation" type="text_trigram" indexed="true" stored="true"/>
  <!-- To enable the update log in solrconfig.xml which would be essential for SolrCloud setup. -->
  <field name="_version_" type="long" indexed="true" stored="true" />
  <uniqueKey>bbid</uniqueKey>
 </schema>
<?xml version="1.0" ?>
<schema name="edition" version="1.0">
  <xi:include href="common/fieldtypes.xml" />
  <similarity class="solr.SchemaSimilarityFactory" />
  <field name="bbid" type="bbid" indexed="true" stored="true" required="true" />
  <field name="annotation" type="text_general" indexed="true" stored="true" />
  <field
    name="default_alias"
    type="text_general"
    indexed="true"
    stored="true"
  />
  <field
    name="alias_set_aliases"
    type="text_general"
    indexed="true"
    stored="true"
  />
  <field
    name="identifier_set_identifiers"
    type="text_general"
    indexed="true"
    stored="true"
  />
  <field name="authors" type="string" indexed="true" stored="true" multiValued="true" />
  <field name="disambiguation" type="text_trigram" indexed="true" stored="true"/>
  <field name="edition_id" type="string" indexed="true" stored="true"/>
  <field name="edition_format" type="string" indexed="true" stored="true"/>
  <field name="edition_status" type="string" indexed="true" stored="true"/>
  <field name="_version_" type="long" indexed="true" stored="true" />
  <uniqueKey>bbid</uniqueKey>
</schema>
<?xml version="1.0" ?>
<schema name="editionGroup" version="1.0">
  <xi:include href="common/fieldtypes.xml" /> 
  <similarity class="solr.SchemaSimilarityFactory" />
  <field name="bbid" type="bbid" indexed="true" stored="true" required="true" />
  <field name="annotation" type="text_general" indexed="true" stored="true" />
  <field
    name="default_alias"
    type="text_general"
    indexed="true"
    stored="true"
  />
  <field
    name="alias_set_aliases"
    type="text_general"
    indexed="true"
    stored="true"
  />
  <field
    name="identifier_set_identifiers"
    type="text_general"
    indexed="true"
    stored="true"
  />
  <field name="authors" type="string" indexed="true" stored="true" multiValued="true" />
  <field name="disambiguation" type="text_trigram" indexed="true" stored="true"/>
 <field name="_version_" type="long" indexed="true" stored="true" />
 <uniqueKey>bbid</uniqueKey>
</schema>
<?xml version="1.0" ?>
<schema name="publisher" version="1.0">
  <xi:include href="common/fieldtypes.xml" />
  <similarity class="solr.SchemaSimilarityFactory" />
  <field name="bbid" type="bbid" indexed="true" stored="true" required="true" />
  <field name="annotation" type="text_general" indexed="true" stored="true" />
  <field
    name="default_alias"
    type="text_general"
    indexed="true"
    stored="true"
  />
  <field
    name="alias_set_aliases"
    type="text_general"
    indexed="true"
    stored="true"
  />
  <field
    name="identifier_set_identifiers"
    type="text_general"
    indexed="true"
    stored="true"
  />
  <field name="disambiguation" type="text_trigram" indexed="true" stored="true"/>
  <field name="publisher_area" type="string" indexed="true" stored="true"/>
  <field name="_version_" type="long" indexed="true" stored="true" />
  <uniqueKey>bbid</uniqueKey>
</schema>
<?xml version="1.0" ?>
<schema name="series" version="1.0">
  <xi:include href="common/fieldtypes.xml" />
  <similarity class="solr.SchemaSimilarityFactory" />
  <field name="bbid" type="bbid" indexed="true" stored="true" required="true" />
  <field name="annotation" type="text_general" indexed="true" stored="true" />
  <field
    name="default_alias"
    type="text_general"
    indexed="true"
    stored="true"
  />
  <field
    name="alias_set_aliases"
    type="text_general"
    indexed="true"
    stored="true"
  />
  <field
    name="identifier_set_identifiers"
    type="text_general"
    indexed="true"
    stored="true"
  />
  <field name="disambiguation" type="text_trigram" indexed="true" stored="true"/>
  <field name="series_ordering_type" type="string" indexed="true" stored="true"/>
  <!-- To enable the update log in solrconfig.xml which would be essential for SolrCloud setup. -->
  <field name="_version_" type="long" indexed="true" stored="true" />
  <uniqueKey>bbid</uniqueKey>
</schema>
<?xml version="1.0" ?>
<schema name="work" version="1.0">
  <xi:include href="common/fieldtypes.xml" /> 
  <similarity class="solr.SchemaSimilarityFactory" />
   <field name="bbid" type="bbid" indexed="true" stored="true" required="true" />
  <field name="annotation" type="text_general" indexed="true" stored="true" />
  <field
    name="default_alias"
    type="text_general"
    indexed="true"
    stored="true"
  />
  <field
    name="alias_set_aliases"
    type="text_general"
    indexed="true"
    stored="true"
  />
  <field
    name="identifier_set_identifiers"
    type="text_general"
    indexed="true"
    stored="true"
  />
  <field name="work_relationship_type" type="string" indexed="true" stored="true"/>
  <field name="authors" type="string" indexed="true" stored="true" multiValued="true" />
  <field name="disambiguation" type="text_trigram" indexed="true" stored="true"/>
  <!-- To enable the update log in solrconfig.xml which would be essential for SolrCloud setup. -->
  <field name="_version_" type="long" indexed="true" stored="true" />
  <uniqueKey>bbid</uniqueKey>
</schema>
   
   <?xml version="1.0" ?>
   <schema name="are" version="1.0">
       <xi:include href="common/fieldtypes.xml" />
       <similarity class="solr.SchemaSimilarityFactory" />
       <field name="area_name" type="text_general" indexed="true" stored="true"/>
      <field name="area_type" type="string" indexed="true" stored="true"/>
      <field name="area_gid" type="string" indexed="true" stored="true"/> 
      <field name="_version_" type="long" indexed="true" stored="true" />
    <uniqueKey>area_gid</uniqueKey>
    </schema>
<?xml version="1.0" ?>
<schema name="editor" version="1.0">
  <xi:include href="common/fieldtypes.xml" />
  <similarity class="solr.SchemaSimilarityFactory" />
  <field name="editor_name" type="text_general" indexed="true" stored="true" />
  <field name="editor_type" type="string" indexed="true" stored="true" />
  <field name="editor_id" type="string" indexed="true" stored="true" />
  <field name="_version_" type="long" indexed="true" stored="true" />
  <uniqueKey>editor_id</uniqueKey>
</schema>
<?xml version="1.0" ?>
<schema name="user_collection" version="1.0">
  <xi:include href="common/fieldtypes.xml" />
  <similarity class="solr.SchemaSimilarityFactory" />
  <field name="user_collection_name" type="text_general" indexed="true" stored="true" />
  <field name="user_collection_type" type="string" indexed="true" stored="true" />
  <field name="user_collection_id" type="string" indexed="true" stored="true" />
  <field name="_version_" type="long" indexed="true" stored="true" />
  <uniqueKey>user_collection_id</uniqueKey>
</schema>

How do we do more advanced indexing such as a work’s author via its relationships ?

In order to index the work’s author, we would populate the authors field of work by iterating through each work and then mapping with corresponding author using id-

const [authorsCollection,
		editionCollection,
		editionGroupCollection,
		publisherCollection,
		seriesCollection,
		workCollection] = entityLists;
		
for (const workEntity of workCollection) {
        const doc: any = {
            id: workEntity.id,
        };
        
        // Retrieve authors associated with the current work entity
        const relationshipSet = workEntity.relationshipSet;
        if (relationshipSet) {
            const authorNames: string[] = [];
            const authorWroteWorkRels = relationshipSet.relationships.filter(relationshipModel => relationshipModel.typeId === 8);
            for (const relationshipModel of authorWroteWorkRels) {
                // Search for the Author in the already fetched authorsCollection
                const source = authorsCollection.find(author => author.id === relationshipModel.sourceBbid);
                if (source) {
                    authorNames.push(source.defaultAlias.name);
                }
            }
            
            // Add authors to the Solr document
            doc.authors = authorNames;
        }
        
        client.update(doc);
      }

Language-wise, contrarily to what you suggest we do have content in many languages and not just english, including a variety of alphabets and scripts. How are those managed?

Ah yes, the names of many entities, author etc could be individually be in several languages and this would need to be addressed, what I had thought earlier was mixed it with the website language (which infact has nothing to do with fields being searched).

Now, on the question about how to manage these alphabets and scripts. There are 2 things which come to my mind:

Solr allows us to add ICU (International Components for Unicode) filters which allows transliteration, mapping and normalization. Thus my way of doing this would be to include these filters (solr.ICUFoldingFilterFactory, solr.ICUNormalizer2FilterFactory). We can add thse filters to our analyser and get the feature

Unicode collation is another feature which I would suggest using. It defines the order in which characters and strings should be sorted based on their Unicode code points, taking into account language-specific sorting rules, such as accents, case sensitivity, and special character ordering. Thus we can add specific field using solr.ICUCollationField and copy our text fields into these. But as you said, we have alphabets and scripts in a lot of languages, adding a large number of sort fields can increase disk and indexing costs. An alternative approach is to use the Unicode default collator which has rules that are designed to work well for most languages.
Thus what we can do is have field:-

<fieldType name="collatedROOT" class="solr.ICUCollationField"
           locale=""
           strength="primary" />

and then copying over the values from individual fields to this using <copyfield>

Overall, I got what your point is and I would definitely try adding more details here and there throughout the proposal. Extremely sorry that you had to point this out and also sorry if I made a wrong impression (which I definitely did) .

3 Likes