GSoC 2026: Modernize search storage format for the MusicBrainz database

Project summary:

Title: Modernize search storage format for the MusicBrainz database
Proposed mentor: @lucifer
Proposed Co-mentors: bitmap , reosarevok , yvanzo
Estimated Project Length: 350 hours
Difficulty: medium
Expected Outcomes:

  • Upgrade the Solr schema version from 1.5 to 1.7
  • Complete fields in configsets and indexer to store all the data to be returned
  • Create two response writers to return data from fields to MB XML/MB JSON formats and add automated validation tests

Personal Information:

Name: Shaik Junaid
IRC Nickname: fettuccinae
Github: fettuccinae
Timezone: UTC +05:30

Introduction:
Hi! I’m Junaid (fettuccinae on matrix), a third-year Computer Science student at Mahatma Gandhi Institute of Technology (MGIT) ,Hyderabad, India.

Last summer, I worked on creating a notification system for MetaBrainz projects as my GSoC project and this year, I’m excited to apply as a contributor to work on the Solr search engine.

Project proposal :

Description:

The MusicBrainz database has a Solr search engine used for both website search and search API.
MB Search architecture for reference.

After solr was upgraded to version 9, the schemas in configsets weren’t updated and still use version 1.5. They need to be upgraded to version 1.7.

Solr currently keeps data in index fields for search and also uses _store field to store the index field data + other non index field’s data for response writer, causing redundancy. To eliminate this redundancy, we need to store all the required data in the fields themselves.

Solr response writer currently reads all the data from _store field of a document and returns it in a valid MB XML/ MB JSON format.
After modifying the configsets and indexer to store data in the fields, we need to create response writers that read from these fields and return data in a valid MB XML/ MB JSON format, along with validation tests for these writers.

Since storing data in individual fields rather than a single _store blob will lead to change in performance,we will start with the artist core and benchmark it before rolling out to all remaining entities.

Implementation:

1. Migrate schema to version 1.7

The change in schema version 1.6 is all about non-stored docValues fields, which we dont use.

The change in schema version 1.7 is that certain field types (Numeric, Date, Bool, String, Enum, UUID) that support docValues, will have docValues enabled by default. Since fields with docValues="true" performs poorly for retreival queries compared to stored="true" fields, we need to explicitly set docValues="false" for these field types in fieldtypes.xml.

Note: storefieldmv already has docValues="true", so we wont be modifying it.

Current: fieldtypes.xml
<types>
  <fieldtype name="string" class="solr.StrField" sortMissingLast="false" />
  <fieldType name="long" class="solr.LongPointField" positionIncrementGap="0" />
  <fieldType name="mbid" class="solr.UUIDField" omitNorms="true" />
  <fieldType name="storefield" class="solr.StrField" />
  <fieldType name="storefieldmv" class="solr.StrField" docValues="true" />
  <fieldType name="bool" class="solr.BoolField" />
  <fieldType name="date" class="solr.DateRangeField" sortMissingLast="false" />
  <fieldType name="int" class="solr.IntPointField" sortMissingLast="false" />
  <fieldType name="float" class="solr.FloatPointField" />
....
</types>
After: fieldtypes.xml

<types>
  <fieldtype name="string" class="solr.StrField" sortMissingLast="false" docValues="false" />
  <fieldType name="long" class="solr.LongPointField" positionIncrementGap="0" docValues="false" />
  <fieldType name="mbid" class="solr.UUIDField" omitNorms="true" docValues="false"  />
  <fieldType name="storefield" class="solr.StrField" docValues="false" />
  <fieldType name="storefieldmv" class="solr.StrField" docValues="true" />
  <fieldType name="bool" class="solr.BoolField" docValues="false" />
  <fieldType name="date" class="solr.DateRangeField" sortMissingLast="false" docValues="false" />
  <fieldType name="int" class="solr.IntPointField" sortMissingLast="false" docValues="false" />
  <fieldType name="float" class="solr.FloatPointField" docValues="false" />
   .....
</types>

Then, update all entities schema versions to 1.7

Example:

_template/_conf/schema.xml

<?xml version="1.0"?>
<!-- This is a template for new cores. -->
<schema name="[new_entity]" version="1.7" xmlns:xi="http://www.w3.org/2001/XInclude">
....
</schema>
artist/conf/schema.xml

<?xml version="1.0"?>
<schema name="artist" version="1.7" xmlns:xi="http://www.w3.org/2001/XInclude">
....
</schema>

2. Complete fields in configsets and remove _store

Currently, the search fields are used only for indexing and the _store blob is being used to store all the data required by response writers.

We can eliminate this redundancy by storing data in the fields themselves and removing the _store blob.

To flatten simple nested fields, We can add their inner elements to the schema.
Example: area element which is of the type def_area-element_inner can be flattened as follows:

<field name="area-id" type="text" indexed="false" stored="true" />
<field name="area-name" type="text" indexed="true" stored="true" />
<field name="area-type" type="mbid" indexed="false" stored="true />
<field name="area-type-id" type="text" indexed="false" stored="true />
<field name="area-sort-name" type="text" indexed="false" stored="true />
<field name="area-lifespan-begin_date" type="date" indexed="false" stored="true />
<field name="area-lifespan-end_date" type="date" indexed="false" stored="true />
<field name="area-lifespan-ended" type="bool" indexed="false" stored="true" />

There are four types of fields:

  1. Flat fields which are to be stored and to be used for indexing.
  2. Flat fields which are used only for indexing.
  3. Flat fields which are used only for storing.
  4. Complex nested fields which can’t be flattened, so we store them as XML string.

Example:

1.   <field name="arid" type="mbid" indexed="true" stored="true" />
     <field name="artist" type="text" indexed="true" stored="true" />

2.   <field name="alias" type="text_mult" indexed="true" stored="false" multiValued="true" />
     <field name="area" type="text_mult" indexed="true" stored="false" multiValued="true" />

3.   <field name="gender-id" type="text" indexed="false" stored="true" />
     <field name="area-type" type="lowercase" indexed="false" stored="true" />

4.   <field name="alias_list_store" type="storefield" indexed="false" stored="true" />
     <field name="tag_store" type="storefield" indexed="false" stored="true" />

This approach removes our dependency on _store completely and the response writer can read all the required data from the fields themselves.

Example for Artist entity schema after completing all the feilds:

artist/conf/schema.xml

<schema name="artist" version="1.7" xmlns:xi="http://www.w3.org/2001/XInclude">

  <!-- Search fields with stored="true" -->
  <field name="artist" type="text" indexed="true" stored="true" />
  <field name="sortname" type="text" indexed="true" stored="true" required="true" />
  <field name="arid" type="mbid" indexed="true" stored="true" />
  <field name="type-name" type="lowercase" indexed="true" stored="true" omitNorms="true" />
  <field name="comment" type="text" indexed="true" stored="true" />
  <field name="country" type="lowercase" indexed="true" stored="true" omitNorms="true" />
  <field name="area-name" type="text_mult" indexed="true" stored="true" multiValued="true" />
  <field name="life_span-begin_date" type="date" indexed="true" stored="false" />
  <field name="begin_area-name" type="text_mult" indexed="true" stored="false" multiValued="true" />
  <field name="life_span-end_date" type="date" indexed="true" stored="false" />
  <field name="end_area-name" type="text_mult" indexed="true" stored="false" multiValued="true" />
  <field name="life_span-ended" type="bool" indexed="true" stored="false" />
  <field name="gender-name" type="lowercase" indexed="true" stored="false" omitNorms="true" />
  <field name="ipis-ipi" type="strip_leading_zeroes_concat_mult" indexed="true" stored="false" multiValued="true" />
  <field name="isnis-isni" type="strip_leading_zeroes_concat_mult" indexed="true" stored="false" multiValued="true" />
  <field name="mbid" type="mbid" indexed="true" stored="true" required="true" />

   
  <!-- Search fields with stored="false" -->
  <field name="alias" type="text_mult" indexed="true" stored="false" multiValued="true" />
  <field name="primary_alias" type="text_mult" indexed="true" stored="false" multiValued="true" />
  <field name="tag" type="text_mult" indexed="true" stored="false" multiValued="true" />
  <field name="ref_count" type="int" indexed="true" stored="false" />
  <field name="ngram" type="ngram" indexed="true" stored="false" multiValued="true" />


<!-- Complex nested fields -->
<field name="alias_list" type="storefield" indexed="false" stored="true"  />
<field name="tag_list" type="storefield" indexed="false" stored="true"  />

  <!-- Non Search fields with stored="true" -->
<field name="area-id" type="text" indexed="false" stored="true" />
<field name="area-type" type="mbid" indexed="false" stored="true" />
<field name="area-type-id" type="text" indexed="false" stored="true" />
<field name="area-sort-name" type="text" indexed="false" stored="true" />
<field name="area-lifespan-begin_date" type="date" indexed="false" stored="true" />
<field name="area-lifespan-end_date" type="date" indexed="false" stored="true" />
<field name="area-lifespan-ended" type="bool" indexed="false" stored="true" />
<field name="begin_area-id" type="text" indexed="false" stored="true" />
<field name="begin_area-type" type="mbid" indexed="false" stored="true" />
<field name="begin_area-type-id" type="text" indexed="false" stored="true" />
<field name="begin_area-sort-name" type="text" indexed="false" stored="true" />
<field name="begin_area-lifespan-begin_date" type="date" indexed="false" stored="true" />
<field name="begin_area-lifespan-end_date" type="date" indexed="false" stored="true" />
<field name="begin_area-lifespan-ended" type="bool" indexed="false" stored="true" />
<field name="end_area-id" type="text" indexed="false" stored="true" />
<field name="end_area-type" type="mbid" indexed="false" stored="true" />
<field name="end_area-type-id" type="text" indexed="false" stored="true" />
<field name="end_area-sort-name" type="text" indexed="false" stored="true" />
<field name="end_area-lifespan-begin_date" type="date" indexed="false" stored="true" />
<field name="end_area-lifespan-end_date" type="date" indexed="false" stored="true" />
<field name="end_area-lifespan-ended" type="bool" indexed="false" stored="true" />
<field name="lifespan-ended" type="bool" indexed="false" stored="true" />
<field name="type-id" type="text" indexed="false" stored="true" />


  <field name="_version_" type="long" indexed="true" stored="true" />
  <copyField source="artist" dest="artistaccent" />
  <copyField source="mbid" dest="arid" />
  <copyField source="artist" dest="ngram" />
  <copyField source="sortname" dest="ngram" />
  <!-- field to use to determine and enforce document uniqueness. -->
  <uniqueKey>mbid</uniqueKey>

</schema>

We also need to modify request-params.xml for each entity to return all stored fields by using * and removing _store from the Field List fl paramter .

artist/conf/request-params.xml

<lst name="defaults">
  <str name="echoParams">explicit</str>
  <str name="fl">score, *</str>
  <str name="qf">alias^1.75 primary_alias^2 artist^2 artistaccent^2.2 comment ngram^0.5 sortname^1.85</str>
  <str name="pf">primary_alias^2 artist^2 artistaccent^2.2 alias^1.75 sortname^1.85 comment</str>
  <str name="bf">log(sum(ref_count,150))^4</str>
</lst>

3. Complete fields in indexer

Search Index Rebuilder or SIR , builds the index by sending search fields and _store field to Solr.
SIR initializes a SearchEntity object for a document with list of search fields as fields and remaining fields as extrapaths. It builds an entity query and then converts it to a dict which is later sent to solr engine.

To complete fields:

a. We need to rename the variables in Search<Entity>(in sir/schema/_init_.py as per the config sets and move them from extrapaths to fields.

Example for Artist entity:
Current code:

SearchArtist = E(modelext.CustomArtist, [
    F("mbid", "gid"),
    F("artist", "name"),
    F("sortname", "sort_name"),
    F("alias", "aliases.name"),
    # Does not require a trigger since this will get updated on an alias update
    F("primary_alias", "primary_aliases", trigger=False),
    F("begin", "begin_date", transformfunc=tfs.index_partialdate_to_string),
    F("end", "end_date", transformfunc=tfs.index_partialdate_to_string),
    F("ended", "ended", transformfunc=tfs.ended_to_string),
    F("area", ["area.name", "area.aliases.name"]),
    F("beginarea", ["begin_area.name", "begin_area.aliases.name"]),
    F("country", "area.iso_3166_1_codes.code"),
    F("endarea", ["end_area.name", "end_area.aliases.name"]),
    F("ref_count", "artist_credit_names.artist_credit.ref_count",
                    transformfunc=sum, trigger=False),
    F("comment", "comment"),
    F("gender", "gender.name"),
    F("ipi", "ipis.ipi"),
    F("isni", "isnis.isni"),
    F("tag", "tags.tag.name"),
    F("type", "type.name")
],
    1.5,
    convert.convert_artist,
    extrapaths=["tags.count",
                "aliases.type.name", "aliases.type.id",
                "aliases.type.gid", "aliases.sort_name",
                "aliases.locale", "aliases.primary_for_locale",
                "aliases.begin_date", "aliases.end_date",
                "begin_area.gid", "area.gid", "end_area.gid",
                "area.begin_date", "area.end_date", "area.ended",
                "begin_area.begin_date", "begin_area.end_date",
                "begin_area.ended", "end_area.begin_date",
                "end_area.end_date", "end_area.ended",
                "gender.gid", "area.type.gid", "area.type.name",
                "begin_area.type.gid", "begin_area.type.name",
                "end_area.type.gid", "end_area.type.name",
                "type.gid"]
)

After the change:

SearchArtist = E(modelext.CustomArtist, [
    F("mbid", "gid"),
    F("artist", "name"),
    F("sortname", "sort_name"),
    F("alias", "aliases.name"),
    # Does not require a trigger since this will get updated on an alias update
    F("primary_alias", "primary_aliases", trigger=False),
    F("life_span-begin_date", "begin_date", transformfunc=tfs.index_partialdate_to_string),
    F("life_span-end_date", "end_date", transformfunc=tfs.index_partialdate_to_string),
    F("life_span-ended", "ended", transformfunc=tfs.ended_to_string),
    F("area-name", ["area.name", "area.aliases.name"]),
    F("begin_area-name", ["begin_area.name", "begin_area.aliases.name"]),
    F("country", "area.iso_3166_1_codes.code"),
    F("end_area-name", ["end_area.name", "end_area.aliases.name"]),
    F("ref_count", "artist_credit_names.artist_credit.ref_count",
                    transformfunc=sum, trigger=False),
    F("comment", "comment"),
    F("gender", "gender.name"),
    F("ipis-ipi", "ipis.ipi"),
    F("isnis-isni", "isnis.isni"),
    F("tag", "tags.tag.name"),
    F("type-name", "type.name"),
    F("begin_area-lifespan-date", "begin_area.begin_date"),

    # Similar fields for `area`, `begin-area`, `end-area`

    F("gender-id", "gender.gid"),
    F("type-id", "type.gid")
],
    1.7,
    convert.convert_artist,
    extrapaths=["tags.count",
                "aliases.type.name", "aliases.type.id",
                "aliases.type.gid", "aliases.sort_name",
                "aliases.locale", "aliases.primary_for_locale",
                "aliases.begin_date", "aliases.end_date"
                ]
)

b. Currently the converter function builds the full entity object and returns it.
We need to modify the existing convert functions to return a list of dictionaries representing the nested fields, alongside the entity object.

Current converter function for artist entity:

def convert_artist(obj):
    """
    :type obj: :class:`sir.schema.modelext.CustomArtist`
    """
    artist = models.artist(id=str(obj.gid), name=obj.name,
                           sort_name=obj.sort_name)

    if obj.comment:
        artist.set_disambiguation(obj.comment)

    if obj.gender is not None:
        artist.set_gender(convert_gender(obj.gender))

    if obj.type is not None:
        artist.set_type(obj.type.name)
        artist.set_type_id(str(obj.type.gid))

    if obj.begin_area is not None:
        artist.set_begin_area(convert_area_inner(obj.begin_area))

    if obj.area is not None:
        artist.set_area(convert_area_inner(obj.area))
        if len(obj.area.iso_3166_1_codes) > 0:
            artist.set_country(
                models.def_iso_3166_1_code(obj.area.iso_3166_1_codes[0].code)
            )

    if obj.end_area is not None:
        artist.set_end_area(convert_area_inner(obj.end_area))

    lifespan = convert_life_span(obj.begin_date, obj.end_date, obj.ended)
    artist.set_life_span(lifespan)

    if len(obj.aliases) > 0:
        artist.set_alias_list(convert_alias_list(obj.aliases))

    if len(obj.ipis) > 0:
        artist.set_ipi_list(convert_ipi_list(obj.ipis))

    if len(obj.isnis) > 0:
        artist.set_isni_list(convert_isni_list(obj.isnis))

    if len(obj.tags) > 0:
        artist.set_tag_list(convert_tag_list(obj.tags))

    return artist

After the change to return nested_field_list alongside:

sir/wscompat/convert.py

def convert_artist(obj):
   """
   :type obj: :class:`sir.schema.modelext.CustomArtist`
   """
   artist = models.artist(id=str(obj.gid), name=obj.name,
                          sort_name=obj.sort_name)
   nested_field_list = []

   if len(obj.tags) > 0:
       artist.set_tag_list(convert_tag_list(obj.tags))
       nested_field_list .append({"tag_list": artist.get_tag_list())
   if len(obj.aliases) > 0:
       artist.set_alias_list(convert_alias_list(obj.aliases))
       nested_field_list .append({"alias_list": artist.get_alias_list()})

   if obj.comment:
       artist.set_disambiguation(obj.comment)

   if obj.gender is not None:
       artist.set_gender(convert_gender(obj.gender))

   if obj.type is not None:
       artist.set_type(obj.type.name)
       artist.set_type_id(str(obj.type.gid))

   if obj.begin_area is not None:
       artist.set_begin_area(convert_area_inner(obj.begin_area))

   if obj.area is not None:
       artist.set_area(convert_area_inner(obj.area))
       if len(obj.area.iso_3166_1_codes) > 0:
           artist.set_country(
               models.def_iso_3166_1_code(obj.area.iso_3166_1_codes[0].code)
           )

   if obj.end_area is not None:
       artist.set_end_area(convert_area_inner(obj.end_area))

   lifespan = convert_life_span(obj.begin_date, obj.end_date, obj.ended)
   artist.set_life_span(lifespan)

   if len(obj.ipis) > 0:
       artist.set_ipi_list(convert_ipi_list(obj.ipis))

   if len(obj.isnis) > 0:
       artist.set_isni_list(convert_isni_list(obj.isnis))

   return artist, nested_field_list

c. Update query_result_to_dict in searchentities.py to add nested fields to the data dict and remove _store from it.

sir/schema/searchentities.py

    def query_result_to_dict(self, obj):
        """
        Converts the result of single ``query`` result into a dictionary via the
        field specification of this entity.

        :param obj: A :ref:`declarative <sqla:declarative_toplevel>` object.
        :rtype: dict
        """
        # Unchanged code.

         if (config.CFG.getboolean("sir", "wscompat") and self.compatconverter is
            not None):
          # _store is not required anymore.
          # data["_store"] = str(tostring(self.compatconverter(obj).to_etree(), encoding='us-ascii'), encoding='us-ascii')

            _, nested_list = self.compatconverter(obj)
            for n in nested_list:
                for n_field, value in n.items():
                     data[n_field] = str(tostring(value.to_etree(), encoding='us-ascii'), encoding='us-ascii')

        return data

d. Fix test/test_searchentities.py and test/test_indexing_real_data.py tests as they are dependent on _store field to validate the output.
Expand test/test_wscompat_convert.py to cover remaining convert_<entity> functions.

4. Create response writers for complete fields with validation tests in query response writer

The MB-XML writer parses the Solr document by extracting the _store XML string, unmarshalling it, and writing it to the output. The MB JSON format is automatically generated from the MB XML format.

We need to create a new writers that parse the Solr document, read values from each field and constructs valid MB-XML/ MB-JSON objects for output.

a. MB-XML Writer
We need to create a writer that unpacks all flat fields, unmarshals nested fields and combines them into an entity object, which is then written to the output.

We can re-use MBXMLWriter and modify both the parseSolrResponse methods for our objective.

We need to add entity specific builders that creates an entity object from its stored fields and returns it.

As we have two implementations of parseSolrResponse( for BasicResultContext and SolrDocumentList), We need a FieldReader interface which can be passed to buildEntityFromFields router.

The XML writer will look something like:

// FieldReader interface
private interface FieldReader {
    String get(String fieldName);
}
//Helper function to unmarshall nested fields

private Object unmarshalFragment(Unmarshaller unmarshaller, String fragment){
    return unmarshaller.unmarshal(new ByteArrayInputStream(fragment.getBytes()));
}

//Artist entity builder

private Artist buildArtist(FieldReader doc, Unmarshaller unmarshaller) {
    Artist artist = new Artist();
    artist.setId(doc.get("mbid"));
    artist.setName(doc.get("artist"));
    artist.setType(doc.get("type-name"));
    artist.setTypeId(doc.get("type-id"));
    artist.setDisambiguation(doc.get("comment");

    Gender gender = new Gender();
    gender.setId(doc.get("gender-id"));
    gender.setContent(doc.get("gender"));
    artist.setGender(gender);

    DefAreaElementInner area = new DefAreaElementInner();
    area.setId(doc.get("area-id"));
    area.setName(doc.get("area-name"));
    LifeSpan ls = new LIfeSpan();
    ls.setBegin(doc.get("area-being_date"));
    ls.setEnd(doc.get("area-end_date"));
    ls.setEnded(doc.get("area-ended"));
    artist.setLifeSpan(ls);

    // Similar logic for beginArea, endArea, ipiList, isniList

    artist.setAliasList(unmarshalFragment(unmarshaller, doc.get("alias_list")));
    artist.setTagList(unmarshalFragment(unmarshaller, doc.get("tag_list")));

    return artist;
}
// Router function

private Object buildEntityFromFields(FieldReader doc, Unmarshaller unmarshaller) {
    switch (entityType) {
        case annotation:    return buildAnnotation(doc, unmarshaller);
        case area:          return buildArea(doc, unmarshaller);
        case artist:        return buildArtist(doc, unmarshaller);
        case cdstub:        return buildCdstub(doc, unmarshaller);
        case editor:        return buildEditor(doc, unmarshaller);
        case event:         return buildEvent(doc, unmarshaller);
        case instrument:    return buildInstrument(doc, unmarshaller);
        case label:         return buildLabel(doc, unmarshaller);
        case place:         return buildPlace(doc, unmarshaller);
        case recording:     return buildRecording(doc, unmarshaller);
        case release:       return buildRelease(doc, unmarshaller);
        case release_group: return buildReleaseGroup(doc, unmarshaller);
        case series:        return buildSeries(doc, unmarshaller);
        case tag:           return buildTag(doc, unmarshaller);
        case work:          return buildWork(doc, unmarshaller);
        case url:           return buildUrl(doc, unmarshaller);
        default: throw new IllegalArgumentException("invalid entity type: " + entityType);
    }
}

public void parseSolrResponse(ResultContext con,
        MetadataListWrapper metadatalistwrapper,
        SolrQueryRequest req)
        throws IOException {

  // unchanged code.

    while (iter.hasNext()) {
        int id = iter.nextDoc();
        Document doc = req.getSearcher().doc(id);
        FieldReader fieldReader = new FieldReader() {
            @Override
            public String get(String fieldName) {
                return doc.getField(fieldName).stringValue();
             }
        }
        Object entity = buildEntityFromFields(fieldReader, unmarshaller);

        try {
            adjustScore(maxScore, entity, iter.score());
        } catch (NullPointerException e) {
            throw new RuntimeException(SCORE_NOT_IN_FIELD_LIST);
        }

        xmlList.add(entity);
    }
}

public void parseSolrResponse(SolrDocumentList doclist,
               MetadataListWrapper metadatalistwrapper){

         // No change.

		 while (iter.hasNext()) {
			SolrDocument doc = iter.next();
            FieldReader fieldReader = new FieldReader() {
                 @Override
                  public String get(String fieldName) {
                      String field = doc.get(fieldName);
                       return field;
                     }
             };
			
			Object entity = buildEntityFromFields(fieldReader, unmarshaller);	

			try {
				adjustScore(maxScore, unmarshalledObj, (float) doc.get("score"));
			} catch (NullPointerException e) {
				throw new RuntimeException(SCORE_NOT_IN_FIELD_LIST);
			}

			xmlList.add(entity);
		}

	}

b. MB-JSON Writer
MBJSONWriter works by converting the output object from MB-XML format to MB-JSON format. We can reuse this writer without any modification once the changes to MBXMLWriter are implemented.

c. Benchmarking
To validate that eliminating _store doesn’t regress query performance, we’ll record production queries using Solr’s Request Logging , then replay them against both the old and new schema. Metrics to compare: query latency (QTime) and end-to-end response time (QTime + time taken by response writer) .

d. Validation Tests
Similar to the existing test strategy, test documents are added to Solr. The modified writers are then queried and their output is validated against the expected <entity>.xml and <entity>.json files.

Example test for artist entity:

Populate all the required fields in the getDoc() method

test/../AbstractMBWriterArtistTest.java

package org.musicbrainz.search.solrwriter;

import java.util.ArrayList;
import java.util.Arrays;

public abstract class AbstractMBWriterArtistTest extends AbstractMBWriterTest {
    @Override
    public ArrayList<String> getDoc() {
        return new ArrayList<String>(Arrays.asList(new String[]{
                "mbid", uuid,
                "artist", "Howard Shore",
                "sortname", "Shore, Howard",
                "type-name", "Person",
                "gender-name", "Male",
                "country", "CA",

                "area-id", "71bbafaa-e825-3e15-8ca9-017dcad1748b",
                "area-name", "Canada",
                "area-sort-name", "Canada",

                "begin-area-id", "74b24e62-d2fe-42d2-9d96-31f2da756c77",
                "begin_area-name", "Toronto",
                "begin-area-sort-name", "Toronto",

                "life_span-begin_date", "1946-10-18",
                "lifespan-ended", "false",

                "alias_list", "<alias-list><alias sort-name=\"Shore\">Shore</alias><alias sort-name=\"Howard Shaw\">Howard Shaw</alias><alias sort-name=\"H. Shore\">H. Shore</alias></alias-list>",
                 "tag_list", "<tag-list count=\"10\">" +
                "<tag count=\"1\"><name>lord of the rings</name></tag>" +
               "<tag count=\"2\"><name>classical</name></tag>" +
                "<tag count=\"2\"><name>canadian</name></tag>" +
                "<tag count=\"1\"><name>film composer</name></tag>" +
            "<tag count=\"1\"><name>score</name></tag>" +
            "<tag count=\"1\"><name>academy award winner</name></tag>" +
            "<tag count=\"1\"><name>easy listening soundtracks and musicals</name></tag>" +
            "<tag count=\"2\"><name>soundtrack</name></tag>" +
            "<tag count=\"1\"><name>howard</name></tag>" +
            "<tag count=\"1\"><name>shore</name></tag>" +
            "</tag-list>"        }));
    }
}

Modify AbstractMBWriterTest to use all document values and then validate the response output against the expected file.

test/../AbstractMBWriterTest.java

@Test
    public void performCoreTest() throws Exception {
        ArrayList<String> docValues = new ArrayList<>(getDoc());
        assertU(adoc(docValues.toArray(new String[0])));
        assertU(commit()); 

        String expectedFileName = String.format("%s-list.%s", getCorename(), getExpectedFileExtension());
        String expectedFilePath = AbstractMBWriterTest.class.getResource(expectedFileName).getFile();
        
        byte[] content = Files.readAllBytes(Paths.get(expectedFilePath));
        String expected = new String(content);

        String response = h.query(req("qt", "/advanced", "q", "*:*", "wt", getWritername()));
        
        compare(expected, response);
    }

Timeline

Phase 1 (May 4 - June 21) :

  • Week 1 (May 4 - May 10)
    Schema migration:

    • Set up local Solr + SIR + mb-solr dev environment

    • Add docValues="false" to all relevant field types in fieldtypes.xml

    • Bump schema version to 1.7 for all entities

  • Week 2 (May 11 - May 17):
    Complete configset:

    • Audit artist configset: catalog which fields can be stored directly, which nested structures can be flattened, which nested structures need to be stored in a xml string.

    • Set stored="true"for all searchable fields.

    • Add flat non-searchable fields based on extrapaths in SIR

    • Add storefields for complex nested fields

    • Rename fields which are child elements as parent_element-child_element

    • Remove _store field

    • Update request-params.xml (fl=score, *, remove _store)

    • Verify locally on Solr that artist configset index and return fields correctly

  • Week 3 (May 18- May 24):
    Complete indexer

    • In schema/__init__.py: rename all field names in SearchArtist to match the new configset names

    • Move non-searchable flat fields from extrapaths to fields for all entities

    • Update version numbers in E(...) call from 1.5 → 1.7

    • Modify convert_artist functions in sir/wscompat/convert.py to return (artist, nested_field_list) tuples instead of just artist

    • Update query_result_to_dict in searchentities.py to unpack (artist, nested_list) from compatconverter, serialize each nested field individually into data[n_field], and remove data[_store]

    • Run SIR against local MusicBrainz database and verify all flat fields are populated correctly, storefield values contain valid XML strings and no _store field is present

  • Week 4 (May 25 - May 31):
    Modify Response writers

    • Implement FieldReader interface and unmarshalFragment helper

    • Implement buildEntityFromFields router

    • Implement buildArtist function

    • Test the full search system end-to-end for Artist entity

  • Week 5 - 6 (June 1 - June 14):
    Benchmarking

    • Record production queries on the artist core and replay against old and new schema

    • Compare Qtime and end to end time taken

    • If performance is better, Proceed with remaining entities

    • If performace is worse, discuss tradeoffs with mentor and adjust approach.

  • Week 7 (June 15 - June 21):
    Buffer period

Phase 2 (June 22 - August 24):

  • Week 8 - 11 (June 22 - July 19)
    Roll out to remaining entities.

    Configsets:

    • Audit remaining entities ( annotation , area , cdstub , editor , event , instrument , label , place , recording , release , release_group , series , tag , url , work )

    • Complete fields for each of the entities.

    • Verify locally on Solr that these entities index and return fields correctly

    Indexer:

    • Complete indexer to populate all of the fields and remove _store

    • Fix test/test_searchentities.py and test/test_indexing_real_data.py as both are currently using _store for output validation.

    • Expand test/test_wscompat_convert.py to cover all remaining convert_<entity> functions

    Response Writer:

    • Implement per-entity builder functions for remaining entities

    • Populate getDoc() for all 16 entities in their respective AbstractMBWriter<entity>Test with all required field values, flat fields as direct string pairs and complex nested field values as XML strings.

    • Update performCoreTest() in AbstractMBWriterTest to call adoc with the complete field array from getDoc(), commit, query, read the expected output and compare

    • Integration test: after the response writers are completed, test the full search system end-to-end for each entity.

  • Week 12 - 16 (July 20 - August 24)
    Main Project is completed

    1. Buffer period for anything that took longer than expected.
    2. Prepare final GSoC report.

    Stretch Goal

    • Replace custom response writers with Solr’s inbuilt writer for Artist core.
    • Make required changes in mb-server to format the data into valid MB XML / MB JSON format
    • Benchmark this against custom response writers and report the findings.

Other Information

  • Tell us about the computer(s) you have available for working on your SoC project!

    I use a laptop with 16GB RAM and 1TB SSD, running Windows 11 and Kubuntu.

  • When did you first start programming?

    I started programming in the first year of my engineering course.

  • What type of music do you listen to? (Please list a series of MBIDs as examples.)

    My music taste didn’t evolve much from last year, Its mostly been the same genres.
    Some new tunes I’ve been listening to lately:
    Bunce Road Blues by Jcole
    Freddie Freeloader by Miles Davis
    Moanin’ by Art Blakey and The Jazz Messengers
    56 Nights by Future
    Fox 5 by Lil Keed
    Mine by Tems

  • What aspects of the project you’re applying for (e.g., MusicBrainz, AcousticBrainz, etc.) interest you the most?

    I am highly interested in the performance of Solr search engine powering the MusicBrainz search architecture and I want to work towards making it more fast and efficient.

  • What sorts of programming projects have you done on your own time?

    I created a real time movie tickets tracker which sends out mail notifications when tickets are available. Github link

    A friend and I created a high performance upload system written in golang for a hackathon.

    I worked on some small GPU programming projects using CUDA to parallelize general algorithms.

    I implemented mapReduce in golang as part of my course work.

  • How much time do you have available, and how would you plan to use it?

    I can contribute 40 hours/ week. I have a week of exams from June 8, so I can contribute around 20 hours during that week.
    I want to start early coding period early from may itself, so we can have some time in the end to work on extra goals.

One quick comment for now, the intent is to use the inbuilt writers provided by Solr once the _stored field is removed unless there is a performance gap. Additionally, I feel these simplifications would improve the performance of the search. If it leads to a worse performance then it’s a tradeoff that we need to consider. So we should only experiment with one core to begin with.

@lucifer Thanks for the feedback.
I have made the required changes !