A modest proposal: delete the Script field of Release

Jim_DeLaHunt · March 7, 2020, 8:30am

The Release entity contains a field, Script. I have a modest proposal: let’s delete this Script field entirely.

I suspect that this field is not used. It is just stored, retrieved, and displayed. Anything which might be interested in this field could likely derive it from the Language field of Release and/or the character codes of the Release’s title, track titles, and other text fields. And populating this field takes a small but noticeable amount of work. Authoring a proper Release entity is plenty of work already; we should not ask people to do an extra task which adds no value.

I suspect that Script might have been significant in the early days of MusicBrainz, before the Release Title and Track Title strings were encoded in Unicode. In those days, strings might have been in a Western European encoding or a Russian encoding or a Japanese encoding, and a Script field might have been crucial in interpreting the byte codes of such strings. But one of the core benefits of Unicode encoding is that character codes are universal and need no script qualifier.

If my suspicion is wrong, if there is a use for the Script field, please let me know. Let’s document it in the Release entity docs, so people know better how to set it.

Jim_DeLaHunt · March 7, 2020, 8:45am

I have done a search of the Musicbrainz-server source code for “scriptID” or “Script_id”. There are only 20 hits, and they look to me like code to get, display, and retrieve the field only.

There is a statistics page for values in the script table, with numbers of Releases. 84% of Releases have Script == Latin. A further 9% have Script == Unknown script. Add in Script == Cyrillic and Script == Japanese, and you have 99% of all Releases. With this skewed a distribution, I suspect the Script field is not storing much information.

It would be an interesting experiment to do a database query, categorising the ranges of Unicode character codes used by the text fields for each Release, and comparing these to the Script field value. I suspect that there will be a number of mismatches. If the Script field contains inaccurate data, that reduces its value still further. I have not done this experiment.

It would be an interesting experiment to delete the Script field from the Release table of the database schema, delete the corresponding code to save and restore that field, and see what breaks. I have not done this experiment.

Does anybody on this forum remember the history of the Script field? That might give some insight.

I learned a long time ago in software engineering that it is no great trick to make a system more complex. It is a great accomplishment to make a system more simple, especially if it doesn’t reduce the functionality of the system. Deleting the right feature for the right reason is valuable.

mfmeulenbelt · March 7, 2020, 3:09pm

It’s information about the release, why would we selectively not record that information? And you can’t derive the script from the language, because some languages can be written in multiple scripts.

I can imagine that in some cases, people would prefer some metadata in a certain script if it is available in multiple scripts, and they could script Picard for that. Or someone would like to do a bit of research on how widespread a certain script or a combination of script and language is in music.

What do you have against script in the first place? This is such a random rant, you might as well campaign for removing the date field or whatever.

culinko · March 7, 2020, 5:44pm

Would that cover stuff like Release “サイレンス!” by Petteri Sariola - MusicBrainz?

Jim_DeLaHunt · March 7, 2020, 8:33pm

I argue that it’s redundant information about the release, which can already be derived from the text fields of the Release entity. Would we want a field “First character of Title”, where people type in a single character matching the start of the Release Title string? No, because it is redundant information, easily derived from other data in the entity, and wasted effort. The Script field seems similarly redundant to me.

Jim_DeLaHunt · March 7, 2020, 8:37pm

True, some languages can be written in multiple scripts. But you can derive the script from the character codes actually used in the strings for Release Title, Track Title, and maybe Release Artist (as credited, not in their native script; I’m looking at you, “Tchaikovsky”). My claim is

Jim_DeLaHunt · March 7, 2020, 8:44pm

Is this possible in MusicBrainz and Picard now? I believe not. There is one Release entity. It has one Release Title, one list of Track Titles, one Script field. I’m not aware of being able to ask for, say, the Latin script alternate of a Japanese-script Release. In any case, I believe that if there ever is a need for a Script field, then we could write a function to derive it by looking at the character codes of the Release Title and Track Titles. We don’t need to require editors to enter it.

I suggest that this would be better done by looking at the actual character codes in the Release and Track title strings, rather than trusting the imperfect data entry of MusicBrainz contributors. I don’t see the Script field adding much value for this use case.

Jim_DeLaHunt · March 7, 2020, 8:46pm

No, not random, it’s a targeted suggestion based on lots of experience.

Jim_DeLaHunt · March 7, 2020, 9:39pm

Absolutely. This Release has a quite simple Script situation. It is titled サイレンス!, which is Japanese katakana characters that spell out a sound much like the English word “Silence”.

The Release Title and Track Titles together are 66 characters, of which 58 are Katakana, 1 is Hiragana, 3 are Kanji, ard 4 are digits and punctuation (which could be either half-width Latin script or full-width Japanese or Chinese script). Reading the documentation for the Script field of Release, the value which best matches this data is Japanese.

Side note: I think the definition of Katakana in the Script doc is a bit strange. It says, “Katakana should only be used for transliterations into Japanese (example, English->Japanese). Japanese language titles with words written in Katakana should use Japanese.” Thus, it is encoding language information in addition to Script information. No other value for Script does this. It is a bit of a special case for Japanese. Why represent this information, and not e.g. the usage of Latin script to transcribe Japanese words? If it is important to note a Release which has a foreign language transcribed into a different script, then let’s make a field for that specifically.

This Release, サイレンス!, also demonstrates a limitation of the Release entity. There is only a single Language field and a single Script field, but among the Release Title and Track Title strings there could be multiple Languages and multiple Scripts. In this Release, there is English transliterated into Katakana, and there is Japanese. That is, according to the docs, multiple values for Script.

This Release has a Language value of English. I think Japanese would be more correct. But that gets into a discussion about the Language field, which is not my purpose here. Artists like to mess around with the boundaries of categories. Whatever categories we set up, we will come across exceptional cases which challenge them.

Zas · March 7, 2020, 10:55pm

That’s especially true since most editors have no clues about what Script and Language are in many cases.
For Script, we can clearly rely on Unicode used ranges, and/or neural network classification.

For Language, I’m pretty sure neural network models for language identification can outperform humans in most cases (a program wouldn’t select the wrong entry in the select box…), plus detect all languages used with probabilities. Check a lib like GitHub - google/cld3 (Python bindings available at: GitHub - Elizafox/cld3: Python bindings for cld3)

So I would rather add “detected scripts” and “detected languages” and let editors fix those if needed.

Clearly, current fields aren’t that useful, because they are either empty, incorrectly set, or set to very common values (like English/latin because even Japanese / French / German (etc.) artists are releasing songs titled in English nowadays).
Also we have many releases mixing languages and/or scripts that end with [Multiple Scripts] or [Multiple Languages] which don’t help much imho.

yvanzo · March 9, 2020, 3:06pm

Independently of Unicode, script is related to what people are able to read, so there is a use for it.

Additionally to the above mentioned causes, this field is probably underused because MB is still difficult to use by non-English speakers. So I don’t think removing it will help with this at all, quite the opposite. That most probably explain the astonishment of @mfmeulenbelt.

However it is true it holds redundant information, and that having automatic detection would be ideal.

To clarify, the main added value of still storing this field is to be used for searching/filtering releases.

Automatic detection of script has already been suggested in below ticket, feel free to vote for it:

Automatic detection of language is a bit more complex, thus less likely to be implemented soon.

Jim_DeLaHunt · March 9, 2020, 6:13pm

Agreed. However, I propose that to the extent we need a Script value to indicate what people can read, we can write code to derive that Script value from the Unicode characters used in the Release Title and Track Titles and other fields in the Release. We don’t need editors to manually set this value in order to use this value.

Instead of “automatic detection”, I propose “derivation from other fields at time of use”. But the difference is really only a matter of caching.

Tell me more? Is this a theoretical possibility, or is there an actual tool or workflow now which uses the value in this field? For what kind of searching or filtering releases?

yvanzo · March 10, 2020, 7:22am

That is an extension of the above ticket I forwarded your proposal to in comment.

once the automated function is reliable, which we don’t know yet as it has not been implemented.

Caching script in the database is probably the simplest approach and we already follow it for track count.
So there is no difference to me.

Theoretical for MusicBrainz website which isn’t using script yet but for displaying its value (e.g. with upcoming MBS-3609), but I don’t know about WS/2 clients. Theoretical example: filtering or searching for releases with script that follow user’s locale or preferences.

scott967 · March 16, 2020, 4:07am

Just a user, but one area where I find script useful is Hans vs Hant for Chinese. I’m not sure how easy it is to determine from the unicode codepoints used in a text. There might also be some instances in Indic languages where explicitly providing the script could assist a user.

Jim_DeLaHunt · March 16, 2020, 5:49am

Distinguishing “hans” (Simplified Chinese) vs “hant” (Traditional Chinese) is probably better done by the Language field than by the script field. The Language field takes ISO 639-3 language codes.

There is a long-standing question about the difference between “language”, something spoken and maybe or maybe not written, and “script”, a way of expressing language in persistent form, and the fact that one language might be written in alternative scripts. Generally, my opinion is that IETF Language Tags (aka BCP 47) work better than ISO 639-3 for this sort of thing, but that would be a separate change.

In any event, ISO 639-3 includes values for Chinese of all forms (zho ); “Mandarin” (cmn); “Cantonese” (yue); and so on. Looking at the Database Statistics page, the language tags for specific kinds of Chinese are already in use. It would be worth cross-tabulating Language and Script values to see how well editors have correlated them. IETF Language Tags has arguably better tags for Chinese of all forms (zh ); “Simplified Chinese” (zh-Hans); and “Traditional Chinese” (zh-Hans).

Likewise, there are ISO 639-3 language tags for various Indic languages, so that could perhaps encode the information about Indic scripts which the Script field purports to contain.

It is possible to classify a block of text as “Simplified Chinese” or “Traditional Chinese” or “Chinese — indeterminate if Simplified or Traditional”, as well as other languages, by examining character codes. The longer the text, the more reliable the classification. See https://github.com/jpatokal/script_detector .

jesus2099 · March 16, 2020, 7:11am

But it does not seem that there are any three letter codes for simplified and traditional characters.

This ISO list seems made for spoken languages rather than for written languages / scripts.

Jim_DeLaHunt · March 16, 2020, 7:28am

Yes, and that’s why the IETF Language Tags work better for MusicBrainz’s purposes. They can include a script qualifier if useful, but don’t have to.