English Spelling Vs Japanese Spelling

Rob_209 · August 7, 2016, 10:12pm

Hi everyone. I am new as a user here but have been a lurking fan for sometime. Decided to have a go at editing some albums today. I have a huge amount of video game music cds frommy favourite games from over the years & also have a last.fm account. Not sure if it is OCD yet but I hate seeing the names of Japanese artists spelt in, Japanese. I would like to know why the names are spelt / tagged in japanese and not in English ?

mfmeulenbelt · August 7, 2016, 10:23pm

Because they are Japanese artists?

But there is an option in Picard to translate artist names to English. It’s under Metadata in Picard’s options.

sparkinson · August 8, 2016, 8:52am

Captain Obvious speaking: the option mentioned will only work if there actually is a suitable alias for the artist.

@Rob_209, it’s always good to give a concrete example. Which artist(s) are you talking about? Then we can see whether the Picard option should fix your case, or if the database is lacking in this regard.

mfmeulenbelt · August 8, 2016, 9:51am

That option used to work with the artist’s sort name, which should always be in the Latin alphabet. I’m not sure if it doesn’t still work that way (when did it change?). Maybe it could be re-added as a fall back?

sparkinson · August 8, 2016, 10:10am

Oh, that is news to me!

I always considered this some kind of crutch. And there are definitely editors/userscripts that keep the sortname in the „native“ script. See Edit #39705736 - MusicBrainz for an example.

Jim_DeLaHunt · August 8, 2016, 11:56pm

@Rob_209 : you might be interested in the earlier forum post, Benefit of having artist names with non-latin characters?.

This is another example of MusicBrainz not being really multilingual. Requiring the sort name to be in the Latin script really is a crutch. It doesn’t give good sorting results except for English-language users. It is a convention not always followed, apparently.

A better way to deliver good sorting is:

Let entities have multiple names, each tagged with a language and culture and purpose. One name might be “Japanese sortable”. One name might be “English sortable”. One might be “German sortable”.
Let the user’s desired language and conventions inform sorting. One user might ask for “Sorted by English conventions, with names in original script”. Another user might ask for “Sorted by German conventions, with name in German script”.
MusicBrainz selects a name for sort use according to the user’s preferences, when preparing to sort, then sorts according to the user’s conventions.
MusicBrainz selects a (possibly different) name for display use after sorting.

dukeja · August 9, 2016, 12:16am

That presumes, of course, that all the various translations are present. Allowing is one thing, providing the various translations is another. How well will the collation algorithms work when some translations are absent? That needs to be worked out as well. We cannot require translations to be present; we can only provide for their existence.

Jim_DeLaHunt · August 9, 2016, 5:53am

It shouldn’t be necessary to have explicit aliases translating each name into each possible language. It’s possible, for most cases, to write code that transliterates a name from one script and language into another script according to the rules for another language. e.g. I expect it is tractable to write code that can transliterate “Чайковский” into “Tschaikowski” (for German) and “Tchaikovsky” (for English) and “チャイコフスキー” (for Japanese).

Exceptions: Japanese kanji names into anything will be flawed, because pronunciation of Japanese names is so irregular. (It’s common for native Japanese databases to include a “phonetic reading of name” field along with a “name” field. We could define a “phonetic reading” alias, I suppose.) Arabic to Latin script transliteration is full of alternate spellings: e.g. Mohammed, Mohamed, Muhammad, Muhammat, etc. Explicit aliases can smooth over frequently-encountered problems.

That should be no problem. The sorting conventions for each language probably have some convention for how to sort names not in that language or script, e.g. for English, sort all names with non-Latin script after the Latin script names, and group names of each script together. Something like the sorting rules of the Common Locale Data Repository will specify this.

sparkinson · August 9, 2016, 6:48am

I mostly agree. You know that the DB already allows for what you want, since aliases can have sort names as well?

I don’t think we can give a useful result for this (esoteric, IMHO) request, until transliterations are universal. You’d want to sort 相川嘉男 before Пётр Ильич Чайковский because A comes before T, but if you’re missing Aikawa-san’s transliteration, you’re bound to sort him after all the artists where you do have them.

Jim_DeLaHunt · August 9, 2016, 6:57am

I am proposing that MusicBrainz include code to transliterate names on demand, so that there is no need to add aliases to each entity for each target language.

And, I probably wasn’t being clear when I wrote:

What I mean is the English conventions for sorting a list of strings, some of which are non-Latin. The English convention I expect is: Latin-script strings first, in English-language sorting order; then all the non-Latin strings, grouped by script and in some reasonable order within script; e.g. the Cyrillic strings in Russian sort order, then all the Greek strings in Greek sort order, then the Japanese strings in kanji character code order, etc.

The English convention i expect for sorting a list of non-Latin strings does not call for sorting by the English Latin-script transliteration, but displaying the original non-Latin string.

dukeja · August 9, 2016, 12:06pm

That would be absolutely wonderful. But I’m dubious about it being tractable. I would love to be proven wrong, however. That would require extensive language and script expertise to write and could degenerate into an N x N mapping (where N is the number of languages supported). Perhaps some of the automatic translation services out there could be employed. Once transliteration is available, I don’t think we need to then have special collation rules for the transliterated names. Just let them be listed in their proper order. As an English speaker, I certainly wouldn’t want to see Tchaikovsky at the end or anywhere else than in the T’s.

After sending this, I did some poking around for Translation API’s. There may be something there. But the sense I had, and still have, is that automatic translation is a non-trivial undertaking and I wouldn’t advise doing it ourselves. If we want to do it, I think the best bet would be to leverage the hard work of others through an API/Service.

Zastai · August 9, 2016, 6:49pm

Point 1: I think the idea is for MB to eventually move to a BB-style “there are no names, only aliases, some of which are marked as primary” model.

Point 2: Automatic transliteration does not remove the need for a sort name field (as in the posts above: with Kanji-based Japanese names, you’d want a katakana or hiragana version as sort name).

Point 3: Even with a sort name field present and transliteration, that does not necessarily produce a usable value for the target locale’s name sorting rules. An example might be the Dutch name sorting rules, where initial particles in last names (typically “van”, “de” and “van der”, but others exist) are not considered for the sort order (they are placed after the first name).
For example, the sort name in Dutch for “Ludwig van Beethoven” would be “Beethoven, Ludwig, van”. To make this extra fun, this only applies to Dutch in the Netherlands. In the Flanders region of Belgium, the “van” is considered a fixed part of the name. So there the sortname should be “vanBeethoven, Ludwig” (note: no space, since for Dutch, it is recommended to ignore spaces and apostrophes in last names).

Jim_DeLaHunt · August 9, 2016, 8:30pm

Good news, then. The Internationalization Classes for Unicode (ICU) project applied the extensive language and script expertise necessary, and released their library as free software. ICU transliterates your choice of input strings according to instructions given by “Transliterator Identifiers”. You can chain transliterations together.

Take a look at the ICU Transform Demonstration. Under “Insert Sample”, select “Names”.

In “Compound 1”, type Latin; Title . You get results like:
정, 병호 → Jeong, Byeongho
たけだ, まさゆき → Takeda, Masayuki
Догилева, Татьяна → Dogileva, Tatʹâna
Καφετζόπουλος, Θεόφιλος → Kaphetzópoulos, Theóphilos

Next, in “Compound 1”, type Latin; Cyrillic . You get results like:
정, 병호 → йеонг, быеонгхо
たけだ, まさゆき → такеда, масаыуки
Догилева, Татьяна → Догилева, Татьяна
Καφετζόπουλος, Θεόφιλος → Капхетзо́поулос, Тхео́пхилос

Not perfect for MB needs immediately, but a promising sign that the transliteration problem may be tractable.

Jim_DeLaHunt · August 9, 2016, 8:39pm

Agreed, correct name sorting needs to allow choice among many culturally-appropriate options. The good news is, we are not the only project that wants to sort lists in culturally-appropriate ways.

The ICU project also has a collation service. It lets you program in sorting rules that you want. I don’t see that they already have tailorings for Dutch name sorting, and Belgium-Flanders name sorting, but the architecture exists. And maybe someone else has come up with tailorings which we could re-use.

Rob_209 · August 9, 2016, 10:17pm

Hi & thanks for the replys. Here are a few examples of the artists I am on about

http://musicbrainz.org/artist/5cde54ba-59d2-4c4f-adb8-1a093c2ba0af http://vgmdb.net/artist/121

http://musicbrainz.org/artist/0fcb6831-ff9a-47ff-a509-45f267aa8c21 http://vgmdb.net/artist/161

Cheers Rob

jesus2099 · August 10, 2016, 8:47am

Very good stuff !
How did you test it ?

EDIT: Oh, OK I see your test link now !

Notice that MB in Japanese should not sort the same way as just transliterated.

They sort this way (in music stores for instance), you would expect those shelves:

あ where artists are ordered as あいうえお
か where artists are ordered as かきくけこ
さ where artists are ordered as さしすせそ
た where artists are ordered as たちつてと
な where artists are ordered as なにぬねの
は where artists are ordered as はひふへほ
ま where artists are ordered as まみむめも
や where artists are ordered as やゆよ
ら where artists are ordered as らりるれろ
わ where artists are ordered as わ maybe some を but I doubt
ん but as を above, I don’t remember such a shelf BTW

They won’t trim “The” from band names and they won’t change the order of the given and family names for the sake of ordering (kind of no sort name):

あ浅井健一 will be stored in Kenichi Asai あ shelf (あさいけんいち)
ア AJICO will will be stored in AJICO あ shelf (アジコ)
お陰陽座 will be stored in Onmyōza あ shelf (おんみょうざ)
ククリスタルキング will be stored in Crystal King か shelf
ザ The Beatles will be stored in The さ shelf (ザ・ビートルズ), not in Beatles は shelf
テ鄧麗君 will be stored in Teresa Teng た shelf (テレサ・テン is her Japanese artist name)
デ David Bowie will be stored in David た shelf (デビッド・ボウイ), not in Bowie は shelf

The order above is realistic, notice that, for instance, David Bowie would come after tata because we compare David’s で ( not Bowie’s ぼ) and tata’s た, both in same た shelf.
And then it’s neither Bowie nor David then tata, it’s TAta then DÉÏVID: ただちつづてでとど

Foreign artists usually have given name first and Japanese artists usually have family name first. And they don’t change it for the sort order.

sparkinson · August 10, 2016, 10:52am

I tried a few strings on the page you linked, and was underwhelmed. First, transliterations depend on the target language, not the target script – I could not set a destination language, anywhere.

In my tests the result was recognisable, but never identical to the popular transliteration in any language I know.

My opinion is: a 20% solution is easy, a 80% solution very, very hard.

dukeja · August 10, 2016, 11:30am

The ICU solution is not the only one out there. The best solutions in my opinion aren’t free. But there does seem to be some free solutions. But in all case, including the very best ones are no where close to a 100% solution. Regardless of what automatic transliteration/automatic translation approach is used, it should be treated as the solution of last resort and we should rely on transliterations/translations provided by people. Therefore I think the solution will be something like the following:

Provide the means to have language/culture specific translations of the base name.
Provide the means to have language/culture specific versions of the collation name supplemented with language/culture collation rules.
In the absence of manually provided translation/transliteration; employ the automatic facility.

I firmly believe that a 100% automated solution is not possible. I think we’d be lucky to get to a 50% solution. When I have examined the professional solutions; they rely on a set of rules backed up by a large and evolving database of exceptions.

Jim_DeLaHunt · August 10, 2016, 6:04pm

I agree. If you want a good sort in Japanese, then start with the name in Japanese (foreign names transliterated into Japanese), then sort with Japanese rules. Don’t start by transliterating Japanese names into Latin script, then sorting the Latin script with English language rules. Some Japanese names just can’t be sorted right without a “reading” or “sortname” field in Japanese kana. Some foreign names (e.g. “The Beatles”, many in fact, will need a Japanese-language alias to specify the expected Japanese for the name (e.g. ザ・ビートルズ).

Bear in mind that the ICU Transform Demonstration demonstrates transliteration, not sorting. A different part of ICU provides sorting.

Jim_DeLaHunt · August 10, 2016, 6:17pm

Oh, no-one is promising a 100% automated solution. But my overall point is that what we have in MusicBrainz now is something like a 1% automated, 5% manual solution. The original poster was an English-language user of MusicBrainz wondering why MB would provide names of Japanese artists only in Japanese language, not in Latin transliteration. My point is that this is a reasonable expectation, that MusicBrainz should have aspire to become multilingual enough to satisfy it.

Of course, current MusicBrainz isn’t there yet. It is fantastic in many ways, but hasn’t yet become fantastic in multi-lingual service.

Yes, I agree. I expect that MusicBrainz will eventually be fully multi-lingual, and it will probably behave as you have described.

But I argue that current MusicBrainz does a tiny fraction of #1, and approximately none of #2 and #3.

There exist components, like ICU, that can provide big parts of the solution. Multilingual MusicBrainz is a reasonable aspiration.