Unicode roman numerals: Is there an official stance?

marlonob · December 8, 2016, 2:01pm

I was wondering If this topic has been discussed (I couldn’t find anything).

I’m aware that the Unicode Consortium discourages the use of this table stating that “For most purposes, it is preferable to compose the Roman numerals from sequences of the appropriate Latin letters”. I fail to see the reasoning behind that recomendation, and none is provided by them. But it is fact that has to be considered.

I do see various advantages in using them, and since they can be handled the same as other unicode characters (using search equivalents, and converted using the “Convert Unicode punctuation characters to ASCII” in Picard), no disadvantages. Here are some of the advantages.

They are logically considered numerical values, and as so, machines can understand unequivocally their meaning and value.
They are displayed more consistently and with better typographic handling (kerning, height (taller than samall‐caps, shorter than capital letters, mostly)) by fonts that support them (of which are plenty).
They can be correctly ordered alphabetically (e.g.: Ⅴ appears before Ⅸ, whereas IX appears before V).

I don’t know whether the server accept this substitution automatically (as it does for capitalization or " → “, etc), but an official stance in favor would mean it should; or one against, would mean to add a clarification note in the styleguide, I suppouse.

chirlu · December 8, 2016, 2:41pm

The reason why the precombined Roman numerals exist is that a legacy encoding had dedicated code points for the numbers on a clockface, and one of the goals of Unicode is to provide round-trip compatibility to those legacy encodings. This is also why there is no 13 (XIII) or other higher numbers – they aren’t needed on a clock.

I don’t see a good reason to use duplicate code points, especially when one of them is deprecated; people won’t generally be able to see (and therefore correctly edit) the difference, and sorting is properly handled by ordering relationships. Whether the issue is so important that it needs mentioning in the style guidelines is another question, though.

chirlu · December 8, 2016, 3:03pm

This topic was briefly mentioned here before:

jesus2099 · December 8, 2016, 3:45pm

It was discussed in the old style mailing list (perhaps) and in the old forum (if I remember correctly).
I like using them but the majority said no.

marlonob · December 8, 2016, 10:04pm

Yes, I mentioned that I’m aware of the stance of the Unicode Consortium, and if the verdict of the community is to follow that recommendation, then so be it. What I want to know is precisely that veredict.

And those characters are not deprecated: They exist permanently in the specification and have at least one recomended, current use by the Unicode Consortium (to keep the glyph in one row in vertical text), not to mention they are supported for the mayority of unicode fonts.

Regarding to the difficulty in the edit process: This can be said of various similar cases, such as the hyphen (-,‐), the non-breaking space (for French punctuation), or even the endash. Yet, it is recommended to do this changes if one is able to. Furthermore, in various cases this is easly discernible by selecting them (it will be obvious wether they are two I or one Ⅱ).

If I had to rephrase my question, it would be: If I made such changes, will they be seen as an improvement to the database? Will it depend on the personal preference of those who see and vote on the edit, or will we have a note in the styleguide to quickly resolve the matter?

My personal opinion is that they would be an improvement, for the reasons that I have mentioned; and I have not seen a reason not to consider them as such (other that they are deprecated, which they are not).

Of course, the stance of the Unicode Consortium is a big point to consider. It’s a shame that they not provide a reasoning for it, so it can be discussed beyond the authority point.

But other than that, I don’t see why wouldn’t we encourage the use of a semantically and stylistically better option (of course, that Picard could convert those character to ASCII if that option is marked would be important in this).

jesus2099 · December 8, 2016, 10:24pm

I think it is an improvement too.
In Japanese text, where I used it, it looks better and it would even follow vertical text more nicely.
When I see them on a release, they are visually different than the (shortcoming) sub optimal succession of letters we use instead (they take one block and they have stokes top and bottom when the fonts are well done).

marlonob · December 9, 2016, 2:42am

Just a little note for clarification: The clock‐face equivalence may be the reason for the existence of Ⅺ and Ⅻ, but not the others, since Ⅼ, Ⅽ, Ⅾ, Ⅿ, and even alternate glyphs (e.g. ↁ or ↅ) do exist.

CallerNo6 · December 9, 2016, 5:59am

Can they? I’m playing with ICU collate, trying to figure out how to do it. Maybe there’s a switch I need to flip somewhere. By default they’re grouped with their Latin alphabet equivalents.

I agree that the pre-composed roman numerals look great. The incompleteness of the set is a bummer.

marlonob · December 9, 2016, 2:02pm

They do in Windows file explorer. They even are ordered before any letter:

I’m not a programmer, so I don’t understand what you’re trying to do. But using the sort() function in javascript and PHP give the expected results (though, it seem that, as a block, both functions put them after all the alphabet letters).

jesus2099 · December 9, 2016, 2:36pm

They do appear in order in the Unicode tables so they are sorted OK in all the softwares I’ve seen so far.

reosarevok · December 9, 2016, 2:59pm

What would be the suggestion when you have 13 movements and you want to use unicode Roman numerals? You could just combine the separate Roman characters, but it seems just easier to avoid the characters, especially since Unicode themselves say they’re there only “for compatibility with East Asian standards”.

CallerNo6 · December 9, 2016, 4:37pm

If the software sorts simply by code point, then it won’t correctly sort the pre-composed numerals, because e.g. 13 (2169 2160 2160 2160) would sort before pre-composed 12 (216B).

marlonob · December 9, 2016, 7:00pm

I think that, for logical and stylistical consistency, it would be better to ignore Ⅺ and Ⅻ, and go with ⅩⅠ and ⅩⅡ.

jesus2099 · December 9, 2016, 8:21pm

Oh, not in my case, they do look good and I rarely needed more than twelve, if ever, so I consider that an exception and I use them up to twelve.

chirlu · December 10, 2016, 6:46pm

Nobody did claim that they were deprecated, nor that they were going to be removed (which is actually impossible for any character, because Unicode has very strict stability policies). Their existence does, however, clash with two main principles of Unicode: that it is a plain text format that doesn’t care about rich text features such as font selection, character positioning such as kerning, etc. (leaving that to a different level of a storage format), and that it doesn’t distinguish between characters that look the same based on semantic differences (e.g., no two different h characters for the silent h in hour and the pronounced h in hero, because they are written the same). Still, the Roman numerals exist as compatibility characters, because (as I said before) they are required for interoperability with certain legacy standards. This doesn’t mean that they should be used.

You can read more about this in chapter 2 of the standard, in particular sections 2.2 (Unicode design principles) and 2.3 (compatibility characters).

Note that these reasons apply to all those Roman numerals that correspond to letters (so, e.g., not ↅ, which exists for the benefit of historians and philologists). Precomposed characters for VI etc. have additional reasons for not using them.

We may have different definitions of “obvious” here. I don’t tend to always try and select individual characters to see whether they are precomposed or not. On the other hand, the difference between - and – or between “ and " can be seen without any special action, so this in indeed obvious.

chirlu · December 10, 2016, 7:10pm

@CallerNo6 reminds me that I did say “deprecated” above. That was a loose usage of the word, as in “not recommended for use”. The Roman numerals are, of course, not deprecated in the strict technical sense.

Lotheric · December 10, 2016, 7:31pm

http://i3.kym-cdn.com/photos/images/original/000/909/991/48c.jpg

Couldn’t help myself

marlonob · December 11, 2016, 6:06am

To quote the very chapter you referred to

[quote]Characters are the abstract representations of the smallest components of written language that have semantic value. The Unicode Standard deals only with character codes. (p. 15)
[/quote]
So, semantics are the very thing that defines a character acording to unicode. Now, contextual meaning or other atributes are not pertaining to the definition of the character. I don’t think that your example with the h relate to semantics.

[quote]Characters have well-defined semantics. These semantics are defined by explicitly assigned character properties…The Unicode Character Database provides machine-readable character property tables for use in implementations of parsing, sorting, and other algorithms… The Unicode Standard identifies more than 100 different character properties, including numeric, casing, combination, and directionality properties (p. 18)
[/quote]

I honestly think that making a distinction between a letter and a number represented by the same glyph (using their nomenclature) but with very clear and incompatible definitions (another whole conceptual order, in this case) is essential to their mission. More so when they have things like α, ⍺, 𝛂, 𝛼, 𝜶, 𝝰, 𝞪 the last of them literally defined as mathematical sans-serif bold italic small alpha; or ² and a whole bunch of super- and subscript characters with no real distinction of their regular couterparts.

Of course the kerning and height, and other visually appealling advantages corresponds to the font or the renderer used. That is an advantage derived of having a character with an unambiguous value/meaning. But this unambiguity is the true reason why using this characters is an improvement over using letters.

And here’s one more: None of those edits would be destructive because in the case that this definitively settles against using this unicode numerals, it would take just a few substitution rules to change it (whereas an inverse convertion would be impossible, if automated).

The Consortium have their rules and have their reasons, but the fact is that this characters exist, are widely supported, and offer clear advatages we could use in the database, and are easly convertible, if it comes to that.

reosarevok · December 11, 2016, 12:07pm

Since it seems we do need to specifically write this down, yes, there is now an official stance:

https://musicbrainz.org/doc/Style/Miscellaneous

marlonob · December 11, 2016, 3:31pm

Well, that was all that was needed… I suppose.