Character variants - Sign “☮” the Times

unicode
Tags: #<Tag:0x00007f309670f878>

#1

Opening this thread to get some additional opinions:

Right now this title in the database uses U+262E PEACE SIGN which is a character with both emoji and text-mode variants. You can optionally specify the variant by adding U+FE0E VARIATION SELECTOR-15 or U+FE0F VARIATION SELECTOR-16 for text style or emoji style, respectively.

I’ve put in a series of edits to change “Sign “:peace_symbol:” the Times” from U+262E to “U+262E U+FE0E” to specify text style. My thinking here is that emoji didn’t exist in 1987 when Prince wrote the song, so it’s inappropriate for the emoji version to appear in the title, and we should specify. I don’t have a breakdown of what system will default to the emoji version if left unspecified, but it does so on my Mac.

In discussion on edit #55189838, the counter-argument is that modifiers are better left off if strictly unnecessary.

I’m happy to go with whatever consensus is here, but I feel like more community input would be valuable.

Total count of related edits: 1 work, three release groups, four recordings, 14 releases, and 21 mediums. Full list of edits.


#2

I wanted to work though what Unicode says about this. The document you linked says that actually there are some defaults. However, it also says that they are kinda like soft defaults and they can be changed based on the type of the environment.

I think it was pretty hard to understand whether a character defaulted to emoji or text so here’s how we find that out:

Using emoji-data.txt, we can list the properties of the peace symbol:

  • Emoji = Yes
  • Emoji_Presentation = No
  • Emoji_Modifier = No
  • Emoji_Modifier_Base = No
  • Emoji_Component = No
  • Extended_Pictographic = Yes

The peace symbol belongs to the second category, text-default.

Alright, the document continues with this very important addition:

The presentation of a given emoji character depends on the environment, whether or not there is an emoji or text presentation selector, and the default presentation style (emoji vs text). In informal environments like texting and chats, it is more appropriate for most emoji characters to appear with a colorful emoji presentation, and only get a text presentation with a text presentation selector. Conversely, in formal environments such as word processing, it is generally better for emoji characters to appear with a text presentation, and only get the colorful emoji presentation with the emoji presentation selector.

Since we know that the character in the title should be displayed as text, we should use the text presentation selector. Otherwise that title will be displayed improperly if it is imported or copied to environments which override the default presentation style. This document even encourages to override the default if it suits the environment and if a character doesn’t have a selector next to it.

I think of the emoji and the text forms as different characters, they have so different feel in them. If an artist uses one form, then MB should use the same form. If the artist uses both forms in different places, then it can be more complicated to tell what the artist means. However, in this title we are talking of, it’s very clear.


#3

My question is - can Picard handle this okay?

Picard is a simple test for how this Unicode data translates to PCs, Macs and other environments. I’ll then take my tagged files on to other older devices like a Blackberry Bold phone, old MP3 players, car hifis. These devices don’t do all these modern emojis.

Which ever answer gives me the more readable, usable data is the one I’d vote for. I believe data needs to be usable on as many platforms as possible.

Back in the 1990s I used to do Unicode work so I could work in Japanese text. Coming into MusicBrainz I have now seen a new world of Unicode that is totally beyond me. (Especially bizarre to me being the hyphen swaps)

HAHA - I just noticed. Discourse’s of this post can’t show the symbol. It is showing “:peace_symbol :” as text instead… if that is what would happen to Picard and tagging that seems too weird to me. Is that what you are talking about here?


#4

Hmm… the peace symbol is showing correctly for me. However, the hyphen that we’re supposed to use (U+2010 HYPHEN) doesn’t display correctly for me on Discourse.


#5

I tested $set(title,$replace(%title%,☮,☮️)) and checked the results by copypasting them to https://unicodelookup.com. Seems to work fine. Picard itself doesn’t respect the selector characters and it only shows the default U+262e. The selectors are correctly being added to tags and filenames though.

One could say that because Picard doesn’t show the characters correctly, it isn’t handling them okay. I guess that for systems which otherwise handle Unicode just fine, the selectors are zero-width characters which don’t just have any use. In Picard they are only noticeable when you’re deleting one of those character + selector combos with backspace. The first backspace deletes the selector, the second one deletes the character.

Here’s how discourse shows the characters:

  • = U+262e
  • ☮︎ = U+262e U+fe0e
  • ☮️ = U+262e U+fe0f

Edit: The edit box shows the characters correctly but the actual comment doesn’t. See: peace_symbols
Edit2: Oh, needed to add them as preformatted text. This is irrelevant to MB but interesting nonetheless.
Edit3: Even as preformatted the characters don’t display correctly but there are selectors included in those.
Edit4: About :peace_symbol:. That’s a human-friendly way to add Unicode characters to text but isn’t really what this is about. I guess it’s part of markdown which is also responsible for the other text stylizing in these comments.


#6

The place where I see Discourse getting confused is just the title of this edit box when I hit Reply to the first post.
image

It is all fine on the right hand side. And good to see that Picard has been thought about and checked.

I just did a highly unscientific test on Windoze file names. Using the above list of Peace Symbols that @phonebox has in his post I tried to copy and paste them into a filename. What is noticeable is the first one gives the expected symbol legally in Windoze. But try and copy the second and third icon and now we have weird junk too as one of them squares appear that I assume is for “modifier” character. (But that is probably PEBCAK from this ID-10T user :wink: )

Once this edit is fully into the database I’ll dig out a Sign of the Times album and see what I get when updating the tags. Seems odd to me to strip the modifier from the tag but leave it in the filename.

And another daft question - how come the above example is purple?


#7

Yes, I don’t like having bogus control characters in fields, invisible characters.
And it displays as icon only on mobiles, where it’s not an issue IMO that some characters appear as symbols, it is a symbol afterall.


#8

This same problem arises when we edit titles which use the character in its emoji form. Then the problem is turned upside down and we see the incorrect text form of the character in most desktop environments. Also it’s not only about mobile vs desktop environments. In the first arturus’ screenshot, a desktop environment seems to have defaulted to emojis. I don’t know whether it’s the font which causes that or what but in the end the system respects the variation selector and shows the correct character.

I agree. It’s problematic when editors don’t know of the presence of invisible characters in titles they edit. Looking up if there’s an invisible character somewhere in any title takes lots of unnecessary work. That being said I think it’s MB’s responsibility to show us those invisible characters instead of banning them for them being hard to work with. I made an example mockup of how MB could work out this problem: mockup

I think that not setting a variation selector in these cases would indicate that we don’t know which form is correct.

Edit: About Windows support of these selectors. If they show up as squares, I don’t think your system uses a character encoding which supports the whole range of Unicode (like UTF-8 does). I looked at my Windows 10 region settings and found a “Language for non-Unicode programs” setting. That makes me believe Win10 supports Unicode by default. However, U+262e U+fe0f still shows up as a text symbol in filenames in my system, meaning that Windows doesn’t respect the selectors.

As a character encoding standard, Unicode has become so important that if a program doesn’t support it, that’s the program’s fault.

The exact appearance of the character is decided by the font used. You could make a font which shows a swastika in place of U+262e if you thought it better represents a “peace symbol”. :smile: Fonts tend to follow other fonts on things like color to not confuse the users.


#9

PMFJI, the rectangle/box typically is from a truetype font. When the cmap table does not provide a glyph for a requested codepoint (such as unicode bmp) there is a defined .notdef glyph that is returned.

Looking at my Win7 system, I have coverage for U+262e in Segoe UI Symbol, DejaVu Sans, and Noto Sans Symbols. DejaVu and in particular Noto Sans, being liberally licensed, are becoming de facto standards, at least in cross-platform apps. U+262e, appears to predate the concept of emoji and in these 3 fonts no variation sequences are provided, so the default presentation is probably the best from a compatibility standpoint.


#10

I think it’s rather we don’t care.
I don’t care, it’s like a font choice to me.
I don’t think we should say « use this or that font or don’t use an emoji. »


#11

A post was merged into an existing topic: Abbreviations in community posts


#12

The swastika used to be a peace symbol until it was hijacked for other uses in the 1930s.


#13

At first, I was confused about what a variation sequence actually is so here’s an explanation: a variation sequence acts as a single character even though in reality it consists of multiple Unicode characters.

DejaVu font family doesn’t seem to have a good support for emojis. DejaVu Sans has the variation selectors needed for variation sequences but it doesn’t have real emojis, only symbols. Same applies to Noto Emoji font. (As a side note, I used Character Map in Windows to find out which characters a font includes.) Noto Color Emoji includes an extensive range of emojis but the font format isn’t supported on Windows. (That’s probably a result of an ongoing battle between color font standards.) Segoe UI family supports variation sequences in its Segoe UI Emoji font.

It doesn’t matter that U+262e predates emojis. From what I understand, the codepoint U+262e is both a symbol and an emoji in the latest Unicode standard. It only becomes a symbol or an emoji when it gets rendered in text. (This sounds like some quantum physics shit.)

I think it would be good to bring in another emoji, for example :couple_with_heart_woman_woman:. This emoji is also an official Unicode variation sequence and I don’t think anyone would be against having it in MB if it happened to be in some song title. It’s a sequence of:

  • 👩 + [zero width joiner] + + [variation selector-16] + [zero width joiner] + 👩
  • i.e. U+1f469 U+200d U+2764 U+fe0f U+200d U+1f469

In a way, it doesn’t matter whether some fonts support these characters. If they are not supported, the characters just don’t show up as they were meant to. In Discourse, when I format :couple_with_heart_woman_woman: as “preformatted text”, it appears as 👩‍❤️‍👩. This (i.e. “woman”, “heart”, “woman”) is the correct way to represent the sequence if the display of the sequence isn’t supported: https://unicode.org/faq/vs.html#5 (shown below). If none of those emojis were supported in some font, the correct way to display :couple_with_heart_woman_woman: would be something like □□□.


That’s the case for visible characters. Unicode’s Display of Unsupported Characters FAQ gives more options:

Later it states: “[Variation selectors etc.] should be rendered as completely invisible (and non advancing, i.e. “zero width”), if not explicitly supported in rendering.” That is actually the way Picard is handling the characters.

Edit: Preformatted some emojis.
Edit2: Correction on font coverage.

Edit3: I can see how the handling of :couple_with_heart_woman_woman: is not the most relevant thing when talking of ☮︎. The base character (U+262e) can look exactly the same as the variation sequence ☮︎ (U+262e U+fe0e) but the base characters of :couple_with_heart_woman_woman: never look exactly the same as the variation sequence. The point of taking :couple_with_heart_woman_woman: into the discussion is more or less about us having a need to handle and support variation sequences anyway.