Correct hyphen: Unicode HYPHEN or HYPHEN-MINUS


#1

Hi,

I’ve noticed some contributors using U+2010 : HYPHEN and others using U+002D : HYPHEN-MINUS for a normal hyphen ‘-’.

What is the preferred code point?

Btw, OS X uses U+002D when using the hyphen/underscore key.

Scott


Correcting Unicode Character in Artists Name
Unicode characters userscript
Script for editing artist and updating exceptionally long list of artist credits
#2

Typographically-correct punctuation is preferred. (see http://wiki.musicbrainz.org/Style/Miscellaneous).

Since hyphens are difficult to enter using common keyboard configurations, the use of hyphen-minus is widespread. Both are correct on MusicBrainz however.

See also https://en.wikipedia.org/wiki/Hyphen#Hyphen-minus


#3

ASCII originally used some code points for multiple purposes. The most well-known cases are the " and ' (“typewriter quotation mark” and “typewriter apostrophe”), but - was also used both for the regular hyphen, various dashes and the minus sign.

Unicode “unsplit” those by making separate “”″ ‘’′ code points and a number of dashes, too. It also introduced U+2010 as an unambiguous way to designate a hyphen (whereas a - could be a legacy minus sign or dash). However, unlike the quotation/prime marks, and unlike the dashes, the hyphen and the hyphen-minus look identical, so interest in actually using U+2010 has remained rather low. Personally, I don’t think it makes much sense, either.

We should probably consider them as “quasi” canonically equivalent and convert all input to one of them consistently.

I had a look at the database: Currently, 558545 recording titles contain a hyphen-minus, and 4543 contain a U+2010 hyphen.


Unicode apostrophe standardization
#4

Out of curiosity, how many have a U+2212 minus sign?


#5

Minus sign is used for when the meaning is… Minus sign (arithmetic operations). So they should be rare as titles with arithmetic so do exist but are rare. :slight_smile:


#6

There are 301 such recordings. Furthermore, 9155 have an en-dash, 4729 an em-dash, and 40 a figure dash. 74 have a non-breaking hyphen.


#7

Ahem, I might be responsible for those. Could you point me to them so I can replace them with the more regular plain hyphen, please ? :slight_smile:

I remember having set some minus sign on an arithmetic like title.
I also use dashes when they are used like parenthesis (EM for full size and EN for normal size).
It’s all in my keyboard macros (AutoHotKey). :sunglasses:


#8

Also, if anyone is interested, @jacobbrett and @Yurim have made some guides/references under their user pages on our wiki:
https://wiki.musicbrainz.org/User:Jacobbrett/English_Punctuation_Guide
https://wiki.musicbrainz.org/User:Yurim/Punctuation_and_Special_Characters


#9

According to Yurim’s page, a hyphen appearing in titles such as “The Winter of 1539‒1540” should be:

(U+2013 ‘EN DASH’) as they are (date) ranges.

Scott

ps. I thought I’d be the only one interested in this issue. Glad to see I’m not alone! :slight_smile:


#10

If you are interested in editing with ellipsis, curly quotes, apostrophes, hyphens and dashes, you may want toremap some characters on your keyboard to not have to make multiple keystrokes each time.
For MB I have completely switch my keyboard to typographic characters, thus using them everywhere.
Each time I lend my keyboard or network conference share PC, I have to remember suspending my clumsy AHK setup. :blush:


#11

I’m a Mac user so I’ll need to find an equivalent to AHK (if I can’t do it with System Prefs). It’d be nice to remap the numpad minus to MINUS and the hypen/underscore key to HYPHEN.

Thanks,
Scott


#12

Yes. It’s currently a figure dash (not a minus sign as I had written originally).


#13

I made a report for them: http://reports.mbsandbox.org/report/334 It’s one classical series and four independent cases.


#14

Woops, I mistook media, I have answered in IRC!.
Let me here I say again:

Thanks very much, @chirlu!
According to my modified version (335+336+snap links), it seems I already reverted all my no break hyphens in the past… I have only found one release of mine with remaining narrow no-break spaces (^_^;)

In the report syntax, what does it mean the « | e » part of « {{ row.2 | e }} » ? and the « E » part of « WHERE track.name ~ E'\u2011' » ?


#15

I haven’t used it (so can’t vouch for it), but apparently there’s a way to get a linux-like compose key working on OS X.

edit: I can vouch for the compose key in general, though :+1:


#16

A foolish consistency is the hobgoblin of little minds… so I have a little mind, I guess. We ought to pick one or the other, and run automated mass changes periodically. It’s bizarre when I get an e-mail update, and look to see an edit from “Cross-Eyed Mary” to “Cross‐Eyed Mary.” It took me a good while to figure out what had even changed.


#17

If we go with HYPHEN (which is my vote), running an automated HYPHEN-MINUS → HYPHEN conversion would be absolutely horrible. HYPHEN-MINUS can technically be either a HYPHEN or a MINUS and realistically can also be used to represent a number of dashes. Automatically replacing a vague character with a potentially outright wrong one would be really bad practice. If we end with HYPHEN it would also be easy to make a report for text with HYPHEN-MINUSes and it would be simple to go through and correct them (to hyphens, minuses, dashes, or whatever is needed). Making a report for HYPHENs would end up with a lot of already correct hits and would thus be largely a waste of time going through.

That means we should have a better diff interface, not that we should have less accurate data. E.g., for https://musicbrainz.org/edit/43309876 (being the edit referenced), it could highlight just the actual character changed instead of the entire word. There used to be a lot of people with similar issues for «""» → «“”» conversions (incl. myself) that couldn’t tell what was changed until it was pointed out. There might even already be a ticket for this, if not, I’d be happy to enter it.


#18

We intentionally use word-based diffs for things like titles (but character-based diffs for, e.g., URLs) because in most cases, those are easier to interpret.


#19

From IRC:

chirlu> The default font for the MB website is Bitstream Vera Sans, which doesn’t have a U+2010.

Why not change the default font of MB then ?


#20

Uh, there are reasons to choose a new default font, but “contains U+2010” would not be my main criterion when looking for one.