Suggestion: Automatic translation of ASCII to preferred Unicode

Tags: #<Tag:0x00007f342413c3b0> #<Tag:0x00007f342413c1d0> #<Tag:0x00007f3424143db8> #<Tag:0x00007f3424143b60>


I’ve just been voted down on an edit for replacing Unicode quotes with ASCII quotes:

I was pointed to , which says ASCII is allowed, but Unicode is preferred.

I was correcting titles that were substantially incorrect, but which included three Unicode quote characters. I replaced them with sleeve-accurate titles, but included three ASCII quotes. (Technically, I don’t see where it’s forbidden to replace Unicode with ASCII as I was told, but if I were reviewing the edit I would also have had doubts about allowing it after reading the guidelines.)

While I am fine with re-doing the edit, it struck me that this is a detail that could be corrected by the web site, either automatically or with a button similar to the “suggested capitalization” buttons.

Unicode apostrophe standardization

The spirit of the guideline is “we shouldn’t force people to enter Unicode because it’s harder, but if it’s already Unicode then it should always be kept that way, since it’s preferred”.

It’s definitely not as easy as that sounds - different languages do things like quotation marks completely differently, and even for English, things like a hyphen or an apostrophe can be multiple different symbols in Unicode.


I think the ‘no’ vote was a bit harsh, but since a voter/editor would have to re-add/fix information (eg it’s replacing more correct information with less correct information, if even slightly so) it’s not really out of line I don’t think.

But annoying in any case.
If it makes you feel better, I almost certainly would make the same mistake, I don’t use ASCII because it’s “yet another thing to do”.

Does anybody have a userscript or something that helps out?


Well, sure, a hyphen can be multiple symbols in Unicode, but if the Musicbrainz guidelines prefer a certain symbol, then there’s only one preferred choice, right? And a button would allow the user to override the suggestions.

But I was referring mostly to the quote symbol. It pops up so often for me since I deal with mostly dance music. I don’t know how common the other symbols are but I am constantly typing 12" and 7".

(Also, this made me laugh on the English Punctuation Guide: “Usage of preferred characters is optional and sometimes unwarranted.” :slight_smile: )


It is annoying, but like I said, I would have probably done the same thing if the roles were reversed based on the guidelines. shrug


For each instance yes, but in general: No. Sometimes an ASCII hyphen-minus should be an em dash, sometimes an en dash, and sometimes a minus sign. Sometimes double-quote should be right, sometimes left, and sometimes double-prime. And so on.



En dash for small separate, em dash for large.
Hyphen for hyphenated words.
Minus sign for maths.

Typewriter double quotes

Left, right, top or bottom curly double quotes, depending on position (start, end) and language.
Double prime for inches.

Typewriter apostrophe

Right curly quote for apostrophes but left then right curly single quote for when they are used in English similarly to double quotes.


Multiplication sign for maths and for cross collaborations. Instead of letter x.
Ellipsis instead of three dots.


Then the only possible automatic thing would be unicode (many) to ascii (one), not the opposite. :slight_smile:


I’m pretty sure I’m being punked.


Come on, folks. I know what Unicode is. I know there’s a bunch of fancy characters that are helpful for all kinds of different situations and languages. I’m not saying make some multi-level fancy filter for all Unicode translations. I’m saying consider one specific common instance to make it easier for people trying to help populate common data on the site.

That’s the only character I care about. It’s the most common character that one would use, normally, when dealing with seven inch or twelve inch singles/mixes. I know it would only affect a certain amount of users, namely people entering pop and dance music data, but then so do a lot of the controls on the site. And it was only a suggestion to make the site friendlier to less patient and/or less knowledgable people.


I think having a toolbox with common Unicode characters in each editing window would be useful. Click the appropriate character and it’s inserted at the cursor position. Might even be possible with a user script.


But we couldn’t really easily automatically decide for all texts:

I LOVE YOU (“I am 12” 12" version) → I LOVE YOU (“I am 12” 12″ version)

(open quotes, close quotes, double prime)

Not that easy and prone to mistake (in rare cases but still it would be a pity to auto‐add some mistakes). :slight_smile:


Adding a very specific 7″ and 12″ substitution to guess case might be fine. Even if it might insert a few errors like @jesus2099 suggested, that’s less likely than English guess case messing with prepositions vs. adverbs and whatnot and we still have that :slight_smile:


It’s pie-in-the sky stuff, but I’m envirioning something like this:

When entering one of the usual „wrong“ ASCII characters, it is marked with a pastel red background, and a speech bubble opens above it giving the common Unicode alternatives. Clicking one of these will replace the offending character with it. The same can be achieved with a suitable keyboard shortcut (say, Alt-2 to select the second Unicode alternative).

If a replacement is done, the speech bubble and colouration are removed. If the user instead keeps on entering text, ignoring the suggestion, the bubble is also removed, but the red background under the problem characters are kept. One can later pop-up the speech bubble again by mousing over these characters.



Perfect, can use the same kind of drop-down code that’s used when you type in an artist and you’re prompted to select which one, which also disappears once you’ve selected something else.


While it looks good as a mockup, it’s unfortunately not easily implemented because there is no way to style characters in an <input>. There are [some workarounds] (, but the most promising way (contenteditable) has other disadvantages; e.g., you could no longer do copy and paste on the input field.