Unicode apostrophe standardization

IvanDobsky · December 30, 2017, 4:40pm

Hi @reosarevok - it is good to get an official reply. I am British English and use a British English keyboard. And going by that link to the miscellaneous page I read that as “use normal apostrophes” as they are typographically correct.

Use of basic ASCII punctuation characters such as ’ and " is allowed, but typographically-correct punctuation is preferred

It is also easier than having to load up Character Map and try and find the specific silly curly thing as used in Word.

I’m just going to therefore stick with what I know. Especially as this keeps standard searches working okay. It also seem seems to be more logical with the way the rest of the rules of data entry work.

It certainly made me laugh when the guy who told me to “read the style guide” and all I could find is that above example where it was clearly using a standard apostrophe from the keyboard. I’ll ignore this next time and just aim to keep things consistent and aim to how the artist intended it to be seen.

jesus2099 · December 30, 2017, 5:04pm

You can see this standard more easily by opening any printed book around you.

IvanDobsky · December 30, 2017, 5:06pm

But surely this is a database and not a book?

I just want to get things right. Which is why I turned to the forum instead of the single person who complained.
So far I have had one reply from an official staff member, and that is the guidance I will stick with. Especially as no one has yet come up with a simple way of entering that curly thing. Even the guy who “corrected” my database entries pointed to a standard apostrophe when I asked him how to enter that thing.

reosarevok · December 31, 2017, 11:11am

I suspect most artists have zero intention of it using one character or the other. They just don’t even know about typography (let’s be fair, many don’t seem to know about capitalization rules or even simple grammar) so there’s no decision there

IvanDobsky · December 31, 2017, 4:53pm

This is getting messy. I have just spotted somewhere else this has been done, and ended up with a non-printable characters instead.

I am using EAC to rip my music, linked to MusicBrainz to fill in initial details. Set file names, folder names.

I then pass the files over to MusicBrainz Picard to do the tagging.

I have now spotted that “Easy Star All-Stars” has an unprintable character instead of the dash when displayed in certain fonts.

I did wonder why I was looking at my folders yesterday and saw what looked like two Easy Star All-Stars folders side by side. Now I realise it was someone playing with the different dashes. (A difference that is almost invisible to the eye)

Try using standard fonts like courier in Notepad++ and these typographical tweaks are not able to be displayed.

Try making up a playlist - how does one type these special characters? (I notice no one has come up with an answer on that simple part of the puzzle?)

I really don’t understand why cosmetic stuff like this is happening to common data in a database. Surely if someone wants to prettify their own version then they can adjust on their output. I don’t see the sense in doing this on data that gets used in so many other places like Media Players.

I don’t want to be a grumpy old git. I am just confused as I thought MusicBrainz was a music database for common world wide use in many projects. Now I find I am putting weird hidden characters into my files making them less usable.

So I need to find a solution for me to get round this issue for my case in tagging music files, writing music tags, manually writing play lists, using the data in other media players like KODI. Using punctuation I can find on a keyboard, whilst still keeping all the standard European \ Asian characters in the text that can be displayed in standard fonts.

I don’t want to tick the “swap unicode to ASCII” options in Picard as I want to keep my Japanese text, etc.

Is this an issue I need to take over to the Picard devs? See if an addon can be made to re-standardise this stuff? Something that can fix punctuation to a standard whilst leaving the Unicode in place for non-ASCII characters?

IvanDobsky · December 31, 2017, 5:04pm

Talking to myself now as I realise this is my problem that I need to fix for my own usage.

I need to take my questions over to the Picard threads I guess. I’m looking through the settings and see “Convert Unicode Punctuation characters to ASCII” is a ticked setting in my Copy of Picard. Now it implies to me that that tick would do exactly what I am asking about here… but I am still ending up with these odd characters in my filenames and tags.

Ah - hang on. Now I am starting to work out where the mess is in my files. It is EAC putting these bits in initially. Now I am getting aware of these things, I may be able to find a fix at my end… time to work on EAC and Picard settings I think.

mfmeulenbelt · December 31, 2017, 5:08pm

It is trivial to turn typographically correct characters into their ASCII equivalents, but the other way around is in some cases impossible (the software would have to choose between “ and ” for example). That is why we prefer to store the former. You could take a look at the plugin Non-ASCII Equivalents (which is similar to the built-in option) and try to remove bits of code you don’t want (like Japanese to Latin). You can download that plugin here: https://picard.musicbrainz.org/plugins/

IvanDobsky · December 31, 2017, 5:36pm

@mfmeulenbelt I am trying to make sense of the Picard plugins and options, but there is so little written down about them.

I have seen options for “Convert Unicode Punctuation characters to ASCII” in the options that seems the best fit - but no documention on what it swaps.

I see the addon for the NON-ASCII Equivalents - but again find no details on what it swaps. (And I don’t really want to have to start editing source code unless I really really have to)

I am pretty sure it is just that first one I need to get right. I want to see my multilingual characters as I have Turkish and Japanese artists in my collection. My modern software in the modern media centre happily displays those okay.

It is just when getting to the level of filenames it gets awkward for me.

I have already realised that part of my problem is also coming from EAC when ripping. Because it now points at Musicbrainz instead of the FreeDB I am getting these oddities in my filenames, which is a bigger headache when mixed in with previous ripped folders and files.

TBH - the main scream from my has been because I have only spotted this now after re-ripping 350+ disks in the past few months. It now means I need to go back through everything and normalise things for my system. HAHAHA - this really is a never ending mission. This was the fourth time ripping my music collection… at least cleaning the tags with Picard should be quicker this time.

In the New Year I’ll make sure I have a clearer picture of this mess that has now happened in my files. And then write it up in some way for other people who will also come across it.

mfmeulenbelt · December 31, 2017, 5:45pm

I would suggest you try a few different options with just one troublesome release and see if it comes out right. The source code for that plugin isn’t very complex. I think it will speak for itself if you open the .py file in notepad (you won’t need any programming knowledge).

IvanDobsky · December 31, 2017, 5:59pm

I’ll take a calm clean look at this all again in the new year. I know I have specific requirements that don’t fit other people’s needs. I only need the punctuation cleaned up - mainly because I can’t SEE the difference between one hyphen dash and another, but my computer file system can see it. Leading to confuddlement.

I will look at the source code deeper - not a problem with the comprehension of it as I have written Python addons for elsewhere. (C \ C++ background). The initial look into that plugin shows it is far too wide for what I want. It is removing far too many characters for me. I still want to see Ayşedeniz Gökçin and Björk but I also want to know that Easy Star All-Stars and Ed Alleyne-Johnson are not getting confused with Easy Star All‐Stars and Ed Alleyne‐Johnson!

I think my bigger issue is going to be with EAC ripping using MusicBrainz metadata as I need to manually add a list of substitutions into that program now to catch these dashes and oddities that a small number of people are entering into the database, even though the official line is clearly that they are optional.

aerozol · January 1, 2018, 2:38am

They are optional, but correct punctuation etc is preferred. That small number of people is improving the database.
That it makes tagging life difficult for you is unfortunate, but not a good reason to water down how we store data (MB aims to be a database first, a resource for Picard to use/for you to tag your files second).

Although I’m not sure what the exact issues are? I haven’t really heard of KODI or playlists having trouble reading or displaying MB tracks.

Kid_Devine · January 3, 2018, 12:53am

I can see this would be confuddling. Unfortunately the two hyphens used on Musicbrainz (Unicode HYPHEN and HYPHEN-MINUS) are supposed to look identical

anon18945670 · March 23, 2018, 9:48am

Seconded.
Without any further explanation (or any knowledge about ASCII or unicode) this sentence:

Use of basic ASCII punctuation characters such as ’ and " is allowed, but typographically-correct punctuation is preferred.

sounds to me like 'em is allowed, but them is preferred.

I for one always use the apostrophe that I can use without hurting my fingers.
To write ' I only need to hit one key once, to write ´ I have to hit Alt Gr + ' twice.
At least I didn’t find any other option. This is my keyboard layout:

I’m using German Switzerland, because that’s where my computer is from and I’d like the signs on the keys to actually represent the output, but if e.g. I switched to English layout:

the correct apostrophe seems to be gone completely.

Do you guys all use copy paste to write apostrophes or do you have different keyboard layouts?

I tried to find a guide to create a custom layout on ubuntu, but that’s way over my head.

That’s a nice policy, but doesn’t always work well.
E.g.: I have recently made about 1000 edits where I moved the “feat. XY” from the title to the artist credits and then I used Guess case, Reuse previous recordings and Copy all … to associated recordings(*) and then assumed that if a recordings title was changed from e.g. “Can´t touch this” to “Can’t touch this” that this was because Guess case switched it to the correct punctuation, when in fact the titles were just different in the album and in the recording to begin with and Guess case did nothing.
I probably changed a lot of recordings incorrectly lately until I finally got called out here.

(*) I changed obvious mistakes of the Guess case function back (like I'm from BK vs I'm from Bk) and unticked Copy all … to associated recordings if the track was e.g. called “Song (album version)” on the single and “Song” as a recording.

obtext · March 23, 2018, 10:19am

What I’d do on Ubuntu is:

Set one of the keys on your keyboard as the ‘compose key’ from system settings (I use right-Ctrl but it’s up to you).
Use the table here to find the key combinations you need (they’re pretty logical and you’ll soon learn the most common ones). Just hit the compose key followed by the two-or-three-key sequence from that table.

mfmeulenbelt · March 23, 2018, 10:20am

The English have little use for special characters. I have a German T2 keyboard (although I have switched the Z and Y back to their proper places), so ’ is a simple Alt-Gr + 1.

I don’t often see correct punctuation being turned back into their ascii equivalents, so it doesn’t go wrong that often. But do be careful with Copy all … to recordings, it can introduce errors. It would be nice if you could see if recordings are shared among multiple tracks in the recordings tab of the release editor.

Llama_lover · March 23, 2018, 11:07am

I also use EAC to rip. If you find a solution to the above, I would like to see your resolution!

anon18945670 · March 23, 2018, 11:07am

Ok, now I’m completely confused. I tried to find a way to change the compose key, but that setting doesn’t seem to exist in Ubuntu 17.10. Anyways a websearch suggested that Shift+Alt Gr is the compose key by default and with that I found two additional apostrophes. So there are 5! ´‘’’` So which one is the correct one?

obtext · March 23, 2018, 11:58am

The official recommendation is to use right-single-quotation-mark for apostrophe and this is what most people do. (I think this was a mistake, but that’s another story.)

IvanDobsky · March 23, 2018, 11:59am

@Llama_lover EAC can do character swaps in the settings for the filenames. It already swaps out the obvious ones that upset file systems. So I have tagged a few more to that list.

From the EAC menu select EAC Options. Now look for the Character Replacements tab.

There are already swaps in here for slashes, colons, question marks. And a few empty boxes at the end. I have added my hypen and apostrophe swaps in here to swap from these prettified ones to the standard ASCII ones.

I see the idea that is being attempted, but for my filenames I need to change them to avoid confusion.

I do still want to see every umlaut and Japanese character correctly. It is just the “hard to see by eye” items that I like to swap back in my filenames. (Happy to leave them in my tags)

jesus2099 · March 23, 2018, 12:27pm

FWIW, I have a normal French AZERTY keyboard (missing lots of French characters) but I have many changes made to it by my crappy AutoHotKey permanent script, with which this U+2019 apostrophe replaces the typewriter apostrophe, so it’s a single key stroke ’ (I use SHIFT+’ if I want the typewriter apostrophe).

But I am waiting for nice keyboards once the new French BÉPO norm will apply (in 2018 or 2019, I don’t remember), I hope more manufacturers will provide one, and they have that single key stroke for apostrophe, and all other useful characters.