Unicode apostrophe standardization

tommycrock · June 3, 2016, 10:50pm

They’d get what they deserve
Obviously it’s not a good approach for non Latin scripts!

derobert · June 3, 2016, 11:19pm

I’d expect setting the collation properly in MySQL to help—see an answer on Stack Overflow or the MySQL docs.

docdem · June 6, 2016, 7:23am

Well, U+2019 being the correct (for most cases) apostrophe, is the reason why MSWord does this kind of artificial „intelligence“ in the first place.

It’s especially irritating when applied to technical documentation, and the combination of stupid software and inattentive users turn a commandline like
ls -l --si 'Tangerine Dream'/*
into
ls -l –si ‘Tangerine Dream’/*
which will not work at all.

tigerman325 · January 30, 2017, 9:02pm

I’m getting a no vote for this very subject!

https://musicbrainz.org/edit/43119660

Please let him no that this is the preferred apostrohpe.

cloudzzz · February 2, 2017, 9:29am

Is this already included in the [Guess case] functionality? Otherwise it should to make it consistent and stop the confusing.

chirlu · February 2, 2017, 9:52am

No, because it can’t be done – a human is needed to decide whether ’ or ‘ (or something else entirely) is meant.

Freso · February 2, 2017, 5:18pm

And even humans can make the wrong ' translation. E.g., just the other day I did Edit #42712692 - MusicBrainz – notice the ’s? Yeah. Those are not appropriate here! Such a git, that editor. Luckily a smart git that sometimes realises his mistakes: https://musicbrainz.org/edit/42767913

IvanDobsky · December 30, 2017, 2:41pm

Why is there nothing in the style guide about using these “correct” apostrophe’s?

It would help if there was a page in there explaining why these curly apostrophe’s are in use, and how to type one in from the keyboard. I laughed when I saw that this this thread says it is for “cosmetic” reasons and that Microsoft Word was being taken as a standard here. (Must be a first - lolz)

I am using a standard Windows PC with a UK keyboard and have no idea how to do a ’ so all I can do is copy and paste. When I got picked up on it elsewhere I was told to Use a Unicode Apostrophe (ALT)+039 to give ’ which is what I was already doing.

So please, can someone write a definitive page in the Style Guide explaining this?

Currently if you look at the English Style Guide page even the style guide itself is written with the standard apostrophe’s using a normal ’ from the keyboard.

I am trying to analyse the differences in a hex editor, and my keyboard puts out the standard apostrophe same as (ALT039) ’ (Hex 0x27) whereas the tilted over ’ is hex 0x92.

https://musicbrainz.org/doc/Style/Language/English
The only example I can find so far is, when I read the English style guide page there is a specific example of “(Don’t Fear) The Reaper” which links to https://musicbrainz.org/recording/963bda54-eed7-47ba-b637-d6110d43db88

Look down that list as you see at least three different apostrophe’s in use. Yet the TITLE as show on the style guide is using (ALT)039 ’ and not that odd tilted over apostrophe ’ you have in your example.

I am confused. (Not arguing as I want to get this right, but I also don’t want to be dragged into a weird *nix vs windoze argument as I live in both of those camps. I also know my music titles are going to get read back in a number of different fonts and I don’t want things getting too weird :D)

Help us get a definitive answer here

Edit Note: Oh great… I have just realised that my carefully typed out text has been trashed and the standard apostrophe’s been replaced with curly ones meaning the above is now not as clear as it should have been…

reosarevok · December 30, 2017, 3:18pm

Hi!

The style guide info is on Style / Miscellaneous - MusicBrainz - but it’s hard to have a complete guide for every language on what to use when. User:Jacobbrett/English Punctuation Guide - MusicBrainz Wiki is useful for English, but it’s made by a user and is not official as such. I use it when I’m confused, anyway

But generally: you only need to worry about these if you want to. Otherwise, just don’t change anything that is already there and might be typographically correct, and if someone complains because you’re using the basic ASCII punctuation when adding stuff, just remind them the guideline very specifically says “usage is allowed”.

IvanDobsky · December 30, 2017, 4:40pm

Hi @reosarevok - it is good to get an official reply. I am British English and use a British English keyboard. And going by that link to the miscellaneous page I read that as “use normal apostrophes” as they are typographically correct.

Use of basic ASCII punctuation characters such as ’ and " is allowed, but typographically-correct punctuation is preferred

It is also easier than having to load up Character Map and try and find the specific silly curly thing as used in Word.

I’m just going to therefore stick with what I know. Especially as this keeps standard searches working okay. It also seem seems to be more logical with the way the rest of the rules of data entry work.

It certainly made me laugh when the guy who told me to “read the style guide” and all I could find is that above example where it was clearly using a standard apostrophe from the keyboard. I’ll ignore this next time and just aim to keep things consistent and aim to how the artist intended it to be seen.

jesus2099 · December 30, 2017, 5:04pm

You can see this standard more easily by opening any printed book around you.

IvanDobsky · December 30, 2017, 5:06pm

But surely this is a database and not a book?

I just want to get things right. Which is why I turned to the forum instead of the single person who complained.
So far I have had one reply from an official staff member, and that is the guidance I will stick with. Especially as no one has yet come up with a simple way of entering that curly thing. Even the guy who “corrected” my database entries pointed to a standard apostrophe when I asked him how to enter that thing.

reosarevok · December 31, 2017, 11:11am

I suspect most artists have zero intention of it using one character or the other. They just don’t even know about typography (let’s be fair, many don’t seem to know about capitalization rules or even simple grammar) so there’s no decision there

IvanDobsky · December 31, 2017, 4:53pm

This is getting messy. I have just spotted somewhere else this has been done, and ended up with a non-printable characters instead.

I am using EAC to rip my music, linked to MusicBrainz to fill in initial details. Set file names, folder names.

I then pass the files over to MusicBrainz Picard to do the tagging.

I have now spotted that “Easy Star All-Stars” has an unprintable character instead of the dash when displayed in certain fonts.

I did wonder why I was looking at my folders yesterday and saw what looked like two Easy Star All-Stars folders side by side. Now I realise it was someone playing with the different dashes. (A difference that is almost invisible to the eye)

Try using standard fonts like courier in Notepad++ and these typographical tweaks are not able to be displayed.

Try making up a playlist - how does one type these special characters? (I notice no one has come up with an answer on that simple part of the puzzle?)

I really don’t understand why cosmetic stuff like this is happening to common data in a database. Surely if someone wants to prettify their own version then they can adjust on their output. I don’t see the sense in doing this on data that gets used in so many other places like Media Players.

I don’t want to be a grumpy old git. I am just confused as I thought MusicBrainz was a music database for common world wide use in many projects. Now I find I am putting weird hidden characters into my files making them less usable.

So I need to find a solution for me to get round this issue for my case in tagging music files, writing music tags, manually writing play lists, using the data in other media players like KODI. Using punctuation I can find on a keyboard, whilst still keeping all the standard European \ Asian characters in the text that can be displayed in standard fonts.

I don’t want to tick the “swap unicode to ASCII” options in Picard as I want to keep my Japanese text, etc.

Is this an issue I need to take over to the Picard devs? See if an addon can be made to re-standardise this stuff? Something that can fix punctuation to a standard whilst leaving the Unicode in place for non-ASCII characters?

IvanDobsky · December 31, 2017, 5:04pm

Talking to myself now as I realise this is my problem that I need to fix for my own usage.

I need to take my questions over to the Picard threads I guess. I’m looking through the settings and see “Convert Unicode Punctuation characters to ASCII” is a ticked setting in my Copy of Picard. Now it implies to me that that tick would do exactly what I am asking about here… but I am still ending up with these odd characters in my filenames and tags.

Ah - hang on. Now I am starting to work out where the mess is in my files. It is EAC putting these bits in initially. Now I am getting aware of these things, I may be able to find a fix at my end… time to work on EAC and Picard settings I think.

mfmeulenbelt · December 31, 2017, 5:08pm

It is trivial to turn typographically correct characters into their ASCII equivalents, but the other way around is in some cases impossible (the software would have to choose between “ and ” for example). That is why we prefer to store the former. You could take a look at the plugin Non-ASCII Equivalents (which is similar to the built-in option) and try to remove bits of code you don’t want (like Japanese to Latin). You can download that plugin here: https://picard.musicbrainz.org/plugins/

IvanDobsky · December 31, 2017, 5:36pm

@mfmeulenbelt I am trying to make sense of the Picard plugins and options, but there is so little written down about them.

I have seen options for “Convert Unicode Punctuation characters to ASCII” in the options that seems the best fit - but no documention on what it swaps.

I see the addon for the NON-ASCII Equivalents - but again find no details on what it swaps. (And I don’t really want to have to start editing source code unless I really really have to)

I am pretty sure it is just that first one I need to get right. I want to see my multilingual characters as I have Turkish and Japanese artists in my collection. My modern software in the modern media centre happily displays those okay.

It is just when getting to the level of filenames it gets awkward for me.

I have already realised that part of my problem is also coming from EAC when ripping. Because it now points at Musicbrainz instead of the FreeDB I am getting these oddities in my filenames, which is a bigger headache when mixed in with previous ripped folders and files.

TBH - the main scream from my has been because I have only spotted this now after re-ripping 350+ disks in the past few months. It now means I need to go back through everything and normalise things for my system. HAHAHA - this really is a never ending mission. This was the fourth time ripping my music collection… at least cleaning the tags with Picard should be quicker this time.

In the New Year I’ll make sure I have a clearer picture of this mess that has now happened in my files. And then write it up in some way for other people who will also come across it.

mfmeulenbelt · December 31, 2017, 5:45pm

I would suggest you try a few different options with just one troublesome release and see if it comes out right. The source code for that plugin isn’t very complex. I think it will speak for itself if you open the .py file in notepad (you won’t need any programming knowledge).

IvanDobsky · December 31, 2017, 5:59pm

I’ll take a calm clean look at this all again in the new year. I know I have specific requirements that don’t fit other people’s needs. I only need the punctuation cleaned up - mainly because I can’t SEE the difference between one hyphen dash and another, but my computer file system can see it. Leading to confuddlement.

I will look at the source code deeper - not a problem with the comprehension of it as I have written Python addons for elsewhere. (C \ C++ background). The initial look into that plugin shows it is far too wide for what I want. It is removing far too many characters for me. I still want to see Ayşedeniz Gökçin and Björk but I also want to know that Easy Star All-Stars and Ed Alleyne-Johnson are not getting confused with Easy Star All‐Stars and Ed Alleyne‐Johnson!

I think my bigger issue is going to be with EAC ripping using MusicBrainz metadata as I need to manually add a list of substitutions into that program now to catch these dashes and oddities that a small number of people are entering into the database, even though the official line is clearly that they are optional.

aerozol · January 1, 2018, 2:38am

They are optional, but correct punctuation etc is preferred. That small number of people is improving the database.
That it makes tagging life difficult for you is unfortunate, but not a good reason to water down how we store data (MB aims to be a database first, a resource for Picard to use/for you to tag your files second).

Although I’m not sure what the exact issues are? I haven’t really heard of KODI or playlists having trouble reading or displaying MB tracks.