Unicode apostrophe standardization

unicode
Tags: #<Tag:0x00007f23c596a178>

#1

I mentioned this a few years ago in IRC but can’t remember the result, so thought I would bring it up here again.

Musicbrainz seems to use a weird unicode apostrophe example here:

You can see the first 3 releases use a different apostrophe character than the last release.

This may seem insignificant to some people, but it actually means that the search for that album will not be correct and show up no results in some circumstances.

I suspect this may be a problem with the source data, not really musicbrainz…

Anyone have an idea?


Request for votes
Accented vowels are deaccented in tags
Apostrophe
#2

The “weird Unicode apostrophe” is actually the preferred apostrophe on MusicBrainz. That’s because the ‘’’ is butt-ugly. Any reasonably good search thingy will find both apostrophes when someone searches for either one though.


#3

Hi,

Adding to [quote=“mfmeulenbelt, post:2, topic:64777”]
"Unicode apostrophe” is actually the preferred apostrophe on MusicBrainz.
[/quote]

Using the apostrophe as an example, the reason MB prefers over ' is because that’s the recommendation in the Unicode Standard spec.

[quote=“Basic Latin chart: 0027 ’ APOSTROPHE”]
neutral (vertical) glyph with mixed usage; 2019 ’ is preferred for apostrophe[/quote]

When you need to do strict pattern-matching, you can use the Convert Unicode punctuation characters to ASCII option in Picard.


#4

Thanks guys, now i see its a style guideline I can of course program around it. Unfortunately both mysql and PHP don’t seem to treat this character as the same without special processing(I’m matching the iTunes RSS feed with the MusicBrainz Release-Group name). Again not a problem, I really was interested in why it was sometimes different. Even on the same Release-Group!

For anyone who is interested here is how I solved it in both:

Mysql
UPDATE album SET release-group = REPLACE(release-group,"’","’");

Php
$output = iconv(‘UTF-8’, ‘ASCII//TRANSLIT’, $input);


#5

You’ll probably need to run similar updates in mysql for the open and close single quote, and for double-quotes (which will have multiple unicode code points in use).


#6

Well, that’s the “preferred” part :slight_smile:

For editors who want to learn how to enter fancy unicode, awesome. Go for it.
For editors who don’t want to, cool. Do what you can. Another editor will probably come along and change it later.


#7

If you look at a book or at a CD, you will more probably see those curly (normal) apostrophes than the legacy upright straight (ugly) typewriter apostrophes that we were made used to because of computers. :wink:
This ’ is not weird, it’s normal. :slight_smile:


#8

Yeh only reason I mentioned it as “weird” is that a quick google had showed me this was mostly a problem with people copying and pasting from Microsoft word which also converts the apostrophe to the curly version. I had no idea it was intended :wink: Anyway, i’m happy to convert, the consistency is a much better question.


#9

An alternative depending on your needs might be to strip all non-alphanumeric characters except spaces (and standardise those as single) from both sides for a simplified search


#10

Of course, that’d probably be problematic for stuff like !!!, ☾∧† ◯ and █ ▄ █ █ ▄ ██ ▄ ██ ▄█


#11

Depending on what you’re doing, you might want a more comprehensive normalize/simplify strategy (beyond apostrophes).

For comparison, you can see the transformations Picard does here.


#12

They’d get what they deserve :stuck_out_tongue_winking_eye:
Obviously it’s not a good approach for non Latin scripts!


#13

I’d expect setting the collation properly in MySQL to help—see an answer on Stack Overflow or the MySQL docs.


#14

Well, U+2019 being the correct (for most cases) apostrophe, is the reason why MSWord does this kind of artificial „intelligence“ in the first place.

It’s especially irritating when applied to technical documentation, and the combination of stupid software and inattentive users turn a commandline like
ls -l --si 'Tangerine Dream'/*
into
ls -l –si ‘Tangerine Dream’/*
which will not work at all.


#15

I’m getting a no vote for this very subject!

https://musicbrainz.org/edit/43119660

Please let him no that this is the preferred apostrohpe.


#16

Is this already included in the [Guess case] functionality? Otherwise it should to make it consistent and stop the confusing.


#17

No, because it can’t be done – a human is needed to decide whether or (or something else entirely) is meant.


Correcting Unicode Character in Artists Name
#18

And even humans can make the wrong ' translation. E.g., just the other day I did https://musicbrainz.org/edit/42712692 – notice the s? Yeah. Those are not appropriate here! Such a git, that editor. Luckily a smart git that sometimes realises his mistakes: https://musicbrainz.org/edit/42767913


#19

Why is there nothing in the style guide about using these “correct” apostrophe’s?

It would help if there was a page in there explaining why these curly apostrophe’s are in use, and how to type one in from the keyboard. I laughed when I saw that this this thread says it is for “cosmetic” reasons and that Microsoft Word was being taken as a standard here. (Must be a first - lolz)

I am using a standard Windows PC with a UK keyboard and have no idea how to do a ’ so all I can do is copy and paste. When I got picked up on it elsewhere I was told to Use a Unicode Apostrophe (ALT)+039 to give ’ which is what I was already doing.

So please, can someone write a definitive page in the Style Guide explaining this?

Currently if you look at the English Style Guide page even the style guide itself is written with the standard apostrophe’s using a normal ’ from the keyboard.

I am trying to analyse the differences in a hex editor, and my keyboard puts out the standard apostrophe same as (ALT039) ’ (Hex 0x27) whereas the tilted over ’ is hex 0x92.

https://musicbrainz.org/doc/Style/Language/English
The only example I can find so far is, when I read the English style guide page there is a specific example of “(Don’t Fear) The Reaper” which links to https://musicbrainz.org/recording/963bda54-eed7-47ba-b637-d6110d43db88

Look down that list as you see at least three different apostrophe’s in use. Yet the TITLE as show on the style guide is using (ALT)039 ’ and not that odd tilted over apostrophe ’ you have in your example.

I am confused. (Not arguing as I want to get this right, but I also don’t want to be dragged into a weird *nix vs windoze argument as I live in both of those camps. I also know my music titles are going to get read back in a number of different fonts and I don’t want things getting too weird :D)

Help us get a definitive answer here :slight_smile:

Edit Note: Oh great… I have just realised that my carefully typed out text has been trashed and the standard apostrophe’s been replaced with curly ones meaning the above is now not as clear as it should have been…


#20

Hi!

The style guide info is on https://musicbrainz.org/doc/Style/Miscellaneous - but it’s hard to have a complete guide for every language on what to use when. https://wiki.musicbrainz.org/User:Jacobbrett/English_Punctuation_Guide is useful for English, but it’s made by a user and is not official as such. I use it when I’m confused, anyway :slight_smile:

But generally: you only need to worry about these if you want to. Otherwise, just don’t change anything that is already there and might be typographically correct, and if someone complains because you’re using the basic ASCII punctuation when adding stuff, just remind them the guideline very specifically says “usage is allowed”.