Bookmarklet/Userscript to guess Unicode punctuation of titles

kellnerd · February 3, 2021, 8:14pm

Hello “Unicode pixies”,

during the last days there has (again) been some debate whether the guess case button could be enhanced to replace ASCII punctuation symbols by their preferred Unicode counterparts. As each ASCII symbol has multiple possible Unicode replacements, it is not easy to implement an automatic conversion. Another result of the discussion was that (even partial) automation could have more drawbacks than benefits if used by inexperienced editors.

On the other hand I also believe that there should be a tool for advanced editors to speed up the conversion process, so I put some lines of JavaScript together to achieve this. As I did not want to waste more time for designing UI extensions to embed the tool into the MB website than for the functionality itself, I created a simple bookmarklet which you can find here:

It searches and replaces ASCII punctuation symbols for all title input fields by their preferred Unicode counterparts. Of course these can only be guessed based on their context as the ASCII symbols are ambiguous and the editor has to validate the changed titles (which get highlighted by the bookmarklet). The code works for release title and track titles in the release editor, and for recording and work titles on their respective edit pages.

I invite you to test and review the code which is of course also available in a non-minified/obfuscated form - with additional comments that are explaining the performed transformation steps for each title.

Maybe I will expand the code into a full userscript with a small UI later - if there is enough interest and I find some free time to implement this.
Edit: A userscript version is now also available in the same repository which I already linked above.

ulugabi · February 4, 2021, 7:39am

Again thanks a lot, tried for some hours it s really useful

Find one issue on those tracks:
Wasla maqam sika “Asïdil qualba bi waslek”
Wasla maqam kurdi “Sabani jamalek”

that are on https://musicbrainz.org/release/7874eead-a14b-4b65-af56-3541e7ce21c7
The quotes were not changed automatically, I had to do it manually

jesus2099 · February 4, 2021, 8:28am

I will test with Chinese/Japanese/Korean (CJK) titles as \w / \W are not usually working with them.
But not \s / \S either as they don’t use spaces.
I usually do with boundaries \b instead of non-word \W as they work with more stuff.
But it’s not a solution for our case here, so I will see what’s possible if it does not work out of the box…
Maybe I’ll suggest a CJK mode detect, or something.
Or maybe it’s just better to have 2 bookmarklets.

kellnerd · February 4, 2021, 11:07am

Thank you for reporting this use case, I have now fixed the code to also match quoted text at the beginning and/or the end of titles, somehow I had missed that

Strange that the topic now shows up as being edited four times although I edited it only twice… Maybe this is caused by the edit notes I entered for both changes? I think I have never used this feature before, so I can not tell

kellnerd · February 4, 2021, 11:23am

So far the code is mainly useful for English titles but I already plan to support other languages, mainly because they are using different Unicode quotes. My intention is to detect and use the release language (only available in the release editor) and maybe the lyrics language as a best effort based assumption for work titles. English style will be the fallback for unset language attributes, multiple languages and recording titles which have no language attribute.

I think I will be able to implement this for the most common European languages but I would appreciate your help for CJK titles as I do not know really much about their punctuation rules (read: almost nothing). Please let me also know when you have found good test cases (and possible solutions?) where the RegEx character classes do not match for non-latin scripts.

ulugabi · February 4, 2021, 11:56am

Tested and working.
Thanks

I can help if you have question about french or english.

jesus2099 · February 4, 2021, 1:07pm

No I think you can forget about my hasty comment for now.
They use their own punctuation that don’t have our Latin typewriter equivalents issue, for the vast majority.
Where they would use Latin punctuation is in parts of title that is in English.
So it’s a false alert.

Thanks for your script, I am eager to use it!
But each time I have some free time, I spend it only in the forums…

Honestly in the OP, where you will soon no longer have edit permission, you should remove your source codes as they will become obsolete and you should just link to your github and explain how to install or update the bookmarklet from there.
And it will be more relax for you than to have to update your post everytime you update your script.

IvanDobsky · February 4, 2021, 4:58pm

Is your Speech Marks processing picking up that “open is to the left of a word and close to the right?”

I hope you are not just counting as that won’t work for a quote in a quote:

“Here’s “Something I have to say” by A. Troublemaker”

Obviously in the above example that is Arnie Troublemaker…

kellnerd · February 4, 2021, 5:48pm

The speech mark processing detects text in quotes that is enclosed by non-word characters, i.e. spaces, brackets and dashes (before the opening quote and after the closing quote). But it does not check for nested quotes of the same type as they are more difficult to handle and probably not worth the extra effort. Matching uses the shortest possible quoted sequence of characters in order to handle titles which contain multiple quoted parts.

In your example the match that gets converted starts with the first (= outer) opening quote and ends with the first (= inner) closing quote and leaves the other two quotes untouched. You need to run it a second time to also convert these (correctly) but I think that is feasible for an edge case:

Before:   "Here's "Something I have to say" by A. Troublemaker"
1st pass: “Here’s "Something I have to say” by A. Troublemaker"
2nd pass: “Here’s “Something I have to say” by A. Troublemaker”

On the other hand, two quotation levels using different types of quotes, i.e. nested single and double quotes do not cause problems and work as expected without additional logic.

IvanDobsky · February 4, 2021, 6:04pm

Is there a way to make your bookmarklet loop itself multiple times automagically then?

I am not just trying to cause trouble, but literally added text like that just now to an annotation. Quoting someone talking who was then quoting something in his speech. And as it was an annotation there were actually five sets of speechmarks in three sentences. ( https://musicbrainz.org/edit/77006051 )

I also like the challenge of an algorithm. As noted in my above comment, doesn’t all “open quotes start to the left of a character, and all close quotes to the right of one?” All about that White Space and where it sits.

Good to see Arnie keeps his capital A for his name. My extra curve ball didn’t catch you there.

kellnerd · February 6, 2021, 10:24pm

I had already started to write an answer for you two days ago but did not manage to finish it to my satisfaction until today. Finally here it is…

My guess punctuation script only replaces punctation marks, it does not trigger the guess case button (which would probably fail for your example).

I could add a second iteration and I even could add logic to detect how many iterations you need if there are say N nested quotes. But this logic would have to run every time for every single track title, so I will not add this for an edge case. Just keep clicking the bookmark (which will later become a button when I turn this into a userscript) until you are happy with the result and the content is not highlighted as changed anymore

One important point first: So far the bookmarklet only replaces punctuation characters in titles (of releases, tracks, recordings and works) and does nothing with annotations. Annotations are a bit tricky because their markup unfortunately uses ASCII apostrophes which would be replaced by Unicode apostrophes by the bookmarklet. But you have triggered my interest and hence I tried to implement this. You can find an experimental version with support for annotations and edit notes ~~here:~~ (feature is now part of the main version)

I have managed to preserve the apostrophe-based markup for bold and italic text ~~but I am quite sure that the code will break URLs that contain punctuation marks…~~ The latest version is now also able to preserve URLs.

kellnerd · February 6, 2021, 10:42pm

Update:
Aside from the experimental version with support for annotations and edit notes there are also improvements for the main version.
@jesus2099 has fixed a bug that prevented release title changes from being recognized by MBS and reported an issue with ISO 8601 dates, thank you for that. Full (YYYY-MM-DD) and partial dates (YYYY-MM) are now supported by the most recent version.

In the background, I am in the process of setting up a toolchain to automatically build the bookmarklet and also a userscript version of it (which I hopefully can release soon™). Stay tuned and do not hesitate to report issues and/or ideas for improvements in this topic or on GitHub.

jesus2099 · February 6, 2021, 11:26pm

Congratulations @kellnerd, for this awesome script!
It requires great power!

Can you check if you manage medium titles?
I’m not sure but I think I remember it does not.

kellnerd · February 6, 2021, 11:42pm

Here you go, you can fetch the new version, it was surprisingly easy to implement…
I thought this would be more complicated because medium titles have no common class which can be selected, only numeric IDs of the type disc-title-123456 but a quick research showed me there is an advanced CSS selector to do this

kellnerd · March 2, 2021, 9:28pm

Now we know how long “soon™” is, it is approximately a month in my world: I am finally able to present you the first version of the Guess Unicode punctuation userscript, so far only with support for the release editor. You can install it from the README page on GitHub (see first post of this topic) or directly through this install link.
It adds a “Guess punctuation” button next to the “Guess case” button at the bottom of the release editor’s tracklist tab. The missing support for recording and work edit pages and also for the experimental annotation feature will be restored during the next days.

Zas · April 14, 2021, 9:55am

Very useful userscript, thanks.

I noticed an issue though, when pressing Guess Punctuation it also parses edit note box, and replace ‘-’ in URLs, breaking them.
See edit note of https://musicbrainz.org/edit/78889188 for an example.

It should definitively skip URLs.

kellnerd · April 14, 2021, 2:35pm

Thank you for reporting this inconsistency, I have now published a fix with v2021.4.14.
So far I only had code in place to preserve bracketed URLs (with the syntax [link|label]) since I rarely use plain text URLs for annotations (and edit notes were just a bonus feature since they can be treated very similarly to annotations).
Although I already support plain text HTTP(S) URLs in my new Annotation Converter bookmarklet, I had forgotten to port this feature to the punctuation script. Now it should support (read: ignore) even more protocols than just HTTP(S) as long as the URL contains ~~a protocol followed by ://~~ two slashes // (or begins with them in the case of protocol-relative URLs).

jesus2099 · April 14, 2021, 5:54pm

They don’t have the same syntax for links.
In edit notes, you just use plain links and there is as a bonus /Edit #[0-9]+/i that is transformed to a link.
Oh but yes, '''bold''' and ''italic'' are the same, indeed.

BTW, I often use //urls (no protocol) in edit notes
Maybe you should simply skip edit notes? (and annotations?)

aerozol · May 6, 2021, 10:06pm

Does anyone have a handy overview (or a good link) of what punctuation should be used where, before I dive into google/wikipedia? I’ve never really looked at it beyond the basics but I think I will soon.

For now, could someone help with an edge case I’ve come across strangely often? Fictional names (e.g. alien names) like:
Sur’Kesh
Should they have a curly?

kellnerd · May 6, 2021, 10:46pm

https://wiki.musicbrainz.org/User:Yurim/Punctuation_and_Special_Characters

https://wiki.musicbrainz.org/User:Jacobbrett/English_Punctuation_Guide

The above wiki pages have quite nice collections of the punctuation essentials, collected from Wikipedia. I had also used them (and the linked Wikipedia articles) as a reference when I defined the initial transformation rules of my userscript.

Regarding the correct usage of punctuation for fictional names (also applies to languages I’m not familiar with), I am as clueless as you are. It depends on whether the apostrophe indicates missing characters/phonemes or has a different meaning (e.g. for pronunciation), which is hard to tell here. If this specific name is used in the same way as an ordinary word in an English language context, I would probably use the curly apostrophe.