Bookmarklet/Userscript to guess Unicode punctuation of titles

kellnerd · August 8, 2021, 4:39pm

It has been a while since the last update, but I finally found the time to realize my most wanted feature: Support for localized quotes based on the selected language in the release editor.
This means the result of the “Guess punctuation” button for inputs of the form "..." and '...' is now dependent from the release’s (tracklist) language

So far I have only integrated the rules for German and French quotes in addition to the English quotes, which will still be used as the fallback.
Other languages with which I am not familar enough (e.g. are there any pitfalls?) have been skipped so far, but it should only be a matter of adding an additional line of code (at least for most of them):

github.com

kellnerd/musicbrainz-bookmarklets/blob/main/src/guessUnicodePunctuation.js#L44

    
      
          	[/<i>/g, "''"],
          	// decode Base64 URLs
          	[/(?<=\/\/)([A-Za-z0-9+/=]+)/g, (_match, path) => atob(path)], // plain text URLs
          	[/\[([A-Za-z0-9+/=]+)(\|.+?)?\]/g, (_match, url, label = '') => `[${atob(url)}${label}]`], // labeled link
          ];
          
          
/**
           * Language-specific double and single quotes (RegEx replace values).
           * @type {Record<string,string[]>}
           */
          const languageSpecificQuotes = {
          	de: ['„$1“', '‚$1‘'], // German
          	en: ['“$1”', '‘$1’'], // English
          	fr: ['« $1 »', '‹ $1 ›'], // French
          };
          
          
/**
           * Indices of the quotation rules (double and single quotes) in `transformationRules`.
           */
          const quotationRuleIndices = [0, 2];

So I am happy to receive PRs on GitHub or comments in this forum to add more languages which you want to have and are familiar with

IvanDobsky · August 8, 2021, 8:40pm

I always find these kind of images funny. As a Brit I always was taught to use “double quotes” for a quote. With a less used option of sometimes ‘single quotes’ (mainly if you are quoting inside a quote).

For the sake of your Guess Case I would have thought Double Quotes would make more sense. (i.e. England\Scotland\Wales in your image would be the same as Ireland.)

jesus2099 · August 9, 2021, 7:04am

Same for French, where we use « » (with spaces) but I don’t remember seeing any ‹ ›.
We also use very often the double quotes “” (these without spaces inside).

But @IvanDobsky, I don’t think it is a search replace list, it is just a list of possible quotes.
I don’t remember the script removing a quote style to put another one instead.

InvisibleMan78 · August 9, 2021, 8:40am

What Wikipedia means about the specific language features
including guillemets or duck-foot quotes

kellnerd · August 9, 2021, 11:00am

That was exactly my thought, I also rather associate double quotes with English text. But I have just checked a random selection of British books from my collection and was surprised that all of them use the single quote variant, the only ones with double quotes I have found were North-American editions.
Luckily this does not matter for the userscript since it only converts ASCII single quotes to the specific Unicode single quotes and double quotes to double quotes. I mainly included the map to illustrate that there are many more combinations of different types of quotation marks in other countries/languages.

Yes, I’ve rarely seen these single angle quotes but if they aren’t there in the ASCII version there will be no issue with them. And if you want to use the curly quotes instead of the guillemets you can achieve this by temporarily changing the language to English before you press the button.

I’ve also thought about using (narrow) non-breaking spaces (instead of regular ASCII spaces) to pad the guillemets, but this seems to be pointless at the moment:

IvanDobsky · August 9, 2021, 12:11pm

I was working with someone on formatting their English university thesis a couple of years back. Something that gets pretty fussy on formatting. And that was still using a normal “double quotes” when quoting text.

Perfect. Artist intent should still rule. I know there are Pixies out there who are determined to ignore artists. If a plugin like this got too controversial it would not get used.

Please don’t go overboard with these substitutions. This is a music database, not a perfect language class. As it is I’m now planning to finish my first plugin as I need to strip this punctuation from my tags as it makes searching with my media player tricky. (Dozens of different hyphens being the biggest headaches) I get why people want to see it on screen, but it causes havoc on my files. And the current Unicode to ASCII plugin is too brutal as I still want to keep stuff like ™. It’s the punctuation I can’t see or type that is trouble in my files.

And every time you say guillemets I just see this:

Nice job on the plugin. I can never keep up with what needs to be used where, so only ever do the apostrophe’s. Will this also highlight the changes? (like the Search and Replace does). I sometimes miss errors that Guess Case adds due to lack of highlight.

kellnerd · August 9, 2021, 12:53pm

I fully agree, but the thing with the non-breaking spaces is more of a display issue, i.e. they will prevent ugly line breaks directly after opening quotes or directly before closing quotes if you have a text like « This quote where the line break occurs right after the opening quote » - But since MBS converts to them into normal spaces there is nothing I can do or need to do.

Maybe one day I will also need your Picard plugin for some of my files, but so far I haven’t found a case where the Picard standard option has failed me, and where I would be digging myself a hole

Yes, it does (and it’s indeed inspired by @jesus2099’s userscript). This is even one of the features which is listed in the description because I can’t live without it:

Highlights all updated input fields in order to allow the user to review the changes.

There are some obscure titles for which I simply don’t know the applicable rule and where I need to see which hyphens had been replaced in order to immediately revert the changes

@aerozol has even created a ticket to integrate this feature into MBS and @Zas has brought in the idea to have highlighting at character level and the possibility to revert changes (also for Guess Case):

chaban · August 9, 2021, 9:05pm

Alternate map by Jakub Marian:

aerozol · September 22, 2021, 2:42am

Hmm, I’ve come across this case where two different releases seem to get two different dashes with the guess punctuation button (on tracks that are named the same, both english + latin):

Any idea why?

kellnerd · September 22, 2021, 9:19am

It took me a while to notice it, because I was experimenting with the release editor first, but the solution is simple: The two releases are using two different recordings, one of them still uses hyphen-minus, the other one and the track titles already use the correct Unicode hyphen. I have not corrected it, so you are still able to see it yourself.

aerozol · September 22, 2021, 10:30pm

thaaaaaaaaaanks

vzell · January 1, 2022, 1:41pm

I tried the following conversion in an annotation field with the “Guess Unicode Punctuation” script:

Before conversion (all three lines use hyphen minus (U002D) from the standard keyboard)

Figure Dash (U2012) Used as a dash within numbers (e.g. 555-1212).
En Dash (U2013) Indicates a range of numbers (e.g. 1989-90).
Hyphen (U2010) Joins words and syllables of a word (e.g. co-operate) and used within dates (e.g. 2022-01-01 or 2021-31)

After conversion

Figure Dash (U2012) Used as a dash within numbers (e.g. 555–1212). actual result: U2013 expected result: U2012
En Dash (U2013) Indicates a range of numbers (e.g. 1989‐90). actual result: U2010 expected result: U2013
Hyphen (U2010) Joins words and syllables of a word (e.g. co‐operate) and used within dates (e.g. 2022‐01‐01 or 2021‐31) actual result: U2010

So according to User:Jacobbrett/English Punctuation Guide - MusicBrainz Wiki the first two lines seem to be wrong after conversion. Is this a supported case by the script ?

kellnerd · January 1, 2022, 6:58pm

The userscript is unable to distinguish between numeric ranges (en dash) and numbers whose digits are split into (two) groups (separated by figure dash). I have given a more detailed answer on GitHub:

github.com/kellnerd/musicbrainz-bookmarklets

Wrong Unicode conversion for "dash within numbers" and "range of numbers"

opened 12:49PM - 01 Jan 22 UTC

vzell

bug punctuation

I tried the following conversion in an annotation field with the "Guess Unicode …Punctuation" script: # Before conversion (all three lines use hyphen minus (U0012) from the standard keyboard) Figure Dash (U2012) Used as a dash within numbers (e.g. 555-1212). En Dash (U2013) Indicates a range of numbers (e.g. 1989-90). Hyphen (U2010) Joins words and syllables of a word (e.g. co-operate) and used within dates (e.g. 2022-01-01 or 2021-31) # After conversion Figure Dash (U2012) Used as a dash within numbers (e.g. 555–1212). actual result: U2013 expected result: U2012 En Dash (U2013) Indicates a range of numbers (e.g. 1989‐90). actual result: U2010 expected result: U2013 Hyphen (U2010) Joins words and syllables of a word (e.g. co‐operate) and used within dates (e.g. 2022‐01‐01 or 2021‐31) actual result: U2010 So according to https://wiki.musicbrainz.org/User:Jacobbrett/English_Punctuation_Guide the first two lines seem to be wrong after conversion. Is this a supported case by the script ?

Edit: Some improvements regarding the handling of numeric ranges and grouped digits have been released in version 2022.1.1.

vzell · January 2, 2022, 11:55pm

On bootleg releases you sometime find date expressions in the following form:

Back in the USA: Live Vol 2, 21-7-1984 Montreol
Barcelona: Teatro Tivoli 7-5-1996
Born in the U.S.A.: Live Vol 1. 21-7-1984 Montreol

Is it possible to also transform them with HYPHEN instead of FIGURE DASH ?

kellnerd · January 3, 2022, 12:26pm

It is possible, but I am neither sure whether HYPHEN is correct for this non-standard date format nor whether it is worth the effort to have an additonal rule for them

[/(?<=\W|^)\d{1,2}-\d{1,2}-\d{4}(?=\W|$)/g, (potentialDMY) => {
	const potentialYMD = potentialDMY.split('-').reverse().join('-');
	if (Number.isNaN(Date.parse(potentialYMD))) return potentialDMY; // skip invalid date strings
	return potentialDMY.replaceAll('-', '‐');
}],

You would have to insert the above code behind the other rule to convert ISO 8691 dates in the transformationRules array of the userscript.

But after reading the comments on this ticket I am no longer sure whether I should touch dates at all or just keep the HYPHEN-MINUS for those…

IvanDobsky · January 3, 2022, 1:59pm

I’d vote to keep dates as ISO standard as that is the layout ISO are using, it is what the website is using, and what external apps will expect. I understand wanting to change apostrophe’s, but dates have their own standards.

But for me it is fairly irrelevant as I strip these all out with my own custom plugin.

vzell · January 3, 2022, 11:32pm

To late for my use case … I just switched all Bruce Springsteen releases and release-group titles which use dates to HYPHEN

aerozol · February 15, 2023, 2:11am

11 posts were split to a new topic: Typographical hyphen not displaying in browser

HibiscusKazeneko · August 23, 2023, 9:55pm

@kellnerd Would it be possible to configure the Guess Unicode Punctuation script to use 〜 (wave dash) to replace ～ (fullwidth tilde)?

kellnerd · August 24, 2023, 10:46am

It is very easy to add new transformation rules to the script, but I’ve decided to keep the scope limited to the most important ASCII to Unicode replacements.
Although it would be nice to also replace inappropriately used Unicode characters with the correct ones, that would potentially lead to many rarely used rules which have to run for every single input field.

I’ve also declined a similar previous request for the same reason and suggested a few possible alternatives there. Admittedly I still haven’t written the proposed “customizable search and replace rulesets” userscript and don’t have the time to do that soon

Alternatively you could manually add a new rule to your locally installed script at the end of the punctuation rules array (or the language-specific rules):

		/* custom rules, manually added */
		[/〜/g, '～'], // wave dash -> full tilde

Beware that local changes will be lost with the next update, but I don’t have any new features planned currently and the code is relatively stable.