Accented vowels are deaccented in tags

tylla · December 28, 2017, 9:48pm

I’m trying to tag a release which has accented vowels in the title and the artist fields but the downloaded information is all de-accented. The release I am trying to tag is: Ákos: Hűség, the database contains the proper accented characters, but the only accented character I see in Picard is the ű in the release title and the first track, all the other vowels are simple ASCII ones.
Am I doing something wrong? Should I include my picard.ini?

Edit: I’m using Picard 1.4.2 on Debian Jessie 8.9 and Debian Stretch 9.3 to tag MP3 files.

mmirG · December 29, 2017, 12:11am

Hi. There may be some other explanation but I think you might be looking at what is discussed in the thread below

tylla · December 29, 2017, 11:57am

Thanks for your reply, but I don’t think this is the same issue (meanwhile I have clarified that I am using only MP3s).
If I understand correctly the topic you mentioned is about differing tag names between different file formats. But I’m using a single file format and the problem is that despite what I am seeing on the web page, Picard downloads tags containing almost only ASCII characters.
Eg: the track “Mindenki táncol” appears in Picard as “Mindenki tancol”. (these are Hungarian songs and the Hungarian language has lots of accented vowels)
It seems some kind of conversion is happening during the download of data.
Now I have double checked and the same happens with other Hungarian albums as well: eg: Félre az útból or Flúgos futam
Only “ő” and “ű” comes through unaffected in the download, all others (éáíúóüö) are de-accented.

There are no errors in the console.

mmirG · December 30, 2017, 9:55am

You are right.
Something else seems to be going on.

Unless someone else can come up with an explanation or solution you could, if you are feeling very IT capable, make a ticket - so treat this as a bug that needs reporting.

https://tickets.metabrainz.org/

If you do make a ticket please post the details on this thread.
I’m not sure what hoops you’ll have to jump through to sign up for JIRA currently.
Just don’t try signing up when you are in a rush and do ask for help back here immediately things look too obscure.

(If you don’t make a ticket I might eventually have a day where I want to try out my hand at doing so.)

tylla · January 1, 2018, 8:53pm

Thanks for the kind directions.
I have no problem with filing tickets, I do it frequently. And it happens I already have a JIRA account so I made a ticket #PICARD-1168:

I’m trying to tag a release which has accented vowels in the title and the artist fields but the downloaded information is containing almost only standard ASCII characters.
The release I am trying to tag is: Ákos: Hűség, the database contains the proper accented characters, but the only accented character I see in Picard is the “ű” in the release title and the first track, all the other vowels are simple ASCII ones.
Eg: the track “Mindenki táncol” appears in Picard as “Mindenki tancol”. (these are Hungarian songs and the Hungarian language has lots of accented vowels)
It seems some kind of conversion is happening during the download of data.
Now I have double checked and the same happens with other Hungarian albums as well: eg: Félre az útból or Flúgos futam
Only “ő” and “ű” comes through unaffected in the download, all others (éáíúóüö) are de-accented.
The files I am tagging are all MP3s and I have checked but there are no errors in the console.
I am missing something/doing something wrong? Is this a bug? Should I include my picard.ini?

We’ll see…

IvanDobsky · January 3, 2018, 1:08pm

Funnily enough, I am also having similar problems from a different angle.

Part of the problem here is also the fancy unicode apostrophes and hyphens some of the editiors on here like to use. They swap a standard ASCII apostrophe as found on the keyboard to a prettified curly version as found on the printed page in a book. Because these can cause confusions in a file system there is a mechanism to swap these for “standard” characters.

Someone then expanded that substitution within MB to cover ALL accented characters. I believe this was being done for cheapo media players that had to have all ASCII text. (Also add in US based programmers who will have less experience of Hungarian, Turkish, etc)

Go and have a look into the options in Picard. Under METADATA there is an option to “Convert Unicode Punctuation characters to ASCII”. And then in the plugins there are plugins that will do even more character swaps (Non-ASCII Equivalents is an example)

When I avoid the plugins and just tick the Convert Unicode Punctuation characters to ASCII option it looks like I am getting just the punctuation swapped out and the text entered correctly.

I am no expert on this. Just a noob to Picard and its oddities. My ripping with EAC using MB data has led to some odd filenames. So I am only just digging deeper to untangle this.

In a different discussion on this, I was handed which points into the MB source and shows us what is being swapped out.

github.com

metabrainz/picard/blob/66ed358093ff4ae1c80b9fd33019390ae08b11cd/picard/util/textencoding.py

# -*- coding: utf-8 -*-
#
# Picard, the next-generation MusicBrainz tagger
# Copyright (C) 2004 Robert Kaye
# Copyright (C) 2006 Lukáš Lalinský
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License
# as published by the Free Software Foundation; either version 2
# of the License, or (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.

This file has been truncated. show original

Even if you don’t know Python, it is possible to look down that list and see what is going on. IMHO this is far too wide. Too much of an all swapped or nothing.

Sorry for waffly post. I need to sit down and look carefully at what option does what as I have been getting confused on this.

(The other discussion - Unicode apostrophe standardization )

tylla · January 5, 2018, 11:53pm

@IvanDobsky thanks for your thoughts. It really helped me, because as I read your reply I happened to remember that I have some plugins enabled and bang there it is, the “Non-ASCII Equivalents” plugin enabled.
The description of the plugin mentions the “Replace non-ASCII characters” option of Picard and this led me into thinking that this plugin would do the same only even better. But actually the plugin works on a very different level and causes exactly the same effect that my issue was about. (while “Replace non-ASCII characters” only changes filenames when renaming the tagged files, “Non-ASCII Equivalents” will alter every downloaded tag to replace non-ASCII characters in it)
Disabled this plugin and everything returned to normal.

I already closed the ticket and will file an issue against “Non-ASCII Equivalents” to change the description to emphasize the different working level of the plugin.

Thanks again guys for the help, it’s a really helpful community with nice people.

IvanDobsky · January 6, 2018, 2:42pm

@tylla glad my random rant helped you. It is all a bit too confusing really. The help files are pretty hopeless at explaining these options and add-ons as there doesn’t seem to be anywhere this stuff is actually specified. And we are pointed into style guides which are then not really followed officially. The lack of clean spec\descriptive help file then leads to this kind of confusion as too many people assume the settings mean different things