Diacritics at the beginning of a word not showing

silviaegt · May 31, 2018, 4:10pm

Hello everyone,

we’re working on a Omeka site with all the books that were announced in the oldest periodical in Latin America and are super happy users. We have one problem though, words with a diacritic on the first character are not showing properly, for instance:

Ágreda --> we get: greda

Órdenes --> we get: rdenes

Do you have any idea why this is happening and what could we do to solve it?

benbakelaar · May 31, 2018, 4:31pm

To me it appears that the “tag” of “Órdenes” is not loaded correctly? You appear to have utf-8 encoding set up, so diacritics should not be an issue.

If I search database from search form for “rdenes”, 2 records come up.

If I search database from search form for “ordenes”, 18 records come up. Of those, some contain “ordenes”, others “órdenes”, and some “Ordenes” as well.

In your link, you have a tag search:
http://sandbox.colmex.mx/~silvia/omeka25/items/browse?tags=rdenes

There are no results for variations:
http://sandbox.colmex.mx/~silvia/omeka25/items/browse?tags=órdenes
http://sandbox.colmex.mx/~silvia/omeka25/items/browse?tags=ordenes
http://sandbox.colmex.mx/~silvia/omeka25/items/browse?tags=Órdenes

Inside the two records that show up for tag search=rdenes, you have these tags:
Etiquetas
Arquitectura; rdenes; Tratados, manuales, etc.

But in other fields, there are instances of ordenes and órdenes. So I guess that if you imported the tags, maybe something happened between Excel (or equivalent) > CSV > CSVImport? I think you might be able to manually correct them, if not in the UI then in the SQL database itself.

silviaegt · May 31, 2018, 6:52pm

Thanks for your quick reply and willingness to help!

We don’t have a problem with diacritics in general, only with fields where the first letter is a capital character has a diacritic, so I didn’t mean to show a search of “rdenes” or “greda” but to point out these fields appear as “Órdenes” and “Ágreda” in our UTF-8 CSV but that this first character (Ó and Á) disappears in Omeka.

I tried changing this field manually

But it didn’t help : http://sandbox.colmex.mx/~silvia/omeka25/items/show/9265

All best,

S.

silviaegt · May 31, 2018, 7:13pm

Oddly enough, I was able to change manually the name of some Authors that start with “Á”

http://sandbox.colmex.mx/~silvia/omeka25/items/browse?advanced[0][element_id]=39&advanced[0][type]=is+exactly&advanced[0][terms]=Ágreda%2C+María+de+Jesús+de

benbakelaar · May 31, 2018, 7:19pm

OK I see.

I just tested some capital letters with diacritics and it was not a problem.

So I think the issue is in the plugin and upload process. It could be that when the CSV file is uploaded to the web server, the encoding is changed at that point (before plugin even digests the file). Or, it could be based on web server + PHP settings that when the plugin is digesting and processing the data from CSV, the mark is being dropped for some reason.

I suppose you are already sure - but have you checked to make sure that, for example, Microsoft Word (or Excel) has not modified the character to a “fancy” character? This even happens on my iMac in Notes when I do single quotes. Typically it is with quotes and punctuation, not letters - but just another thought.

Maybe attach your CSV file?

silviaegt · June 15, 2018, 11:52pm

Ben!
Thanks again for your help and sorry for the late reply.
For some reason I can’t post a link but at bitly dot com slash 2Mwp7b5 you will find a file with a part of our database: that is with fields that start with accentuated capital letters (that’s where we’re having problems)
I just discovered the for some reason there is one only example that is being recognised properly (Ética cristiana) but all the other instances of É are not uploaded properly
[Like in the indices of our system
Ética cristiana (11)
vs
tica (10)
tica legal
tica política (2)
– which should be Ética, Ética lega, Ética política]

Daniel_KM · June 16, 2018, 5:07am

See https://gist.github.com/Daniel-KM/9754f18f9632423fb1a08909e9f01c04 for an explanation of the issue.

Daniel_KM · June 16, 2018, 5:18am

The link above is a test file. I copy the explanation here:

Two solutions are possible, and they require to config the file /etc/apache2/envvars, where it is indicated:

## The locale used by some modules like mod_dav
export LANG=C
## Uncomment the following line to use the system default locale instead:
#. /etc/default/locale

First solution is to uncomment the line as specified and to add a generic value for numbers to avoid other issues:

. /etc/default/locale
export LC_NUMERIC=C

The second solution is more generic: don‘t uncomment the line, but replace “export LANG=C” by “C.UTF-8”:

#export LANG=C
export LANG="C.UTF-8"

Don‘t forget to relaunch the server between two tests.

sudo systemctl restart apache2

In fact, the default locale of Apache is “C” for historic and geographic reasons (USA based), so it should be changed to any UTF-8 compliant locale, for example the default locale of Debian, “en_US.UTF-8”. Apache does not apply it by default, so it should be fixed.

Ideally, the default locale of Apache should be the generic “C.UTF-8”, but it is not possible, because American people wouldn’t understand why they would lose their “en_US.UTF-8”.