Wednesday, May 30, 2012

Visually Identical, different encoding

And we’re back...

Fittingly, I am going to restart the blog on a bit of old chestnut. I’m considering throwing in the towel on those Unicode Characters that look identical (or are functionally identical in the minds of typists).

A bit of history. When I first started making keyboard layouts for indigenous languages, I quickly found that I needed to make a decision on apostrophes, lots of languages use them for a whole variety of sounds: glottal stop, aspiration, ejectives, reduced vowels, etymological deletions, and so on. The first thing I did is just use U+0027 '. This started long before I was using Unicode, so in a standard ASCII font, U+0027 made sense.

Once I began to look at Unicode, I wanted to follow the Unicode directive and use the Spacing Modifier Letter U+02BC ʼ because I’m supposed to according to the Standard.
  • "glottal stop, glottalization, ejective"
  • "many languages use this as a letter of one of their alphabets"
  • "2019 is the preferred character for a punctuation apostrophe"
Almost immediately, I was getting emails from North American Native-language speakers that their word processor spell checker wasn’t working anymore. The first e-mail was from a Tohono ’O’odham speaker so we’ll go with that. Thing is, virtually all Native speakers in Canada and the USA are bilingual in English or French, and she was using the Native-language keyboard for typing in English – not a problem because the keyboards accommodate English. And, although there is a functional difference between the Tohono ’O’odham apostrophe and the English apostrophe (actually there isn’t, but that’s for another blog), there is no visual difference. Therefore, in the mind of the typist, an English apostrophe and a Native-language apostrophe are the same thing and are on the same key.


So when she was typing in English using the Tohono ’O’odham keyboard, words like “wasnʼt” was showing up as incorrect, because the apostrophe was typed as U+02BC. How on Earth is a typist going to know that a Tohono ’O’odham apostrophe is one entity while an English apostrophe is a different entity when they look exactly the same? They’re not going to know, so forcing the distinction is unfair. Especially in languages like Anishinaabemowin or Blackfoot which have writing systems with no letters outside English but with an apostrophe. So the Blackfoot typist can’t use the US English keyboard layout because they need a special apostrophe, which, by the way, looks exactly like the English? U+02BC doesn’t make sense and it’s out of my life.

All right then, I went through all the keyboards I had already made public, and redid the lot of them (well over 100). I changed them all to U+2019. This way I got the curly apostrophe I wanted, but no hassles with two different apostrophes underlyingly.

But this has had its own problems, primarily with line breaking. Because the U+2019 is "punctuation", it line-breaks. Let’s look at a couple of examples
  • Blackfoot: omatsini’katsitsai
  • Cheyenne: mostanotseevavo’hoveotse̊hevohe
Whenever a big long word like this is close to the end of a line of text, it should break with a hyphen somewhere. However, if the word has U+2019 as the apostrophe, the line break will always occur on the apostrophe.

  • omatsini’
    katsitsai
  • mostanotseevavo’
    hoveotse̊hevohe

The software won’t automatically insert a hyphen because it considers there to be no word break: they apostrophe is a word boundary. A Mohawk word like kotitakhehnóntie’s is going to word-break between the apostrophe and the final s which is very wrong.
  • kotitakhehnóntie’
    s
Furthermore, because U+2019 is a word boundary, the double-click highlight text or Command/Ctrl Right Arrow word skip doesn’t work at all.

What to do then? I’m going back to plain ol’ U+0027 APOSTROPHE. Yes it is vague as to what shape it ought to be. Yes it can be mutated into a curly apostrophe ‘ or ’ if smart quotes are turned on in your word processor. Yes the standard says that “2019 ’ is preferred for apostrophe.”

But in the end, U+0027 works. Software recognizes it for what it is, an apostrophe. Speakers recognize it for what it is, an apostrophe (both English, French, and Native-language). Drawbacks?
  1. With smart quotes turned on, any word-initial apostrophes (like ’O’odham) are going to be 6-curled: ‘O’odham. Ugly and wrong. But I see this all the time in English too, like “Hits from the ‘60’s”. Maybe one day we’ll have smart smart quotes.
  2. No apostrophes in web-site domain names. Sure this is no fun, but languages have other punctuation letters, like : for length which can’t go in file names even. We manage no apostrophes in domain names in English just fine.
My loyalty lies with the Native-language typist before IT. Usually there is no conflict between both groups once the facts are on the table. I feel, however, that with apostrophes this may never be the case. So one of these days I will go back into all the keyboard layouts and ditch U+2019 and replace it back to U+0027. If English ever starts using U+02BC for its apostrophe, which according to the standard it probably should, then I’ll revert the lot back to U+02BC. I’m done worrying about it, U+0027 works, and it is what English, French, Italian, and others use so it’ll continue to work.

1 comment:

  1. I had this problem after I transcribed several hours of text, about half of which was done using the Coast Tsimshian keyboard. The problem came when the program I was using for coding turned all the punctuation into little clusters of random symbols in the output! It made for a lot of search and replace fun.

    ReplyDelete