Thursday, May 31, 2012

Canadian Aboriginal Syllabics Decombined

One look at the Unicode range Unified Canadian Aboriginal Syllabics (1400–14DF) shows that there are a lot of precomposed characters. For example, the syllabic for /ta/ ᑕ has several extensions: ᑖ, ᑡ, ᑢ, ᑣ, ᑤ
  • The dot-on-top diacritic signifies that the vowel is lengthened or more peripheral in some way: ᑖ is /tā/. 
  • A mid-dot diacritic to the left or right indicates that there is a /w/ glide in the onset of the syllable: ᑢ is /twa/. Whether the dot appears on the left or right of the base character depends on dialect/ortholect, so in most of Ontario and points east ᑡ is the accepted form for /twa/.
  • Combine the two diacritics together and you get ᑤ or ᑣ ~ /twā/.

The guideline is that one should never use the character 1427 (CANADIAN SYLLABICS FINAL MIDDLE DOT) for the mid-dot diacritic, nor should one ever use a combining diacritic like 0307 (COMBINING DOT ABOVE) for the dot-on-top diacritic. Consequently, ᑖ 1456 (CANADIAN SYLLABICS TAA) and ᑕ̇ 1455 0307 (CANADIAN SYLLABICS TA + COMBINING DOT ABOVE) are not equivalent.

Ever since the addition of UCAS Extended, one can represent all the extended syllabic characters as single Unicode precomposed characters; at least one can for Cree, Ojibwe, Naskapi and Inuktitut. However, if base-syllabics with diacritics must be precomposed into a single character, then Blackfoot and Dene cannot be written following Unicode guidelines.

Before we carry on, there is an important principle to posit.

One symbol - one encoding Principle

As I discussed in the previous post, I believe that it is fundamentally a bad move to have one visual/graphical symbol with two different encodings and expect a typist to know the difference, and be dilligent about differentiating the two. For example, one could not expect an English typist to know that the apostrophe in won’t and the ‘single closing quotation mark’ are two different characters. Functionally they are quite different:
  • the apostrophe in won’t is lexical, it is built into the word and is essentially a letter.
  • the ‘closing single quotation mark’ is punctuation, it is not in the dictionary, it is not part of the word mark.
English typists are not expected to encode the lexical-apostrophe and the punctuation-quotation-mark differently – they’re visually identical, thus they should be encoded identically.

Why have 1427 (CANADIAN SYLLABICS FINAL MIDDLE DOT) anyway?

Back to syllabics and the mid-dot diacritic.

If we can’t use 1427 as a sort of diacritic, what is the point (pardon the pun) of this character anyway? The only source that I know of that would explain the inclusion of 1427 is the 1866 Recueil de prières, catéchisme, et cantiques à l’usage des Sauvages de la Baie d’Hudson (Moose Cree dialect). The writer follows the normal eastern practice of employing a left-side mid dot for /w/ onsets – on the chart on page 2, ᐺ is transcribed pwe. Interestingly for us, in the following row, ᐯᐧ is transcribed peu. Whereas most if not all other writers use a ring-final ᐤ (1424 CANADIAN SYLLABICS FINAL RING) to indicate a /w/ at the end of a syllable (in the coda), this version of the Recueil de prières uses a mid-dot.

This raises an encoding problem. As an example, let’s look on page 6 (dot-on-top marking is absent in the work), specifically the word ᓂ ᑕᐺᑕᐗᐧ /ni-tāpwētawāw/ I believe him. In other works, this would be written with a raised-ring final ᓂ ᑕᐺᑕᐗᐤ, but here final-w is written with a mid-dot.

The problem is, how can one know the difference between the mid-dots in ᓂ ᑕᐺᑕᐗᐧ ? The first and second mid-dots indicate the w-onset: ᐺ /pwē/, ᐗ /wā/ – but the last is a final-w. According to Unicode guidelines, the first two mid-dots are part of precomposed characters: ᐺ (143A) and ᐗ (1417). The final mid-dot would be ᐧ (1427). Here we have two different encodings of the same graphical symbol, which goes squarely against my policy as outlined in the previous post on apostrophes. One cannot expect typists to know that two otherwise identical symbols need two different encodings, it’s an unfair expectation.

So what to do for this variant of Moose Cree? The only thing to do is abandon the precomposed w-onset characters altogether, and use 1427 in all cases. Thus in the word ᓂ ᑕᐧᐯᑕᐧᐊᐧ
  • the syllable /pwē/ ᐧᐯ should be the sequence 1427 142F
  • the syllable /wāw/ᐧᐊᐧ should be 1427 140A 1427.

Blackfoot mid-dot

All right, so the above section was just about an obscure, historical orthographic variant of Moose Cree. Let’s move on to Blackfoot, where the mid-dot indicates an onset /s/: plain ᒣ is /ta/, ᒭ is /tsa/. Not a problem, one can use the precomposed ᒭ 14AD (CANADIAN SYLLABICS WEST-CREE MWE) to represent /tsa/. But Blackfoot can use the s-onset mid-dot after other syllabics base letters, most often the k-series: ᖿᐧ /ksa/. These Blackfoot k-series syllabics have no Unicode precomposed dotted characters.

Barring any additions to Unicode to include these characters (which I’m not inclined to propose), the simple solution is to just use 1427 for all instances of s-onset in Blackfoot, no precomposed characters anywhere.
  • ᖿᐧ is 15BF 1427
  • ᒣᐧ is 14A3 1427 (not the precomposed 14AD)

Dene mid-dot

Dene has the same issue for w-onset:
  • ᐘ /wa/ : could use precomposed 1418
  • ᑿ /gwa/ : could use precomposed 147F
  • ᗃᐧ /ghwa/ : there is no precomposed character, must use 15C3 1427
  • ᒈᐧ /k’wa/ : no precomposed character must use 1488 1427 (see later though for Dene ejectives)
Again, more additions to Unicode could be proposed to accommodate the unsupported characters, but like Blackfoot and old Moose Cree, it’s more efficient just to use 1427 in all cases: ᐊᐧ /wa/ is two Unicode characters 140A 1427.

Dene and why I have thrown in the towel

But is there something stopping the addition of these additional mid-dotted precomposed characters? Apart from the one symbol, two encodings problem for old Moose Cree?

Yes there is, and it comes from Dene as written in the Northwest Territories. In fact there are two major issues.

Nasal vowels

In Dene, nasal vowels are indicated by a raised "Grave" diacritic. In virtually every written document in Dene, this diacritic is spacing, i.e. comes after the base glyph. So in a word like ᓀᘕˋ /nezǫ/ he/she is good, that final ˋ indicates that the vowel is nasal.

However, in two very early Dënesųłiné versions of Prières, cantiques et catéchisme en langue montagnaise ou chipeweyan (1857, 1965), if the nasalized vowel had no consonant onset, that is it was a nasal version of the syllabics: ᐊ ᐁ ᐃ or ᐅ, the nasal diacritic went on top of the syllabic, ᐅ̀ᐣᐟᓯ /ǫntsi/. By 1890, this practice had ceased and the nasal diacritic always goes after the syllabic ᐅˋᐣᐟᓯ.

The UCAS Unicode range includes precomposed nasalized vowels, but only for the no-onset syllabics, ᐫ ᐬ ᐭ ᐮ – should the diacritic should go atop or to the right of the base? But any syllable can be nasalized in Dene, so we need to account for ᘕˋ /zǫ/, ᗃˋ /ghą/, ᓇˋ /ną/ and so on. To manage this, we have two choices, either make a proposal to add at least 64 new nasalized characters to Unicode, or use ˋ 02CB (MODIFIER LETTER GRAVE ACCENT) for all cases of the nasal diacritic. As a result, the precomposed characters  ᐫ ᐬ ᐭ ᐮ should be abandoned completely, deprecrate the lot.

I suggest that this is both preferable and more economical. More economical because we don’t need over 64 new characters in Unicode, and preferable because the internal logic of the syllabics system seems to be that the nasal diacritic is distinct. In the Dëne Yatı Ɂerehtł'ı́s, one can see that nasalization can occur after a final: ᐁᐣᑯᕍᐩˋ /ekuląy/. Obviously we cannot compose the nasalization mark with the base character ᕍ /la/ with the final ᐩ /y/ in the way.

Thus we are forced to abandon a precomposed set of characters for a multi-characters series.

Ejective consonants

In Dene syllabics, ejective consonants are indicated with a raised vertical tick: ᑪ is /t'a/ (146A), ᒈ is /k'a/ (1488). In Dënesųłiné, there are other ejective consonants (ch', tł', ts') but these aren’t indicated in writing, and /tth'/ gets it’s own series ᕮ. No problem thus far.

In one variant of Tłįchǫ syllabics, these unmarked ejectives are given the vertical tick: so ᐟᕃˈ is /tł'e/. To write this sequence, we need 02C8 (MODIFIER LETTER VERTICAL LINE).

The raised vertical tick is also used in Dënesųłiné to indicate aspiration: ˈᐁ is /he/: 02C8 1401.

We could propose to add 12 new characters to account for Tłįchǫ, plus another 4 for Dënesųłiné. Double this to include the nasalized variants (32) plus nasal versions of /t'/ and /k'/ is 40 new characters. But given the situation with nasal vowels, it seems easier to use 02C8 for all cases, so ᑕˈ is decomposed into 1455 02C8. So I’ll abandon the t' and k' series.

Conclusions

I could go on, and I’m sure I will in another post, but for now what have I concluded
  1. For Old Moose Cree, precomposed mid-dot characters breaks the one symbol - one encoding principle. Thus all mid-dots in Old Moose Cree must be 1427 – all precomposed characters are disallowed.
  2. For Blackfoot and Dene to use only precomposed characters, we would need to add well over 100 new characters. If we abandon the precomposed characters, we can write these langauges with the UCAS as it is.
  3. We have to deprecate the nasalized-vowel series anyway because finals can interrupt the vowel-nasal bind.

We gain little by using precomposed characters, after much reflection, I can’t see any benefit. My keyboard layouts have always used the precomposed characters, and I have received complaints from users:
  • They cannot just insert a mid-dot. In editing a document, the typist wants to correct a misspelled word, let’s say ᑕᐯ should be ᑕᐻ /tāpwē/. They want to simply put the cursor after the ᐯ, and type the dot-key. They can’t though and have to delete the ᐯ and retype ᐻ. This complaint indicates that conceptually the mid-dot is a discrete entity.
  • Dot-on-top vowel indication is optional in many dialects. For /tāpwē/, it is acceptable for many to write either ᑖᐻ or ᑕᐻ. With precomposed characters, ᑕ (1455) and ᑖ (1456) are completely different characters, whereas if we have a common base character ᑕ which may or may not be followed by a combining accent, it is simple to ignore the accent mark and treat ᑖᐻ and ᑕᐻ as the same word in searches or spell checkers. People see the dotted and undotted characters as the same underlying thing, with one perhaps a more precise spelling. 

What am I to do?

I’ve thrown in the towel and am not going to bother with precomposed syllabics any longer. They’re gone, out of my life.

I’m going to put out new Syllabics keyboards with 1427 mid-dots. This is what speakers have been asking for, and it will make everyone’s lives easier. It’s what the Cree Wikipedia does anyway!

Wednesday, May 30, 2012

Visually Identical, different encoding

And we’re back...

Fittingly, I am going to restart the blog on a bit of old chestnut. I’m considering throwing in the towel on those Unicode Characters that look identical (or are functionally identical in the minds of typists).

A bit of history. When I first started making keyboard layouts for indigenous languages, I quickly found that I needed to make a decision on apostrophes, lots of languages use them for a whole variety of sounds: glottal stop, aspiration, ejectives, reduced vowels, etymological deletions, and so on. The first thing I did is just use U+0027 '. This started long before I was using Unicode, so in a standard ASCII font, U+0027 made sense.

Once I began to look at Unicode, I wanted to follow the Unicode directive and use the Spacing Modifier Letter U+02BC ʼ because I’m supposed to according to the Standard.
  • "glottal stop, glottalization, ejective"
  • "many languages use this as a letter of one of their alphabets"
  • "2019 is the preferred character for a punctuation apostrophe"
Almost immediately, I was getting emails from North American Native-language speakers that their word processor spell checker wasn’t working anymore. The first e-mail was from a Tohono ’O’odham speaker so we’ll go with that. Thing is, virtually all Native speakers in Canada and the USA are bilingual in English or French, and she was using the Native-language keyboard for typing in English – not a problem because the keyboards accommodate English. And, although there is a functional difference between the Tohono ’O’odham apostrophe and the English apostrophe (actually there isn’t, but that’s for another blog), there is no visual difference. Therefore, in the mind of the typist, an English apostrophe and a Native-language apostrophe are the same thing and are on the same key.


So when she was typing in English using the Tohono ’O’odham keyboard, words like “wasnʼt” was showing up as incorrect, because the apostrophe was typed as U+02BC. How on Earth is a typist going to know that a Tohono ’O’odham apostrophe is one entity while an English apostrophe is a different entity when they look exactly the same? They’re not going to know, so forcing the distinction is unfair. Especially in languages like Anishinaabemowin or Blackfoot which have writing systems with no letters outside English but with an apostrophe. So the Blackfoot typist can’t use the US English keyboard layout because they need a special apostrophe, which, by the way, looks exactly like the English? U+02BC doesn’t make sense and it’s out of my life.

All right then, I went through all the keyboards I had already made public, and redid the lot of them (well over 100). I changed them all to U+2019. This way I got the curly apostrophe I wanted, but no hassles with two different apostrophes underlyingly.

But this has had its own problems, primarily with line breaking. Because the U+2019 is "punctuation", it line-breaks. Let’s look at a couple of examples
  • Blackfoot: omatsini’katsitsai
  • Cheyenne: mostanotseevavo’hoveotse̊hevohe
Whenever a big long word like this is close to the end of a line of text, it should break with a hyphen somewhere. However, if the word has U+2019 as the apostrophe, the line break will always occur on the apostrophe.

  • omatsini’
    katsitsai
  • mostanotseevavo’
    hoveotse̊hevohe

The software won’t automatically insert a hyphen because it considers there to be no word break: they apostrophe is a word boundary. A Mohawk word like kotitakhehnóntie’s is going to word-break between the apostrophe and the final s which is very wrong.
  • kotitakhehnóntie’
    s
Furthermore, because U+2019 is a word boundary, the double-click highlight text or Command/Ctrl Right Arrow word skip doesn’t work at all.

What to do then? I’m going back to plain ol’ U+0027 APOSTROPHE. Yes it is vague as to what shape it ought to be. Yes it can be mutated into a curly apostrophe ‘ or ’ if smart quotes are turned on in your word processor. Yes the standard says that “2019 ’ is preferred for apostrophe.”

But in the end, U+0027 works. Software recognizes it for what it is, an apostrophe. Speakers recognize it for what it is, an apostrophe (both English, French, and Native-language). Drawbacks?
  1. With smart quotes turned on, any word-initial apostrophes (like ’O’odham) are going to be 6-curled: ‘O’odham. Ugly and wrong. But I see this all the time in English too, like “Hits from the ‘60’s”. Maybe one day we’ll have smart smart quotes.
  2. No apostrophes in web-site domain names. Sure this is no fun, but languages have other punctuation letters, like : for length which can’t go in file names even. We manage no apostrophes in domain names in English just fine.
My loyalty lies with the Native-language typist before IT. Usually there is no conflict between both groups once the facts are on the table. I feel, however, that with apostrophes this may never be the case. So one of these days I will go back into all the keyboard layouts and ditch U+2019 and replace it back to U+0027. If English ever starts using U+02BC for its apostrophe, which according to the standard it probably should, then I’ll revert the lot back to U+02BC. I’m done worrying about it, U+0027 works, and it is what English, French, Italian, and others use so it’ll continue to work.

Let’s try this blog thing again

A couple of years ago, I thought about starting a blog about all things Languagegeeky. Unfortunately, it ended up in the closet along with old tennis racquets and other forgotten projects.

I’m going to try again.