Thursday, May 31, 2012

Canadian Aboriginal Syllabics Decombined

One look at the Unicode range Unified Canadian Aboriginal Syllabics (1400–14DF) shows that there are a lot of precomposed characters. For example, the syllabic for /ta/ ᑕ has several extensions: ᑖ, ᑡ, ᑢ, ᑣ, ᑤ
  • The dot-on-top diacritic signifies that the vowel is lengthened or more peripheral in some way: ᑖ is /tā/. 
  • A mid-dot diacritic to the left or right indicates that there is a /w/ glide in the onset of the syllable: ᑢ is /twa/. Whether the dot appears on the left or right of the base character depends on dialect/ortholect, so in most of Ontario and points east ᑡ is the accepted form for /twa/.
  • Combine the two diacritics together and you get ᑤ or ᑣ ~ /twā/.

The guideline is that one should never use the character 1427 (CANADIAN SYLLABICS FINAL MIDDLE DOT) for the mid-dot diacritic, nor should one ever use a combining diacritic like 0307 (COMBINING DOT ABOVE) for the dot-on-top diacritic. Consequently, ᑖ 1456 (CANADIAN SYLLABICS TAA) and ᑕ̇ 1455 0307 (CANADIAN SYLLABICS TA + COMBINING DOT ABOVE) are not equivalent.

Ever since the addition of UCAS Extended, one can represent all the extended syllabic characters as single Unicode precomposed characters; at least one can for Cree, Ojibwe, Naskapi and Inuktitut. However, if base-syllabics with diacritics must be precomposed into a single character, then Blackfoot and Dene cannot be written following Unicode guidelines.

Before we carry on, there is an important principle to posit.

One symbol - one encoding Principle

As I discussed in the previous post, I believe that it is fundamentally a bad move to have one visual/graphical symbol with two different encodings and expect a typist to know the difference, and be dilligent about differentiating the two. For example, one could not expect an English typist to know that the apostrophe in won’t and the ‘single closing quotation mark’ are two different characters. Functionally they are quite different:
  • the apostrophe in won’t is lexical, it is built into the word and is essentially a letter.
  • the ‘closing single quotation mark’ is punctuation, it is not in the dictionary, it is not part of the word mark.
English typists are not expected to encode the lexical-apostrophe and the punctuation-quotation-mark differently – they’re visually identical, thus they should be encoded identically.

Why have 1427 (CANADIAN SYLLABICS FINAL MIDDLE DOT) anyway?

Back to syllabics and the mid-dot diacritic.

If we can’t use 1427 as a sort of diacritic, what is the point (pardon the pun) of this character anyway? The only source that I know of that would explain the inclusion of 1427 is the 1866 Recueil de prières, catéchisme, et cantiques à l’usage des Sauvages de la Baie d’Hudson (Moose Cree dialect). The writer follows the normal eastern practice of employing a left-side mid dot for /w/ onsets – on the chart on page 2, ᐺ is transcribed pwe. Interestingly for us, in the following row, ᐯᐧ is transcribed peu. Whereas most if not all other writers use a ring-final ᐤ (1424 CANADIAN SYLLABICS FINAL RING) to indicate a /w/ at the end of a syllable (in the coda), this version of the Recueil de prières uses a mid-dot.

This raises an encoding problem. As an example, let’s look on page 6 (dot-on-top marking is absent in the work), specifically the word ᓂ ᑕᐺᑕᐗᐧ /ni-tāpwētawāw/ I believe him. In other works, this would be written with a raised-ring final ᓂ ᑕᐺᑕᐗᐤ, but here final-w is written with a mid-dot.

The problem is, how can one know the difference between the mid-dots in ᓂ ᑕᐺᑕᐗᐧ ? The first and second mid-dots indicate the w-onset: ᐺ /pwē/, ᐗ /wā/ – but the last is a final-w. According to Unicode guidelines, the first two mid-dots are part of precomposed characters: ᐺ (143A) and ᐗ (1417). The final mid-dot would be ᐧ (1427). Here we have two different encodings of the same graphical symbol, which goes squarely against my policy as outlined in the previous post on apostrophes. One cannot expect typists to know that two otherwise identical symbols need two different encodings, it’s an unfair expectation.

So what to do for this variant of Moose Cree? The only thing to do is abandon the precomposed w-onset characters altogether, and use 1427 in all cases. Thus in the word ᓂ ᑕᐧᐯᑕᐧᐊᐧ
  • the syllable /pwē/ ᐧᐯ should be the sequence 1427 142F
  • the syllable /wāw/ᐧᐊᐧ should be 1427 140A 1427.

Blackfoot mid-dot

All right, so the above section was just about an obscure, historical orthographic variant of Moose Cree. Let’s move on to Blackfoot, where the mid-dot indicates an onset /s/: plain ᒣ is /ta/, ᒭ is /tsa/. Not a problem, one can use the precomposed ᒭ 14AD (CANADIAN SYLLABICS WEST-CREE MWE) to represent /tsa/. But Blackfoot can use the s-onset mid-dot after other syllabics base letters, most often the k-series: ᖿᐧ /ksa/. These Blackfoot k-series syllabics have no Unicode precomposed dotted characters.

Barring any additions to Unicode to include these characters (which I’m not inclined to propose), the simple solution is to just use 1427 for all instances of s-onset in Blackfoot, no precomposed characters anywhere.
  • ᖿᐧ is 15BF 1427
  • ᒣᐧ is 14A3 1427 (not the precomposed 14AD)

Dene mid-dot

Dene has the same issue for w-onset:
  • ᐘ /wa/ : could use precomposed 1418
  • ᑿ /gwa/ : could use precomposed 147F
  • ᗃᐧ /ghwa/ : there is no precomposed character, must use 15C3 1427
  • ᒈᐧ /k’wa/ : no precomposed character must use 1488 1427 (see later though for Dene ejectives)
Again, more additions to Unicode could be proposed to accommodate the unsupported characters, but like Blackfoot and old Moose Cree, it’s more efficient just to use 1427 in all cases: ᐊᐧ /wa/ is two Unicode characters 140A 1427.

Dene and why I have thrown in the towel

But is there something stopping the addition of these additional mid-dotted precomposed characters? Apart from the one symbol, two encodings problem for old Moose Cree?

Yes there is, and it comes from Dene as written in the Northwest Territories. In fact there are two major issues.

Nasal vowels

In Dene, nasal vowels are indicated by a raised "Grave" diacritic. In virtually every written document in Dene, this diacritic is spacing, i.e. comes after the base glyph. So in a word like ᓀᘕˋ /nezǫ/ he/she is good, that final ˋ indicates that the vowel is nasal.

However, in two very early Dënesųłiné versions of Prières, cantiques et catéchisme en langue montagnaise ou chipeweyan (1857, 1965), if the nasalized vowel had no consonant onset, that is it was a nasal version of the syllabics: ᐊ ᐁ ᐃ or ᐅ, the nasal diacritic went on top of the syllabic, ᐅ̀ᐣᐟᓯ /ǫntsi/. By 1890, this practice had ceased and the nasal diacritic always goes after the syllabic ᐅˋᐣᐟᓯ.

The UCAS Unicode range includes precomposed nasalized vowels, but only for the no-onset syllabics, ᐫ ᐬ ᐭ ᐮ – should the diacritic should go atop or to the right of the base? But any syllable can be nasalized in Dene, so we need to account for ᘕˋ /zǫ/, ᗃˋ /ghą/, ᓇˋ /ną/ and so on. To manage this, we have two choices, either make a proposal to add at least 64 new nasalized characters to Unicode, or use ˋ 02CB (MODIFIER LETTER GRAVE ACCENT) for all cases of the nasal diacritic. As a result, the precomposed characters  ᐫ ᐬ ᐭ ᐮ should be abandoned completely, deprecrate the lot.

I suggest that this is both preferable and more economical. More economical because we don’t need over 64 new characters in Unicode, and preferable because the internal logic of the syllabics system seems to be that the nasal diacritic is distinct. In the Dëne Yatı Ɂerehtł'ı́s, one can see that nasalization can occur after a final: ᐁᐣᑯᕍᐩˋ /ekuląy/. Obviously we cannot compose the nasalization mark with the base character ᕍ /la/ with the final ᐩ /y/ in the way.

Thus we are forced to abandon a precomposed set of characters for a multi-characters series.

Ejective consonants

In Dene syllabics, ejective consonants are indicated with a raised vertical tick: ᑪ is /t'a/ (146A), ᒈ is /k'a/ (1488). In Dënesųłiné, there are other ejective consonants (ch', tł', ts') but these aren’t indicated in writing, and /tth'/ gets it’s own series ᕮ. No problem thus far.

In one variant of Tłįchǫ syllabics, these unmarked ejectives are given the vertical tick: so ᐟᕃˈ is /tł'e/. To write this sequence, we need 02C8 (MODIFIER LETTER VERTICAL LINE).

The raised vertical tick is also used in Dënesųłiné to indicate aspiration: ˈᐁ is /he/: 02C8 1401.

We could propose to add 12 new characters to account for Tłįchǫ, plus another 4 for Dënesųłiné. Double this to include the nasalized variants (32) plus nasal versions of /t'/ and /k'/ is 40 new characters. But given the situation with nasal vowels, it seems easier to use 02C8 for all cases, so ᑕˈ is decomposed into 1455 02C8. So I’ll abandon the t' and k' series.

Conclusions

I could go on, and I’m sure I will in another post, but for now what have I concluded
  1. For Old Moose Cree, precomposed mid-dot characters breaks the one symbol - one encoding principle. Thus all mid-dots in Old Moose Cree must be 1427 – all precomposed characters are disallowed.
  2. For Blackfoot and Dene to use only precomposed characters, we would need to add well over 100 new characters. If we abandon the precomposed characters, we can write these langauges with the UCAS as it is.
  3. We have to deprecate the nasalized-vowel series anyway because finals can interrupt the vowel-nasal bind.

We gain little by using precomposed characters, after much reflection, I can’t see any benefit. My keyboard layouts have always used the precomposed characters, and I have received complaints from users:
  • They cannot just insert a mid-dot. In editing a document, the typist wants to correct a misspelled word, let’s say ᑕᐯ should be ᑕᐻ /tāpwē/. They want to simply put the cursor after the ᐯ, and type the dot-key. They can’t though and have to delete the ᐯ and retype ᐻ. This complaint indicates that conceptually the mid-dot is a discrete entity.
  • Dot-on-top vowel indication is optional in many dialects. For /tāpwē/, it is acceptable for many to write either ᑖᐻ or ᑕᐻ. With precomposed characters, ᑕ (1455) and ᑖ (1456) are completely different characters, whereas if we have a common base character ᑕ which may or may not be followed by a combining accent, it is simple to ignore the accent mark and treat ᑖᐻ and ᑕᐻ as the same word in searches or spell checkers. People see the dotted and undotted characters as the same underlying thing, with one perhaps a more precise spelling. 

What am I to do?

I’ve thrown in the towel and am not going to bother with precomposed syllabics any longer. They’re gone, out of my life.

I’m going to put out new Syllabics keyboards with 1427 mid-dots. This is what speakers have been asking for, and it will make everyone’s lives easier. It’s what the Cree Wikipedia does anyway!

2 comments:

  1. I don’t know much about Syllabics, but I am interested in Unicode, and the points you have made have given me some questions.

    • You use U+02CB for the nasal mark; why not U+1420? The same goes for U+02C8 vs. U+144A for the vertical tick.
    • Given that users can type the precomposed characters or not, there will be inconsistently encoded data, with all the negative effects. Have you told the Unicode Consortium about the problems with precomposed syllables? If not, I think you should.
    • Is there ever any graphical difference between ᒭ and ᒣᐧ?
    • How do left-dotted syllables collate? (I know this is not a reason to prefer precomposed characters; I am just curious.)
    • Do the arguments against precomposed syllables apply to ᙵ?

    ReplyDelete
  2. To answer the questions
    1) I use U+02CB for the Dene nasal because U+1420 has a different value in Dene, /k/. Same goes for U+144A which is Dene /b/. The problem is that the finals in Western Cree aren’t defined for line height - in some typefaces they’re top-line in others they’re mid-line. In Dene line-height is important: the mid-line finals correspond to the generic Cree finals. There are three top-line finals in Dene, and for two of these, it’s good enough to use the Spacing Modifier Letters.
    2) There’s already a whole bunch of inconsistently encoded data. That’s not going to change. I’ve talked a lot about this to Unicode people, but the UCAS is what it is, warts (and there are some doozies) and all. I think I make pretty good points as to why the precomposed characters are the poorer choice to use.
    3) No, there is no graphical difference
    4) There is no standard collation in syllabics cross-linguistically. Different traditions do things differently.
    5) I’d say so, but Inuktitut is standardized, and functions ok with precomposed syllables. Here’s the difference, in Inuktitut, something like ᖁ is perceived as a single unit, not a combination of ᕐ and ᑯ. That there are two parts making up the character is due to the developer largely restricting himself to the Cree inventory. In Algonquian, the w-dot is at least diacritic on a base syllabic, at most a completely separate character. So ᑢ has two parts, the base ᑕ + a modifier ᐧ.

    ReplyDelete