Complex Text Layout Language Support in the Solaris Operating Environment

1.4 Character Representation

In CTL languages, the basic input character is often modified before being displayed. Character representation is the final display form of the screen or printed character and depends on context, ligatures, diacritical marks, and character clusters.

Note that any modified input will not have a one-to-one correspondence between the number of input characters and the number of output glyphs. A single glyph occupying one display cell can represent two or more typed characters. Internal algorithms determine the character representation from the number and sequence of input characters. By using the PLS library, an application uses the appropriate character representation in each CTL language.

1.4.1 Contextual Analysis

The shape of a character in its final display form can depend on its position in a word or its position relative to neighboring characters. Changing the shape of a character is called shaping or contextual analysis. For example, in English handwriting, letters can have different shapes in isolation or connected to other letters: the letter "r" appears differently at the beginning, in the middle, or at the end of a word. In English printing, however, character representation is unaffected by position--a character in isolation or connected to other letters is identically represented. For example, lowercase "a" is represented as "a"--either by itself, at the start, in the middle, or at the end of a word as in: a, all, lap, formula.)

Arabic uses a cursive script, connecting one character to another as in English handwriting. All characters are affected in many different ways by their context. Arabic characters can have up to four final display forms: initial, final, medial, or isolated. During input, the keystrokes are stored as basic code-point values. The Arabic language engine reads the code-point values and selects the appropriate final display form from the context. Ligatures and diacritics, however, must also be considered.

Note -

Only the basic shape of each character appears on an Arabic keyboard.

1.4.2 Ligatures

A ligature is the combination of two or more characters to create a single character or syllable. For example, in English, a diphthong is a ligature which unites two vowel characters to create another sound: a + e = æ.

In Arabic, ligatures are the combinations of two and, sometimes, three characters into one glyph. The resulting glyph replaces the characters composing it. For example, an Arabic letter typed twice is stored in memory as two distinct keystrokes, occupying two display cells. However, the Arabic language engine recognizes the context of each character and returns one glyph as shown in Figure 1-3.

Figure 1-3 Arabic ligature


In the Arabic language engine, a ligature occupies the same number of display cells as input characters, with one exception: the combination character lamalif.

Figure 1-4 Lamalif combination character


The rules governing ligatures in Arabic text are very complex and do not depend solely on individual characters. Certain fonts define as many as 200 ligatures, while other fonts do not use ligatures at all.

1.4.3 Diacritics

Diacritical marks (or diacritics) are added to a character to show pronunciation. For example, in French, an accent is a diacritical mark above a vowel which alters the stress of the vowel: à á è é.

In Hebrew, diacritical marks represent vowel sounds and appear above, below, or inside characters . Typically, however, words are written without diacritics and vowel sounds are determined from the context.

In Arabic, diacritical marks can appear above (single or double diacritic) or below (single diacritic) any character. In the Arabic language engine, diacritics are entered as separate keystrokes and occupy the same cell as their associated character.

In Thai, diacritical marks appear above or below the base line containing consonants, vowels, symbols, and numbers. In the Thai language engine, diacritics are entered as separate keystrokes and occupy the same cell as their associated character.

1.4.4 Character Clusters

A character cluster is a collection of alphabetic characters forming a single word or syllable.

A Hebrew character cluster can contain characters and diacritics, though diacritics are not often used.

Character clusters are integral to written text in Thai. A Thai character cluster consists of up to four parallel lines:

One or more elements may change shape in the presence of other elements. In the Thai language engine, a character cluster occupies one display cell. Depending on the composition, one display cell can contain up to three input characters, as shown in Figure 1-5.

Figure 1-5 Thai-language character cluster