Complex Text Layout Language Support in the Solaris Operating Environment

Chapter 1 Complex Text Layout Languages

This chapter introduces features of Complex Text Layout languages and their support in the Solaris operating environment.

Complex Text Layout language support in Solaris
Bidirectional scripts (Arabic and Hebrew)
Character representation
Input and text handling
Numbers and dates

1.1 Introduction

Storing and displaying text has traditionally followed the basic structure of writing systems in Western languages (languages based on the Roman, Cyrillic, and Greek alphabets). Input characters are stored in the order in which they are typed, and displayed in the same order. There is no logical difference between how text is stored and how it is displayed.

Some writing systems, however, are different than those of Western languages. The direction in which characters are displayed can be different and the shape of a character or word can be modified depending on the adjoining text. Complex text layout transformations are required for display. A Complex Text Layout (CTL) language is any language which stores text differently from how it is displayed.

Particularly, many CTL languages use bidirectional script. Words and sentences are written from right to left, while some text, such as numbers and Roman-based words, are written from left to right. Arabic and Hebrew use bidirectional script. Furthermore, some CTL languages are context dependent. In Arabic, characters can be modified depending on the preceding and following characters. Written Thai uses character clusters, a collection of syllabic elements denoting consonants, vowels, and tonal values.

1.2 CTL Language Support in Solaris

The Solaris operating environment supports CTL easily and conveniently in all Unicode and Thai, Arabic, and Hebrew locales. CTL scripts are properly rendered as screen output or in printed format. Most applications need only be recompiled without changing the source code to run under these locales. A CTL-supported application enables users to input, view, and display CTL text without any additional development work.

The Solaris operating environment contains a standard API, the Portable Layout Services (PLS) from The Open Group, which provide applications with a consistent interface to the CTL language engines. Developers set various CTL attributes in the PLS configuration file, including cursor positioning and character and cell deletion.

Note -

Unless otherwise stated, the CTL language engines manage complex text layout transparently to the user.

1.3 Bidirectional Scripts

Arabic and Hebrew use bidirectional script. In a bidirectional script, the default text direction is from right to left. However, numbers and Western-language text (such as quotations or proper names) are written from left to right. Text justification depends on line direction--typically, right-to-left text is right justified and left-to-right text is left justified.

Because input characters are stored in order, an application must manage the display of blocks of right-to-left and left-to-right text in the proper order. In Figure 1-1, text direction is left to right for the English text and right to left for the Arabic text. Note that the cursor is to the left of the Arabic text after input.

Figure 1-1 Bidirectional text

Note -

In the Solaris operating environment, complex text layout issues are managed by Motif. Applications using Motif do not need to address these issues directly. For more information, see Section 2.1, Architecture.

1.3.1 Managing Bidirectional Text

In an application supporting bidirectional script, the global direction determines text flow and alignment. For example, in an Arabic word processor, the global direction is right to left for justification, tabulation, and related formatting operations. Roman words and phrases within an Arabic text string, however, remain left-to-right aligned. When a line break occurs in the middle of an Arabic string imbedded in an English sentence, however, the break must still respect the left-to-right text flow as shown in Figure 1-2.

Figure 1-2 English-Arabic-English left-to-right text flow

1.3.2 Symmetrical Swapping

Some characters have a meaning related to their direction, such as greater than (>) and less than (<). Bidirectional text strings containing these characters may need to replace one character with its complement to retain the correct meaning. For example, the expression 3 < 4 (three is less than four) is stored in the sequence in which it is typed. To display this expression correctly in a right-to-left string, however, it must appear as 4 > 3 (read right to left as 3 less than 4). Because a direct transposition of the three characters in storage would give 4 < 3, which is incorrect, the transformation process must swap < and >.

Replacing complementary symbols to retain their correct meaning is called symmetrical swapping. Characters requiring symmetrical swapping include parentheses: ( and ), angle brackets: < and >, square brackets: [ and ], and curly brackets: { and }.

1.4 Character Representation

In CTL languages, the basic input character is often modified before being displayed. Character representation is the final display form of the screen or printed character and depends on context, ligatures, diacritical marks, and character clusters.

Note that any modified input will not have a one-to-one correspondence between the number of input characters and the number of output glyphs. A single glyph occupying one display cell can represent two or more typed characters. Internal algorithms determine the character representation from the number and sequence of input characters. By using the PLS library, an application uses the appropriate character representation in each CTL language.

1.4.1 Contextual Analysis

The shape of a character in its final display form can depend on its position in a word or its position relative to neighboring characters. Changing the shape of a character is called shaping or contextual analysis. For example, in English handwriting, letters can have different shapes in isolation or connected to other letters: the letter "r" appears differently at the beginning, in the middle, or at the end of a word. In English printing, however, character representation is unaffected by position--a character in isolation or connected to other letters is identically represented. For example, lowercase "a" is represented as "a"--either by itself, at the start, in the middle, or at the end of a word as in: a, all, lap, formula.)

Arabic uses a cursive script, connecting one character to another as in English handwriting. All characters are affected in many different ways by their context. Arabic characters can have up to four final display forms: initial, final, medial, or isolated. During input, the keystrokes are stored as basic code-point values. The Arabic language engine reads the code-point values and selects the appropriate final display form from the context. Ligatures and diacritics, however, must also be considered.

Note -

Only the basic shape of each character appears on an Arabic keyboard.

1.4.2 Ligatures

A ligature is the combination of two or more characters to create a single character or syllable. For example, in English, a diphthong is a ligature which unites two vowel characters to create another sound: a + e = æ.

In Arabic, ligatures are the combinations of two and, sometimes, three characters into one glyph. The resulting glyph replaces the characters composing it. For example, an Arabic letter typed twice is stored in memory as two distinct keystrokes, occupying two display cells. However, the Arabic language engine recognizes the context of each character and returns one glyph as shown in Figure 1-3.

Figure 1-3 Arabic ligature

In the Arabic language engine, a ligature occupies the same number of display cells as input characters, with one exception: the combination character lamalif.

Figure 1-4 Lamalif combination character

The rules governing ligatures in Arabic text are very complex and do not depend solely on individual characters. Certain fonts define as many as 200 ligatures, while other fonts do not use ligatures at all.

1.4.3 Diacritics

Diacritical marks (or diacritics) are added to a character to show pronunciation. For example, in French, an accent is a diacritical mark above a vowel which alters the stress of the vowel: à á è é.

In Hebrew, diacritical marks represent vowel sounds and appear above, below, or inside characters . Typically, however, words are written without diacritics and vowel sounds are determined from the context.

In Arabic, diacritical marks can appear above (single or double diacritic) or below (single diacritic) any character. In the Arabic language engine, diacritics are entered as separate keystrokes and occupy the same cell as their associated character.

In Thai, diacritical marks appear above or below the base line containing consonants, vowels, symbols, and numbers. In the Thai language engine, diacritics are entered as separate keystrokes and occupy the same cell as their associated character.

1.4.4 Character Clusters

A character cluster is a collection of alphabetic characters forming a single word or syllable.

A Hebrew character cluster can contain characters and diacritics, though diacritics are not often used.

Character clusters are integral to written text in Thai. A Thai character cluster consists of up to four parallel lines:

Base line of consonant, vowel, or symbol
Lower line of lower vowel or diacritic
Upper line of upper vowel or diacritic
Line above the upper line for tone marks and upper diacritics

One or more elements may change shape in the presence of other elements. In the Thai language engine, a character cluster occupies one display cell. Depending on the composition, one display cell can contain up to three input characters, as shown in Figure 1-5.

Figure 1-5 Thai-language character cluster

1.5 Input and Text Handling

The Solaris operating environment locales support text input in a number of different writing systems. In the UTF-8 locale, to choose an input mode, press the Compose key and a two-letter code as in Table 1-1. For example, to input text in Thai, press Compose+tt. Alternatively, click the status area and select an input mode. (To select the English/European input mode, press Control+Space.)

Note -

In a CTL locale, to toggle between the English/European and CTL locale, press Control+Space.

The bottom left corner of the text window displays the current input mode, as shown in Figure 1-6.

Figure 1-6 Current input mode (Thai)

The user can mix text between English and the CTL language locale (Arabic, Hebrew, Thai) by switching input modes. The formatting rules governing the input stream are managed by the application.

Table 1-1 UTF-8 Input Mode two-letter codes


Language	Code
Cyrillic	`cc`
Greek	`gg`
Thai	`tt`
Arabic	`ar`
Hebrew	`hh`
Unicode Hex	`uh`
Unicode Octal	`uo`
Lookup	`ll`
English/European	`Control+Space`

Note -

To switch modes, press Compose+code. On a X86 machine, use Ctrl+Shift+F1 as the Compose key.

CTL text can also be input with a lookup table as shown in Figure 1-7.

Figure 1-7 Lookup Thai input mode

To view a lookup table, click the status area, [Lookup], and a Lookup input mode. To input a character, click the character.

1.5.1 Cursor Positioning

The default cursor position is the upper left corner of the screen. The cursor direction depends on the input mode (left-to-right for Western languages and right-to-left for Arabic and Hebrew). The cursor position is determined after the second keystroke.

In English and other Western languages, the cursor redisplays to the right of each keystroke. In Hebrew and Arabic, the cursor remains stationary, with the character stream appearing to the right or left of the cursor, depending on the page alignment.

In Thai, the character cluster is composed to the left of the cursor. If a set of two or three keystrokes belong to the same cluster, the language engine values these keystrokes as zero and the cursor remains in place. The cursor moves to the right as each cluster is composed.

Note -

Cursor attributes can be reconfigured in the application resource file. For more information, see Step 4 in Section 3.1 CTL Motif Resources.

1.5.2 Character and Cell Deletion

To manage editing character clusters, the CTL language supports both character-based deletion and cell-based deletion. In character-based deletion, the user deletes the separate elements of the character cluster with each press of the Delete key. In cell-based deletion, the user deletes the entire cluster with one keystroke.

The two deletion modes can be assigned to separate keys on the keyboard using translation and resources. For more information, see Step 3 in Section 3.1 CTL Motif Resources.

1.5.3 Processing Stored Text

In the Solaris operating environment, all CTL text is stored in logical order--the order of the character input stream. The CTL language engines manage the transformation from CTL scripts to final display form on screen.

Most legacy systems containing CTL text store data in physical order--the order represented on screen. This was done because there was no processing capability at the terminal for reading and correctly rendering logical input. Screens of text were stored and then read again to screen.

Applications can read legacy data using the CTL language engines by configuring the appropriate language engine modifier. For more information, see Step 1 in Section 3.1 CTL Motif Resources.

For other language-engine configuration parameters, see CAE Specification: Portable Layout Services (ISBN 1-85912-142-X).

1.6 Numbers and Dates

For numbers, most countries use the "Arabic" or Western format of ten digits: 1 2 3 4 5 6 7 8 9 0. Some Arabic countries, however, use Hindi digits.

Figure 1-8 Hindi digits

The Solaris operating environment supports both Arabic and Hindi digits. By default, Western language input modes display Arabic numbers while the Arabic input mode displays Hindi numbers, though this is configurable by the user. To change the default, see Step 1 in Section 3.1 CTL Motif Resources.

Arabic countries use both the Gregorian (Western) and Hijri (Islamic) date systems. These and other numeric data, such as time and monetary formats and currency symbol, are managed by the Solaris operating environment locale settings.

1.7 Summary

CTL languages require complex transformations to properly display or print characters. The fundamental characteristics of CTL languages are:

Bidirectional text--Arabic and Hebrew are written right to left, but numbers and Roman-based characters are written left to right.
Contextual analysis--each Arabic character has up to four display representations. The representation glyph depends on its position in the text.
Ligatures--combinations of two or more Arabic characters can form a different, single character.
Diacritics--diacritical marks placed above, below, or inside vowels and consonants modify their tonal value.
Character clusters--syllables and cells, especially in Thai, are composed of several alphabetic elements, including vowels, consonants, diacritics, and tone marks.
Number and date formats--some Arabic countries use Hindi digits rather than Arabic digits, and the Islamic calendar rather than the Western calendar.

Motif interfaces with PLS which interfaces with the CTL language engines. Applications that make use of Motif services need not address CTL issues directly because all CTL transformations and rendering are handled transparently.