Unicode Support in the Solaris Operating Environment

Chapter 3 Unicode in the Solaris 8 Operating Environment

The support of Unicode, Version 3.0 in the Solaris 8 Operating Environment's Unicode locales has provided an enhanced framework for developing multiscript applications. Properly internationalized applications require no changes to support the Unicode locales. All internationalized CUI and GUI utilities and commands in the Solaris operating environment are available in Unicode locales without modification.

All Unicode locales in the Solaris operating environment are based on the UTF-8 format. Each locale includes a base language in the UTF-8 codeset and regional data related to the base language and its cultural conventions (such as local formatting rules, text messages, help messages, and other related files). Each locale also supports several other scripts for input, display, code conversion, and printing.

3.1 Unicode UTF-8 `en_US.UTF-8` Locale

en_US.UTF-8 is the flagship Unicode locale in the Solaris operating environment. The en_US.UTF-8 locale is an American English-based locale with multiscript processing support for characters in many different languages. New and enhanced features of all Unicode locales include support of the Unicode 3.0 character set, complex text layout scripts in correct rendition, native Asian input methods, more MIME character sets in dtmail, various new iconv code conversions, and an enhanced PostScript print filter.

All Unicode locales in the Solaris operating environment support multiple scripts. Thirteen input modes area available: English/European, Cyrillic, Greek, Arabic, Hebrew, Thai, Unicode Hex, Unicode Octal, Table lookup, Japanese, Korean, Simplified Chinese, and Traditional Chinese.

Users can input characters from any combination of scripts and the entire Unicode coding space.

Note -

To choose an input mode, press the Compose key and a two-letter code. For example, to input text in Thai, press Compose+tt. Alternatively, click the status area and select an input mode as shown in Figure 3-1. (To select the default English/European mode, press Control+Space.)

Table 3-1 UTF-8 Input Mode two-letter codes


Language	Code
Cyrillic	`cc`
Greek	`gg`
Thai	`tt`
Arabic	`ar`
Hebrew	`hh`
Unicode Hex	`uh`
Unicode Octal	`uo`
Lookup	`ll`
Japanes	`ja`
Korean	`ko`
Simplified Chinese	`sc`
Traditional Chinese	`tc`
English/European	`Control+Space`

Figure 3-1 UTF-8 Input Mode selection

To input text from a Lookup table, select the Lookup input mode. A lookup table with all input modes and various symbol and technical codesets appears, as shown in Figure 3-2.

The Table lookup input mode is the easiest for non-native speakers to input characters in a foreign language--a lookup window displays characters from a selected script, as shown for the Asian input mode in Figure 3-3.

The Arabic, Hebrew, and Thai input modes provide full complex text layout features, including right-to-left display and context-sensitive character rendering. The Unicode octal and hexadecimal code input modes generate Unicode characters from their octal and hexadecimal equivalents, respectively.

The Japanese, Korean, Simplified Chinese, and Traditional Chinese input modes provide full native Asian input.

Figure 3-2 UTF-8 Table Lookup

Figure 3-3 Asian input mode

For more information on each input method, refer to the chapter Overview of en_US.UTF-8 Locale Support in the latest Solaris International Language Environments Guide, ATOK12 User's Guide, Wnn6 User's Guide, cs00 User's Guide, Korean Solaris User's Guide, Simplified Chinese Solaris User's Guide, and Traditional Chinese Solaris User's Guide.

The Unicode locales can use the enhanced mp(1) printing filter to print text files. mp(1) prints flat text files written in UTF-8 using various Solaris system and printer resident fonts (such as bitmap, Type1, TrueType) depending on the script. The output is standard PostScript. For more information, refer to the mp(1) man page.

The Unciode locale supports various MIME character sets in dtmail, including various Latin, Greek, Cyrillic, Thai, and Asian character sets. Some of the example character sets are: ISO-8859-1 ~ 10, 13, 14, 15, UTF-8, UTF-7, UTF-16, UTF-16BE, UTF-16LE, Shift_JIS, ISO-2022-JP, EUC-KR, ISO-2022-KR, TIS-620, Big5, GB2312, KOI8-R, KOI8-U, and ISO-2022-CN. With this support, users can send and receive email messages encoded in MIME character sets from almost any region in the world. dtmail automatically decodes e-mail by recognizing the MIME character set and content transfer encoding in the message. The sender specifies the MIME character set for the recipient mail user agent.

Figure 3-4 Multiple character sets in `dtmail`

3.2 Codeset Conversion

The Solaris operating environment locale supports enhanced code conversion among the major codesets of several countries. Figure 3-5 shows the codeset conversions between UTF-8 and many other codesets.

Figure 3-5 Unicode codeset conversions

Codesets can be converted using the sdtconvtool utility or the iconv(1) command. sdtconvtool detects available iconv code conversions and presents them in an easy-to-use format.

Figure 3-6 `sdtconvtool` for converting between codesets

Users can also add their own code conversions and use them in iconv(3) functions, iconv(1) command line utilities, and sdtconvtool(1). For more information on user-extensible, user-defined code conversions, refer to the geniconvtbl(1) and geniconvtbl(4) man pages.

Developers can use iconv(3) to access the same functionality. This includes conversions to and from UTF-8 and many ISO-standard codesets, including UCS-2, UCS-4, UTF-7, UTF-16, KO18-R, Japanese EUC, Korean EUC, Simplified Chinese EUC, Traditional Chinese EUC, GBK, PCK (Shift JIS), BIG5, Johap, ISO-2022-JP, ISO-2022-KR, and ISO-2022-CN.

For a detailed listing of the supported code conversions, see Appendix A, Codeset Conversions.

3.3 European Unicode Locales

In the Solaris 8 operating environment, five European Unicode locales offer the same level of support as en_US.UTF-8 with modifications for language and cultural data.

The five European Unicode locales are:

fr_FR.UTF-8 (French)
de_DE.UTF-8 (German)
it_IT.UTF-8 (Italian)
es_ES.UTF-8 (Spanish)
sv_SE.UTF-8 (Swedish)

Each locale contains the same feature set as en_US.UTF-8 and regional definitions for numeric notation, date and time, currency, and translated text messages.

The following additional five European locales support the Euro currency symbol and monetary formatting conventions:

fr_FR.UTF-8@euro (French with euro monetary convention)
de_DE.UTF-8@euro (German with euro monetary convention)
it_IT.UTF-8@euro (Italian with euro monetary convention)
es_ES.UTF-8@euro (Spanish with euro monetary convention)
sv_SE.UTF-8@euro (Swedish with euro monetary convention)

Note -

All Unicode locales (including en_US.UTF-8 and Asian Unicode locales) support input and output of the new euro currency symbol.

Figure 3-7 Euro currency symbol

3.4 Asian Unicode Locales

The Solaris 8 operating environment also supports four Unicode locales with the same scope as en_US.UTF-8 and the European Unicode locales, with the necessary language and cultural modifications:

ja_JP.UTF-8 (Japanese)
ko_KR.UTF-8 (Korean)
zh_CN.UTF-8 (Simplified Chinese)
zh_TW.UTF-8 (Traditional Chinese)

Each Asian Unicode locale is tailored to the Asian customer's needs. For example, the Japanese Unicode locale supports additional characters from JIS X0212-1990 at the presentation layer. All existing native Asian input methods and systems are also transparently supported.

3.5 Unicode Font Resources

The Unicode Standard, Version 3.0 contains 49,194 characters from the world's scripts, with over 25,000 ideographic characters for Chinese, Japanese, and Korean. The font resources representing these characters, however, are not always one to one--some Unicode code points associate different, multiple glyphs, enabling specific code points to be rendered correctly based upon their context. For example, in Asian languages, the Unified han glyphs are written and displayed differently in Simplified Chinese, Traditional Chinese, Japanese kanji, and Korean hanja ideographs.

To manage these difficulties, the Solaris operating environment contains an output method combining existing fonts to form a Unicode font set, instead of providing a single Unicode font. The Solaris 8 operating environment supports the following range of scripts:

English/European
Greek, Turkish, Cyrillic
Arabic, Hebrew, Thai
Simplified Chinese, Traditional Chinese, Japanese, Korean

For European scripts, there is a one-to-one mapping between Unicode characters and corresponding glyphs. For Complex Text Layout language text (Arabic, Hebrew, Thai), the Solaris Universal Multiscript Layout Engine pre-processes the text (right-to-left swapping, contextual analysis, and so on) before rendering the associated glyphs.

For Asian characters, the Solaris operating environment output methods provide dynamic remapping of the font and glyph index according to the locale definition. Each locale contains a font table with mapping mechanisms specifying which font and glyph to use for each character code. The mechanism remaps the Unicode code point values to existing Chinese, Japanese, and Korean fonts and glyph index pairs. A locale administrator can define the sort priority among fonts. For example, the mechanism may search the Simplified Chinese fonts for the appropriate glyph and then search the Traditional Chinese fonts, and so on.