Asian Application Developer's Guide

Chapter 2 Language Environment and Character Codes

Asian Solaris software enables you to switch between different languages, or character sets, and can represent very large character sets. Some operating systems use the 7-bit ASCII codeset to represent English language characters. It has become a common assumption that characters equal bytes. Because the Korean and Chinese character sets are very large, more than one byte is required to represent each character. Asian Solaris software provides a mechanism for specifying multiple codesets. These codesets may contain characters of more than one byte. Character sets and codesets are defined as follows:

A character set is a set of elements used for the organization, control, or representation of data. Character sets may be composed of alphabets, ideograms, or other units.
A codeset, or coded character set, is a set of unambiguous rules that establishes a character set and the one-to-one relationship between each character in the character set and its bit representation.

For example, the ASCII codeset contains a bit representation for each uppercase and lowercase alphabetic letter as well as punctuation, numbers, and control codes.

Extended UNIX Code (EUC)

Asian Solaris software implements Extended UNIX® Code (EUC) specified by the SVR4 Multi-National Language Supplement (MNLS), which follows the pattern of ISO 2022 standards. Four single-byte and multibyte codesets can be represented in EUC at both the process level and the file level.

EUC is used as file code for storing data and internally in the CPU and RAM memory. It is composed of one or more bytes and may be accompanied by single-shift characters.

EUC Definition

EUC is composed of one primary codeset and three supplementary codesets. The primary codeset, codeset 0, is used for ASCII. The three supplementary codesets (codesets 1, 2, and 3) can be assigned to different character sets by the user. There is a system default assignment for these codesets.

The primary codeset is defined to use single bytes with the most significant bit (MSB) set to zero. The supplementary codesets can use multiple bytes, and the MSB of each byte is set to one. Codesets 2 and 3 have a preceding single-shift character, known as SS2 (0x8E) in codeset 2 and SS3 (0x8F) in codeset 3. Differentiating between codesets is done as follows: If the MSB is 0, the code is one-byte ASCII. If the MSB is 1, the byte is checked (SS2 or SS3) to determine which codeset it belongs to. The length in bytes of characters from that codeset is retrieved from an ANSI localization table governing character classification, and that number of bytes is read in.

Table 2-1 EUC Codeset Representations


Codeset	EUC Representation
codeset 0	0xxxxxxx
codeset 1	1xxxxxxx -or- 1xxxxxxx 1xxxxxxx -or- 1xxxxxxx 1xxxxxxx 1xxxxxxx
codeset 2	SS2 1xxxxxxx -or- SS2 1xxxxxxx 1xxxxxxx -or- SS2 1xxxxxxx 1xxxxxxx 1xxxxxxx
codeset 3	SS3 1xxxxxxx -or- SS3 1xxxxxxx 1xxxxxxx -or- SS3 1xxxxxxx 1xxxxxxx 1xxxxxxx

EUC Special Characters

In accord with ISO 2022 and ISO 6937/3, EUC divides the codeset space into graphic and special characters. Graphic characters are those that have a glyph or shape that can be displayed. Special characters include Control characters, unassigned characters, and the Space and Delete characters. Control characters are characters, other than graphic characters, whose occurrence in a particular context initiates, modifies, or stops a control operation.

Table 2-2 Single-Byte Special Character Representations


Special Character	EUC Representation
Space	00100000
Delete	01111111
Control codes (Primary)	000xxxxx
Control codes (Supplementary)	100xxxxx

Wide Character (WC)

The wide character (WC) is defined in Asian Solaris software to be a constant-width four-byte code. It provides a standard character size, which is useful in indexing, interprocess communication, memory management, and other tasks that use character counts and known array sizes.

Note -

Wide characters are intended for internal processing only. Applications should not depend on the wide character implementation, but use standard library APIs to handle wide characters.

Korean Solaris Supported Character Sets

Three types of coding conventions are currently supported in the Korean Solaris software:

N-byte code. This single-byte code has each byte represent a consonant or vowel. These are combined together to build Hangul characters.
Johap or Packed code. This two-byte code consists of a leading bit followed by three 5-bit fields. These three fields contain the codes for a leading consonant, followed by a vowel, followed by a final consonant (if any) for a Hangul character. This two-byte code is specified in Korean Industry Standard KS C 5601-1992.
Wansung code. This two-byte code is specified in Korean Industry Standard KS C 5601-1987 for Hangul, Hanja, and other characters. In the Korean Solaris software these KS C 5601-1987 characters are in EUC codeset 1.
ko.UTF-8 - Korean Universal Multiple Octet Coded Character Set (UCS) Transmission Format. See "The ko.UTF-8 Locale " for further information.

Korean Solaris software provides code conversion between these four Korean code conventions at three levels of support:

User commands support file transfers for existing files in different codes.
Library functions support application development for existing codes.
STREAMS modules support existing TTY devices using different codes.

The `ko.UTF-8` Locale

The Korean government announced the standard Korean codeset KS C 5700, which is based on Unicode 2.0. KS C 5700 will be widely used in the Korean market, replacing the previous standard, KS C 5601, which is based on ISO 2022.

To comply with this new standard, the ko.UTF-8 locale was developed. UTF-8 is a file system safe (Universal Character Set Transformation Format) Unicode, which is based on ISO 10646-1/Unicode 2.0.

ko.UTF-8 supports all the characters of KSC 5601 and 11,172 characters from Johap. ko.UTF-8 supports all Korean-related Unicode 2.0 characters and fonts. All Unicode characters can be accepted and processed, but some cannot be correctly displayed because of input and output limitations.

ko.UTF-8 supports the following subset of Unicode:

Basic Latin and Latin-1 (190 characters) - Row 00 of BMP (Basic Multilingual Plan)
Symbolic characters - Row 20 to Row 27, and Row 32 of BMP Including box (line) drawing characters that are defined in KS C 5601
Numerals that are defined in KSC 5601 (20 characters) - Row 21 and Row FF of BMP
Roman, Greek, Japanese, and Cyrillic alphabet characters that are defined in KS C 5601 (362 characters) - Row 03, Row 04, Row 30 and Row FF of BMP
Jamo (Hangul alphabet) characters (94 characters) - Row 31 of BMP
Pre-composed Hangul syllables (11,172 characters) - From Row AC to Row D7 of BMP
Hanja characters defined in KS C 5601 (4,888 characters) - From Row 4E to Row 9F and from Row F9 to Row FA of BMP

Simplified Chinese Solaris Supported Character Sets

Simplified Chinese Solaris supports the PRC Chinese national standard character set (GB 2312-80). GB 2312-80 consists of 7,445 characters: 3,755 level-1 Hanzi characters, 3,008 level-2 Hanzi characters and Hanzi radicals, Roman characters, Greek and Cyrillic characters, Arabic and Greek numerals, and miscellaneous symbols.

Chinese Solaris software provides code conversion between Chinese code conventions at two levels of support:

User commands support file transfers for existing files in different codes.
Library functions support application development for existing codes.

Traditional Chinese Solaris Supported Character Sets

Traditional Chinese Solaris supports the Taiwanese Chinese National Standard CNS 11643-1992 and Big5 character sets. CNS 11643-1992 is a Chinese national standard in Taiwan. It defines 16 planes:

Plane 1:

Miscellaneous symbols, Hanzi radicals, and Roman and Greek alphabets, total 684 symbol characters in the range of 0x2121 to 0x427E, and 5,401 most commonly used Hanzi characters in the range of 0x4421 to 0x7D4B.
Plane 2:

7,650 secondary commonly-used Hanzi characters in the range of 0x2121 to 0x7244.
Plane 3:

A total of 6,148 other Hanzi characters, including some user-defined characters from the original plane 14 characters and different shaped characters in the range 0x2121-0x6246 from the Republic of China's (ROC) Department of Education.
Plane 4:

This plane contains a total of 7,298 characters, including some of ISO/IEC 10646 defined CJK Unified Han characters (range: 0x2121-0x6E5C).
Plane 5:

This plane contains a total of 8,603 characters that the ROC Department of Education defined as currently-used characters but not included in planes 1 through 4 (range: 0x2121-0x7C51).
Plane 6:

This plane contains a total of 6,388 characters that the ROC Department of Education defined as different shaped characters but not included in planes 1 through 5 (range: 0x2121-0x647A).
Plane 7:

This plane contains a total of 6,539 characters that the ROC Department of Education defined as different shaped characters but not included in planes 1 through 6 (range: 0x2121-0x6655).
Plane 8 to 11:

These planes are not yet defined.
Plane 12 to 16:

These planes are for user-defined characters.

Big5 was defined by five major Taiwanese computer vendors (including the Institute of Information Industry) in May of 1984. Although Big5 is not the national standard, it is more widely used than the CNS 11634-1992.

The total number of characters defined in Big5 is 13,523. It is a subset of CNS 11643-1992.

Traditional Chinese Solaris software provides code conversion between Chinese code conventions at three levels of support:

User commands support file transfers for existing files in different codes.
Library functions support application development for existing codes.
STREAMS modules support existing TTY devices using different codes.