Asian Application Developer's Guide

Chapter 2 Language Environment and Character Codes

Asian Solaris software enables you to switch between different languages, or character sets, and can represent very large character sets. Some operating systems use the 7-bit ASCII codeset to represent English language characters. It has become a common assumption that characters equal bytes. Because the Korean and Chinese character sets are very large, more than one byte is required to represent each character. Asian Solaris software provides a mechanism for specifying multiple codesets. These codesets may contain characters of more than one byte. Character sets and codesets are defined as follows:

For example, the ASCII codeset contains a bit representation for each uppercase and lowercase alphabetic letter as well as punctuation, numbers, and control codes.

Extended UNIX Code (EUC)

Asian Solaris software implements Extended UNIX® Code (EUC) specified by the SVR4 Multi-National Language Supplement (MNLS), which follows the pattern of ISO 2022 standards. Four single-byte and multibyte codesets can be represented in EUC at both the process level and the file level.

EUC is used as file code for storing data and internally in the CPU and RAM memory. It is composed of one or more bytes and may be accompanied by single-shift characters.

EUC Definition

EUC is composed of one primary codeset and three supplementary codesets. The primary codeset, codeset 0, is used for ASCII. The three supplementary codesets (codesets 1, 2, and 3) can be assigned to different character sets by the user. There is a system default assignment for these codesets.

The primary codeset is defined to use single bytes with the most significant bit (MSB) set to zero. The supplementary codesets can use multiple bytes, and the MSB of each byte is set to one. Codesets 2 and 3 have a preceding single-shift character, known as SS2 (0x8E) in codeset 2 and SS3 (0x8F) in codeset 3. Differentiating between codesets is done as follows: If the MSB is 0, the code is one-byte ASCII. If the MSB is 1, the byte is checked (SS2 or SS3) to determine which codeset it belongs to. The length in bytes of characters from that codeset is retrieved from an ANSI localization table governing character classification, and that number of bytes is read in.

Table 2-1 EUC Codeset Representations

Codeset  

EUC Representation  

codeset 0  

0xxxxxxx  

codeset 1  

1xxxxxxx -or-  

1xxxxxxx 1xxxxxxx -or-  

1xxxxxxx 1xxxxxxx 1xxxxxxx  

codeset 2  

SS2 1xxxxxxx -or-  

SS2 1xxxxxxx 1xxxxxxx -or-  

SS2 1xxxxxxx 1xxxxxxx 1xxxxxxx  

codeset 3  

SS3 1xxxxxxx -or-  

SS3 1xxxxxxx 1xxxxxxx -or-  

SS3 1xxxxxxx 1xxxxxxx 1xxxxxxx  

EUC Special Characters

In accord with ISO 2022 and ISO 6937/3, EUC divides the codeset space into graphic and special characters. Graphic characters are those that have a glyph or shape that can be displayed. Special characters include Control characters, unassigned characters, and the Space and Delete characters. Control characters are characters, other than graphic characters, whose occurrence in a particular context initiates, modifies, or stops a control operation.

Table 2-2 Single-Byte Special Character Representations

Special Character  

EUC Representation 

Space 

00100000 

Delete 

01111111 

Control codes (Primary) 

000xxxxx 

Control codes (Supplementary) 

100xxxxx 

Wide Character (WC)

The wide character (WC) is defined in Asian Solaris software to be a constant-width four-byte code. It provides a standard character size, which is useful in indexing, interprocess communication, memory management, and other tasks that use character counts and known array sizes.


Note -

Wide characters are intended for internal processing only. Applications should not depend on the wide character implementation, but use standard library APIs to handle wide characters.


Korean Solaris Supported Character Sets

Three types of coding conventions are currently supported in the Korean Solaris software:

Korean Solaris software provides code conversion between these four Korean code conventions at three levels of support:

The ko.UTF-8 Locale

The Korean government announced the standard Korean codeset KS C 5700, which is based on Unicode 2.0. KS C 5700 will be widely used in the Korean market, replacing the previous standard, KS C 5601, which is based on ISO 2022.

To comply with this new standard, the ko.UTF-8 locale was developed. UTF-8 is a file system safe (Universal Character Set Transformation Format) Unicode, which is based on ISO 10646-1/Unicode 2.0.

ko.UTF-8 supports all the characters of KSC 5601 and 11,172 characters from Johap. ko.UTF-8 supports all Korean-related Unicode 2.0 characters and fonts. All Unicode characters can be accepted and processed, but some cannot be correctly displayed because of input and output limitations.

ko.UTF-8 supports the following subset of Unicode:

Simplified Chinese Solaris Supported Character Sets

Simplified Chinese Solaris supports the PRC Chinese national standard character set (GB 2312-80). GB 2312-80 consists of 7,445 characters: 3,755 level-1 Hanzi characters, 3,008 level-2 Hanzi characters and Hanzi radicals, Roman characters, Greek and Cyrillic characters, Arabic and Greek numerals, and miscellaneous symbols.

Chinese Solaris software provides code conversion between Chinese code conventions at two levels of support:

Traditional Chinese Solaris Supported Character Sets

Traditional Chinese Solaris supports the Taiwanese Chinese National Standard CNS 11643-1992 and Big5 character sets. CNS 11643-1992 is a Chinese national standard in Taiwan. It defines 16 planes:

Big5 was defined by five major Taiwanese computer vendors (including the Institute of Information Industry) in May of 1984. Although Big5 is not the national standard, it is more widely used than the CNS 11634-1992.

The total number of characters defined in Big5 is 13,523. It is a subset of CNS 11643-1992.

Traditional Chinese Solaris software provides code conversion between Chinese code conventions at three levels of support: