Common Desktop Environment: Internationalization Programmer's Guide

Code Set Structure

Each code set is divided into two principle areas:

The first two columns of each code set are reserved by ISO standards for control characters. The terms C0 and C1 are used to denote the control characters for the Graphic Left and Graphic Right areas, respectively.


Note -

The PC code sets use the C1 control area to encode graphic characters.


The remaining six columns are used to encode graphic characters (see Figure 3-1). Graphic characters are considered to be printable characters, while the control characters are used by devices and applications to indicate some special function

Figure 3-1 Code Set Overview

Graphic

Control Characters

Based on the ISO definition, a control character initiates, modifies, or stops a control operation. A control character is not a graphic character, but can have graphic representation in some instances. The control characters in the ISO646-IRV character set are present in all supported code sets,and the encoded values of the C0 control characters are consistent throughout the code sets.

Graphic Characters

Each code set can be considered to be divided into one or more character sets, such that each character is given a unique coded value. The ISO standard reserves six columns for encoding characters and does not allow graphic characters to be encoded in the control character columns.

Single-Byte Code Sets

Code sets that use all 8 bits of a byte can support European, Middle Eastern, and other alphabetic languages. Such code sets are called single-byte code sets. This provides a limit of encoding 191 characters, not including control characters.

Multibyte Code Sets

The term multibyte code sets is used to refer to all possible code sets regardless of the number of bytes needed to encode any specific character. Because the operating system should be capable of supporting any number of bits to encode a character, a multibyte code set may contain characters that are encoded with 8, 16, 32, or more bits. Even single-byte code sets are considered to be multibyte code sets.

Extended UNIX Code (EUC) Code Set

The EUC code set uses control characters to identify characters in some of the character sets. The encoding rules are based on the ISO2022 definition for the encoding of 7-bit and 8-bit data. The EUC code set uses control characters to separate some of the character sets.

The term EUC denotes these general encoding rules. A code set based on EUC conforms to the EUC encoding rules but also identifies the specific character sets associated with the specific instances. For example, eucJP for Japanese refers to the encoding of the JIS characters according to the EUC encoding rules.

The first set (CS0) always contains an ISO646 character set. All of the other sets must have the most-significant bit (MSB) set to 1, and they can use any number of bytes to encode the characters. In addition, all characters within a set must have:

Each character in the third set (CS2) is always preceded with the control character SS2 (single-shift 2, 0x8e). Code sets that conform to EUC do not use the SS2 control character other than to identify the third set.

Each character in the fourth set (CS3) is always preceded with the control character SS3 (single-shift 3, 0x8f). Code sets that conform to EUC do not use the SS3 control character other than to identify the fourth set.