Asian Application Developer's Guide

Extended UNIX Code (EUC)

Asian Solaris software implements Extended UNIX® Code (EUC) specified by the SVR4 Multi-National Language Supplement (MNLS), which follows the pattern of ISO 2022 standards. Four single-byte and multibyte codesets can be represented in EUC at both the process level and the file level.

EUC is used as file code for storing data and internally in the CPU and RAM memory. It is composed of one or more bytes and may be accompanied by single-shift characters.

EUC Definition

EUC is composed of one primary codeset and three supplementary codesets. The primary codeset, codeset 0, is used for ASCII. The three supplementary codesets (codesets 1, 2, and 3) can be assigned to different character sets by the user. There is a system default assignment for these codesets.

The primary codeset is defined to use single bytes with the most significant bit (MSB) set to zero. The supplementary codesets can use multiple bytes, and the MSB of each byte is set to one. Codesets 2 and 3 have a preceding single-shift character, known as SS2 (0x8E) in codeset 2 and SS3 (0x8F) in codeset 3. Differentiating between codesets is done as follows: If the MSB is 0, the code is one-byte ASCII. If the MSB is 1, the byte is checked (SS2 or SS3) to determine which codeset it belongs to. The length in bytes of characters from that codeset is retrieved from an ANSI localization table governing character classification, and that number of bytes is read in.

Table 2-1 EUC Codeset Representations

Codeset  

EUC Representation  

codeset 0  

0xxxxxxx  

codeset 1  

1xxxxxxx -or-  

1xxxxxxx 1xxxxxxx -or-  

1xxxxxxx 1xxxxxxx 1xxxxxxx  

codeset 2  

SS2 1xxxxxxx -or-  

SS2 1xxxxxxx 1xxxxxxx -or-  

SS2 1xxxxxxx 1xxxxxxx 1xxxxxxx  

codeset 3  

SS3 1xxxxxxx -or-  

SS3 1xxxxxxx 1xxxxxxx -or-  

SS3 1xxxxxxx 1xxxxxxx 1xxxxxxx  

EUC Special Characters

In accord with ISO 2022 and ISO 6937/3, EUC divides the codeset space into graphic and special characters. Graphic characters are those that have a glyph or shape that can be displayed. Special characters include Control characters, unassigned characters, and the Space and Delete characters. Control characters are characters, other than graphic characters, whose occurrence in a particular context initiates, modifies, or stops a control operation.

Table 2-2 Single-Byte Special Character Representations

Special Character  

EUC Representation 

Space 

00100000 

Delete 

01111111 

Control codes (Primary) 

000xxxxx 

Control codes (Supplementary) 

100xxxxx 

Wide Character (WC)

The wide character (WC) is defined in Asian Solaris software to be a constant-width four-byte code. It provides a standard character size, which is useful in indexing, interprocess communication, memory management, and other tasks that use character counts and known array sizes.


Note -

Wide characters are intended for internal processing only. Applications should not depend on the wide character implementation, but use standard library APIs to handle wide characters.