Unicode Support in the Solaris Operating Environment

Chapter 2 Unicode

In most writing systems, keyboard input is converted into character codes, stored in memory, and converted to glyphs in a particular font for display and printing. The collection of characters and character codes form a codeset. To represent characters of different languages, a different codeset is used.

A character code in one codeset, however, does not necessarily represent the same character in another codeset. For example, the character code 0xB1 is the plus-minus sign (+-) in Latin-1 (ISO 8859-1 codeset), capital BE in Cyrillic (ISO 8859-5 codeset), and does not represent anything in Arabic (ISO 8859-6 codeset) or Traditional Chinese (CJK unified ideographs).

In Unicode, every character, ideograph, and symbol has a unique character code, eliminating any confusion between character codes of different codesets. In Unicode, multiple codesets need not be defined. Unicode represents characters from most of the world's languages as well as publishing characters, mathematical and technical symbols, and punctuation characters. This universal representation for text data has been further enhanced and extended in the latest release of Unicode: The Unicode Standard, Version 3.0.

2.1 Unicode Coded Representations

In recent years, the Unicode Consortium and other related organizations have developed different formats to represent and store a Unicode codeset. To represent characters from all major languages in multibyte format, the ISO/IEC International Standard 10646-1 (commonly referred to as 10646) has defined the Universal Multiple-Octet Coded Character Set (UCS) format. Character forms contained in the 10464 specifications are:

UCS-2 defines a 64K coding space, or BMP, to represent character codes in a two-octet row and cell format. The row and cell octets designate the cell location of a particular character code within a 256 by 256 (00-FF) plane.

UCS-4 defines a four-octet coding space divided into four units: group, plane, row, and cell. The row and cell octets designate the cell location of a particular character code within a plane. The plane octet designates the plane number (00-FF), and the group octet the group number (00-7F) to which the plane belongs. In total, there are 256 planes occurring 127 times.

Figure 2-1 UCS-2 and UCS-4 coding schemes

Graphic

In addition to the 10646 UCS forms, Unicode defines another form called UTF (UCS Transformation Format). One version of UTF is an extended UCS-2 encoding form designed to include characters from outside the BMP 64K coding space. This form was first called UCS-2E (extended UCS-2), but is now known as UTF-16 (UCS Transformation Format 16-bit form).

The UTF-16 form translates a range of UCS-4 codes into a two-octet encoded string. It does this by reserving an area of codes in the BMP coding space for mapping to and from 16 planes of group 00 of UCS-4. Each plane is assigned a certain set of code positions in the two-octet UCS-2 scheme. Specifically, Planes 01 to 0E (14 planes, or 14 x 65,536 = 917,504 characters) are reserved for standard encodings and Planes 0F and 10 (2 planes, or 2 x 65,536 = 131,072 characters) are reserved for private use.

Although UCS-4 and UTF-16 provide comprehensive ways to represent several character sets, they do not preserve the byte values for ASCII characters. Because all UNIX systems are based on an ASCII kernel, they reserve certain character codes for I/O operations, such as the null character as a string terminator, the slash (/) character as a path name separator, and the DEL and SPACE control characters. To circumvent this problem, another version of UTF was devised, called FSS-UTF (File System Safe-UTF), now commonly known as UTF-8.

UTF-8 is an encoding scheme which maps the entire UCS-4 character set to a series of single-octet and multi-octet strings. In this scheme, the most significant bit is 0 for ASCII characters and 1 for all other characters. The ASCII character range is contained in a single-byte encoding, and all other characters in a range from 2 up to 6-byte encoding.

Table 2-1 UTF-8 encoding scheme

Bits 

Hex Min 

Hex Max 

UTF-8 Binary Encoding 

00000000 

0000007F 

0xxxxxxx 

11 

00000080 

000007FF 

110xxxxx 10xxxxxx 

16 

00000800 

0000FFFF 

1110xxxx 10xxxxxx 10xxxxxx 

21 

00010000 

001FFFFF 

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 

26 

00200000 

03FFFFFF 

111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 

31 

04000000 

7FFFFFFF 

1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 


Note -

The UTF-8 scheme does not use any ASCII byte values in its 2- to 6-byte sequences, yet ASCII values remain 8-bit within the new byte structure. Thus, UTF-8 is compatible with all legacy file systems and other systems that parse for the ASCII byte, while UCS-2/UTF-16 and UCS-4 are not compatible with ASCII.

Furthermore, applications supporting Unicode can use existing data in ASCII format without applying a conversion utility. In addition, there is support within the Internet community for adopting UTF-8 as the Internet encoding standard.


In addition to its backward compatibility with 7-bit ASCII, UTF-8 is a space-efficient encoding scheme when the encoded data needs only one-byte or less (as for English and other Roman character-based writing systems). Because UTF-8 stores one-byte data as one byte, rather than, for example, the two bytes required by UTF-16, this can significantly decrease the storage space required to hold large blocks of international data.

Because of its flexibility and compatibility with ASCII and UNIX, Unicode support of the UTF-8 format is used in the Solaris operating environment. UTF-8 provides developers with a format compatible with existing internationalized environments and an easy path for Internet and legacy data interoperability. As a file system safe format, UTF-8 supports one-byte unit I/O operations and can represent the Unicode formats UCS-2 and UCS-4. Furthermore, UTF-8 fits well within the XPG internationalization framework.