Sun WorkShop Compiler C 5.0 User's Guide

Multibyte Characters and Wide Characters

At first, the internationalization of ANSI/ISO C affected only library functions. However, the final stage of internationalization--multibyte characters and wide characters--also affected the language proper.

Asian Languages Require Multibyte Characters

The basic difficulty in an Asian-language computer environment is the huge number of ideograms needed for I/O. To work within the constraints of usual computer architectures, these ideograms are encoded as sequences of bytes. The associated operating systems, application programs, and terminals understand these byte sequences as individual ideograms. Moreover, all of these encodings allow intermixing of regular single-byte characters with the ideogram byte sequences. Just how difficult it is to recognize distinct ideograms depends on the encoding scheme used.

The term "multibyte character" is defined by ANSI/ISO C to denote a byte sequence that encodes an ideogram, no matter what encoding scheme is employed. All multibyte characters are members of the "extended character set." A regular single-byte character is just a special case of a multibyte character. The only requirement placed on the encoding is that no multibyte character can use a null character as part of its encoding.

ANSI/ISO C specifies that program comments, string literals, character constants, and header names are all sequences of multibyte characters.

Encoding Variations

The encoding schemes fall into two camps. The first is one in which each multibyte character is self-identifying, that is, any multibyte character can simply be inserted between any pair of multibyte characters.

The second scheme is one in which the presence of special shift bytes changes the interpretation of subsequent bytes. An example is the method used by some character terminals to get in and out of line-drawing mode. For programs written in multibyte characters with a shift-state-dependent encoding, ANSI/ISO C requires that each comment, string literal, character constant, and header name must both begin and end in the unshifted state.

Wide Characters

Some of the inconvenience of handling multibyte characters would be eliminated if all characters were of a uniform number of bytes or bits. Since there can be thousands or tens of thousands of ideograms in such a character set, a 16-bit or 32-bit sized integral value should be used to hold all members. (The full Chinese alphabet includes more than 65,000 ideograms!) ANSI/ISO C includes the typedef name wchar_t as the implementation-defined integral type large enough to hold all members of the extended character set.

For each wide character, there is a corresponding multibyte character, and vice versa; the wide character that corresponds to a regular single-byte character is required to have the same value as its single-byte value, including the null character. However, there is no guarantee that the value of the macro EOF can be stored in a wchar_t, just as EOF might not be representable as a char.

Conversion Functions

ANSI/ISO C provides five library functions that manage multibyte characters and wide characters:

Table E-2 Multibyte Character Conversion Functions
mblen()   length of next multibyte character
mbtowc()   convert multibyte character to wide character
wctomb()   convert wide character to multibyte character
mbstowcs()   convert multibyte character string to wide character string
wcstombs()   convert wide character string to multibyte character string

The behavior of all of these functions depends on the current locale. (See "The setlocale() Function ".)

It is expected that vendors providing compilation systems targeted to this market supply many more string-like functions to simplify the handling of wide character strings. However, for most application programs, there is no need to convert any multibyte characters to or from wide characters. Programs such as diff, for example, read in and write out multibyte characters, needing only to check for an exact byte-for-byte match. More complicated programs, such as grep, that use regular expression pattern matching, may need to understand multibyte characters, but only the common set of functions that manages the regular expression needs this knowledge. The program grep itself requires no other special multibyte character handling.

C Language Features

To give even more flexibility to the programmer in an Asian-language environment, ANSI/ISO C provides wide character constants and wide string literals. These have the same form as their non-wide versions, except that they are immediately prefixed by the letter L:

'x' regular character constant

'¥' regular character constant

L'x' wide character constant

L'¥' wide character constant

"abc¥xyz" regular string literal

L"abcxyz" wide string literal

Multibyte characters are valid in both the regular and wide versions. The sequence of bytes necessary to produce the ideogram ¥ is encoding-specific, but if it consists of more than one byte, the value of the character constant ' is implementation-defined, just as the value of 'ab' is implementation-defined. Except for escape sequences, a regular string literal contains exactly the bytes specified between the quotes, including the bytes of each specified multibyte character.

When the compilation system encounters a wide character constant or wide string literal, each multibyte character is converted into a wide character, as if by calling the mbtowc() function. Thus, the type of L'¥' is wchar_t; the type of abc¥xyz is array of wchar_t with length eight. Just as with regular string literals, each wide string literal has an extra zero-valued element appended, but in these cases, it is a wchar_t with value zero.

Just as regular string literals can be used as a shorthand method for character array initialization, wide string literals can be used to initialize wchar_t arrays:


wchar_t *wp = L"a¥z";
wchar_t x[] = L"a¥z";
wchar_t y[] = {L'a', L'¥', L'z', 0};
wchar_t z[] = {'a', L'¥', 'z', '\0'};

In the above example, the three arrays x, y, and z, and the array pointed to by wp, have the same length. All are initialized with identical values.

Finally, adjacent wide string literals are concatenated, just as with regular string literals. However, adjacent regular and wide string literals produce undefined behavior. A compiler is not required to produce an error if it does not accept such concatenations.