JavaScript is required to for searching.
Skip Navigation Links
Exit Print View
Oracle Solaris Studio 12.3: C User's Guide     Oracle Solaris Studio 12.3 Information Library
search filter icon
search icon

Document Information


1.  Introduction to the C Compiler

2.  C-Compiler Implementation-Specific Information

3.  Parallelizing C Code

4.  lint Source Code Checker

5.  Type-Based Alias Analysis

6.  Transitioning to ISO C

6.1 Basic Modes

6.1.1 -Xc

6.1.2 -Xa

6.1.3 -Xt

6.1.4 -Xs

6.2 New-Style Function Prototypes

6.2.1 Writing New Code

6.2.2 Updating Existing Code

6.2.3 Mixing Considerations

6.3 Functions With Varying Arguments

6.4 Promotions: Unsigned Versus Value Preserving

6.4.1 Some Background History

6.4.2 Compilation Behavior

6.4.3 Example: The Use of a Cast

6.4.4 Example: Same Result, No Warning

6.4.5 Integral Constants

6.4.6 Example: Integral Constants

6.5 Tokenization and Preprocessing

6.5.1 ISO C Translation Phases

6.5.2 Old C Translation Phases

6.5.3 Logical Source Lines

6.5.4 Macro Replacement

6.5.5 Using Strings

6.5.6 Token Pasting

6.6 const and volatile

6.6.1 Types for lvalue Only

6.6.2 Type Qualifiers in Derived Types

6.6.3 const Means readonly

6.6.4 Examples of const Usage

6.6.5 Examples of volatile Usage

6.7 Multibyte Characters and Wide Characters

6.7.1 Asian Languages Require Multibyte Characters

6.7.2 Encoding Variations

6.7.3 Wide Characters

6.7.4 C Language Features

6.8 Standard Headers and Reserved Names

6.8.1 Standard Headers

6.8.2 Names Reserved for Implementation Use

6.8.3 Names Reserved for Expansion

6.8.4 Names Safe to Use

6.9 Internationalization

6.9.1 Locales

6.9.2 setlocale() Function

6.9.3 Changed Functions

6.9.4 New Functions

6.10 Grouping and Evaluation in Expressions

6.10.1 Expression Definitions

6.10.2 K&R C Rearrangement License

6.10.3 ISO C Rules

6.10.4 Parentheses Usage

6.10.5 The As If Rule

6.11 Incomplete Types

6.11.1 Types

6.11.2 Completing Incomplete Types

6.11.3 Declarations

6.11.4 Expressions

6.11.5 Justification

6.11.6 Examples: Incomplete Types

6.12 Compatible and Composite Types

6.12.1 Multiple Declarations

6.12.2 Separate Compilation Compatibility

6.12.3 Single Compilation Compatibility

6.12.4 Compatible Pointer Types

6.12.5 Compatible Array Types

6.12.6 Compatible Function Types

6.12.7 Special Cases

6.12.8 Composite Types

7.  Converting Applications for a 64-Bit Environment

8.  cscope: Interactively Examining a C Program

A.  Compiler Options Grouped by Functionality

B.  C Compiler Options Reference

C.  Implementation-Defined ISO/IEC C99 Behavior

D.  Features of C99

E.  Implementation-Defined ISO/IEC C90 Behavior

F.  ISO C Data Representations

G.  Performance Tuning

H.  Oracle Solaris Studio C: Differences Between K&R C and ISO C


6.7 Multibyte Characters and Wide Characters

At first, the internationalization of ISO C affected only library functions. However, the final stage of internationalization, multibyte characters and wide characters, also affected the language proper.

The 1990 ISO/IEC C standard provides five library functions that manage multibyte characters and wide characters, the 1999 ISO/IEC C standard provides many more such functions.

6.7.1 Asian Languages Require Multibyte Characters

The basic difficulty in an Asian-language computer environment is the huge number of ideograms needed for I/O. To work within the constraints of usual computer architectures, these ideograms are encoded as sequences of bytes. The associated operating systems, application programs, and terminals understand these byte sequences as individual ideograms. Moreover, all of these encodings allow intermixing of regular single-byte characters with the ideogram byte sequences. The level of difficulty recognizing distinct ideograms depends on the encoding scheme used.

The term “multibyte character” is defined by ISO C to denote a byte sequence that encodes an ideogram, no matter what encoding scheme is employed. All multibyte characters are members of the “extended character set.” A regular single-byte character is just a special case of a multibyte character. The only requirement placed on the encoding is that no multibyte character can use a null character as part of its encoding.

ISO C specifies that program comments, string literals, character constants, and header names are all sequences of multibyte characters.

6.7.2 Encoding Variations

The encoding schemes fall into two camps. The first is one in which each multibyte character is self-identifying, that is, any multibyte character can simply be inserted between any pair of multibyte characters.

The second scheme is one in which the presence of special shift bytes changes the interpretation of subsequent bytes. An example is the method used by some character terminals to enter and leave line-drawing mode. For programs written in multibyte characters with a shift-state-dependent encoding, ISO C requires that each comment, string literal, character constant, and header name must both begin and end in the unshifted state.

6.7.3 Wide Characters

Some of the inconvenience of handling multibyte characters would be eliminated if all characters were of a uniform number of bytes or bits. Because such a character set can contain t thousands or tens of thousands of ideograms, a 16-bit or 32-bit sized integer value should be used to hold all members. (The full Chinese alphabet includes more than 65,000 ideograms!) ISO C includes the typedef name wchar_t as the implementation-defined integer type large enough to hold all members of the extended character set.

Each wide character has a corresponding multibyte character, and vice versa. The wide character that corresponds to a regular single-byte character is required to have the same value as its single-byte value, including the null character. However, the macro EOF might not necessarily be stored in a wchar_t, just as EOF might not be representable as a char.

6.7.4 C Language Features

To give even more flexibility to the programmer in an Asian-language environment, ISO C provides wide character constants and wide string literals. These have the same form as their non-wide versions, except that they are immediately prefixed by the letter L:

Multibyte characters are valid in both the regular and wide versions. The sequence of bytes necessary to produce the ideogram¥ is encoding-specific. If it consists of more than one byte, the value of the character constant ’¥’ is implementation-defined, just as the value of ’ab’ is implementation-defined. Except for escape sequences, a regular string literal contains exactly the bytes specified between the quotes, including the bytes of each specified multibyte character.

When the compilation system encounters a wide character constant or wide string literal, each multibyte character is converted into a wide character, as if by calling the mbtowc() function. Thus, the type of L¥’ is wchar_t; the type of abc¥xyz is array of wchar_t with length eight. Just as with regular string literals, each wide string literal has an extra zero-valued element appended, but in these cases, it is a wchar_t with value zero.

Just as regular string literals can be used as a shorthand method for character array initialization, wide string literals can be used to initialize wchar_t arrays:

wchar_t *wp = L"a¥z";
wchar_t x[] = L"a¥z";
wchar_t y[] = {L’a’, L’¥’, L’z’, 0};
wchar_t z[] = {’a’, L’¥’, ’z’, ’\0’};

In this example, the three arrays x, y, and z, and the array pointed to by wp, have the same length. All are initialized with identical values.

Finally, adjacent wide string literals are concatenated, just as with regular string literals. However, with the 1990 ISO/IEC C standard, adjacent regular and wide string literals produce undefined behavior. Also, the 1990 ISO/IEC C standard specifies that a compiler is not required to produce an error if it does not accept such concatenations.