C User's Guide |
Transitioning to ANSI/ISO C
This chapter contains the following sections:
- Basic Modes
- A Mixture of Old- and New-Style Functions
- Functions With Varying Arguments
- Promotions: Unsigned Versus Value Preserving
- Tokenization and Preprocessing
- const and volatile
- Multibyte Characters and Wide Characters
- Standard Headers and Reserved Names
- Internationalization
- Grouping and Evaluation in Expressions
- Incomplete Types
- Compatible and Composite Types
Basic Modes
The ANSI/ISO C compiler allows both old-style and new-style C code. The following
-X
(note case) options provide varying degrees of compliance to the ANSI/ISO C standard.-Xa
is the default mode.
-Xa
(
a
= ANSI) ANSI/ISO C plus K&R C compatibility extensions, with semantic changes required by ANSI/ISO C. Where K&R C and ANSI/ISO C specify different semantics for the same construct, the compiler issues warnings about the conflict and uses the ANSI/ISO C interpretation. This is the default mode.
-Xc
(
c
= conformance) Maximally conformant ANSI/ISO C, without K&R C compatibility extensions. The compiler issues errors and warnings for programs that use non-ANSI/ISO C constructs.
-Xs
(
s
= K&R C) The compiled language includes all features compatible with pre-ANSI/ISO K&R C. The computer warns about all language constructs that have differing behavior between ANSI/ISO C and K&R C.
-Xt
(
t
= transition) ANSI/ISO C plus K&R C compatibility extensions, without semantic changes required by ANSI/ISO C. Where K&R C and ANSI/ISO C specify different semantics for the same construct, the compiler issues warnings about the conflict and uses the K&R C interpretation.A Mixture of Old- and New-Style Functions
ANSI/ISO C's most sweeping change to the language is the function prototype borrowed from the C++ language. By specifying for each function the number and types of its parameters, not only does every regular compile get the benefits of argument and parameter checks (similar to those of
lint
) for each function call, but arguments are automatically converted (just as with an assignment) to the type expected by the function. ANSI/ISO C includes rules that govern the mixing of old- and new-style function declarations since there are many, many lines of existing C code that could and should be converted to use prototypes.Writing New Code
When you write an entirely new program, use new-style function declarations (function prototypes) in headers and new-style function declarations and definitions in other C source files. However, if there is a possibility that someone will port the code to a machine with a pre-ANSI/ISO C compiler, we suggest you use the macro
__STDC__
(which is defined only for ANSI/ISO C compilation systems) in both header and source files. Refer to Mixing Considerations for an example.An ANSI/ISO C-conforming compiler must issue a diagnostic whenever two incompatible declarations for the same object or function are in the same scope. If all functions are declared and defined with prototypes, and the appropriate headers are included by the correct source files, all calls should agree with the definition of the functions. This protocol eliminates one of the most common C programming mistakes.
Updating Existing Code
If you have an existing application and want the benefits of function prototypes, there are a number of possibilities for updating, depending on how much of the code you would like to change:
- Recompile without making any changes.
- Add function prototypes just to the headers.
- Add function prototypes to the headers and start each source file with function prototypes for its local (static) functions.
- Change all function declarations and definitions to use function prototypes.
For most programmers, choices 2 and 3 are probably the best cost/benefit compromise. Unfortunately, these options are precisely the ones that require detailed knowledge of the rules for mixing old and new styles.
Mixing Considerations
For function prototype declarations to work with old-style function definitions, both must specify functionally identical interfaces or have compatible types using ANSI/ISO C's terminology.
For functions with varying arguments, there can be no mixing of ANSI/ISO C's ellipsis notation and the old-style
varargs
( ) function definition. For functions with a fixed number of parameters, the situation is fairly straightforward: just specify the types of the parameters as they were passed in previous implementations.In K&R C, each argument was converted just before it was passed to the called function according to the default argument promotions. These promotions specified that all integral types narrower than
int
were promoted toint
size, and anyfloat
argument was promoted todouble
, hence simplifying both the compiler and libraries. Function prototypes are more expressive--the specified parameter type is what is passed to the function.Thus, if a function prototype is written for an existing (old-style) function definition, there should be no parameters in the function prototype with any of the following types:
char
signed
char
unsigned
char
float
short
signed short
unsigned
short
There still remain two complications with writing prototypes:
typedef
names and the promotion rules for narrow unsigned types.If parameters in old-style functions were declared using
typedef
names, such asoff_t
andino_t
, it is important to know whether or not thetypedef
name designates a type that is affected by the default argument promotions. For these two,off_t
is along
, so it is appropriate to use in a function prototype;ino_t
used to be anunsigned
short
, so if it were used in a prototype, the compiler issues a diagnostic because the old-style definition and the prototype specify different and incompatible interfaces.Just what should be used instead of an
unsigned
short
leads us into the final complication. The one biggest incompatibility between K&R C and the ANSI/ISO C compiler is the promotion rule for the widening ofunsigned
char
andunsigned
short
to anint
value. (See Promotions: Unsigned Versus Value Preserving.) The parameter type that matches such an old-style parameter depends on the compilation mode used when you compile:
-Xs
and-Xt
should useunsigned
int
-Xa
and-Xc
should useint
The best approach is to change the old-style definition to specify either
int
orunsigned
int
and use the matching type in the function prototype. You can always assign its value to a local variable with the narrower type, if necessary, after you enter the function.Watch out for the use of id's in prototypes that may be affected by preprocessing. Consider the following example:
#define status 23void my_exit(int status); /* Normally, scope begins *//* and ends with prototype */
Do not mix function prototypes with old-style function declarations that contain narrow types.
void foo(unsigned char, unsigned short);void foo(i, j) unsigned char i; unsigned short j; {...}
Appropriate use of
__STDC__
produces a header file that can be used for both the old and new compilers:
header.h:struct s { /* . . . */ };#ifdef __STDC__void errmsg(int, ...);struct s *f(const char *);int g(void);#elsevoid errmsg();struct s *f();int g();#endif
The following function uses prototypes and can still be compiled on an older system:
struct s *#ifdef __STDC__f(const char *p)#elsef(p) char *p;#endif{/* . . . */}
Here is an updated source file (as with choice 3 above). The local function still uses an old-style definition, but a prototype is included for newer compilers:
source.c:#include "header.h"typedef /* . . . */ MyType;#ifdef __STDC__static void del(MyType *);/* . . . */static voiddel(p)MyType *p;{/* . . . */}/* . . . */
Functions With Varying Arguments
In previous implementations, you could not specify the parameter types that a function expected, but ANSI/ISO C encourages you to use prototypes to do just that. To support functions such as
printf()
, the syntax for prototypes includes a special ellipsis (...
) terminator. Because an implementation might need to do unusual things to handle a varying number of arguments, ANSI/ISO C requires that all declarations and the definition of such a function include the ellipsis terminator.Since there are no names for the "
...
" part of the parameters, a special set of macros contained instdarg.h
gives the function access to these arguments. Earlier versions of such functions had to use similar macros contained invarargs.h
.Let us assume that the function we wish to write is an error handler called
errmsg()
that returnsvoid
, and whose only fixed parameter is anint
that specifies details about the error message. This parameter can be followed by a file name, a line number, or both, and these are followed by format and arguments, similar to those ofprintf()
, that specify the text of the error message.To allow our example to compile with earlier compilers, we make extensive use of the macro
__STDC__
which is defined only for ANSI/ISO C compilation systems. Thus, the function's declaration in the appropriate header file is:
#ifdef __STDC__void errmsg(int code, ...);#elsevoid errmsg();#endif
The file that contains the definition of
errmsg()
is where the old and new styles can get complex. First, the header to include depends on the compilation system:
#ifdef __STDC__#include <stdarg.h>#else#include <varargs.h>#endif#include <stdio.h>
stdio.h
is included because we callfprintf()
andvfprintf()
later.Next comes the definition for the function. The identifiers
va_alist
andva_dcl
are part of the old-stylevarargs.h
interface.
void#ifdef __STDC__errmsg(int code, ...)#elseerrmsg(va_alist) va_dcl /* Note: no semicolon! */#endif{/* more detail below */}
Since the old-style variable argument mechanism did not allow us to specify any fixed parameters, we must arrange for them to be accessed before the varying portion. Also, due to the lack of a name for the "
...
" part of the parameters, the newva_start()
macro has a second argument--the name of the parameter that comes just before the "...
" terminator.As an extension, Sun ANSI/ISO C allows functions to be declared and defined with no fixed parameters, as in:
int f(...);
For such functions,
va_start()
should be invoked with an empty second argument, as in:
va_start(ap,)
The following is the body of the function:
Both the
va_arg()
andva_end()
macros work the same for the old-style and ANSI/ISO C versions. Becauseva_arg()
changes the value ofap
, the call tovfprintf()
cannot be:
(void)vfprintf(stderr, va_arg(ap, char *), ap);
The definitions for the macros
FILENAME
,LINENUMBER
, andWARNING
are presumably contained in the same header as the declaration oferrmsg()
.A sample call to
errmsg()
could be:
errmsg(FILENAME, "<command line>", "cannot open: %s\n",argv[optind]);
Promotions: Unsigned Versus Value Preserving
The following information appears in the Rationale section that accompanies the draft C Standard: "QUIET CHANGE ". A program that depends on unsigned preserving arithmetic conversions will behave differently, probably without complaint. This is considered to be the most serious change made by the Committee to a widespread current practice. "
This section explores how this change affects our code.
Background
According to K&R, The C Programming Language (First Edition),
unsigned
specified exactly one type; there were nounsigned
char
s,unsigned
short
s, orunsigned
long
s, but most C compilers added these very soon thereafter. Some compilers did not implementunsigned
long
but included the other two. Naturally, implementations chose different rules for type promotions when these new types mixed with others in expressions.In most C compilers, the simpler rule, "unsigned preserving" is used: when an unsigned type needs to be widened, it is widened to an unsigned type; when an unsigned type mixes with a signed type, the result is an unsigned type.
The other rule, specified by ANSI/ISO C, is known as "value preserving," in which the result type depends on the relative sizes of the operand types. When an
unsigned char
orunsigned short
is widened, the result type isint
if anint
is large enough to represent all the values of the smaller type. Otherwise, the result type isunsigned int
. The value preserving rule produces the least surprise arithmetic result for most expressions.Compilation Behavior
Only in the transition or pre-ANSI/ISO modes (
-Xt
or-Xs
) does the ANSI/ISO C compiler use the unsigned preserving promotions; in the other two modes, conforming (-Xc
) and ANSI/ISO (-Xa
), the value preserving promotion rules are used.First Example: The Use of a Cast
In the following code, assume that an
unsigned
char
is smaller than anint
.
int f(void){int i = -2;unsigned char uc = 1;return (i + uc) < 17;}
The code above causes the compiler to issue the following warning when you use the
-xtransition
option:
line 6: warning: semantics of "<" change in ANSI/ISO C; use explicit cast
The result of the addition has type
int
(value preserving) orunsigned
int
(unsigned preserving), but the bit pattern does not change between these two. On a two's-complement machine, we have:
i: 111...110 (-2)+ uc: 000...001 ( 1)===================111...111 (-1 or UINT_MAX)
This bit representation corresponds to
-1
forint
andUINT_MAX
forunsigned int
. Thus, if the result has typeint
, a signed comparison is used and the less-than test is true; if the result has typeunsigned
int
, an unsigned comparison is used and the less-than test is false.The addition of a cast serves to specify which of the two behaviors is desired:
value preserving:(i + (int)uc) < 17unsigned preserving:(i + (unsigned int)uc) < 17
Since differing compilers chose different meanings for the same code, this expression can be ambiguous. The addition of a cast is as much to help the reader as it is to eliminate the warning message.
Bit-fields
The same situation applies to the promotion of bit-field values. In ANSI/ISO C, if the number of bits in an
int
orunsigned
int
bit-field is less than the number of bits in anint
, the promoted type isint
; otherwise, the promoted type isunsigned
int
. In most older C compilers, the promoted type isunsigned
int
for explicitly unsigned bit-fields, andint
otherwise.Similar use of casts can eliminate situations that are ambiguous.
Second Example: Same Result
In the following code, assume that both
unsigned
short
andunsigned
char
are narrower thanint
.
int f(void){unsigned short us;unsigned char uc;return uc < us;}
In this example, both automatics are either promoted to
int
or tounsigned
int
, so the comparison is sometimes unsigned and sometimes signed. However, the C compiler does not warn you because the result is the same for the two choices.Integral Constants
As with expressions, the rules for the types of certain integral constants have changed. In K&R C, an unsuffixed decimal constant had type
int
only if its value fit in anint
; an unsuffixed octal or hexadecimal constant had typeint
only if its value fit in anunsigned
int
. Otherwise, an integral constant had typelong
. At times, the value did not fit in the resulting type. In ANSI/ISO C, the constant type is the first type encountered in the following list that corresponds to the value:
int, long, unsigned long
unsuffixed octal or hexadecimal:
int, unsigned int, long, unsigned long
unsigned int, unsigned long
long, unsigned long
unsigned long
The ANSI/ISO C compiler warns you, when you use the
-xtransition
option, about any expression whose behavior might change according to the typing rules of the constants involved. The old integral constant typing rules are used only in the transition mode; the ANSI/ISO and conforming modes use the new rules.Third Example: Integral Constants
In the following code, assume
int
s are 16 bits.
int f(void){int i = 0;return i > 0xffff;}
Because the hexadecimal constant's type is either
int
(with a value of-1
on a two's-complement machine) or anunsigned
int
(with a value of 65535), the comparison is true in-Xs
and-Xt
modes, and false in-Xa
and-Xc
modes.Again, an appropriate cast clarifies the code and suppresses a warning:
-Xt, -Xs modes:i > (int)0xffff-Xa, -Xc modes:i > (unsigned int)0xffffori > 0xffffU
The
U
suffix character is a new feature of ANSI/ISO C and probably produces an error message with older compilers.Tokenization and Preprocessing
Probably the least specified part of previous versions of C concerned the operations that transformed each source file from a bunch of characters into a sequence of tokens, ready to parse. These operations included recognition of white space (including comments), bundling consecutive characters into tokens, handling preprocessing directive lines, and macro replacement. However, their respective ordering was never guaranteed.
ANSI/ISO C Translation Phases
The order of these translation phases is specified by ANSI/ISO C. Every trigraph sequence in the source file is replaced. ANSI/ISO C has exactly nine trigraph sequences that were invented solely as a concession to deficient character sets, and are three-character sequences that name a character not in the ISO 646-1983 character set:
TABLE 7-1 Trigraph Sequences ??=
#
??<
{
??-
~
??>
}
??(
[
??/
\
??)
]
??'
^
??!
|
These sequences must be understood by ANSI/ISO C compilers, but we do not recommend their use. The ANSI/ISO C compiler warns you, when you use the
-xtransition
option, whenever it replaces a trigraph while in transition (-Xt
) mode, even in comments. For example, consider the following:
/* comment *??//* still comment? */
The
??/
becomes a backslash. This character and the following newline are removed. The resulting characters are:
/* comment */* still comment? */
- The first
/
from the second line is the end of the comment. The next token is the*
.
- Every backslash/new-line character pair is deleted.
- The source file is converted into preprocessing tokens and sequences of white space. Each comment is effectively replaced by a space character.
- Every preprocessing directive is handled and all macro invocations are replaced. Each
#include
d source file is run through the earlier phases before its contents replace the directive line.- Every escape sequence (in character constants and string literals) is interpreted.
- Adjacent string literals are concatenated.
- Every preprocessing token is converted into a regular token; the compiler properly parses these and generates code.
- All external object and function references are resolved, resulting in the final program.
Old C Translation Phases
Previous C compilers did not follow such a simple sequence of phases, nor were there any guarantees for when these steps were applied. A separate preprocessor recognized tokens and white space at essentially the same time as it replaced macros and handled directive lines. The output was then completely retokenized by the compiler proper, which then parsed the language and generated code.
Because the tokenization process within the preprocessor was a moment-by-moment operation and macro replacement was done as a character-based, not token-based, operation, the tokens and white space could have a great deal of variation during preprocessing.
There are a number of differences that arise from these two approaches. The rest of this section discusses how code behavior may change due to line splicing, macro replacement, stringizing, and token pasting, which occur during macro replacement.
Logical Source Lines
In K&R C, backslash/new-line pairs were allowed only as a means to continue a directive, a string literal, or a character constant to the next line. ANSI/ISO C extended the notion so that a backslash/new-line pair can continue anything to the next line. The result is a logical source line. Therefore, any code that relied on the separate recognition of tokens on either side of a backslash/new-line pair does not behave as expected.
Macro Replacement
The macro replacement process has never been described in detail prior to ANSI/ISO C. This vagueness spawned a great many divergent implementations. Any code that relied on anything fancier than manifest constant replacement and simple function-like macros was probably not truly portable. This manual cannot uncover all the differences between the old C macro replacement implementation and the ANSI/ISO C version. Nearly all uses of macro replacement with the exception of token pasting and stringizing produce exactly the same series of tokens as before. Furthermore, the ANSI/ISO C macro replacement algorithm can do things not possible in the old C version. For example,
#define name (*name)
causes any use of
name
to be replaced with an indirect reference throughname
. The old C preprocessor would produce a huge number of parentheses and stars and eventually produce an error about macro recursion.The major change in the macro replacement approach taken by ANSI/ISO C is to require macro arguments, other than those that are operands of the macro substitution operators
#
and##
, to be expanded recursively prior to their substitution in the replacement token list. However, this change seldom produces an actual difference in the resulting tokens.Using Strings
Note In ANSI/ISO C, the examples below marked with aproduce a warning about use of old features, when you use the
-xtransition
option. Only in the transition mode (-Xt
and-Xs
) is the result the same as in previous versions of C.
In K&R C, the following code produced the string literal
"x y!"
:
#define str(a) "a!"str(x y)
Thus, the preprocessor searched inside string literals and character constants for characters that looked like macro parameters. ANSI/ISO C recognized the importance of this feature, but could not condone operations on parts of tokens. In ANSI/ISO C, all invocations of the above macro produce the string literal
"a!"
. To achieve the old effect in ANSI/ISO C, we make use of the#
macro substitution operator and the concatenation of string literals.
#define str(a) #a "!"str(x y)
The above code produces the two string literals
"x y"
and"!"
which, after concatenation, produces the identical"x y!"
.There is no direct replacement for the analogous operation for character constants. The major use of this feature was similar to the following:
#define CNTL(ch) (037 & 'ch')CNTL(L)
which evaluates to the ASCII control-L character. The best solution we know of is to change all uses of this macro to:
#define CNTL(ch) (037 & (ch))CNTL('L')
This code is more readable and more useful, as it can also be applied to expressions.
Token Pasting
In K&R C, there were at least two ways to combine two tokens. Both invocations in the following produced a single identifier
x1
out of the two tokensx
and1
.
#define self(a) a#define glue(a,b) a/**/bself(x)1glue(x,1)
Again, ANSI/ISO C could not sanction either approach. In ANSI/ISO C, both the above invocations would produce the two separate tokens
x
and1
. The second of the above two methods can be rewritten for ANSI/ISO C by using the##
macro substitution operator:
#define glue(a,b) a ## bglue(x, 1)
# and ## should be used as macro substitution operators only when
__STDC__
is defined. Since##
is an actual operator, the invocation can be much freer with respect to white space in both the definition and invocation.There is no direct approach to effect the first of the two old-style pasting schemes, but since it put the burden of the pasting at the invocation, it was used less frequently than the other form.
const
andvolatile
The keyword
const
was one of the C++ features that found its way into ANSI/ISO C. When an analogous keyword,volatile
, was invented by the ANSI/ISO C Committee, the "type qualifier" category was created. This category still remains one of the more nebulous parts of ANSI/ISO C.Types, Only for
lvalue
const
andvolatile
are part of an identifier's type, not its storage class. However, they are often removed from the topmost part of the type when an object's value is fetched in the evaluation of an expression--exactly at the point when anlvalue
becomes anrvalue
. These terms arise from the prototypical assignment"L=R";
in which the left side must still refer directly to an object (anlvalue
) and the right side need only be a value (anrvalue
). Thus, only expressions that arelvalues
can be qualified byconst
orvolatile
or both.Type Qualifiers in Derived Types
The type qualifiers may modify type names and derived types. Derived types are those parts of C's declarations that can be applied over and over to build more and more complex types: pointers, arrays, functions, structures, and unions. Except for functions, one or both type qualifiers can be used to change the behavior of a derived type.
For example,
const int five = 5;
declares and initializes an object with type
const
int
whose value is not changed by a correct program. The order of the keywords is not significant to C. For example, the declarations:
int const five = 5;
are identical to the above declaration in its effect.
The declaration
const int *pci = &five;
declares an object with type pointer to
const
int
, which initially points to the previously declared object. The pointer itself does not have a qualified type--it points to a qualified type, and can be changed to point to essentially anyint
during program execution.pci
cannot be used to modify the object to which it points unless a cast is used, as in the following:
*(int *)pci = 17;
If
pci
actually points to aconst
object, the behavior of this code is undefined.The declaration
extern int *const cpi;
says that somewhere in the program there exists a definition of a global object with type
const
pointer toint
. In this case,cpi
's value will not be changed by a correct program, but it can be used to modify the object to which it points. Notice thatconst
comes after the*
in the above declaration. The following pair of declarations produces the same effect:
typedef int *INT_PTR;extern const INT_PTR cpi;
These declarations can be combined as in the following declaration in which an object is declared to have type
const
pointer toconst
int
:
const int *const cpci;
const
Meansreadonly
In hindsight,
readonly
would have been a better choice for a keyword thanconst
. If one readsconst
in this manner, declarations such as:
char *strcpy(char *, const char *);
are easily understood to mean that the second parameter is only used to read character values, while the first parameter overwrites the characters to which it points. Furthermore, despite the fact that in the above example, the type of
cpi
is a pointer to aconst
int
, you can still change the value of the object to which it points through some other means, unless it actually points to an object declared withconst
int
type.Examples of
const
UsageThe two main uses for
const
are to declare large compile-time initialized tables of information as unchanging, and to specify that pointer parameters do not modify the objects to which they point.The first use potentially allows portions of the data for a program to be shared by other concurrent invocations of the same program. It may cause attempts to modify this invariant data to be detected immediately by means of some sort of memory protection fault, since the data resides in a read-only portion of memory.
The second use helps locate potential errors before generating a memory fault during that demo. For example, functions that temporarily place a null character into the middle of a string are detected at compile time, if passed a pointer to a string that cannot be so modified.
volatile
Means Exact SemanticsSo far, the examples have all used
const
because it's conceptually simpler. But what doesvolatile
really mean? To a compiler writer, it has one meaning: take no code generation shortcuts when accessing such an object. In ANSI/ISO C, it is a programmer's responsibility to declare every object that has the appropriate special properties with avolatile
qualified type.Examples of
volatile
UsageThe usual four examples of
volatile
objects are:
- An object that is a memory-mapped I/O port
- An object that is shared between multiple concurrent processes
- An object that is modified by an asynchronous signal handler
- An automatic storage duration object declared in a function that calls
setjmp
, and whose value is changed between the call tosetjmp
and a corresponding call tolongjmp
The first three examples are all instances of an object with a particular behavior: its value can be modified at any point during the execution of the program. Thus, the seemingly infinite loop:
flag = 1;while (flag);
is valid as long as
flag
has avolatile
qualified type. Presumably, some asynchronous event setsflag
to zero in the future. Otherwise, because the value offlag
is unchanged within the body of the loop, the compilation system is free to change the above loop into a truly infinite loop that completely ignores the value offlag
.The fourth example, involving variables local to functions that call
setjmp
, is more involved. The fine print about the behavior ofsetjmp
andlongjmp
notes that there are no guarantees about the values for objects matching the fourth case. For the most desirable behavior, it is necessary forlongjmp
to examine every stack frame between the function callingsetjmp
and the function callinglongjmp
for saved register values. The possibility of asynchronously created stack frames makes this job even harder.When an automatic object is declared with a
volatile
qualified type, the compilation system knows that it has to produce code that exactly matches what the programmer wrote. Therefore, the most recent value for such an automatic object is always in memory and not just in a register, and is guaranteed to be up-to-date whenlongjmp
is called.Multibyte Characters and Wide Characters
At first, the internationalization of ANSI/ISO C affected only library functions. However, the final stage of internationalization--multibyte characters and wide characters--also affected the language proper.
Asian Languages Require Multibyte Characters
The basic difficulty in an Asian-language computer environment is the huge number of ideograms needed for I/O. To work within the constraints of usual computer architectures, these ideograms are encoded as sequences of bytes. The associated operating systems, application programs, and terminals understand these byte sequences as individual ideograms. Moreover, all of these encodings allow intermixing of regular single-byte characters with the ideogram byte sequences. Just how difficult it is to recognize distinct ideograms depends on the encoding scheme used.
The term "multibyte character" is defined by ANSI/ISO C to denote a byte sequence that encodes an ideogram, no matter what encoding scheme is employed. All multibyte characters are members of the "extended character set." A regular single-byte character is just a special case of a multibyte character. The only requirement placed on the encoding is that no multibyte character can use a null character as part of its encoding.
ANSI/ISO C specifies that program comments, string literals, character constants, and header names are all sequences of multibyte characters.
Encoding Variations
The encoding schemes fall into two camps. The first is one in which each multibyte character is self-identifying, that is, any multibyte character can simply be inserted between any pair of multibyte characters.
The second scheme is one in which the presence of special shift bytes changes the interpretation of subsequent bytes. An example is the method used by some character terminals to get in and out of line-drawing mode. For programs written in multibyte characters with a shift-state-dependent encoding, ANSI/ISO C requires that each comment, string literal, character constant, and header name must both begin and end in the unshifted state.
Wide Characters
Some of the inconvenience of handling multibyte characters would be eliminated if all characters were of a uniform number of bytes or bits. Since there can be thousands or tens of thousands of ideograms in such a character set, a 16-bit or 32-bit sized integral value should be used to hold all members. (The full Chinese alphabet includes more than 65,000 ideograms!) ANSI/ISO C includes the
typedef
namewchar_t
as the implementation-defined integral type large enough to hold all members of the extended character set.For each wide character, there is a corresponding multibyte character, and vice versa; the wide character that corresponds to a regular single-byte character is required to have the same value as its single-byte value, including the null character. However, there is no guarantee that the value of the macro
EOF
can be stored in awchar_t
, just asEOF
might not be representable as achar
.Conversion Functions
ANSI/ISO C provides five library functions that manage multibyte characters and wide characters:
The behavior of all of these functions depends on the current locale. (See The setlocale() Function.)
It is expected that vendors providing compilation systems targeted to this market supply many more string-like functions to simplify the handling of wide character strings. However, for most application programs, there is no need to convert any multibyte characters to or from wide characters. Programs such as
diff
, for example, read in and write out multibyte characters, needing only to check for an exact byte-for-byte match. More complicated programs, such asgrep
, that use regular expression pattern matching, may need to understand multibyte characters, but only the common set of functions that manages the regular expression needs this knowledge. The programgrep
itself requires no other special multibyte character handling.C Language Features
To give even more flexibility to the programmer in an Asian-language environment, ANSI/ISO C provides wide character constants and wide string literals. These have the same form as their non-wide versions, except that they are immediately prefixed by the letter
L
:
- '
x
' regular character constant- '
¥
' regular character constantL
'x
' wide character constantL
'¥
' wide character constant- "
abc¥xyz
" regular string literalL
"abcxyz
" wide string literalMultibyte characters are valid in both the regular and wide versions. The sequence of bytes necessary to produce the ideogram
¥
is encoding-specific, but if it consists of more than one byte, the value of the character constant'¥
' is implementation-defined, just as the value of 'ab
' is implementation-defined. Except for escape sequences, a regular string literal contains exactly the bytes specified between the quotes, including the bytes of each specified multibyte character.When the compilation system encounters a wide character constant or wide string literal, each multibyte character is converted into a wide character, as if by calling the
mbtowc()
function. Thus, the type ofL'¥
' iswchar_t
; the type ofabc¥xyz
is array ofwchar_t
with length eight. Just as with regular string literals, each wide string literal has an extra zero-valued element appended, but in these cases, it is awchar_t
with value zero.Just as regular string literals can be used as a shorthand method for character array initialization, wide string literals can be used to initialize
wchar_t
arrays:
wchar_t *wp = L"a¥z";wchar_t x[] = L"a¥z";wchar_t y[] = {L'a', L'¥', L'z', 0};wchar_t z[] = {'a', L'¥', 'z', '\0'};
In the above example, the three arrays
x
,y
, andz
, and the array pointed to bywp
, have the same length. All are initialized with identical values.Finally, adjacent wide string literals are concatenated, just as with regular string literals. However, adjacent regular and wide string literals produce undefined behavior. A compiler is not required to produce an error if it does not accept such concatenations.
Standard Headers and Reserved Names
Early in the standardization process, the ANSI/ISO Standards Committee chose to include library functions, macros, and header files as part of ANSI/ISO C. While this decision was necessary for the writing of truly portable C programs, a side effect is the basis of some of the most negative comments about ANSI/ISO C from the public--a large set of reserved names.
This section presents the various categories of reserved names and some rationale for their reservations. At the end is a set of rules to follow that can steer your programs clear of any reserved names.
Balancing Process
To match existing implementations, the ANSI/ISO C committee chose names like
printf
andNULL
. However, each such name reduced the set of names available for free use in C programs.On the other hand, before standardization, implementors felt free to add both new keywords to their compilers and names to headers. No program could be guaranteed to compile from one release to another, let alone port from one vendor's implementation to another.
As a result, the Committee made a hard decision: to restrict all conforming implementations from including any extra names, except those with certain forms. It is this decision that causes most C compilation systems to be almost conforming. Nevertheless, the Standard contains 32 keywords and almost 250 names in its headers, none of which necessarily follow any particular naming pattern.
Standard Headers
The standard headers are:
TABLE 7-3 Standard Headers assert.h
locale.h
stddef.h
ctype.h
math.h
stdio.h
errno.h
setjmp.h
stdlib.h
float.h
signal.h
string.h
limits.h
stdarg.h
time.h
Most implementations provide more headers, but a strictly conforming ANSI/ISO C program can only use these.
Other standards disagree slightly regarding the contents of some of these headers. For example, POSIX (IEEE 1003.1) specifies that
fdopen
is declared instdio.h
. To allow these two standards to coexist, POSIX requires the macro_POSIX_SOURCE
to be#defined
prior to the inclusion of any header to guarantee that these additional names exist. In its Portability Guide, X/Open has also used this macro scheme for its extensions. X/Open's macro is_XOPEN_SOURCE
.ANSI/ISO C requires the standard headers to be both self-sufficient and idempotent. No standard header needs any other header to be
#included
before or after it, and each standard header can be#included
more than once without causing problems. The Standard also requires that its headers be#included
only in safe contexts, so that the names used in the headers are guaranteed to remain unchanged.Names Reserved for Implementation Use
The Standard places further restrictions on implementations regarding their libraries. In the past, most programmers learned not to use names like
read
andwrite
for their own functions on UNIX Systems. ANSI/ISO C requires that only names reserved by the Standard be introduced by references within the implementation.Thus, the Standard reserves a subset of all possible names for implementations to use. This class of names consists of identifiers that begin with an underscore and continue with either another underscore or a capital letter. The class of names contains all names matching the following regular expression:
_[_A-Z][0-9_a-zA-Z]*
Strictly speaking, if your program uses such an identifier, its behavior is undefined. Thus, programs using
_POSIX_SOURCE
(or_XOPEN_SOURCE
) have undefined behavior.However, undefined behavior comes in different degrees. If, in a POSIX-conforming implementation you use
_POSIX_SOURCE
, you know that your program's undefined behavior consists of certain additional names in certain headers, and your program still conforms to an accepted standard. This deliberate loophole in the ANSI/ISO C standard allows implementations to conform to seemingly incompatible specifications. On the other hand, an implementation that does not conform to the POSIX standard is free to behave in any manner when encountering a name such as_POSIX_SOURCE
.The Standard also reserves all other names that begin with an underscore for use in header files as regular file scope identifiers and as tags for structures and unions, but not in local scopes. The common practice of having functions named
_filbuf
and_doprnt
to implement hidden parts of the library is allowed.Names Reserved for Expansion
In addition to all the names explicitly reserved, the Standard also reserves (for implementations and future standards) names matching certain patterns:
In the above lists, names that begin with a capital letter are macros and are reserved only when the associated header is included. The rest of the names designate functions and cannot be used to name any global objects or functions.
Names Safe to Use
There are four simple rules you can follow to keep from colliding with any ANSI/ISO C reserved names:
#include
all system headers at the top of your source files (except possibly after a#define
of_POSIX_SOURCE
or_XOPEN_SOURCE
, or both).- Do not define or declare any names that begin with an underscore.
- Use an underscore or a capital letter somewhere within the first few characters of all file scope tags and regular names. Beware of the
va_
prefix found instdarg.h
orvarargs.h
.- Use a digit or a non-capital letter somewhere within the first few characters of all macro names. Almost all names beginning with an
E
are reserved iferrno.h
is#included
.These rules are just a general guideline to follow, as most implementations will continue to add names to the standard headers by default.
Internationalization
The section Multibyte Characters and Wide Characters introduced the internationalization of the standard libraries. This section discusses the affected library functions and gives some hints on how programs should be written to take advantage of these features.
Locales
At any time, a C program has a current locale--a collection of information that describes the conventions appropriate to some nationality, culture, and language. Locales have names that are strings. The only two standardized locale names are
"C"
and""
. Each program begins in the"C"
locale, which causes all library functions to behave just like they have historically. The""
locale is the implementation's best guess at the correct set of conventions appropriate to the program's invocation."C"
and""
can cause identical behavior. Other locales may be provided by implementations.For the purposes of practicality and expediency, locales are partitioned into a set of categories. A program can change the complete locale, or just one or more categories. Generally, each category affects a set of functions disjoint from the functions affected by other categories, so temporarily changing one category for a little while can make sense.
The
setlocale()
FunctionThe
setlocale()
function is the interface to the program's locale. In general, any program that uses the invocation country's conventions should place a call such as:
#include <locale.h>/*...*/setlocale(LC_ALL, "");
early in the program's execution path. This call causes the program's current locale to change to the appropriate local version, since
LC_ALL
is the macro that specifies the entire locale instead of one category. The following are the standard categories:
Any of these macros can be passed as the first argument to
setlocale()
to specify that category.The
setlocale()
function returns the name of the current locale for a given category (orLC_ALL
) and serves in an inquiry-only capacity when its second argument is a null pointer. Thus, code similar to the following can be used to change the locale or a portion thereof for a limited duration:
Most programs do not need this capability.
Changed Functions
Wherever possible and appropriate, existing library functions were extended to include locale-dependent behavior. These functions came in two groups:
- Those declared by the
ctype.h
header (character classification and conversion), and- Those that convert to and from printable and internal forms of numeric values, such as
printf()
andstrtod()
.All
ctype.h
predicate functions, exceptisdigit()
andisxdigit()
, can return nonzero (true) for additional characters when theLC_CTYPE
category of the current locale is other than"C"
. In a Spanish locale,isalpha('ñ')
should be true. Similarly, the character conversion functions,tolower()
andtoupper()
, should appropriately handle any extra alphabetic characters identified by theisalpha()
function. Thectype.h
functions are almost always macros that are implemented using table lookups indexed by the character argument. Their behavior is changed by resetting the table(s) to the new locale's values, and therefore there is no performance impact.Those functions that write or interpret printable floating values can change to use a decimal-point character other than period (
.
) when theLC_NUMERIC
category of the current locale is other than"C"
. There is no provision for converting any numeric values to printable form with thousands separator-type characters. When converting from a printable form to an internal form, implementations are allowed to accept such additional forms, again in other than the"C"
locale. Those functions that make use of the decimal-point character are theprintf()
andscanf()
families,atof()
, andstrtod()
. Those functions that are allowed implementation-defined extensions areatof()
,atoi()
,atol()
,strtod()
,strtol()
,strtoul()
, and thescanf()
family.New Functions
Certain locale-dependent capabilities were added as new standard functions. Besides
setlocale()
, which allows control over the locale itself, the Standard includes the following new functions:
localeconv()numeric/monetary conventions strcoll()collation order of two strings strxfrm()translate string for collation strxfrm()translate string for collation
In addition, there are the multibyte functions
mblen()
,mbtowc()
,mbstowcs()
,wctomb()
, andwcstombs()
.The
localeconv()
function returns a pointer to a structure containing information useful for formatting numeric and monetary information appropriate to the current locale'sLC_NUMERIC
andLC_MONETARY
categories. This is the only function whose behavior depends on more than one category. For numeric values, the structure describes the decimal-point character, the thousands separator, and where the separator(s) should be located. There are fifteen other structure members that describe how to format a monetary value.The
strcoll()
function is analogous to thestrcmp()
function, except that the two strings are compared according to theLC_COLLATE
category of the current locale. Thestrxfrm()
function can also be used to transform a string into another, such that any two such after-translation strings can be passed tostrcmp()
, and get an ordering analogous to whatstrcoll()
would have returned if passed the two pre-translation strings.The
strftime()
function provides formatting similar to that used withsprintf()
of the values in astruct tm
, along with some date and time representations that depend on theLC_TIME
category of the current locale. This function is based on theascftime()
function released as part of UNIX System V Release 3.2.Grouping and Evaluation in Expressions
One of the choices made by Dennis Ritchie in the design of C was to give compilers a license to rearrange expressions involving adjacent operators that are mathematically commutative and associative, even in the presence of parentheses. This is explicitly noted in the appendix in the The C Programming Language by Kernighan and Ritchie. However, ANSI/ISO C does not grant compilers this same freedom.
This section discusses the differences between these two definitions of C and clarifies the distinctions between an expression's side effects, grouping, and evaluation by considering the expression statement from the following code fragment.
int i, *p, f(void), g(void);/*...*/i = *++p + f() + g();
Definitions
The side effects of an expression are its modifications to memory and its accesses to
volatile
qualified objects. The side effects in the above expression are the updating ofi
andp
and any side effects contained within the functionsf()
andg()
.An expression's grouping is the way values are combined with other values and operators. The above expression's grouping is primarily the order in which the additions are performed.
An expression's evaluation includes everything necessary to produce its resulting value. To evaluate an expression, all specified side effects must occur anywhere between the previous and next sequence point, and the specified operations are performed with a particular grouping. For the above expression, the updating of
i
andp
must occur after the previous statement and by the;
of this expression statement; the calls to the functions can occur in either order, any time after the previous statement, but before their return values are used. In particular, the operators that cause memory to be updated have no requirement to assign the new value before the value of the operation is used.The K&R C Rearrangement License
The K&R C rearrangement license applies to the above expression because addition is mathematically commutative and associative. To distinguish between regular parentheses and the actual grouping of an expression, the left and right curly braces designate grouping. The three possible groupings for the expression are:
i = { {*++p + f()} + g() };i = { *++p + {f() + g()} };i = { {*++p + g()} + f() };
All of these are valid given K&R C rules. Moreover, all of these groupings are valid even if the expression were written instead, for example, in either of these ways:
i = *++p + (f() + g());i = (g() + *++p) + f();
If this expression is evaluated on an architecture for which either overflows cause an exception, or addition and subtraction are not inverses across an overflow, these three groupings behave differently if one of the additions overflows.
For such expressions on these architectures, the only recourse available in K&R C was to split the expression to force a particular grouping. The following are possible rewrites that respectively enforce the above three groupings:
i = *++p; i += f(); i += g()i = f(); i += g(); i += *++p;i = *++p; i += g(); i += f();
The ANSI/ISO C Rules
ANSI/ISO C does not allow operations to be rearranged that are mathematically commutative and associative, but that are not actually so on the target architecture. Thus, the precedence and associativity of the ANSI/ISO C grammar completely describes the grouping for all expressions; all expressions must be grouped as they are parsed. The expression under consideration is grouped in this manner:
i = { {*++p + f()} + g() };
This code still does not mean that
f()
must be called beforeg()
, or thatp
must be incremented beforeg()
is called.In ANSI/ISO C, expressions need not be split to guard against unintended overflows.
The Parentheses
ANSI/ISO C is often erroneously described as honoring parentheses or evaluating according to parentheses due to an incomplete understanding or an inaccurate presentation.
Since ANSI/ISO C expressions simply have the grouping specified by their parsing, parentheses still only serve as a way of controlling how an expression is parsed; the natural precedence and associativity of expressions carry exactly the same weight as parentheses.
The above expression could have been written as:
i = (((*(++p)) + f()) + g());
with no different effect on its grouping or evaluation.
The As If Rule
There were several reasons for the K&R C rearrangement rules:
- The rearrangements provide many more opportunities for optimizations, such as compile-time constant folding.
- The rearrangements do not change the result of integral-typed expressions on most machines.
- Some of the operations are both mathematically and computationally commutative and associative on all machines.
The ANSI/ISO C Committee eventually became convinced that the rearrangement rules were intended to be an instance of the as if rule when applied to the described target architectures. ANSI/ISO C's as if rule is a general license that permits an implementation to deviate arbitrarily from the abstract machine description as long as the deviations do not change the behavior of a valid C program.
Thus, all the binary bitwise operators (other than shifting) are allowed to be rearranged on any machine because there is no way to notice such regroupings. On typical two's-complement machines in which overflow wraps around, integer expressions involving multiplication or addition can be rearranged for the same reason.
Therefore, this change in C does not have a significant impact on most C programmers.
Incomplete Types
The ANSI/ISO C standard introduced the term "incomplete type" to formalize a fundamental, yet misunderstood, portion of C, implicit from its beginnings. This section describes incomplete types, where they are permitted, and why they are useful.
Types
ANSI/ISO separates C's types into three distinct sets: function, object, and incomplete. Function types are obvious; object types cover everything else, except when the size of the object is not known. The Standard uses the term "object type" to specify that the designated object must have a known size, but it is important to know that incomplete types other than
void
also refer to an object.There are only three variations of incomplete types:
void
, arrays of unspecified length, and structures and unions with unspecified content. The typevoid
differs from the other two in that it is an incomplete type that cannot be completed, and it serves as a special function return and parameter type.Completing Incomplete Types
An array type is completed by specifying the array size in a following declaration in the same scope that denotes the same object. When an array without a size is declared and initialized in the same declaration, the array has an incomplete type only between the end of its declarator and the end of its initializer.
An incomplete structure or union type is completed by specifying the content in a following declaration in the same scope for the same tag.
Declarations
Certain declarations can use incomplete types, but others require complete object types. Those declarations that require object types are array elements, members of structures or unions, and objects local to a function. All other declarations permit incomplete types. In particular, the following constructs are permitted:
- Pointers to incomplete types
- Functions returning incomplete types
- Incomplete function parameter types
typedef
names for incomplete typesThe function return and parameter types are special. Except for
void
, an incomplete type used in such a manner must be completed by the time the function is defined or called. A return type ofvoid
specifies a function that returns no value, and a single parameter type ofvoid
specifies a function that accepts no arguments.Since array and function parameter types are rewritten to be pointer types, a seemingly incomplete array parameter type is not actually incomplete. The typical declaration of
main
'sargv
, namely,char
*argv[]
, as an unspecified length array of character pointers, is rewritten to be a pointer to character pointers.Expressions
Most expression operators require complete object types. The only three exceptions are the unary
&
operator, the first operand of the comma operator, and the second and third operands of the?:
operator. Most operators that accept pointer operands also permit pointers to incomplete types, unless pointer arithmetic is required. The list includes the unary*
operator. For example, given:
void *p
&*p
is a valid subexpression that makes use of this.Justification
Why are incomplete types necessary? Ignoring
void
, there is only one feature provided by incomplete types that C has no other way to handle, and that has to do with forward references to structures and unions. If one has two structures that need pointers to each other, the only way to do so is with incomplete types:
struct a { struct b *bp; };struct b { struct a *ap; };
All strongly typed programming languages that have some form of pointer and heterogeneous data types provide some method of handling this case.
Examples
Defining
typedef
names for incomplete structure and union types is frequently useful. If you have a complicated bunch of data structures that contain many pointers to each other, having a list oftypedef
s to the structures up front, possibly in a central header, can simplify the declarations.
typedef struct item_tag Item;typedef union note_tag Note;typedef struct list_tag List;. . .struct item_tag { . . . };. . .struct list_tag {struct list_tag {};
Moreover, for those structures and unions whose contents should not be available to the rest of the program, a header can declare the tag without the content. Other parts of the program can use pointers to the incomplete structure or union without any problems, unless they attempt to use any of its members.
A frequently used incomplete type is an external array of unspecified length. Generally, it is not necessary to know the extent of an array to make use of its contents.
Compatible and Composite Types
With K&R C, and even more so with ANSI/ISO C, it is possible for two declarations that refer to the same entity to be other than identical. The term "compatible type" is used in ANSI/ISO C to denote those types that are "close enough". This section describes compatible types as well as "composite types"--the result of combining two compatible types.
Multiple Declarations
If a C program were only allowed to declare each object or function once, there would be no need for compatible types. Linkage, which allows two or more declarations to refer to the same entity, function prototypes, and separate compilation all need such a capability. Separate translation units (source files) have different rules for type compatibility from within a single translation unit.
Separate Compilation Compatibility
Since each compilation probably looks at different source files, most of the rules for compatible types across separate compiles are structural in nature:
- Matching scalar (integral, floating, and pointer) types must be compatible, as if they were in the same source file.
- Matching structures, unions, and enums must have the same number of members. Each matching member must have a compatible type (in the separate compilation sense), including bit-field widths.
- Matching structures must have the members in the same order. The order of union and enum members does not matter.
- Matching enum members must have the same value.
- An additional requirement is that the names of members, including the lack of names for unnamed members, match for structures, unions, and enums, but not necessarily their respective tags.
Single Compilation Compatibility
When two declarations in the same scope describe the same object or function, the two declarations must specify compatible types. These two types are then combined into a single composite type that is compatible with the first two. More about composite types later.
The compatible types are defined recursively. At the bottom are type specifier keywords. These are the rules that say that
unsigned
short
is the same asunsigned
short
int
, and that a type without type specifiers is the same as one withint
. All other types are compatible only if the types from which they are derived are compatible. For example, two qualified types are compatible if the qualifiers,const
andvolatile
, are identical, and the unqualified base types are compatible.Compatible Pointer Types
For two pointer types to be compatible, the types they point to must be compatible and the two pointers must be identically qualified. Recall that the qualifiers for a pointer are specified after the
*
, so that these two declarations
int *const cpi;int *volatile vpi;
declare two differently qualified pointers to the same type,
int
.Compatible Array Types
For two array types to be compatible, their element types must be compatible. If both array types have a specified size, they must match, that is, an incomplete array type (see Incomplete Types) is compatible both with another incomplete array type and an array type with a specified size.
Compatible Function Types
To make functions compatible, follow these rules:
- For two function types to be compatible, their return types must be compatible. If either or both function types have prototypes, the rules are more complicated.
- For two function types with prototypes to be compatible, they also must have the same number of parameters, including use of the ellipsis (
...
) notation, and the corresponding parameters must be parameter-compatible.- For an old-style function definition to be compatible with a function type with a prototype, the prototype parameters must not end with an ellipsis (
...
). Each of the prototype parameters must be parameter-compatible with the corresponding old-style parameter, after application of the default argument promotions.- For an old-style function declaration (not a definition) to be compatible with a function type with a prototype, the prototype parameters must not end with an ellipsis (
...
). All of the prototype parameters must have types that would be unaffected by the default argument promotions.- For two types to be parameter-compatible, the types must be compatible after the top-level qualifiers, if any, have been removed, and after a function or array type has been converted to the appropriate pointer type.
Special Cases
signed
int
behaves the same asint
, except possibly for bit-fields, in which a plainint
may denote an unsigned-behaving quantity.Another interesting note is that each enumeration type must be compatible with some integral type. For portable programs, this means that enumeration types are separate types. In general, the ANSI/ISO C standard views them in that manner.
Composite Types
The construction of a composite type from two compatible types is also recursively defined. The ways compatible types can differ from each other are due either to incomplete arrays or to old-style function types. As such, the simplest description of the composite type is that it is the type compatible with both of the original types, including every available array size and every available parameter list from the original types.
Sun Microsystems, Inc. Copyright information. All rights reserved. Feedback |
Library | Contents | Previous | Next | Index |