Sun Studio 12: C User's Guide

6.5 Tokenization and Preprocessing

Probably the least specified part of previous versions of C concerned the operations that transformed each source file from a bunch of characters into a sequence of tokens, ready to parse. These operations included recognition of white space (including comments), bundling consecutive characters into tokens, handling preprocessing directive lines, and macro replacement. However, their respective ordering was never guaranteed.

6.5.1 ISO C Translation Phases

The order of these translation phases is specified by ISO C.

Every trigraph sequence in the source file is replaced. ISO C has exactly nine trigraph sequences that were invented solely as a concession to deficient character sets, and are three-character sequences that name a character not in the ISO 646-1983 character set:

Table 6–1 Trigraph Sequences


Trigraph Sequence	Converts to
`??=`	`#`
`??-`	`~`
`??(`	`[`
`??)`	`]`
`??!`	`\|`
`??<`	`{`
`??>`	`}`
`??/`	`\`
`??’`	`^`

These sequences must be understood by ISO C compilers, but we do not recommend their use. The ISO C compiler warns you, when you use the -xtransition option, whenever it replaces a trigraph while in transition (–Xt) mode, even in comments. For example, consider the following:

/* comment *??/
/* still comment? */

The ??/ becomes a backslash. This character and the following newline are removed. The resulting characters are:

/* comment */* still comment? */

The first / from the second line is the end of the comment. The next token is the *.

Every backslash/new-line character pair is deleted.
The source file is converted into preprocessing tokens and sequences of white space. Each comment is effectively replaced by a space character.
Every preprocessing directive is handled and all macro invocations are replaced. Each #included source file is run through the earlier phases before its contents replace the directive line.
Every escape sequence (in character constants and string literals) is interpreted.
Adjacent string literals are concatenated.
Every preprocessing token is converted into a regular token; the compiler properly parses these and generates code.
All external object and function references are resolved, resulting in the final program.

6.5.2 Old C Translation Phases

Previous C compilers did not follow such a simple sequence of phases, nor were there any guarantees for when these steps were applied. A separate preprocessor recognized tokens and white space at essentially the same time as it replaced macros and handled directive lines. The output was then completely retokenized by the compiler proper, which then parsed the language and generated code.

Because the tokenization process within the preprocessor was a moment-by-moment operation and macro replacement was done as a character-based, not token-based, operation, the tokens and white space could have a great deal of variation during preprocessing.

There are a number of differences that arise from these two approaches. The rest of this section discusses how code behavior may change due to line splicing, macro replacement, stringizing, and token pasting, which occur during macro replacement.

6.5.3 Logical Source Lines

In K&R C, backslash/new-line pairs were allowed only as a means to continue a directive, a string literal, or a character constant to the next line. ISO C extended the notion so that a backslash/new-line pair can continue anything to the next line. The result is a logical source line. Therefore, any code that relied on the separate recognition of tokens on either side of a backslash/new-line pair does not behave as expected.

6.5.4 Macro Replacement

The macro replacement process has never been described in detail prior to ISO C. This vagueness spawned a great many divergent implementations. Any code that relied on anything fancier than manifest constant replacement and simple function–like macros was probably not truly portable. This manual cannot uncover all the differences between the old C macro replacement implementation and the ISO C version. Nearly all uses of macro replacement with the exception of token pasting and stringizing produce exactly the same series of tokens as before. Furthermore, the ISO C macro replacement algorithm can do things not possible in the old C version. For example,

#define name (*name)

causes any use of name to be replaced with an indirect reference through name. The old C preprocessor would produce a huge number of parentheses and stars and eventually produce an error about macro recursion.

The major change in the macro replacement approach taken by ISO C is to require macro arguments, other than those that are operands of the macro substitution operators # and ##, to be expanded recursively prior to their substitution in the replacement token list. However, this change seldom produces an actual difference in the resulting tokens.

6.5.5 Using Strings

Note –

In ISO C, the examples below marked with a ? produce a warning about use of old features, when you use the -xtransition option. Only in the transition mode ( –Xt and -Xs) is the result the same as in previous versions of C.

In K&R C, the following code produced the string literal "x y!":

#define str(a) "a!"   ?
str(x y)

Thus, the preprocessor searched inside string literals and character constants for characters that looked like macro parameters. ISO C recognized the importance of this feature, but could not condone operations on parts of tokens. In ISO C, all invocations of the above macro produce the string literal "a!". To achieve the old effect in ISO C, we make use of the # macro substitution operator and the concatenation of string literals.

#define str(a) #a "!"
str(x y)

The above code produces the two string literals "x y" and "!" which, after concatenation, produces the identical "x y!".

There is no direct replacement for the analogous operation for character constants. The major use of this feature was similar to the following:

#define CNTL(ch) (037 & ’ch’)    ?
CNTL(L)

which produced

(037 & ’L’)

which evaluates to the ASCII control-L character. The best solution we know of is to change all uses of this macro to:

#define CNTL(ch) (037 & (ch))
CNTL(’L’)

This code is more readable and more useful, as it can also be applied to expressions.

6.5.6 Token Pasting

In K&R C, there were at least two ways to combine two tokens. Both invocations in the following produced a single identifier x1 out of the two tokens x and 1.

#define self(a) a
#define glue(a,b) a/**/b ?
self(x)1
glue(x,1)

Again, ISO C could not sanction either approach. In ISO C, both the above invocations would produce the two separate tokens x and 1. The second of the above two methods can be rewritten for ISO C by using the ## macro substitution operator:

#define glue(a,b) a ## b
glue(x, 1)

# and ## should be used as macro substitution operators only when __STDC__ is defined. Since ## is an actual operator, the invocation can be much freer with respect to white space in both the definition and invocation.

There is no direct approach to effect the first of the two old-style pasting schemes, but since it put the burden of the pasting at the invocation, it was used less frequently than the other form.