Tokenization and Preprocessing - Oracle® Developer Studio 12.6: C User's Guide

Language:

7.4 Tokenization and Preprocessing

Probably the least specified part of previous versions of C concerned the operations that transformed each source file from a bunch of characters into a sequence of tokens, ready to parse. These operations included recognition of white space (including comments), bundling consecutive characters into tokens, handling preprocessing directive lines, and macro replacement. However, their respective ordering was never guaranteed.

7.4.1 ISO C Translation Phases

The order of these translation phases is specified by ISO C.

Every trigraph sequence in the source file is replaced . ISO C has exactly nine trigraph sequences that were invented solely as a concession to deficient character sets. They are three-character sequences that name a character not in the ISO 646-1983 character set:

Table 16 Trigraph Sequences

Trigraph Sequence	Converts to
`??=`	`#`
`??-`	`~`
`??(`	`[`
`??)`	`]`
`??!`	`\|`
`??<`	`{`
`??>`	`}`
`??/`	`\`
`??’`	`^`

These sequences must be understood by ISO C compilers, but are not recommended. When you use the -xtransition option, the ISO C compiler warns you whenever it replaces a trigraph while in transition (–Xt) mode, even in comments. For example, consider the following:

/* comment *??/
/* still comment? */

The ??/ becomes a backslash. This character and the following newline are removed. The resulting characters are:

/* comment */* still comment? */

The first / from the second line is the end of the comment. The next token is the *.

Every backslash/new-line character pair is deleted.
The source file is converted into preprocessing tokens and sequences of white space. Each comment is effectively replaced by a space character.
Every preprocessing directive is handled and all macro invocations are replaced. Each #included source file is run through the earlier phases before its contents replace the directive line.
Every escape sequence (in character constants and string literals) is interpreted.
Adjacent string literals are concatenated.
Every preprocessing token is converted into a regular token. The compiler properly parses these and generates code.
All external object and function references are resolved, resulting in the final program.

7.4.2 Old C Translation Phases

Previous C compilers did not follow such a simple sequence of phases, and the order in which these steps were applied was not predictable. A separate preprocessor recognized tokens and white space at essentially the same time as it replaced macros and handled directive lines. The output was then completely retokenized by the compiler proper, which then parsed the language and generated code.

The tokenization process within the preprocessor was a moment-by-moment operation and macro replacement was done as a character-based, not token-based, operation. Therefore, the tokens and white space could greatly vary during preprocessing.

A number of differences arise from these two approaches. The rest of this section discusses how code behavior can change due to line splicing, macro replacement, stringizing, and token pasting, which occur during macro replacement.

7.4.3 Logical Source Lines

In K&R C, backslash/new-line pairs were allowed only as a means to continue a directive, a string literal, or a character constant to the next line. ISO C extended the notion so that a backslash/new-line pair can continue anything to the next line. The result is a logical source line. Therefore, any code that relies on the separate recognition of tokens on either side of a backslash/new-line pair does not behave as expected.

7.4.4 Macro Replacement

The macro replacement process was not described in detail prior to ISO C. This vagueness spawned a great many divergent implementations. Any code that relied on anything more complex than manifest constant replacement and simple function–like macros was probably not truly portable. This manual cannot uncover all the differences between the old C macro replacement implementation and the ISO C version. Nearly all uses of macro replacement with the exception of token pasting and stringizing produce exactly the same series of tokens as before. Furthermore, the ISO C macro replacement algorithm can do things not possible in the old C version. The following example causes any use of name to be replaced with an indirect reference through name.

#define name (*name)

The old C preprocessor would produce a huge number of parentheses and stars and eventually produce an error about macro recursion.

The major change in the macro replacement approach taken by ISO C is to require macro arguments, other than those that are operands of the macro substitution operators # and ##, to be expanded recursively prior to their substitution in the replacement token list. However, this change seldom produces an actual difference in the resulting tokens.

7.4.5 Using Strings

Note - In ISO C, the examples below marked with a ? produce a warning about use of old features when you use the -xtransition option. Only in the transition mode ( –Xt and -Xs) is the result the same as in previous versions of C .

In K&R C, the following code produced the string literal "x y!":

#define str(a) "a!"   ?
str(x y)

Thus, the preprocessor searched inside string literals and character constants for characters that looked like macro parameters. ISO C recognized the importance of this feature, but could not condone operations on parts of tokens. In ISO C, all invocations of the above macro produce the string literal "a!". To achieve the old effect in ISO C, use the # macro substitution operator and the concatenation of string literals.

#define str(a) #a "!"
str(x y)

This code produces the two string literals "x y" and "!" which, after concatenation, produce the identical "x y!".

There is no direct replacement for the analogous operation for character constants. The major use of this feature was similar to the following example:

#define CNTL(ch) (037 & ’ch’)    ?
CNTL(L)

This example produces the following result, which evaluates to the ASCII control-L character.

(037 & ’L’)

The best solution is to change all uses of this macro as follows:

#define CNTL(ch) (037 & (ch))
CNTL(’L’)

This code is more readable and more useful, as it can also be applied to expressions.

7.4.6 Token Pasting

K&R C had at least two ways to combine two tokens. Both invocations in the following code produced a single identifier x1 out of the two tokens x and 1.

#define self(a) a
#define glue(a,b) a/**/b ?
self(x)1
glue(x,1)

Again, ISO C could not sanction either approach. In ISO C, both invocations would produce the two separate tokens x and 1. The second of the two methods can be rewritten for ISO C by using the ## macro substitution operator:

#define glue(a,b) a ## b
glue(x, 1)

# and ## should be used as macro substitution operators only when __STDC__ is defined. Because ## is an actual operator, the invocation can be much freer with respect to white space in both the definition and invocation.

The compiler issues a warning diagnostic for an undefined ## operation (C standard, section 3.4.3), where undefined is a ## result that, when preprocessed, consists of multiple tokens rather than one single token (C standard, section 6.10.3.3(3)). The result of an undefined ## operation is now defined as the first individual token generated by preprocessing the string created by concatenating the ## operands.

No direct approach reproduces the first of the two old-style pasting schemes but because it put the burden of the pasting at the invocation, it was used less frequently than the other form .