Programming Utilities Guide

`lex` Routines

The following macros enable you to perform special actions.

input() reads another character

unput() puts a character back to be read again a moment later

output() writes a character on an output device

One way to ignore all characters between two special characters, such as between a pair of double quotation marks, is to use input() like this:

\"		    while (input() != '"');

After the first double quotation mark, the scanner reads all subsequent characters, and does not look for a match, until it reads the second double quotation mark. (See the further examples of input() and unput(c) usage in the "User Routines " section.)

For special I/O needs that are not covered by these default macros, such as writing to several files, use standard I/O routines in C to rewrite the macro functions.

These routines, however, must be modified consistently. In particular, the character set used must be consistent in all routines, and a value of 0 returned by input() must mean end of file. The relationship between input() and unput(c) must be maintained or the lex lookahead will not work.

If you do provide your own input(), output(c), or unput(c), write a #undef input and so on in your definitions section first:

#undef input
#undef output
  . . .
#define input() ...  etc.
more declarations 
  . . .

Your new routines will replace the standard ones. See the "Definitions " section for further details.

A lex library routine that you can redefine is yywrap(), which is called whenever the scanner reaches the end of file. If yywrap() returns 1, the scanner continues with normal wrapup on the end of input. To arrange for more input to arrive from a new source, redefine yywrap() to return 0 when more processing is required. The default yywrap() always returns 1.

Note that it is not possible to write a normal rule that recognizes end of file; the only access to that condition is through yywrap(). Unless a private version of input() is supplied, a file containing nulls cannot be handled because a value of 0 returned by input() is taken to be end of file.

lex routines that let you handle sequences of characters to be processed in more than one way include yymore(), yyless(n), and REJECT. Recall that the text that matches a given specification is stored in the array yytext[]. In general, once the action is performed for the specification, the characters in yytext[] are overwritten with succeeding characters in the input stream to form the next match.

The function yymore(), by contrast, ensures that the succeeding characters recognized are appended to those already in yytext[]. This lets you do things sequentially, such as when one string of characters is significant and a longer one that includes the first is significant as well.

Consider a language that defines a string as a set of characters between double quotation marks and specifies that to include a double quotation mark in a string, it must be preceded by a backslash. The regular expression matching that is somewhat confusing, so it might be preferable to write:

\"[^"]* { 
     if (yytext[yyleng-2] == '\\') 
         yymore(); 
     else
         ... normal processing
     }

When faced with the string "abc\"def", the scanner first matches the characters "abc\. Then the call to yymore() causes the next part of the string "def to be tacked on the end. The double quotation mark terminating the string is picked up in the code labeled ``normal processing.''

With the function yyless(n) you can specify the number of matched characters on which an action is to be performed: only the first n characters of the expression are retained in yytext[]. Subsequent processing resumes at the nth + 1 character.

Suppose you are deciphering code, and working with only half the characters in a sequence that ends with a certain one, say an upper or lowercase Z. You could write:

[a-yA-Y]+[Zz] { yyless(yyleng/2); 
                	...  process first half of string
                 ...
                }

Finally, with the REJECT function, you can more easily process strings of characters even when they overlap or contain one another as parts. REJECT does this by immediately jumping to the next rule and its specification without changing the contents of yytext[]. To count the number of occurrences both of the regular expression snapdragon and of its subexpression dragon in an input text, the following works:

snapdragon				{countflowers++; REJECT;} 
dragon			     	countmonsters++;

As an example of one pattern overlapping another, the following counts the number of occurrences of the expressions comedian and diana, even where the input text has sequences such as comediana:

comedian				{comiccount++; REJECT;} 
diana	      	princesscount++;

The actions here can be considerably more complicated than incrementing a counter. In all cases, you declare the counters and other necessary variables in the definitions section at the beginning of the lex specification.

lex Routines

`lex` Routines