Programming Utilities Guide

Advanced `lex` Features

You can process input text riddled with complicated patterns by using a suite of features provided by lex. These include rules that decide which specification is relevant when more than one seems so at first, functions that transform one matching pattern into another, and the use of definitions and subroutines.

Here is an example that draws together several of the points already covered:

%% 
-[0-9]+		   printf("negative integer"); 
\+?[0-9]+ 	 printf("positive integer"); 
-0.[0-9]+	 	printf("negative fraction, no whole number part");
rail[ \t]+road			printf("railroad is one word"); 
crook		    	printf("Here's a crook"); 
function		  subprogcount++; 
G[a-zA-Z]*		{ printf("may have a G word here:%s", yytext); 
   		              			Gstringcount++; }

The first three rules recognize negative integers, positive integers, and negative fractions between 0 and -1. The terminating + in each specification ensures that one or more digits compose the number in question.

Each of the following rules recognizes a specific pattern:

The specification for railroad matches cases where one or more blanks intervene between the two syllables of the word. In the cases of railroad and crook, synonyms could have been printed rather than the messages.

The rule recognizing a function increments a counter.

This rule illustrates several points:
- The braces specify an action sequence that extends over several lines.
- The action uses the lex array yytext[], which stores the recognized character string.
- The specification uses the * to indicate that zero or more letters can follow the G.

Some Special Features

Besides storing the matched input text in yytext[], the scanner automatically counts the number of characters in a match and stores it in the variable yyleng. You can use this variable to refer to any specific character just placed in the array yytext[].

Remember that C language array indexes start with 0, so to print the third digit (if there is one) in a just-recognized integer, you might enter:

[1-9]+				{if (yyleng > 2) 
			         	printf("%c", yytext[2]); 
            }

lex follows a number of high-level rules to resolve ambiguities that might arise from the set of rules that you write. In the following lexical analyzer example, the ``reserved word'' end could match the second rule as well as the eighth, the one for identifiers:

begin						 return(BEGIN); 
end					   	return(END); 
while			  		return(WHILE); 
if			       return(IF); 
package					return(PACKAGE); 
reverse					return(REVERSE); 
loop				    return(LOOP); 
[a-zA-Z][a-zA-Z0-9]*  { tokval = put_in_tabl();
                         return(IDENTIFIER); }
[0-9]+   { tokval = put_in_tabl();
             return(INTEGER); }
\+				   { tokval = PLUS;
            return(ARITHOP); } 
\-       { tokval = MINUS;
            return(ARITHOP); }
>	       { tokval = GREATER;
            return(RELOP); } 
>=		  			{ tokval = GREATEREQL;
            return(RELOP); }

lex follows the rule that, where there is a match with two or more rules in a specification, the first rule is the one whose action is executed. Placing the rule for end and the other reserved words before the rule for identifiers ensures that the reserved words are recognized.

Another potential problem arises from cases where one pattern you are searching for is the prefix of another. For instance, the last two rules in the lexical analyzer example above are designed to recognize > and >=.

lex follows the rule that it matches the longest character string possible and executes the rule for that string. If the text has the string >= at some point, the scanner recognizes the >= and acts accordingly, instead of stopping at the > and executing the > rule. This rule also distinguishes + from ++ in a C program.

When the analyzer must read characters beyond the string you are seeking, use trailing context. The classic example is the DO statement in FORTRAN. In the following DO() statement, the first 1 looks like the initial value of the index k until the first comma is read:

DO 50 k = 1 , 20, 1

Until then, this looks like the assignment statement:

DO50k = 1

Remember that FORTRAN ignores all blanks. Use the slash, /, to signify that what follows is trailing context, something not to be stored in yytext[], because the slash is not part of the pattern itself.

The rule to recognize the FORTRAN DO statement could be:

DO/([ ]*[0-9]+[ ]*[a-zA-Z0-9]+=[a-zA-Z0-9]+,) { 
	       printf("found DO");
        }

While different versions of FORTRAN limit the identifier size, here the index name, this rule simplifies the example by accepting an index name of any length. See the "Start Conditions " section for a discussion of a similar handling of prior context.

lex uses the $ symbol as an operator to mark a special trailing context -- the end of a line. An example would be a rule to ignore all blanks and tabs at the end of a line:

[ \t]+$ ;

The previous example could also be written as:

[ \t]+/\n ;

To match a pattern only when it starts a line or a file, use the ^ operator. Suppose a text-formatting program requires that you not start a line with a blank. You could check input to the program with the following rule:

^[ ]		  printf("error: remove leading blank");

Notice the difference in meaning when the ^ operator appears inside the left bracket.

`lex` Routines

The following macros enable you to perform special actions.

input() reads another character

unput() puts a character back to be read again a moment later

output() writes a character on an output device

One way to ignore all characters between two special characters, such as between a pair of double quotation marks, is to use input() like this:

\"		    while (input() != '"');

After the first double quotation mark, the scanner reads all subsequent characters, and does not look for a match, until it reads the second double quotation mark. (See the further examples of input() and unput(c) usage in the "User Routines " section.)

For special I/O needs that are not covered by these default macros, such as writing to several files, use standard I/O routines in C to rewrite the macro functions.

These routines, however, must be modified consistently. In particular, the character set used must be consistent in all routines, and a value of 0 returned by input() must mean end of file. The relationship between input() and unput(c) must be maintained or the lex lookahead will not work.

If you do provide your own input(), output(c), or unput(c), write a #undef input and so on in your definitions section first:

#undef input
#undef output
  . . .
#define input() ...  etc.
more declarations 
  . . .

Your new routines will replace the standard ones. See the "Definitions " section for further details.

A lex library routine that you can redefine is yywrap(), which is called whenever the scanner reaches the end of file. If yywrap() returns 1, the scanner continues with normal wrapup on the end of input. To arrange for more input to arrive from a new source, redefine yywrap() to return 0 when more processing is required. The default yywrap() always returns 1.

Note that it is not possible to write a normal rule that recognizes end of file; the only access to that condition is through yywrap(). Unless a private version of input() is supplied, a file containing nulls cannot be handled because a value of 0 returned by input() is taken to be end of file.

lex routines that let you handle sequences of characters to be processed in more than one way include yymore(), yyless(n), and REJECT. Recall that the text that matches a given specification is stored in the array yytext[]. In general, once the action is performed for the specification, the characters in yytext[] are overwritten with succeeding characters in the input stream to form the next match.

The function yymore(), by contrast, ensures that the succeeding characters recognized are appended to those already in yytext[]. This lets you do things sequentially, such as when one string of characters is significant and a longer one that includes the first is significant as well.

Consider a language that defines a string as a set of characters between double quotation marks and specifies that to include a double quotation mark in a string, it must be preceded by a backslash. The regular expression matching that is somewhat confusing, so it might be preferable to write:

\"[^"]* { 
     if (yytext[yyleng-2] == '\\') 
         yymore(); 
     else
         ... normal processing
     }

When faced with the string "abc\"def", the scanner first matches the characters "abc\. Then the call to yymore() causes the next part of the string "def to be tacked on the end. The double quotation mark terminating the string is picked up in the code labeled ``normal processing.''

With the function yyless(n) you can specify the number of matched characters on which an action is to be performed: only the first n characters of the expression are retained in yytext[]. Subsequent processing resumes at the nth + 1 character.

Suppose you are deciphering code, and working with only half the characters in a sequence that ends with a certain one, say an upper or lowercase Z. You could write:

[a-yA-Y]+[Zz] { yyless(yyleng/2); 
                	...  process first half of string
                 ...
                }

Finally, with the REJECT function, you can more easily process strings of characters even when they overlap or contain one another as parts. REJECT does this by immediately jumping to the next rule and its specification without changing the contents of yytext[]. To count the number of occurrences both of the regular expression snapdragon and of its subexpression dragon in an input text, the following works:

snapdragon				{countflowers++; REJECT;} 
dragon			     	countmonsters++;

As an example of one pattern overlapping another, the following counts the number of occurrences of the expressions comedian and diana, even where the input text has sequences such as comediana:

comedian				{comiccount++; REJECT;} 
diana	      	princesscount++;

The actions here can be considerably more complicated than incrementing a counter. In all cases, you declare the counters and other necessary variables in the definitions section at the beginning of the lex specification.

Definitions

The lex definitions section can contain any of several classes of items. The most critical are external definitions, preprocessor statements like #include, and abbreviations. For legal lex source this section is optional, but in most cases some of these items are necessary. Preprocessor statements and C source code appear between a line of the form %{ and one of the form %}.

All lines between these delimiters -- including those that begin with white space -- are copied to lex.yy.c immediately before the definition of yylex(). (Lines in the definition section that are not enclosed by the delimiters are copied to the same place provided they begin with white space.)

The definitions section is where you usually place C definitions of objects accessed by actions in the rules section or by routines with external linkage.

For example, when using lex with yacc, which generates parsers that call a lexical analyzer, include the file y.tab.h, which can contain #defines for token names:

%{ 
#include "y.tab.h"
extern int tokval; 
int lineno;
%}

After the %} that ends your #includes and declarations, place your abbreviations for regular expressions in the rules section. The abbreviation appears on the left of the line and, separated by one or more spaces, its definition or translation appears on the right.

When you later use abbreviations in your rules, be sure to enclose them within braces. Abbreviations avoid repetition in writing your specifications and make them easier to read.

As an example, reconsider the lex source reviewed in the section " Advanced lex Features ". Using definitions simplifies later reference to digits, letters, and blanks.

This is especially true when the specifications appear several times:

D		             		[0-9]
L		             		[a-zA-Z]
B		             		[ \t]+
%%
-{D}+			          printf("negative integer");
\+?{D}+		    	   	printf("positive integer");
-0.{D}+ 	         printf("negative fraction"); 
G{L}*		    	     	printf("may have a G word
                          here"); 
rail{B}road			   	printf("railroad is one word");
crook		         		printf("criminal"); 
 ...
 ...

Start Conditions

Start conditions provide greater sensitivity to prior context than is afforded by the ^ operator alone. You might want to apply different rules to an expression depending on a prior context that is more complex than the end of a line or the start of a file.

In this situation you could set a flag to mark the change in context that is the condition for the application of a rule, then write code to test the flag. Alternatively, you could define for lex the different ``start conditions'' under which it is to apply each rule.

Consider this problem:

Copy the input to the output, except change the word magic to the word first on every line that begins with the letter a.

Change magic to second on every line that begins with b.

Change magic to third on every line that begins with c. Here is how the problem might be handled with a flag.

Recall that ECHO is a lex macro equivalent to printf("%s", yytext):

int flag 
%% 
^a {flag = 'a'; ECHO;} 
^b {flag = 'b'; ECHO;} 
^c {flag = 'c'; ECHO;} 
\n {flag = 0; ECHO;} 
magic { 
switch (flag) 
     	{ 
          		case 'a': printf("first"); break; 
          		case 'b': printf("second"); break; 
          		case 'c': printf("third"); break; 
          		default: ECHO; break; 
     	} 
}

To handle the same problem with start conditions, each start condition must be introduced to lex in the definitions section with a line, such as the following one, where the conditions can be named in any order:

%Start name1 name2 ...

The word Start can be abbreviated to S or s. The conditions are referenced at the head of a rule with <> brackets. So the following is a rule that is recognized only when the scanner is in start condition name1:

<name1>expression

To enter a start condition, execute the action following statement:

BEGIN name1;

The above statement changes the start condition to name1. To resume the normal state, use the following:

BEGIN 0;

This resets the initial condition of the scanner.

A rule can be active in several start conditions. For example, the following is a legal prefix:

<name1,name2,name3>

Any rule not beginning with the <> prefix operators is always active.

The example can be written with start conditions as follows:

%Start AA BB CC 
%% 
^a			{ECHO; BEGIN AA;} 
^b			{ECHO; BEGIN BB;} 
^c			{ECHO; BEGIN CC;} 
\n			{ECHO; BEGIN 0;} 
<AA>magic			printf("first"); 
<BB>magic			printf("second"); 
<CC>magic			printf("third");

User Routines

You can use your lex routines in the same ways you use routines in other programming languages. Action code used for several rules can be written once and called when needed. As with definitions, this simplifies program writing and reading.

The put_in_tabl() function, discussed in the "Using lex and yacc Together" section, fits well in the user routines section of a lex specification.

Another reason to place a routine in this section is to highlight some code of interest or to simplify the rules section, even if the code is to be used for one rule only. As an example, consider the following routine to ignore comments in a language like C where comments occur between /* and */:

%{ 
static skipcmnts(); 
%} 
%% 
"/*"							skipcmnts(); 
 ...
 ...				/* rest of rules */
%% 
static 
skipcmnts() 
{ 
      	for(;;) 
      	{ 
          		while (input() != '*') 
               			; 
         
          		if (input() != '/') 
             			unput(yytext[yyleng-1]) 
         	 	else return; 
      	} 
}

There are three points of interest in this example.

First, the unput(c) macro puts back the last character that was read to avoid missing the final / if the comment ends unusually with a **/.

In this case, after the scanner reads an *, it finds that the next character is not the terminal / and it continues reading.

Second, the expression yytext[yyleng-1] picks the last character read.

Third, this routine assumes that the comments are not nested, as is the case with the C language.