Programming Utilities Guide

Some Special Features

Besides storing the matched input text in yytext[], the scanner automatically counts the number of characters in a match and stores it in the variable yyleng. You can use this variable to refer to any specific character just placed in the array yytext[].

Remember that C language array indexes start with 0, so to print the third digit (if there is one) in a just-recognized integer, you might enter:

[1-9]+				{if (yyleng > 2) 
			         	printf("%c", yytext[2]); 
            } 

lex follows a number of high-level rules to resolve ambiguities that might arise from the set of rules that you write. In the following lexical analyzer example, the ``reserved word'' end could match the second rule as well as the eighth, the one for identifiers:

begin						 return(BEGIN); 
end					   	return(END); 
while			  		return(WHILE); 
if			       return(IF); 
package					return(PACKAGE); 
reverse					return(REVERSE); 
loop				    return(LOOP); 
[a-zA-Z][a-zA-Z0-9]*  { tokval = put_in_tabl();
                         return(IDENTIFIER); }
[0-9]+   { tokval = put_in_tabl();
             return(INTEGER); }
\+				   { tokval = PLUS;
            return(ARITHOP); } 
\-       { tokval = MINUS;
            return(ARITHOP); }
>	       { tokval = GREATER;
            return(RELOP); } 
>=		  			{ tokval = GREATEREQL;
            return(RELOP); } 

lex follows the rule that, where there is a match with two or more rules in a specification, the first rule is the one whose action is executed. Placing the rule for end and the other reserved words before the rule for identifiers ensures that the reserved words are recognized.

Another potential problem arises from cases where one pattern you are searching for is the prefix of another. For instance, the last two rules in the lexical analyzer example above are designed to recognize > and >=.

lex follows the rule that it matches the longest character string possible and executes the rule for that string. If the text has the string >= at some point, the scanner recognizes the >= and acts accordingly, instead of stopping at the > and executing the > rule. This rule also distinguishes + from ++ in a C program.

When the analyzer must read characters beyond the string you are seeking, use trailing context. The classic example is the DO statement in FORTRAN. In the following DO() statement, the first 1 looks like the initial value of the index k until the first comma is read:

DO 50 k = 1 , 20, 1 

Until then, this looks like the assignment statement:

DO50k = 1 

Remember that FORTRAN ignores all blanks. Use the slash, /, to signify that what follows is trailing context, something not to be stored in yytext[], because the slash is not part of the pattern itself.

The rule to recognize the FORTRAN DO statement could be:

DO/([ ]*[0-9]+[ ]*[a-zA-Z0-9]+=[a-zA-Z0-9]+,) { 
	       printf("found DO");
        }

While different versions of FORTRAN limit the identifier size, here the index name, this rule simplifies the example by accepting an index name of any length. See the "Start Conditions " section for a discussion of a similar handling of prior context.

lex uses the $ symbol as an operator to mark a special trailing context -- the end of a line. An example would be a rule to ignore all blanks and tabs at the end of a line:

[ \t]+$ ; 

The previous example could also be written as:

[ \t]+/\n ; 

To match a pattern only when it starts a line or a file, use the ^ operator. Suppose a text-formatting program requires that you not start a line with a blank. You could check input to the program with the following rule:

^[ ]		  printf("error: remove leading blank"); 

Notice the difference in meaning when the ^ operator appears inside the left bracket.