Programming Utilities Guide

Using `lex` and `yacc` Together

If you work on a compiler project or develop a program to check the validity of an input language, you might want to use the system tool yacc (Chapter 3, yacc -- A Compiler Compiler ). yacc generates parsers, programs that analyze input to insure that it is syntactically correct.

lex and yacc often work well together for developing compilers.

As noted, a program uses the lex-generated scanner by repeatedly calling the function yylex(). This name is convenient because a yacc-generated parser calls its lexical analyzer with this name.

To use lex to create the lexical analyzer for a compiler, end each lex action with the statement return token, where token is a defined term with an integer value.

The integer value of the token returned indicates to the parser what the lexical analyzer has found. The parser, called yyparse() by yacc, then resumes control and makes another call to the lexical analyzer to get another token.

In a compiler, the different values of the token indicate what, if any, reserved word of the language has been found or whether an identifier, constant, arithmetic operator, or relational operator has been found. In the latter cases, the analyzer must also specify the exact value of the token: what the identifier is, whether the constant is, say, 9 or 888, whether the operator is + or *, and whether the relational operator is = or >.

Consider the following portion of lex source for a scanner that recognizes tokens in a "C-like" language:

Table 2-2 Sample lex Source Recognizing Tokens

begin return(BEGIN); end return(END); while return(WHILE); if return(IF); package return(PACKAGE); reverse return(REVERSE); loop return(LOOP); [a-zA-Z][a-zA-Z0-9]* { tokval = put_in_tabl(); return(IDENTIFIER); } [0-9]+ { tokval = put_in_tabl(); return(INTEGER); } \+ { tokval = PLUS; return(ARITHOP); } \- { tokval = MINUS; return(ARITHOP); } > { tokval = GREATER; return(RELOP); } >= { tokval = GREATEREQL; return(RELOP); }

begin				   	 		return(BEGIN); 
end 			       		return(END); 
while					     	return(WHILE); 
if						       	return(IF); 
package					 	 	return(PACKAGE); 
reverse				 		 	return(REVERSE); 
loop				    		 	return(LOOP); 
[a-zA-Z][a-zA-Z0-9]*			
             { tokval = put_in_tabl(); 
           			          return(IDENTIFIER); } 
[0-9]+						 { tokval = put_in_tabl(); 
                        return(INTEGER); } 
\+					     	{ tokval = PLUS;
                        return(ARITHOP); } 
\- 			      	{ tokval = MINUS;     		 			      
                        return(ARITHOP); } 
>			      			{ tokval = GREATER;  							        
                        return(RELOP); } 
>=				      	{ tokval = GREATEREQL; 			          
                      		return(RELOP); }

The tokens returned, and the values assigned to tokval, are integers. Good programming style suggests using informative terms such as BEGIN, END, and WHILE, to signify the integers the parser understands, rather than using the integers themselves.

You establish the association by using #define statements in your C parser calling routine. For example:

#define BEGIN 1 
#define END 2 
	... 
#define PLUS 7 
	...

Then, to change the integer for some token type, change the #define statement in the parser rather than change every occurrence of the particular integer.

To use yacc to generate your parser, insert the following statement in the definitions section of your lex source:

#include "y.tab.h"

The file y.tab.h, which is created when yacc is invoked with the -d option, provides #define statements that associate token names such as BEGIN and END with the integers of significance to the generated parser.

To indicate the reserved words in Table 2-2, the returned integer values suffice. For the other token types, the integer value is stored in the variable tokval.

This variable is globally defined so that the parser and the lexical analyzer can access it. yacc provides the variable yylval for the same purpose.

Note that Table 2-2 shows two ways to assign a value to tokval.

First, a function put_in_tabl() places the name and type of the identifier or constant in a symbol table so that the compiler can refer to it.

More to the present point, put_in_tabl() assigns a type value to tokval so that the parser can use the information immediately to determine the syntactic correctness of the input text. The function put_in_tabl() is a routine that the compiler writer might place in the user routines section of the parser.

Second, in the last few actions of the example, tokval is assigned a specific integer indicating which arithmetic or relational operator the scanner recognized.

If the variable PLUS, for instance, is associated with the integer 7 by means of the #define statement above, then when a + is recognized, the action assigns to tokval the value 7, which indicates the +.

The scanner indicates the general class of operator by the value it returns to the parser (that is, the integer signified by ARITHOP or RELOP).

When using lex with yacc, either can be run first. The following command generates a parser in the file y.tab.c:
```
$ yacc d grammar.y
```

As noted, the -d option creates the file y.tab.h, which contains the #define statements that associate the yacc-assigned integer token values with the user-defined token names. Now you can invoke lex with the following command:

$ lex lex.l

You can then compile and link the output files with the command:

$ cc lex.yy.c y.tab.c -ly -ll

The yacc library is loaded with the -ly option before the lex library with the -ll option to insure that the supplied main() calls the yacc parser.

Also, to use yacc with CC, especially when routines like yyback(), yywrap(), and yylook() in .l files are to be extern C functions, the command line must include the following.

$ CC -D__EXTERN_C__ ... filename

Using lex and yacc Together

Using `lex` and `yacc` Together