Programming Utilities Guide

Basic Specifications

Names refer to either tokens or nonterminal symbols. yacc requires token names to be declared as such. While the lexical analyzer might be included as part of the specification file, it is perhaps more in keeping with modular design to keep it as a separate file. Like the lexical analyzer, other subroutines can be included as well.

Thus, every specification file theoretically consists of three sections: the declarations, (grammar) rules, and subroutines. The sections are separated by double percent signs (%%; the percent sign is generally used in yacc specifications as an escape character).

When all sections are used, a full specification file looks like:

declarations 
%% 
rules 
%% 
subroutines

The declarations and subroutines sections are optional. The smallest legal yacc specification might be:

%% 
S:;

Blanks, tabs, and newlines are ignored, but they cannot appear in names or multicharacter reserved symbols. Comments can appear wherever a name is legal. They are enclosed in /* and */, as in the C language.

The rules section is made up of one or more grammar rules. A grammar rule has the form:

A: BODY ;

where A represents a nonterminal symbol, and BODY represents a sequence of zero or more names and literals. The colon and the semicolon are yacc punctuation.

Names can be of any length and can consist of letters, periods, underscores, and digits, although a digit cannot be the first character of a name. Uppercase and lowercase letters are distinct. The names used in the body of a grammar rule can represent tokens or nonterminal symbols.

A literal consists of a character enclosed in single quotes. As in the C language, the backslash is an escape character within literals. yacc recognizes all C language escape sequences. For a number of technical reasons, the null character should never be used in grammar rules.

If there are several grammar rules with the same left-hand side, the vertical bar can be used to avoid rewriting the left-hand side. In addition, the semicolon at the end of a rule is dropped before a vertical bar.

Thus the grammar rules:

A : B C D ; 
A : E F ; 
A : G ;

can be given to yacc as:

A    	: B C D 
	     | E F 
     	| G 
;

by using the vertical bar. It is not necessary that all grammar rules with the same left side appear together in the grammar rules section although it makes the input more readable and easier to change.

If a nonterminal symbol matches the empty string, this can be indicated by:

epsilon : ;

The blank space following the colon is understood by yacc to be a nonterminal symbol named epsilon.

Names representing tokens must be declared. This is most simply done by writing:

$token name1 name2 name3

and so on in the declarations section. Every name not defined in the declarations section is assumed to represent a nonterminal symbol. Every nonterminal symbol must appear on the left side of at least one rule.

Of all the nonterminal symbols, the start symbol has particular importance. By default, the symbol is taken to be the left-hand side of the first grammar rule in the rules section. It is possible and desirable to declare the start symbol explicitly in the declarations section using the %start keyword:

$start symbol

The end of the input to the parser is signaled by a special token, called the end-marker. The end-marker is represented by either a zero or a negative number.

If the tokens up to but not including the end-marker form a construct that matches the start symbol, the parser function returns to its caller after the end-marker is seen and accepts the input. If the end-marker is seen in any other context, it is an error.

It is the job of the user-supplied lexical analyzer to return the end-marker when appropriate. Usually the end-marker represents some reasonably obvious I/O status, such as end of file or end of record.

Actions

With each grammar rule, you can associate actions to be performed when the rule is recognized. Actions can return values and can obtain the values returned by previous actions. Moreover, the lexical analyzer can return values for tokens, if desired.

An action is an arbitrary C-language statement and as such can do input and output, call subroutines, and alter arrays and variables. An action is specified by one or more statements enclosed in { and }. For example, the following two examples are grammar rules with actions:

A   	: '(' B ')' 
    	{ 
       		hello( 1, "abc" ); 
     }

and

XXX	: YYY ZZZ 
       	{ 
          		(void) printf("a message\n"); 
          		flag = 25; 
        }

The $ symbol is used to facilitate communication between the actions and the parser. The pseudo-variable $$ represents the value returned by the complete action.

For example, the action:

{ $$ = 1; }

returns the value of one; in fact, that's all it does.

To obtain the values returned by previous actions and the lexical analyzer, the action can use the pseudo-variables $1, $2, ... $n. These refer to the values returned by components 1 through n of the right side of a rule, with the components being numbered from left to right. If the rule is

A	     : B C D ;

then $2 has the value returned by C, and $3 the value returned by D. The following rule provides a common example:

expr	   : '(' expr ')' ;

You would expect the value returned by this rule to be the value of the expr within the parentheses. Since the first component of the action is the literal left parenthesis, the desired logical result can be indicated by:

expr	    	: '(' expr ')' 
        		{ 
           			$$ = $2 ; 
          }

By default, the value of a rule is the value of the first element in it ($1). Thus, grammar rules of the following form frequently need not have an explicit action:

A : B ;

In previous examples, all the actions came at the end of rules. Sometimes, it is desirable to get control before a rule is fully parsed. yacc permits an action to be written in the middle of a rule as well as at the end.

This action is assumed to return a value accessible through the usual $ mechanism by the actions to the right of it. In turn, it can access the values returned by the symbols to its left. Thus, in the rule below, the effect is to set x to 1 and y to the value returned by C:

Actions that do not terminate a rule are handled by yacc by manufacturing a new nonterminal symbol name and a new rule matching this name to the empty string. The interior action is the action triggered by recognizing this added rule.

yacc treats the above example as if it had been written

$ACT	    	:	/* empty */ 
        		{ 
            			$$ = 1; 
         	} 
        		; 
A	       	:	 B $ACT C 
         	{ 
            			x = $2; 
            			y = $3; 
          } 
         	;

where $ACT is an empty action.

In many applications, output is not a directl result of the actions. A data structure, such as a parse tree, is constructed in memory and transformations are applied to it before output is generated. Parse trees are particularly easy to construct, given routines to build and maintain the tree structure desired.

For example, suppose there is a C-function node written so that the call:

node( L, n1, n2 )

creates a node with label L and descendants n1 and n2 and returns the index of the newly created node. Then a parse tree can be built by supplying actions such as in the following specification:

expr		   : expr '+' expr 
        	{ 
           		$$ = node( '+', $1, $3 ); 
         }

You can define other variables to be used by the actions. Declarations and definitions can appear in the declarations section enclosed in %{ and%}.These declarations and definitions have global scope, so they are known to the action statements and can be made known to the lexical analyzer. For example:

%{ int variable = 0; %}

could be placed in the declarations section making variable accessible to all of the actions. You should avoid names beginning with yy because the yaccparser uses only such names. Note, too, that in the examples shown thus far, all the values are integers.

A discussion of values is found in the section "Advanced Topics ". Finally, note that in the following case:

%{ 
     	int i; 
     	printf("%}"); 
%}

yacc starts copying after %{ and stops copying when it encounters the first %}, the one in printf(). In contrast, it would copy %{ in printf() if it encountered it there.

Lexical Analysis

You must supply a lexical analyzer to read the input stream and communicate tokens (with values, if desired) to the parser. The lexical analyzer is an integer-valued function called yylex(). The function returns an integer, the token number, representing the kind of token read. If a value is associated with that token, it should be assigned to the external variable yylval.

The parser and the lexical analyzer must agree on these token numbers in order for communication between them to take place. The numbers can be chosen by yacc or the user. In either case, the #define mechanism of C language is used to allow the lexical analyzer to return these numbers symbolically.

For example, suppose that the token name DIGIT has been defined in the declarations section of the yacc specification file. The relevant portion of the lexical analyzer might look like the following to return the appropriate token:

int yylex() 
{ 
     extern int yylval; 
     int c; 
      	...  
     c = getchar(); 
      	...  
     switch (c) 
     { 
        	...  
       case '0': 
       case '1': 
        	...  
       case '9': 
         		yylval = c - '0'; 
         		return (DIGIT); 
        	...  
      } 
        	...  
}

The intent is to return a token number of DIGIT and a value equal to the numerical value of the digit. You put the lexical analyzer code in the subroutines section and the declaration for DIGIT in the declarations section. Alternatively, you can put the lexical analyzer code in a separately compiled file, provided you:

Invoke yacc with the -d option, which generates a file called y.tab.h that contains #define statements for the tokens.

#include y.tab.h in the separately compiled lexical analyzer.

This mechanism leads to clear, easily modified lexical analyzers. The only pitfall to avoid is the use of any token names in the grammar that are reserved or significant in C language or the parser.

For example, the use of token names if or while will almost certainly cause severe difficulties when the lexical analyzer is compiled. The token name error is reserved for error handling and should not be used naively.

In the default situation, token numbers are chosen by yacc. The default token number for a literal character is the numerical value of the character in the local character set. Other names are assigned token numbers starting at 257.

If you prefer to assign the token numbers, the first appearance of the token name or literal in the declarations section must be followed immediately by a nonnegative integer. This integer is taken to be the token number of the name or literal. Names and literals not defined this way are assigned default definitions by yacc. The potential for duplication exists here. Care must be taken to make sure that all token numbers are distinct.

For historical reasons, the end-marker must have token number 0 or negative. You cannot redefine this token number. Thus, all lexical analyzers should be prepared to return 0 or a negative number as a token upon reaching the end of their input.

As noted in Chapter 2, Lexical Analysis, lexical analyzers produced by lex are designed to work in close harmony with yacc parsers. The specifications for these lexical analyzers use regular expressions instead of grammar rules. lex can be used to produce quite complicated lexical analyzers, but there remain some languages that do not fit any theoretical framework and whose lexical analyzers must be crafted by hand.