Programming Utilities Guide

The Fundamentals of lex Rules

The mandatory rules section opens with the delimiter %%. If a routines section follows, another %% delimiter ends the rules section. The %% delimiters must be entered at the beginning of a line, that is, without leading blanks. If there is no second delimiter, the rules section is presumed to continue to the end of the program.

Lines in the rules section that begin with white space and that appear before the first rule are copied to the beginning of the function yylex(), immediately after the first brace. You might use this feature to declare local variables for yylex().

Each rule specifies the pattern sought and the actions to take on finding it. The pattern specification must be entered at the beginning of a line. The scanner writes input that does not match a pattern directly to the output file. So the simplest lexical analyzer program is just the beginning rules delimiter, %%. It writes out the entire input to the output with no changes at all.

Regular Expressions

You specify the patterns you are interested in with a notation called a regular expression. A regular expression is formed by stringing together characters with or without operators. The simplest regular expressions are strings of text characters with no operators at all:

apple 
orange
pluto 

These three regular expressions match any occurrences of those character strings in an input text. To have the scanner remove every occurrence of orange from the input text, you could specify the rule

orange ;

Because you specified a null action on the right with the semicolon, the scanner does nothing but print the original input text with every occurrence of this regular expression removed, that is, without any occurrence of the string orange at all.

Operators

Unlike orange, most expressions cannot be specified so easily. The expression itself might be too long, or, more commonly, the class of desired expressions is too large; it might, in fact, be infinite.

Using lex operators -- summarized in Table 2-1 -- you can form regular expressions for any expression of a certain class. The + operator, for instance, means one or more occurrences of the preceding expression, the ? means 0 or 1 occurrences of the preceding expression (which is equivalent to saying that the preceding expression is optional), and the * means 0 or more occurrences of the preceding expression. So m+ is a regular expression that matches any string of ms:

mmm 
m 
mmmmm 

and 7* is a regular expression that matches any string of zero or more 7s:

77 
77777 

777

The empty third line matches because it has no 7s in it at all.

The | operator indicates alternation, so that ab|cd matches either ab or cd. The operators {} specify repetition, so that a{1,5} looks for 1 to 5 occurrences of a, and A(B{1,4}) matches ABC, ABBC, ABBBC, and ABBBBC (notice the use of parentheses, (), as grouping symbols).

Brackets, [], indicate any one character from the string of characters specified between the brackets. Thus, [dgka] matches a single d, g, k, or a. The characters between brackets must be adjacent, without spaces or punctuation.

The ^ operator, when it appears as the first character after the left bracket, indicates all characters in the standard set except those specified between the brackets. (Note that |, {,}} and ^ may serve other purposes as well.)

Ranges within a standard alphabetic or numeric order (A through Z, a through z, 0 through 9) are specified with a hyphen. [a-z], for instance, indicates any lowercase letter.

[A-Za-z0-9*&#]

This is a regular expression that matches any letter (whether upper or lowercase), any digit, an asterisk, an ampersand, or a #.

Given the following input text, the lexical analyzer with the previous specification in one of its rules will recognize *, &, r, and #, perform on each recognition whatever action the rule specifies (we have not indicated an action here), and print the rest of the text as it stands:

$$$$?? ????!!!*$$ $$$$$$&+====r~~# (( 

To include the hyphen character in the class, have it appear as the first or last character in the brackets: [-A-Z] or [A-Z-].

The operators become especially powerful in combination. For example, the regular expression to recognize an identifier in many programming languages is:

[a-zA-Z][0-9a-zA-Z]*

An identifier in these languages is defined to be a letter followed by zero or more letters or digits, and that is just what the regular expression says. The first pair of brackets matches any letter. The second, if it were not followed by a *, would match any digit or letter.

The two pairs of brackets with their enclosed characters would then match any letter followed by a digit or a letter. But with the *, the example matches any letter followed by any number of letters or digits. In particular, it would recognize the following as identifiers:

e 
not 
idenTIFIER 
pH 
EngineNo99 
R2D2

Note that it would not recognize the following as identifiers because not_idenTIFIER has an embedded underscore; 5times starts with a digit, not a letter; and $hello starts with a special character:

not_idenTIFIER
5times 
$hello 

A potential problem with operator characters is how to specify them as characters to look for in a search pattern. The previous example, for instance, does not recognize text with a * in it. lex solves the problem in one of two ways: an operator character preceded by a backslash, or characters (except backslash) enclosed in double quotation marks, are taken literally, that is, as part of the text to be searched for.

To use the backslash method to recognize, say, an * followed by any number of digits, you can use the pattern:

\*[1-9]*

To recognize a \ itself, we need two backslashes: \\. Similarly, "x\*x" matches x*x, and "y\"z" matches y"z. Other lex operators are noted as they arise; see Table 1-1:

Table 2-1 lex Operators

Expression 

Description  

\x

x, if x is a lex operator  

"xy"

xy, even if x or y is a lex operator (except \)  

[xy]

x or y  

[x-z]

x, y, or z  

[^x]

Any character but x  

.

Any character but newline  

^x

x at the beginning of a line  

<y>x

x when lex is in start condition y  

x$

x at the end of a line  

x?

Optional x  

x*

0, 1, 2, ... instances of x  

x+

1, 2, 3, ... instances of x  

x{m,n}

m through n occurrences of x 

xx|yy

Either xx or yy  

x |

The action on x is the action for the next rule  

(x)

x  

x/y

x but only if followed by y  

{xx}

The translation of xx from the definitions section  

Actions

After the scanner recognizes a string matching the regular expression at the start of a rule, it looks to the right of the rule for the action to be performed. You supply the actions.

Kinds of actions include recording the token type found and its value, if any; replacing one token with another; and counting the number of instances of a token or token type. You write these actions as program fragments in C.

An action can consist of as many statements as are needed. You might want to change the text in some way or print a message noting that the text has been found. So, to recognize the expression Amelia Earhart and to note such recognition, apply the rule:

"Amelia Earhart" printf("found Amelia"); 

To replace lengthy medical terms in a text with their equivalent acronyms, a rule such as the following would work:

Electroencephalogram printf("EEG"); 

To count the lines in a text, you recognize the ends of lines and increment a line counter.

lex uses the standard C escape sequences, including \n for newline. So, to count lines you might have the following syntax, where lineno, like other C variables, is declared in the "Definitions " section.

\n	 lineno++; 

Input is ignored when the C language null statement, a colon ;, is specified. So the following rule causes blanks, tabs, and new-lines to be ignored:

[ \t\n] ; 

The alternation operator | can also be used to indicate that the action for a rule is the action for the next rule. The previous example could have been written with the same result:

" "	| 
\t
\n ;

The scanner stores text that matches an expression in a character array called yytext[]. You can print or manipulate the contents of this array as you like. In fact, lex provides a macro called ECHO that is equivalent to printf ("%s", yytext).

When your action consists of a long C statement, or two or more C statements, you might write it on several lines. To inform lex that the action is for one rule only, enclose the C code in braces.

For example, to count the total number of all digit strings in an input text, print the running total of the number of digit strings, and print out each one as soon as it is found, your lex code might be:

\+?[1-9]+						{ digstrngcount++; 
					            	printf("%d",digstrngcount); 
					            	printf("%s", yytext); }

This specification matches digit strings whether or not they are preceded by a plus sign because the ? indicates that the preceding plus sign is optional. In addition, it catches negative digit strings because that portion following the minus sign matches the specification.