recomp,rematch - regular expression compile/execute
char *recomp( pattern-1, [pattern-2, ...], 0 ) char *pattern-1, [*pattern-2, ...]; extern int _Cerrnbr; extern char *_Cerrmsg[]; char *rematch( pat, text, [substr-0, ..., substr-9,] 0 ); char *pat, *text, [*substr-0, ..., *substr-9]; extern char *_Mbegin; extern int _Merrnbr; extern char *_Merrmsg[]; extern char _Eol;
The routines, recomp() and rematch(), provide a regular expression pattern matching scheme for C. There are two parts: a pattern compiler, recomp(); and a pattern interpreter, rematch(). They are, in effect and in spirit, extensions of the standard routines, regcmp(3) and regex(3)
Significant features are the inclusion of regular expression alternation and portability of the code.
recomp() compiles a pattern, in the form of a regular expression, into an intermediate code sequence. rematch() then searches user text for a pattern match by interpreting the codes.
The code sequence, an array of characters, can be computed off-line by the command rex(1), which reads regular expressions from the standard input and writes the corresponding character arrays to the standard output. The output can then be included in a regular C compile.
The patterns for these routines are given with regular expressions, much like those used in the UNIX System editor, ed(1). The alternation operator, (|), has been added along with some other practical things. In general, however, there should be few surprises.
Regular expressions (REs) are constructed by applying any of the following production rules one or more times.
Rule Matching Text
character itself (character is any ASCII character except the special ones mentioned below).
\ character itself except as follows: \n -> newline \t -> tab \b -> backspace \r -> carriage return \f -> formfeed
\ special-character its unspecial self. The special characters are . * + ? | ( ) [ { and \\. . any character except the end-of-line character (usually newline or null)
^ beginning of the line
$ end-of-line character.
[ class ] any character in the class denoted by a sequence of characters and/or ranges. A range is given by the construct character-character. For example, the character class, [a-zA-Z0-9_], will match any alphameric character or "_" To be included in the class, a hyphen, "-", must be escaped (preceded by a "\") or appear first or last in the class. A literal "]" must be escaped or appear first in the class. A literal "^" must be escaped if it appears first in the class [ ^ class ] any character in the complement of the class with respect to the ASCII character set, excluding the end-of-line character.
RE RE the sequence. (catenation)
RE | RE either the left RE or the right RE. (left to right alternation)
RE * zero or more occurrences of RE RE + one or more occurrences of RE RE ? zero or one occurrences of RE
RE { n } n occurrences of RE. n must be between 0 and 255, inclusive
RE { m, n } m through n occurrences of RE, inclusive. A missing m is taken to be zero. A missing n denotes m or more occurrences of RE
( RE ) explicit precedence/grouping
( RE ) $ n the text matching RE is copied into the nth user buffer. n may be 0 thru 9. User buffers are cleared before matching begins and loaded only if the entire pattern is matched There are three levels of precedence. In order of decreasing binding strength they are:
catenation closure (*,+,?,{...}) catenation alternation (|)
As indicated above, parentheses are used to give explicit precedence.
recomp() concatenates its arguments up to a terminating zero into a single expression. The expression is then compiled into a character array whose address is returned as the function value.
Space for the array is obtained from the standard C routine, malloc(3), and may be released (by the user) with a call to the standard free(3) routine.
recomp() returns a zero (NULL) value if the pattern cannot be processed. The reason is indicated by a global variable, _Cerrnbr, which is set to a non-zero value on any failure. _Cerrnbr may be used directly or as an index into a table of error messages, _Cerrmsg. _Cerrnbr is reset on each call to recomp(). The possible values for _Cerrnbr and the corresponding messages from _Cerrmsg are given below.
The first and second characters of the code array form the least significant byte and the most significant byte, respectively, of an unsigned 16 bit quantity that gives the length, in bytes, of the entire array. This value will prove useful for copying or otherwise manipulating the array.
rematch() interprets the code sequence produced by recomp() to search a user string for a match. When a match is found, rematch() returns as its value the address of the first character beyond the matching text (which may then be used as the text argument in a subsequent call to rematch()). Also, the variable _Mbegin is set to the address of the first character of the matching text.
Any text matching a specified sub-pattern (see "( rE ) $ n" above) is copied into the corresponding user buffer, providing one was supplied on the call. All supplied user buffers are reset on each rematch() call and filled only on a successful match.
Note: rematch(), unlike its role model, regex(3), requires a zero terminating argument.
rematch() returns NULL if no match can be found or if something else goes wrong. If no match is found the variable, _Merrnbr, is set to zero. If something worse happens it is set to a non-zero value. As above, _Merrnbr serves as an index for a table of diagnostic messages as indicated below.
_Merrnbr values of 1, 2, or 5 are not likely to occur. They relate to the size of data structures used by rematch().
The variable _Eol is the current end-of-line character. It is initialized to '\ ' but may be changed by the user to other reasonable values (e.g., '\\ '). The end-of-line character determines what the special character, $, matches.
The following program scans its input for C identifiers and prints each one on a separate line.
#include <stdio.h> main() { char *recomp(), *rematch(); char *patVect, *cursor, line[100], usrBuf[100]; patVect = recomp( "([a-zA-Z_][a-zA-Z0-9_]*)$0", 0 ); while ( gets(line) ) { cursor = line; while ( cursor=rematch(patVect,cursor,usrBuf,0) ) printf( "%s&bsoln", usrBuf ); } }
Note the use of the variable, cursor, to indicate a successful match as well as to provide (on success) the starting point for the next search. A less courageous programmer would check recomp()'s return value and restrict the length of the pattern match to the receiving buffer's size (e.g., "{0,98}" instead of "*").
recomp() and rematch() are written in portable C code. recomp() employs YACC, which accounts for the fact that it is bigger and somewhat slower than its counterpart, regcmp(3). The intermediate code produced by recomp() is generally more compact than that of regcmp(3).
rematch() is about the same size and has about the same speed as its counterpart, regex(3).
Support for the functions described in this manual page will be withdrawn in Release 5.0 of the TUXEDO System.
rex(1),
ed(1), regcmp(3), malloc(3), free(3), regex(3) in a UNIX
System reference manual