recomp
, rematch
-regular expression compile/execute
char *recomp( pattern-1, [pattern-2, ...], 0 )
char *pattern-1, [*pattern-2, ...];
extern int _Cerrnbr;
extern char *_Cerrmsg[];
char *rematch( pat, text, [substr-0, ..., substr-9,] 0 );
char *pat, *text, [*substr-0, ..., *substr-9];
extern char *_Mbegin;
extern int _Merrnbr;
extern char *_Merrmsg[];
extern char _Eol;
The routines, recomp
() and rematch
(), provide a regular expression pattern matching scheme for C. There are two parts: a pattern compiler, recomp
(); and a pattern interpreter, rematch
(). They are, in effect and in spirit, extensions of the standard routines, regcmp
(3) and regex
(3)
Significant features are the inclusion of regular expression alternation and portability of the code.
recomp
() compiles a pattern, in the form of a regular expression, into an intermediate code sequence. rematch
() then searches user text for a pattern match by interpreting the codes.
The code sequence, an array of characters, can be computed off-line by the command rex
(1), which reads regular expressions from the standard input and writes the corresponding character arrays to the standard output. The output can then be included in a regular C compile.
The patterns for these routines are given with regular expressions, much like those used in the UNIX System editor, ed
(1). The alternation operator, (|
), has been added along with some other practical things. In general, however, there should be few surprises.
Regular expressions (REs) are constructed by applying any of the following production rules one or more times.
There are three levels of precedence. In order of decreasing binding strength they are:
catenation closure (*,+,?,{...})
As indicated above, parentheses are used to give explicit precedence.
Space for the array is obtained from the standard C routine, recomp - Regular Expression Compiler
recomp
() concatenates its arguments up to a terminating zero into a single expression. The expression is then compiled into a character array whose address is returned as the function value.
malloc
(3), and may be released (by the user) with a call to the standard free
(3) routine.
recomp
() returns a zero (NULL) value if the pattern cannot be processed. The reason is indicated by a global variable, _Cerrnbr
, which is set to a non-zero value on any failure. _Cerrnbr
may be used directly or as an index into a table of error messages, _Cerrmsg
. _Cerrnbr
is reset on each call to recomp
(). The possible values for _Cerrnbr
and the corresponding messages from _Cerrmsg
are given below.
Conditions that cause The first and second characters of the code array form the least significant byte and the most significant byte, respectively, of an unsigned 16 bit quantity that gives the length, in bytes, of the entire array. This value will prove useful for copying or otherwise manipulating the array.
Any text matching a specified sub-pattern (see " Note: _Cerrnbr
values of 2, 3, and 4 relate to the size of recomp
()'s internal data structures and are unlikely to occur.
rematch -- Regular Expression Matcher
rematch
() interprets the code sequence produced by recomp
() to search a user string for a match. When a match is found, rematch
() returns as its value the address of the first character beyond the matching text (which may then be used as the text argument in a subsequent call to rematch
()). Also, the variable _Mbegin
is set to the address of the first character of the matching text.
( rE ) $ n
" above) is copied into the corresponding user buffer, providing one was supplied on the call. All supplied user buffers are reset on each rematch
() call and filled only on a successful match.
rematch
(), unlike its role model, regex
(3), requires a zero terminating argument.
rematch
() returns NULL if no match can be found or if something else goes wrong. If no match is found the variable, _Merrnbr
, is set to zero. If something worse happens it is set to a non-zero value. As above, _Merrnbr
serves as an index for a table of diagnostic messages as indicated below.
The variable The following program scans its input for C identifiers and prints each one on a separate line.
Note the use of the variable, Support for the functions described in this manual page will be withdrawn in Release 5.0 of the BEA TUXEDO system.
_Merrnbr
values of 1, 2, or 5 are not likely to occur. They relate to the size of data structures used by rematch
().
_Eol
is the current end-of-line character. It is initialized to "\0
" but may be changed by the user to other reasonable values (e.g., "\n
"). The end-of-line character determines what the special character, $
, matches.
Example
#include <stdio.h>
main()
{
char *recomp(), *rematch();
char *patVect, *cursor, line[100], usrBuf[100];
patVect = recomp( "([a-zA-Z_][a-zA-Z0-9_]*)$0", 0 );
while ( gets(line) ) {
cursor = line;
while ( cursor=rematch(patVect,cursor,usrBuf,0) )
printf( "%sn", usrBuf );
}
}cursor
, to indicate a successful match as well as to provide (on success) the starting point for the next search. A less courageous programmer would check recomp
()'s return value and restrict the length of the pattern match to the receiving buffer's size (e.g., "{0,98}" instead of "*").
Implementation
recomp
() and rematch
() are written in portable C code. recomp
() employs YACC, which accounts for the fact that it is bigger and somewhat slower than its counterpart, regcmp
(3). The intermediate code produced by recomp
() is generally more compact than that of regcmp
(3).
rematch
() is about the same size and has about the same speed as its counterpart, regex
(3).
Notices
See Also
rex
(1), ed
(1) in a UNIX System reference manual, regcmp
(3), malloc
(3), free
(3), regex
(3) in a UNIX System reference manual