|
|
recomp, rematch(3c)
Name
recomp(), rematch() - regular expression compile/execute
Synopsis
char *recomp( pattern-1, [pattern-2, ...], 0 )
char *pattern-1, [*pattern-2, ...];
extern int _Cerrnbr;
extern char *_Cerrmsg[];
char *rematch( pat, text, [substr-0, ..., substr-9,] 0 );
char *pat, *text, [*substr-0, ..., *substr-9];
extern char *_Mbegin;
extern int _Merrnbr;
extern char *_Merrmsg[];
extern char _Eol;
Description
The routines, recomp() and rematch(), provide a regular expression pattern matching scheme for C. There are two parts: a pattern compiler, recomp(); and a pattern interpreter, rematch(). They are, in effect and in spirit, extensions of the standard routines, regcmp(3) and regex(3).
Significant features are the inclusion of regular expression alternation and portability of the code.
recomp() compiles a pattern, in the form of a regular expression, into an intermediate code sequence. rematch() then searches user text for a pattern match by interpreting the codes.
The code sequence, an array of characters, can be computed off-line by the command rex(), which reads regular expressions from the standard input and writes the corresponding character arrays to the standard output. The output can then be included in a regular C compile.
A thread in a multithreaded application may issue a call to recomp() or rematch() while running in any context state, including TPINVALIDCONTEXT.
Regular Expressions
The patterns for these routines are given with regular expressions, much like those used in the UNIX System editor, ed(1). The alternation operator, (|), has been added along with some other practical things. In general, however, there should be few surprises.
Regular expressions (REs) are constructed by applying any of the following production rules one or more times.
There are three levels of precedence. In order of decreasing binding strength they are:
As indicated above, parentheses are used to give explicit precedence.
recomp: Regular Expression Compiler
recomp() concatenates its arguments up to a terminating zero into a single expression. The expression is then compiled into a character array whose address is returned as the function value.
Space for the array is obtained from the standard C routine, malloc(), and may be released (by the user) with a call to the standard free() routine.
recomp() returns a zero (NULL) value if the pattern cannot be processed. The reason is indicated by a global variable, _Cerrnbr, which is set to a non-zero value on any failure. _Cerrnbr may be used directly or as an index into a table of error messages, _Cerrmsg. _Cerrnbr is reset on each call to recomp(). The possible values for _Cerrnbr and the corresponding messages from _Cerrmsg are given below.
Conditions that cause _Cerrnbr values of 2, 3, and 4 relate to the size of recomp()'s internal data structures and are unlikely to occur.
The first and second characters of the code array form the least significant byte and the most significant byte, respectively, of an unsigned 16 bit quantity that gives the length, in bytes, of the entire array. This value will prove useful for copying or otherwise manipulating the array.
rematch: Regular Expression Matcher
rematch() interprets the code sequence produced by recomp() to search a user string for a match. When a match is found, rematch() returns as its value the address of the first character beyond the matching text (which may then be used as the text argument in a subsequent call to rematch()). Also, the variable _Mbegin is set to the address of the first character of the matching text.
Any text matching a specified sub-pattern (see "( rE ) $ n" above) is copied into the corresponding user buffer, providing one was supplied on the call. All supplied user buffers are reset on each rematch() call and filled only on a successful match.
Note: rematch(), unlike its role model, regex(3), requires a zero terminating argument.
rematch() returns NULL if no match can be found or if something else goes wrong. If no match is found the variable, _Merrnbr, is set to zero. If something worse happens it is set to a non-zero value. As above, _Merrnbr serves as an index for a table of diagnostic messages as indicated below.
_Merrnbr values of 1, 2, or 5 are not likely to occur. They relate to the size of data structures used by rematch().
The variable _Eol is the current end-of-line character. It is initialized to "\0" but may be changed by the user to other reasonable values (for example, "\n"). The end-of-line character determines what the special character, $, matches.
Example
The following program scans its input for C identifiers and prints each one on a separate line.
#include <stdio.h>
main()
{
char *recomp(), *rematch();
char *patVect, *cursor, line[100], usrBuf[100];
patVect = recomp( "([a-zA-Z_][a-zA-Z0-9_]*)$0", 0 );
while ( gets(line) ) {
cursor = line;
while ( cursor=rematch(patVect,cursor,usrBuf,0) )
printf( "%sn", usrBuf );
}
}
Note the use of the variable, cursor, to indicate a successful match as well as to provide (on success) the starting point for the next search. A less courageous programmer would check recomp()'s return value and restrict the length of the pattern match to the receiving buffer's size (for example, "{0,98}" instead of "*").
Implementation
recomp() and rematch() are written in portable C code. recomp() employs YACC, which accounts for the fact that it is bigger and somewhat slower than its counterpart, regcmp(). The intermediate code produced by recomp() is generally more compact than that of regcmp().
rematch() is about the same size and has about the same speed as its counterpart, regex().
Notices
Support for the functions described in this manual page will be withdrawn in Release 5.0 of the BEA Tuxedo system.
See Also
rex(1)
ed(1), free(3), malloc(3), regcmp(3), regex(3) in a UNIX system reference manual
|
Copyright © 2000 BEA Systems, Inc. All rights reserved.
|