BEA Logo BEA Tuxedo Release 7.1

  Corporate Info  |  News  |  Solutions  |  Products  |  Partners  |  Services  |  Events  |  Download  |  How To Buy

 

   Tuxedo Doc Home   |   Reference   |   Topic List   |   Previous   |   Next   |   Contents

   BEA Tuxedo C Function Reference

recomp, rematch(3c)

Name

recomp(), rematch() - regular expression compile/execute

Synopsis

char *recomp( pattern-1, [pattern-2, ...], 0 ) 
char *pattern-1, [*pattern-2, ...];

extern int _Cerrnbr;
extern char *_Cerrmsg[];

char *rematch( pat, text, [substr-0, ..., substr-9,] 0 );
char *pat, *text, [*substr-0, ..., *substr-9];

extern char *_Mbegin;
extern int _Merrnbr;
extern char *_Merrmsg[];
extern char _Eol;

Description

The routines, recomp() and rematch(), provide a regular expression pattern matching scheme for C. There are two parts: a pattern compiler, recomp(); and a pattern interpreter, rematch(). They are, in effect and in spirit, extensions of the standard routines, regcmp(3) and regex(3).

Significant features are the inclusion of regular expression alternation and portability of the code.

recomp() compiles a pattern, in the form of a regular expression, into an intermediate code sequence. rematch() then searches user text for a pattern match by interpreting the codes.

The code sequence, an array of characters, can be computed off-line by the command rex(), which reads regular expressions from the standard input and writes the corresponding character arrays to the standard output. The output can then be included in a regular C compile.

A thread in a multithreaded application may issue a call to recomp() or rematch() while running in any context state, including TPINVALIDCONTEXT.

Regular Expressions

The patterns for these routines are given with regular expressions, much like those used in the UNIX System editor, ed(1). The alternation operator, (|), has been added along with some other practical things. In general, however, there should be few surprises.

Regular expressions (REs) are constructed by applying any of the following production rules one or more times.

Regular Expressions

Rule

Matching Text

character

itself (character is any ASCII character except the special ones mentioned below).

\ character

itself except as follows:

\ special-character

its unspecial self. The special characters are . * + ? | ( ) [ { and \\.

. -- any character except the end-of-line character (usually newline or null).

^ -- beginning of the line.

$ -- end-of-line character.

[class]

any character in the class denoted by a sequence of characters and/or ranges. A range is given by the construct character-character. For example, the character class, [a-zA-Z0-9_], will match any alphameric character or "_". To be included in the class, a hyphen, "-", must be escaped (preceded by a "\\") or appear first or last in the class. A literal "]" must be escaped or appear first in the class. A literal "^" must be escaped if it appears first in the class.

[^ class ]

any character in the complement of the class with respect to the ASCII character set, excluding the end-of-line character.

RE RE

the sequence. (catenation)

RE | RE

either the left RE or the right RE. (left to right alternation)

RE *

zero or more occurrences of RE.

RE +

one or more occurrences of RE.

RE ?

zero or one occurrences of RE.

RE { n }

n occurrences of RE. n must be between 0 and 255, inclusive.

RE { m, n }

m through n occurrences of RE, inclusive. A missing m is taken to be zero. A missing n denotes m or more occurrences of RE.

( RE )

explicit precedence/grouping.

( RE ) $ n

the text matching RE is copied into the nth user buffer. n may be 0 through 9. User buffers are cleared before matching begins and loaded only if the entire pattern is matched.

There are three levels of precedence. In order of decreasing binding strength they are:

As indicated above, parentheses are used to give explicit precedence.

recomp: Regular Expression Compiler

recomp() concatenates its arguments up to a terminating zero into a single expression. The expression is then compiled into a character array whose address is returned as the function value.

Space for the array is obtained from the standard C routine, malloc(), and may be released (by the user) with a call to the standard free() routine.

recomp() returns a zero (NULL) value if the pattern cannot be processed. The reason is indicated by a global variable, _Cerrnbr, which is set to a non-zero value on any failure. _Cerrnbr may be used directly or as an index into a table of error messages, _Cerrmsg. _Cerrnbr is reset on each call to recomp(). The possible values for _Cerrnbr and the corresponding messages from _Cerrmsg are given below.

Regular Expression Compiler

_Cerrnbr

_Cerrmsg[_Cerrnbr]

0

"Ok"

1

"Syntax error at col colnbr, char `char'"

(colnbr is the position where the error is discovered; char is the character at that position)

2

"Out of node storage"

3

"Out of vector storage"

4

"Too many OR's"

5

"More than 255 repetitions"

(a number in the "rE{...}" construct is greater than 255)

6

"Negative range"

(a range for a character class or a closure is given backward)

7

"Out of heap storage"

(malloc failed)

Conditions that cause _Cerrnbr values of 2, 3, and 4 relate to the size of recomp()'s internal data structures and are unlikely to occur.

The first and second characters of the code array form the least significant byte and the most significant byte, respectively, of an unsigned 16 bit quantity that gives the length, in bytes, of the entire array. This value will prove useful for copying or otherwise manipulating the array.

rematch: Regular Expression Matcher

rematch() interprets the code sequence produced by recomp() to search a user string for a match. When a match is found, rematch() returns as its value the address of the first character beyond the matching text (which may then be used as the text argument in a subsequent call to rematch()). Also, the variable _Mbegin is set to the address of the first character of the matching text.

Any text matching a specified sub-pattern (see "( rE ) $ n" above) is copied into the corresponding user buffer, providing one was supplied on the call. All supplied user buffers are reset on each rematch() call and filled only on a successful match.

Note: rematch(), unlike its role model, regex(3), requires a zero terminating argument.

rematch() returns NULL if no match can be found or if something else goes wrong. If no match is found the variable, _Merrnbr, is set to zero. If something worse happens it is set to a non-zero value. As above, _Merrnbr serves as an index for a table of diagnostic messages as indicated below.

_Merrnbr

_Merrmsg[_Merrnbr]

0

"Ok"

(If rematch() returned NULL, no match was found)

1

"Too many closures"

2

"Line too long"

3

"Corrupt vector"

(check recomp() for failure)

4

"More than 10 substr args"

(User probably forgot to terminate rematch() arguments with a zero)

5

"Too many assignments"

_Merrnbr values of 1, 2, or 5 are not likely to occur. They relate to the size of data structures used by rematch().

The variable _Eol is the current end-of-line character. It is initialized to "\0" but may be changed by the user to other reasonable values (for example, "\n"). The end-of-line character determines what the special character, $, matches.

Example

The following program scans its input for C identifiers and prints each one on a separate line.

#include <stdio.h>  
main()
{
char *recomp(), *rematch();
char *patVect, *cursor, line[100], usrBuf[100];

patVect = recomp( "([a-zA-Z_][a-zA-Z0-9_]*)$0", 0 );

while ( gets(line) ) {
cursor = line;
while ( cursor=rematch(patVect,cursor,usrBuf,0) )
printf( "%sn", usrBuf );
}
}

Note the use of the variable, cursor, to indicate a successful match as well as to provide (on success) the starting point for the next search. A less courageous programmer would check recomp()'s return value and restrict the length of the pattern match to the receiving buffer's size (for example, "{0,98}" instead of "*").

Implementation

recomp() and rematch() are written in portable C code. recomp() employs YACC, which accounts for the fact that it is bigger and somewhat slower than its counterpart, regcmp(). The intermediate code produced by recomp() is generally more compact than that of regcmp().

rematch() is about the same size and has about the same speed as its counterpart, regex().

Notices

Support for the functions described in this manual page will be withdrawn in Release 5.0 of the BEA Tuxedo system.

See Also

rex(1)

ed(1), free(3), malloc(3), regcmp(3), regex(3) in a UNIX system reference manual