Using Regular Expressions

Regular expressions (regex) are patterns that describe character combinations in text. Regex provides a concise and flexible means to match strings of text, such as particular characters, words, or patterns of characters. SIP messages are treated as sets of substrings on which regex patterns rules are executed. With regex you can create strings to match other string values and use groupings in order to create stored values on which to operate.

Note:

An understanding of regex is required for successful HMRs. Refer to Mastering Regular Expressions from O'Reily Media for more information.

Oracle's OCSBC supports the standardized regular expression format called Portable Operating System Interface (POSIX) Extended Regular Expressions. The OCSBC regex engine is a traditional regex-directed (NFA) type.

Example of HMR with Regex

The following HMR removes a P-Associated-URI from an response to a REGISTER request. The regex expression ^<tel: lets you specify the removal only if it is a tel-URI.

sip-manipulation
    name                         rem_telPAU
    description
    header-rule
        name                     modPAU
        header-name              P-Associated-URI
        action                   delete
        comparison-type          pattern-rule
        match-value              ^<tel:
        msg-type                 reply
        new-value
        methods                  REGISTER

Regex Characters

Regular expressions are used to search for patterns of text using one or more of the following devices:

Character Type Example Description
Literal text foobar With the exception of a small number of characters that have a special meaning in a regex, text matches itself.
Special wildcard characters \d Known as metacharacters or metasequences, these match or exclude specific types of text, such as any number.
Character classes [1-5] When a suitable metacharacter or metasequence doesn't exist, you can create your own definition to match or exclude specified characters.
Quantifiers + or ? These specify how many times you want the preceding expression to match or whether it's optional.
Capturing groups and backreferences (foobar) or \1 These specify parts of the regex that you want remembered, either to find a similar match later on, or to preserve the value in a find and replace operation.
Boundaries and anchors ^ or $ These specify where the match should be made, for example at the beginning of a line or word.
Alternation | This specifies alternatives.

By default, regular expressions are case-sensitive, so A and a are treated as different characters. As long as what you're looking for fits a regular pattern, a regex can be created to find it.

Literal (Ordinary)

Many of the characters you can type on your keyboard are literal, ordinary characters; they present their actual value in the pattern. For example, the regex pattern sip, is a pattern of all literal characters, that will be matched from left to right, at each position in the input string, until a match is found. Given an input string of <sip:me@here.com>, the regex pattern sip will successfully match the sip, starting at the position of the s and ending at the position of the p. But the same regex will also match sip in <sips:me@here.com> and tel:12345;isip=192.168.0.3 because an s followed by an i followed by a p exists in both of those as well.

Special (Metacharacters)

Some characters have special meaning. They instruct the regex function (or engine which interprets the expressions) to treat the characters in designated ways. The following table outlines these special characters or metacharacters.

Character Name Description
. dot Matches any one character, including a space; it will match one character, but there must be one character to match.

Matches a literal dot when bracketed or placed next to a backslash: [.] or \..

* star/asterisk Matches one or more preceding character (0, 1, or any number), bracketed carrier class, or group in parentheses. Used for quantification.

Typically used with a dot in the format .* to indicate that a match for any character, 0 or more times.

Matches a iteral asterisk when bracketed: [*].

+ plus Matches one or more of the preceding character, bracketed carrier class, or group in parentheses. Used for quantification.

Matches a literal plus sign when bracketed: [+].

| bar/vertical bar/pipe Matches anything to the left or to the right; the bar separates the alternatives. Both sides are not always tried; if the left does not match, only then is the right attempted. Used for alternation.
{ left brace Begins an interval range, ended with } (right brace) to match; identifies how many times the previous single character or group in parentheses must repeat.

Interval ranges are entered as minimum and maximums {minimum,maximum} where the character or group must appear a minimum number of times up to the maximum. You can also use interval ranges to set magnitude, or exactly the number of times a character must appear; you can set this, for example, as the minimum value without the maximum {minimum,}.

? question mark Signifies that the preceding character or group in parentheses is optional; the character or group can appear not at all or one time.
^ caret Acts as an anchor to represent the beginning of a string.
$ dollar sign Acts as an anchor to represent the end of a string.
[ left bracket Acts as the start of a bracketed character class, ended with the ] (right bracket). A character class is a list of character options; one and only one of the characters in the bracketed class must appear for a match. A - (hyphen) in between two characters enclosed by brackets designates a range; for example [a-z] is the character range of the lower case twenty-six letters of the alphabet.

Note that the ] (right bracket) ends a bracketed character class unless it sits directly next to the [ (left bracket) or the ^ (caret); in those two cases, it is the literal character.

( left parenthesis Creates a grouping when used with the ) (right parenthesis). Groupings have two functions:

Separate pattern strings so that a whole string can have special characters within it as if it were a single character.

They allow the designated pattern to be stored and referenced later (so that other operations can be performed on it).

Regex Tips

  • Limit use of wildcards asterisk * and plus sign +.
  • A character class enclosed by brackets [] is not a choice of one or more characters but rather a choice of one and only one character in the set.
  • The range 0-1000 is not the same as the range 0000-1000.
  • Spaces are legal characters and will be interpreted like any other character.

Matching New Lines

In the regular expression library, the dot . character does not match new lines or carriage returns. Conversely, the not-dot does match new lines and carriage returns. This provides a safety mechanism preventing egregious backtracking of the entire SIP message body when there are no matches. The OCSBC reduces backtracking to a single line within the body.

Escaped Characters

SIP HMR's support for escaped characters allows for searches for values you would be unable to enter yourself. Because they are necessary to MIME manipulation, support for escaped characters includes:

Syntax Description
\s Whitespace
\S Non-whitespace
\d Digits
\D Non-digits
\R Any \r, \n, or \r\n
\w Word
\W Non-word
\A Beginning of buffer
\Z End of buffer
\f Form feed
\n New line
\r Carriage return
\t Tab
\v Vertical tab

Building Expressions with Parentheses

You can use parentheses () when you use HMR to support order of operations and to simplify header manipulation rules that might otherwise prove complex. This means that expressions such as (sip + urp) - (u + rp) can now be evaluated to sip. Previously, the same expression would have evaluated to sipurprp. In addition, you previously would have been required to create several different manipulation rules to perform the same expression.

Boolean Operators

The following Boolean operators are supported:

  • &, meaning AND.
  • |, meaning OR.
  • !, meaning NOT.

You can only use Boolean operators when the comparison type is pattern-rule and you are evaluating stored matches. The OCSBC evaluates these Boolean expressions from left to right, and does not support any grouping mechanisms that might change the order of evaluation. For example, the OCSBC evaluates the expression A & B | C (where A=true, B=false, and C=true) as follows: A & B = false; false | true = true.

Equality Operators

You can use equality operators in conjunction with string operators. You can also use equality operators with:

  • Boolean operators, as in this example: ($rule1.$0 == $rule2.$1) & $rule3.
  • The !, &, and | operators.
  • Variables and constant strings.

You can group them in parentheses for precedence.

Equality operators always evaluate to either true or false.

Equality Operator Symbol Short Description Detailed Information
== String case sensitive equality operator Performs a character-by-character, case-sensitive string comparison on both the left side and the right side of the operator.
~= String case insensitive equality operator Performs a character-by-character, case-insensitive string comparison on both the left side and the right side of the operator.
!= String case sensitive inequality operator Performs a character-by-character, case-sensitive string comparison on both the left side and the right side of the operator, returning true if the left side is not equal to the right side.
<= Less than or equal to operator Performs a string-to-integer conversion. If the string-to-integer comparison fails, the value is treated as 0. After the conversion, the operator will compare the two values and return true only if the left side is less than or equal to the right side of the operator.
>= Greater than or equal to operator Performs a string-to-integer conversion. If the string-to-integer comparison fails, the value is treated as 0. After the conversion, the operator will compare the two values and return true only if the left side is greater than or equal to the right side of the operator.
< Less than operator Performs a string-to-integer conversion. If the string-to-integer conversion fails, the value is treated as 0. After the conversion, the operator will compare the two values and return true only if the left side is less than the right side of the operator.
> Greater than operator Performs a string-to-integer conversion. If the string-to-integer conversion fails, the value is treated as 0. After the conversion, the operator will compare the two values and return true only if the left side is greater than the right side of the operator.

Normalizing EBNF ExpressionString Grammar

The expression parser grammar implies that any expression string can have boolean and string manipulation operators in the same expression. While technically this is possible, the expression parser prevents it.

Because all boolean expressions evaluate to the string value TRUE or FALSE and since all manipulation are string manipulations, the result of a boolean expression returns the value TRUE or FALSE. The ExpressionString class interprets this as an actual TRUE or FALSE value. For this reason, boolean operators are not mixed with string manipulation operators (which is true with most programming languages).

The expression string grammar also indicates that it is possible to nest self-references and rule names indefinitely. For HMR, this is not allowed. A self-reference can only exist by itself, and a terminal index can only come at the end of a rule reference.

Storing Regex Patterns

Any HMR with a pattern-rule comparison type can store a regex pattern's matches for later use. In many cases you don't have to create store rules before manipulation rules. Data is only stored for items that later rules actually reference.

For example, if a later rule never references a header rule's stored value, but only its element rules, then the header rule itself doesn't store anything. Alternatively, you could delete a header or field, but still use its stored value later without having to create a separate store rule for it. In general, fewer rules improve OCSBC performance.

Performance Considerations

The regex engine consumes as much of the input string as it can before it backtracks or gives up trying, which is called greediness. Greediness can introduce errors in regex patterns and has an effect on performance. There is usually a trade-off of efficiency versus exactness - you should choose how exacting you need to be. Keep the following in mind in order to lessen the effect:

  • Poorly constructed regex patterns can effect the performance of regex matching for long strings
  • Search on the smallest input string possible, perform a regex search in element rules for the specific header component type you want to match for
  • Test the regex pattern against long strings which do not match to evaluate the effect on performance.
  • Test a regex with a wildcard in between characters against an input string with those characters repeated in different spots to evaluate performance
  • If the input string format is fairly fixed and well-known, be explicit in the regex rather than using wildcards
  • If the regex pattern is trying to capture everything before a specific character, use the negation of the character for the wildcard character. Note that this is true most times, except when there is an anchor at the end.
  • Use beginning-line and ending-line anchors whenever possible if you want to only match if the pattern begins or ends as such.
  • A dot . means any character, including whitespace. A wild-carded dot, such as .* or .+, will capture/match everything until the end of line, and then it will backtrack if there are more characters after the wildcard that need to be matched. If you don't need to capture the things before the characters after the wildcard, don't use the wildcard.

Additional References

To learn more about regex, you can visit the following Web site, which has information and tutorials that can help to get you started:http://www.regular-expressions.info/.