Using Regular Expressions
Regular expressions (regex) are patterns that describe character combinations in text. Regex provides a concise and flexible means to match strings of text, such as particular characters, words, or patterns of characters. SIP messages are treated as sets of substrings on which regex patterns rules are executed. With regex you can create strings to match other string values and use groupings in order to create stored values on which to operate.
Note:
An understanding of regex is required for successful HMRs. Refer to Mastering Regular Expressions from O'Reily Media for more information.Oracle's SBC supports the standardized regular expression format called Portable Operating System Interface (POSIX) Extended Regular Expressions. The SBC regex engine is a traditional regex-directed (NFA) type.
Example of HMR with Regex
The following HMR removes a P-Associated-URI from an response to a REGISTER request. The regex expression
^<tel:
lets you specify the removal only if it is a tel-URI.
sip-manipulation
name rem_telPAU
description
header-rule
name modPAU
header-name P-Associated-URI
action delete
comparison-type pattern-rule
match-value ^<tel:
msg-type reply
new-value
methods REGISTER
Regex Characters
Regular expressions are used to search for patterns of text using one or more of the following devices:
Character Type | Example | Description |
---|---|---|
Literal text |
foobar
|
With the exception of a small number of characters that have a special meaning in a regex, text matches itself. |
Special wildcard characters | \d
|
Known as metacharacters or metasequences, these match or exclude specific types of text, such as any number. |
Character classes | [1-5]
|
When a suitable metacharacter or metasequence doesn't exist, you can create your own definition to match or exclude specified characters. |
Quantifiers | + or ?
|
These specify how many times you want the preceding expression to match or whether it's optional. |
Capturing groups and backreferences | (foobar) or
\1
|
These specify parts of the regex that you want remembered, either to find a similar match later on, or to preserve the value in a find and replace operation. |
Boundaries and anchors | ^ or $
|
These specify where the match should be made, for example at the beginning of a line or word. |
Alternation | |
|
This specifies alternatives. |
By default, regular expressions are case-sensitive, so A
and a
are treated as different
characters. As long as what you're looking for fits a regular pattern, a regex can be
created to find it.
Literal (Ordinary)
Many of the characters you can type on your keyboard are literal, ordinary characters; they present their actual value in the pattern. For example, the regex pattern
sip
, is a pattern of all literal characters, that will be matched from left to right, at each position in the input string, until a match is found. Given an input string of
<sip:me@here.com>
, the regex pattern
sip
will successfully match the
sip
, starting at the position of the
s
and ending at the position of the
p
. But the same regex will also match
sip
in
<sips:me@here.com>
and
tel:12345;isip=192.168.0.3
because an
s
followed by an
i
followed by a
p
exists in both of those as well.
Special (Metacharacters)
Some characters have special meaning. They instruct the regex function (or engine which interprets the expressions) to treat the characters in designated ways. The following table outlines these special characters or metacharacters.
Character | Name | Description |
---|---|---|
.
|
dot | Matches any one character,
including a space; it will match one character, but there must be one character
to match.
Matches a literal dot when bracketed or placed
next to a backslash:
|
*
|
star/asterisk | Matches one or more preceding
character (0, 1, or any number), bracketed carrier class, or group in
parentheses. Used for quantification.
Typically used with a dot in the format
Matches a iteral asterisk when bracketed:
|
+
|
plus | Matches one or more of the
preceding character, bracketed carrier class, or group in parentheses. Used for
quantification.
Matches a literal plus sign when bracketed:
|
|
|
bar/vertical bar/pipe | Matches anything to the left or to the right; the bar separates the alternatives. Both sides are not always tried; if the left does not match, only then is the right attempted. Used for alternation. |
{
|
left brace | Begins an interval range, ended
with
}
(right brace) to match; identifies how many times the previous single character
or group in parentheses must repeat.
Interval ranges are entered as minimum and
maximums
|
?
|
question mark | Signifies that the preceding character or group in parentheses is optional; the character or group can appear not at all or one time. |
^
|
caret | Acts as an anchor to represent the beginning of a string. |
$
|
dollar sign | Acts as an anchor to represent the end of a string. |
[
|
left bracket | Acts as the start of a bracketed
character class, ended with the
]
(right bracket). A character class is a list of character options; one and only
one of the characters in the bracketed class must appear for a match. A
-
(hyphen) in between two characters enclosed by brackets designates a range; for
example
[a-z] is the character
range of the lower case twenty-six letters of the alphabet.
Note that the
|
(
|
left parenthesis | Creates a grouping when used with
the
)
(right parenthesis). Groupings have two functions:
Separate pattern strings so that a whole string can have special characters within it as if it were a single character. They allow the designated pattern to be stored and referenced later (so that other operations can be performed on it). |
Regex Tips
- Limit use of wildcards asterisk
*
and plus sign+
. - A character class enclosed by brackets
[]
is not a choice of one or more characters but rather a choice of one and only one character in the set. - The range 0-1000 is not the same as the range 0000-1000.
- Spaces are legal characters and will be interpreted like any other character.
Matching New Lines
In the regular expression library, the dot
.
character does not match new lines or carriage returns. Conversely, the not-dot does match new lines and carriage returns. This provides a safety mechanism preventing egregious backtracking of the entire SIP message body when there are no matches. The
SBC reduces backtracking to a single line within the body.
Escaped Characters
SIP HMR's support for escaped characters allows for searches for values you would be unable to enter yourself. Because they are necessary to MIME manipulation, support for escaped characters includes:
Syntax | Description |
---|---|
\s
|
Whitespace |
\S
|
Non-whitespace |
\d
|
Digits |
\D
|
Non-digits |
\R
|
Any
\r ,
\n , or
\r\n
|
\w
|
Word |
\W
|
Non-word |
\A
|
Beginning of buffer |
\Z
|
End of buffer |
\f
|
Form feed |
\n
|
New line |
\r
|
Carriage return |
\t
|
Tab |
\v
|
Vertical tab |
Building Expressions with Parentheses
You can use parentheses
()
when you use HMR to support order of operations and to simplify header manipulation rules that might otherwise prove complex. This means that expressions such as
(sip + urp) - (u + rp)
can now be evaluated to
sip
. Previously, the same expression would have evaluated to
sipurprp
. In addition, you previously would have been required to create several different manipulation rules to perform the same expression.
Boolean Operators
The following Boolean operators are supported:
-
&
, meaning AND. |
, meaning OR.-
!
, meaning NOT.
You can only use Boolean operators when the
comparison type is pattern-rule and you are evaluating stored matches. The
SBC evaluates these Boolean expressions from left to right, and does not support any grouping mechanisms that might change the order of evaluation. For example, the
SBC evaluates the expression
A & B | C
(where A=true, B=false, and C=true) as follows: A & B = false; false | true = true.
Equality Operators
You can use equality operators in conjunction with string operators. You can also use equality operators with:
- Boolean
operators, as in this example:
($rule1.$0 == $rule2.$1) & $rule3
. - The
!
,&
, and|
operators. - Variables and constant strings.
You can group them in parentheses for precedence.
Equality operators always evaluate to either true or false.
Equality Operator Symbol | Short Description | Detailed Information |
---|---|---|
==
|
String case sensitive equality operator | Performs a character-by-character, case-sensitive string comparison on both the left side and the right side of the operator. |
~=
|
String case insensitive equality operator | Performs a character-by-character, case-insensitive string comparison on both the left side and the right side of the operator. |
!=
|
String case sensitive inequality operator | Performs a character-by-character, case-sensitive string comparison on both the left side and the right side of the operator, returning true if the left side is not equal to the right side. |
<=
|
Less than or equal to operator | Performs a string-to-integer conversion. If the string-to-integer comparison fails, the value is treated as 0. After the conversion, the operator will compare the two values and return true only if the left side is less than or equal to the right side of the operator. |
>=
|
Greater than or equal to operator | Performs a string-to-integer conversion. If the string-to-integer comparison fails, the value is treated as 0. After the conversion, the operator will compare the two values and return true only if the left side is greater than or equal to the right side of the operator. |
<
|
Less than operator | Performs a string-to-integer conversion. If the string-to-integer conversion fails, the value is treated as 0. After the conversion, the operator will compare the two values and return true only if the left side is less than the right side of the operator. |
>
|
Greater than operator | Performs a string-to-integer conversion. If the string-to-integer conversion fails, the value is treated as 0. After the conversion, the operator will compare the two values and return true only if the left side is greater than the right side of the operator. |
Normalizing EBNF ExpressionString Grammar
The expression parser grammar implies that any expression string can have boolean and string manipulation operators in the same expression. While technically this is possible, the expression parser prevents it.
Because all boolean expressions evaluate to the string value TRUE or FALSE and since all manipulation are string manipulations, the result of a boolean expression returns the value TRUE or FALSE. The ExpressionString class interprets this as an actual TRUE or FALSE value. For this reason, boolean operators are not mixed with string manipulation operators (which is true with most programming languages).
The expression string grammar also indicates that it is possible to nest self-references and rule names indefinitely. For HMR, this is not allowed. A self-reference can only exist by itself, and a terminal index can only come at the end of a rule reference.
Storing Regex Patterns
Any HMR with a pattern-rule comparison type can store a regex pattern's matches for later use. In many cases you don't have to create store rules before manipulation rules. Data is only stored for items that later rules actually reference.
For example, if a later rule never references a header rule's stored value, but only its element rules, then the header rule itself doesn't store anything. Alternatively, you could delete a header or field, but still use its stored value later without having to create a separate store rule for it. In general, fewer rules improve SBC performance.
Performance Considerations
The regex engine consumes as much of the input string as it can before it backtracks or gives up trying, which is called greediness. Greediness can introduce errors in regex patterns and has an effect on performance. There is usually a trade-off of efficiency versus exactness - you should choose how exacting you need to be. Keep the following in mind in order to lessen the effect:
- Poorly constructed regex patterns can effect the performance of regex matching for long strings
- Search on the smallest input string possible, perform a regex search in element rules for the specific header component type you want to match for
- Test the regex pattern against long strings which do not match to evaluate the effect on performance.
- Test a regex with a wildcard in between characters against an input string with those characters repeated in different spots to evaluate performance
- If the input string format is fairly fixed and well-known, be explicit in the regex rather than using wildcards
- If the regex pattern is trying to capture everything before a specific character, use the negation of the character for the wildcard character. Note that this is true most times, except when there is an anchor at the end.
- Use beginning-line and ending-line anchors whenever possible if you want to only match if the pattern begins or ends as such.
- A dot . means any character, including whitespace. A wild-carded dot, such as .* or .+, will capture/match everything until the end of line, and then it will backtrack if there are more characters after the wildcard that need to be matched. If you don't need to capture the things before the characters after the wildcard, don't use the wildcard.
Additional References
To learn more about regex, you can visit the following Web site, which has information and tutorials that can help to get you started:http://www.regular-expressions.info/.