Write Performant Regular Expressions

J Write Performant Regular Expressions

Here we discuss some of the important aspects that you must consider while crafting performant regular expressions.

Character Classes

The character classes specify the characters that you're trying or not trying to match. Ensure to replace the . in your .*s with a more specific character. The .* will invariably shift to the end of your input and will then backtrack, that is return to a previously saved state to continue the search for a match. When using a specific character class, you have control over how many characters the * will cause the regex engine to consume, giving you the power to stop the rampant backtracking.

Consider the following example regular expression:

(\d{4})(\d{2})(\d{2}),(\S+),([\S\s]*),([\S\s]*),([\S\s]*),([\S\s]*),([\S\s]*),([\S\s]*),([\S\s]*),([\S\s]*),([\S\s]*),([\S\s]*),([0-9.]+),([0-9.]+),([0-9.]+),([0-9.]+),([0-9.]+),([0-9.]+)

For the following input:

20150220,201502,16798186260,tvN,Entertainment/Music,Female,67,2,Individual,ollehtv Economic,Commercial Housing,5587,0,2,0,1,1

As the result of the specified regular expression, the match can run into backtracking. This situation is detected by Oracle Log Analytics and the match operation is aborted.

By changing the regular expression to the example below, you can ensure that the match completes faster. Notice that [\S\s]* is changed to [^,] which avoids unnecessary backtracking.

(\d{4})(\d{2})(\d{2}),(\S+),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([0-9.]+),([0-9.]+),([0-9.]+),([0-9.]+),([0-9.]+),([0-9.]+)

Lazy Quantifiers

In many regexes, greedy quantifiers (.*s) can be safely replaced by lazy quantifiers (.*?s), thus giving the regex a performance boost without changing the result.

Consider the input:

Trace file /u01/app/oracle/diag/rdbms/navisdb/NAVISDB/trace/NAVISDB_arc0_3941.trc
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
ORACLE_HOME = /u01/app/oracle/product/11.2.0/db_1
System name: Linux
Node name: NAVISDB
Release: 2.6.18-308.el5
Version: #1 SMP Fri Jan 27 17:17:51 EST 2012
Machine: x86_64
Instance name: NAVISDB
Redo thread mounted by this instance: 1
Oracle process number: 21
Unix process pid: 3941, image: oracle@NAVISDB (ARC0)

Consider the following greedy regex for the given input:

Trace\sfile\s(\S*).*ORACLE_HOME\s*[:=]\s*(\S*).*System\sname:\s*(\S*).*Node\sname:\s*(\S*).*Release:\s*(\S*).*Machine:\s*(\S*).*Instance\sname:\s*(\S*).*Redo\sthread\smounted\sby\sthis\sinstance:\s(\d*).*Oracle\sprocess\snumber:\s*(\d*).*Unix\sprocess\spid:\s(\d*).*image:\s+([^\n\r]*)

The regex engine shoots to the end of the input every time it encounters .*.. The first time that the .* appears, it consumes all the input and then backtracks until it gets to ORACLE_HOME. This is an inefficient way of matching. The alternative lazy regex is as shown below:

Trace\sfileRelease:\s*(\S*).*?Machine:\s*(\S*).*?Instance\sname:\s*(\S*).*?Redo\sthread\smounted\sby\sthis\sinstance:\s(\d*).*?Oracle\sprocess\snumber:\s*(\d*).*?Unix\sprocess\spid:\s(\d*).*?image:\s+([^\n\r]*)

The above lazy regex consumes starting from the beginning of the string until it reaches ORACLE_HOME, at which point it can proceed to match the rest of the string.

Note: If the ORACLE_HOME field appears toward the beginning of the input, the lazy quantifier should be used. If the ORACLE_HOME field appears toward the end, it might be appropriate to use the greedy quantifier.

Anchors

Anchors tell the regex engine that you intend the cursor to be in a particular place in the input. The most common anchors are ^ and $, indicating the beginning and end of the input.

Consider the following regexes to find an IPv4 address:

\d{1,3}\.d{1,3}.\d{1,3}.\d{1,3}

^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}

Notice that the second regex begins with ^ and is specific about the IP address appearing at the beginning of the input.

We're searching for the regex in input that looks like the following example:

107.21.20.1 - - [07/Dec/2012:18:55:53 -0500] "GET
      /extension/bsupport/design/cl/images/btn_letschat.png HTTP/1.1" 200
    2144

A non-matching input would look similar to the following example:

[07/Dec/2012:23:57:13 +0000] 1354924633 GET "/favicon.ico" "" HTTP/1.1 200 82726 "-"
      "ELB-HealthChecker/1.0"

The second regex (which starts with ^) runs faster on the non-matching input because it discards the non-matching input immediately.

The Importance of Alternation

The order of alternation counts, so place the more common options in the front so they can be matched faster. If the rarer options are placed first, then the regex engine will waste time in checking those before checking the more common options which are likelier to succeed. Also, try to extract common patterns. For example, instead of (abcd|abef) use ab(cd|ef).

Observe the following regexes:

{TIMEDATE}\s{0,1}:\s*(?:|\[)(\w+)(?:\:|\])(?:\[|)(\d+)(?:\:|\])(.{1,1000}).*

{TIMEDATE}\s{0,1}:\s*(?:\[|)(\w+)(?:\:|\])(?:\[|)(\d+)(?:\:|\])(.{1,1000}).*

On the following input:

2014-06-16 12:13:46.743: [UiServer][1166092608] {0:7:2} Done for
ctx=0x2aaab45d8330

The second regex matches faster as the alternation looks for character [ first, followed by null. As the input has [, the match runs faster.