1.3.2.13 RegEx Check

The RegEx Check processor checks the data in an attribute against reference lists of valid and invalid regular expressions for the attribute. It takes string, multiple strings, or string array as an input.

The case-sensitivity and matching technique (Whole Value / Contains / Starts With / Ends With) of the check can be controlled.

The RegEx Check processor is a powerful tool, allowing you to validate data according to its exact content, using the position of data, partial and exact values, and wild cards.

The RegEx Check is useful in order to check any data that should be in a consistent structure, for example, UK National Insurance Numbers.

Regular Expressions

Regular expressions are a standard technique for expressing patterns and manipulating Strings that is very powerful once mastered.

Tutorials and reference material about regular expressions are available on the Internet, and in books, including: Mastering Regular Expressions by Jeffrey E. F. Friedl published by O'Reilly UK; ISBN: 0-596-00289-0.

There are also software packages available to help you master regular expressions, such as RegExBuddy, and online libraries of useful regular expressions, such as RegExLib.

The following are some examples of regular expressions that could check data:

Regular Expression Pattern Meaning

^\d{5}$

5 integer US zip code.

([A-Z]{1,2}[0-9]{1,2}|[A-Z]{3}|[A-Z]{1,2}[0-9][A-Z])( |-)[0-9][A-Z]{2}

Valid UK Postcode.

^[A-CEGHJ-PR-TW-Z]{1}[A-CEGHJ-NPR-TW-Z]{1}[0-9]{6}[A-DFM]{0,1}$

Valid UK National Insurance number.

^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$

Valid email address.

The following table describes the configuration options:

Configuration Description

Inputs

Specify a single, multiple, or array attributes that you want to check based on lists of valid and invalid regular expressions (or both).

Options

Specify the following valid pattern options:

  • Reference Data: list of valid regular expressions for the attribute. Specified as Reference Data (Regular Expressions Category). Default value: None.

  • Regular Expression: allows a single regular expression for valid patterns to be specified without using Reference Data.

    Note that if this option is used as well as Reference Data, all regular expressions (in both options) are used in the check.

    Specified as a free text entry. Default value: None.

  • Categorize unmatched as: how to categorize values that do not match the list of valid regular expressions. Specified as a Selection (Unknown/Invalid). Default value: Unknown.

  • Valid Values in: how to categorize a record if it has multiple inputs, or array inputs, based on how many are categorized as Valid. Specified as a selection (All Values/Any Value). Default value: All Values.

Specify the following invalid pattern options:

  • Reference Data: list of invalid regular expressions for the attribute. Specified as Reference Data (Regular Expressions Category). Default value: None.

  • Regular Expression: allows a single regular expression for invalid patterns to be specified without using Reference Data.

    Note that if this option is used as well as Reference Data, all regular expressions (in both options) are used in the check.

    Specified as a free text entry. Default value: None.

  • Categorize unmatched as: how to categorize values that do not match the list of regular expressions. Specified as a Selection (Unknown/Invalid). Default value: Unknown.

  • Invalid Values in: how to categorize a record if it has multiple inputs, or array inputs, based on how many are categorized as Invalid. Specified as a selection (All Values/Any Value). Default value: All Values.

Specify the following match options:

  • Ignore case?: drives whether or not to ignore case when matching the list(s). Specified as Yes/No. Default value: Yes.

  • Match list by: drives how to match against the list(s). Specified as a Selection (Whole Value/Contains/Starts With/Ends With). Default value: Whole Value.

Outputs

Describes any data attribute or flag attribute outputs.

Data Attributes

None.

Flags

The following flags are output:

  • RegExValid: indicates which data passes the RegExCheck: Valid RegEx, Invalid RegEx and Unknown. Possible values are Y/N/-.

A single summary flag is also output:

  • RegExValidSummary: indicates whether the record passes the regular expression check. Possible values are Y/N/-.

The following table describes the statistics produced by the profiler:

Statistic Description

Valid records

The records that were categorized as Valid by the RegEx Check.

Unknown records

The records that were categorized as Unknown by the RegEx Check.

Invalid records

The records that were categorized as Invalid by the RegEx Check.

Drilling down on any of the above statistics reveals a count of the distinct patterns that were found to be Valid, Unknown or Invalid. You can then drill down again to see the records themselves.

Output Filters

The following output filters are available:

  • Valid records

  • Unknown records

  • Invalid records

Example

In this example, a RegEx Check is used to check the format of an Account Number attribute (CU_ACCOUNT), using a Whole Value match against the following regular expression:

^([0-9]{2})(-)([0-9]{4,5})(-)([a-zA-Z]{2})

This regular expression dictates that values must start with exactly 2 digits, following by a hyphen, followed by either 4 or 5 digits, followed by another hyphen, followed by two letters.

A summary view:

Valid Records Unknown Records Invalid Records

1997

0

14

A drill down on invalid records:

Value Count %

[Null]

4

0.2

OO-24077-SH

1

<0.1

OO-24282-LR

1

<0.1

OO-24276-LR

1

<0.1

0975t3487263

1

<0.1

OI-25057-JD

1

<0.1

97-19671-5H

1

<0.1

97-19601-

1

<0.1

02-999-ZZ

1

<0.1

00-0-XX

1

<0.1