1.3.2.12 Pattern Check

The Pattern Check processor checks the pattern of data in an attribute against reference lists of valid and invalid patterns. It takes multiple single or array inputs.

Use the Pattern Check processor to ensure that the data in an attribute conforms to one of the valid patterns for that attribute. Data may need to conform to a set of valid patterns for either technical or business reasons. For example, when migrating data, the target system may require that all data for a given attribute consists of numeric characters only, and with minimum and maximum length restrictions. Alternatively, for business reasons, it may be that you want to tag as invalid records that have bad data, or data in the wrong attributes, for example, numeric values in Name fields, malformed product codes etc.

You can create the lists of valid and invalid patterns from the data itself using the Patterns Profiler.

Pattern Check allows the use of up to two reference lists - a list of Valid patterns for the attribute, and a list of Invalid patterns.

You may choose only to use one of these two lists. For example, if you discover from profiling that there are many different valid patterns for an attribute, you may want only to check the attribute for invalid patterns, and consider the non-matching values as either Valid, or Unknown.

If, however, the attribute has a small number of valid patterns, you may want simply to check the data against a list of valid patterns, and consider the non-matching values as Invalid, or Unknown.

Finally, you can use both lists, and recognize both valid and invalid patterns, with values that do not match either list categorized as Unknown.

The following table describes the configuration options:

Configuration Description

Inputs

Specify one or more single or array attributes that you want to check for valid or invalid patterns (or both).

Options

Specify the following valid value options:

  • Character Map Reference Data: maps each character to a pattern character. Specified as Reference Data (Pattern Generation Category).

    The default *Base Tokenization map is designed for use with Latin-1 encoded data, as are the alternative *Unicode Base Tokenization and *Unicode Character Pattern maps. If these maps are not suited to the character-encoding of the data, it is possible to create and use a new one to take account of, for example, multi-byte Unicode (hexadecimal) character references.

    Default value: *Character Pattern Map.

Specify the following valid pattern options:

  • Reference Data: list of valid patterns for the attribute. Specified as Reference Data (Pattern Category). Default value: None.

  • Categorize unmatched as: how to categorize values that do not match the list of valid patterns. Specified as a Selection (Unknown/Invalid). Default value: Unknown.

  • Valid patterns in: how to categorize a record if it has multiple inputs, or array inputs, based on how many are categorized as Valid. Specified as a selection (All Values/Any Value). Default value: All Values.

Specify the following invalid pattern options:

  • Reference Data: list of invalid patterns for the attribute. Specified as Reference Data (Pattern Category). Default value: None.

  • Categorize unmatched as: how to categorize values that do not match the list of invalid patterns. Specified as a Selection (Unknown/Invalid). Default value: Unknown.

  • Invalid patterns in: how to categorize a record if it has multiple inputs, or array inputs, based on how many are categorized as Invalid. Specified as a selection (All Values/Any Value). Default value: All Values.

Outputs

Describes any data attribute or flag attribute outputs.

Data Attributes

None.

Flags

The following flags are output:

  • [Attribute Name].Pattern: indicates the pattern of the selected attribute. Contains the pattern which the attribute value matched.

  • [Attribute Name].PatternValid: indicates which data passes the Pattern Check: Valid Patterns, Invalid Patterns and Unknown Patterns. Possible values are Y/N/-.

A single summary flag is also output for the whole record:

  • PatternValid: indicates whether the record passes the Pattern Check. Possible values are Y/N/-.

The following table describes the statistics produced by the profiler:

Statistic Description

Valid records

The records that were categorized as Valid by the Pattern Check.

Unknown records

The records that were categorized as Unknown by the Pattern Check.

Invalid records

The records that were categorized as Invalid by the Pattern Check.

Drilling down on any of the above statistics reveals a count of the distinct patterns that were found to be Valid, Unknown or Invalid. You can then drill down again to see the records themselves.

Output Filters

The following output filters are available:

  • Valid records

  • Unknown records

  • Invalid records

Example

In this example, Pattern Check is used to verify values in an Account Number attribute (CU_ACCOUNT) using lists of valid and invalid patterns collated from Patterns Profiling.

Values that did not match either list of pattens were categorized as Unknown.

A summary view:

Valid Records Unknown Records Invalid Records

1991

1

9

A drill down on invalid records:

Value Count %

aa-NNNNN-aa

4

0.2

NN-NNN-aa

2

<0.1

NN-NNNNN-Na

1

<0.1

NN-NNNNN-

1

<0.1

NN-N-aa

1

<0.1