1.3.9.7 Parse

The Parse processor is an extremely powerful tool for the understanding and structural improvement of data. It allows you to analyze and understand the semantic meaning of data in one or many attributes, by applying both manually configured business rules, and artificial intelligence. It then allows you to use that semantic meaning in rules to validate and optionally restructure the data. For example, Parse allows you to recognize names data that is wrongly captured in address attributes, and optionally map it to new attributes in a different structure.

The Parse processor can be configured to understand and transform any type of data. For an introduction to parsing in EDQ, see the Parsing Concept Guide. More

The Parse processor has a variety of uses. For example, you can use Parse to:

  • Apply an improved structure to the data for a specific business purpose, such as to transform data into a structure more suitable for accurate matching.

  • Apply structure to data in unstructured, or semi-structured format, such as to capture into several output attributes distinct items of data that are all contained in a single Notes attribute.

  • Check that data is semantically fit for purpose across a number of attributes, either on a batch or real-time basis.

  • Change the structure of data from a number of input attributes, such as to migrate data from a number of different source formats to a single target format.

Parse Overview

The Parser runs in several steps. Each step is fully explained in the Configuration section below. However, the processing of the parser can be summarized as follows:

Input data >

  1. Tokenization: Syntactic analysis of data. Split data into smallest units (base tokens)

  2. Classification: Semantic analysis of data. Assign meanings to tokens

  3. Reclassification: Examine token sequences for new classified tokens

  4. Pattern Selection: Select the best description of the data, where possible

  5. Resolution: Resolve date to its desired structure and give a result

> Output data and flags

To understand how the Parse processor works as a whole, it is useful to follow an example record through it. In this example, we are parsing an individual name from three attributes - Title, Forenames, and Surname.

Example Input Record

The following record is input:

Title Forenames Surname

Mr

Bill Archibald

SCOTT

Tokenization

Tokenization tokenizes the record as follows, recognizing the tokens 'Mr', 'Bill', 'Archibald' and 'SCOTT' and assigning them a token tag of <A>. It also recognizes the space between 'Bill' and 'Archibald' as a token and assigns it a token tag of <_>. Note that Tokenization always outputs a single pattern of base tokens. In this case, the pattern is as shown below (from the Tokenization view):

Title Forenames Surname

<A>

<A>_<A>

<A>

Classification

Classification then classifies the tokens in the record using classification rules with lists of names and titles. As some of the names appear on multiple lists, some of the tokens are classified in multiple ways - for example, the token 'Archibald' is classified both as a <possible forename> and as a <possible surname>, and the token 'SCOTT' is classified both as a <possible forename> and as a <valid surname>. As a result, Classification outputs multiple classification patterns, as shown below in the Classification view:

Title Forenames Surname

<valid title>

<valid forename>_<possible_surname>

<possible forename>

<valid title>

<valid forename>_<possible_forename>

<possible forename>

<valid title>

<valid forename>_<possible_surname>

<valid surname>

<valid title>

<valid forename>_<possible_forename>

<valid surname>

Reclassification

As above, we now have multiple descriptions of the data. However, we might decide to apply the following Reclassification rule to the Forenames attribute to denote that because the token 'Archibald' follows a valid forename, we can be confident that it represents a middle name:

Name Look for Reclassify as Result

Middle name after Forename

<valid forename>(<possible forename>)

middlename

Valid

This rule acts on the pattern '<valid forename>(<possible forename>)' in the Forenames attribute, which affects the second and fourth classification patterns above. As Reclassification adds new patterns but does not take away existing ones, we now have the original four patterns and two new ones, as shown in the following table:

Title Forenames Surname

<valid title>

<valid forename>_<possible_surname>

<possible forename>

<valid title>

<valid forename>_<possible_forename>

<possible forename>

<valid title>

<valid forename>_<possible_surname>

<valid surname>

<valid title>

<valid forename>_<valid_middlename>

<valid surname>

<valid title>

<valid forename>_<valid_middlename>

<possible forename>

<valid title>

<valid forename>_<possible_forename>

<valid surname>

Note:

The Reclassification view will only show the patterns that have been pre-selected for entry into the Selection process. Pre-selection is a non-configurable first step in the selection process which eliminates patterns containing too many unclassified tokens. The pre-selection process first surveys all the patterns generated so far, and determines the minimum number of unclassified tokens present in any one pattern. Next, any patterns with more than that number of unclassified tokens are eliminated. In the example above, none of the patterns contain any unclassified tokens, so the minimum number of unclassified tokens is zero. Since none of the patterns contain more than zero unclassified tokens, none of the patterns are eliminated in the pre-selection process.

Selection

Selection now attempts to pick the best overall pattern from the six possibilities. In this case, we can see that the fourth pattern above is the strongest as all of its token classifications have a result of 'valid'. By scoring each of the patterns using the default selection rules, therefore, the first pattern is selected and displayed in the Selection view:

Title Forenames Surname

<valid title>

<valid forename>_<valid_middlename>

<valid surname>

Resolution

If we accept that the selected pattern is a good description of the record, we can then resolve the pattern to output attributes, and assign a result. In this case, we can do this by right-clicking on the selected pattern above and selecting Resolve... to add an Exact resolution rule.

We use the default output assignments (according to the classifications made), and assign the pattern a Pass result, with a Comment of 'Known name format'.

After re-running Parse with this rule, we can see that the rule has resolved the input record:

Id Rule Result Comment Count

1

Exact Rule

Pass

Known name format

1

Finally, we can drill down to the record and see that data has correctly been assigned to output attributes according to the resolution rule:

Title Forenames Surname UnclassifiedData.Parse title.Parse forename.Parse surname.Parse

Mr

Bill Archibald

SCOTT

Mr

Bill

SCOTT

Configuration

Parse is an advanced processor with several sub-processors, where each sub-processor performs a different step of parsing, and requires separate configuration. The following sub-processors make up Parse, each performing a distinct function as described below.

Sub-processor Description

Input

Allows you to select the input attributes to parse, and configure Dashboard publication options. Note that only String attributes are valid inputs.

Map

Maps the input attributes to the attributes required by the parser.

Tokenize

Tokenize analyzes data syntactically, and splits it into its smallest units (base tokens) using rules. Each base token is given a tag, for example <A> is used for an unbroken sequence of alphabetic characters.

Classify

Classify analyzes data semantically, assigning meaning to base tokens, or sequences of base tokens. Each classification has a tag, such as 'Building', and a classification level (Valid or Possible) that is used when selecting the best description of ambiguous data.

Reclassify

Reclassify is an optional step that allows sequences of classified and unclassified (base) tokens to be reclassified as a single new token.

Select

Select attempts to select the 'best' description of the data using a tuneable algorithm, where a record has many possible descriptions (or token patterns).

Resolve

Resolve uses rules to associate the selected description of the data (token pattern) with a Result (Pass, Review or Fail), and an optional Comment. It also allows you to configure rules for outputting the data in a new structure, according to the selected token pattern.

Advanced Options

The Parser offers two modes of execution in order to provide optimized performance where some of the results views are not required.

The two modes are:

  • Parse and Profile

  • Parse

Parse and Profile (the default mode) should be used when first parsing data, as the parser will output the Token Checks and Unclassified Tokens results views, which are useful when still in the process of defining the parsing rules, by creating and adding to lists used for classification.

Parse mode should be used when the classification configuration of the parser is complete, and optimal performance is needed. Note that the Token Checks and Unclassified Tokens views are not produced when running in this mode.

Options

All options are configurable within each sub-processor.

Outputs

Data Attributes

The output data attributes are configurable in the Resolve sub-processor.

Flags

Flag attribute Purpose Possible Values

[Attribute name].SelectedPattern

Indicates the selected token pattern for the record

The selected token pattern

[Attribute name].BasePattern

Indicates the base token pattern for the record, output from tokenization (if using the parser purely to generate this pattern)

The base token pattern

ParseResult

Indicates the result of the parser on the record.

Unknown/Pass/Review/Fail

ParseComment

Adds the user-specified comment of the record's resolution rule.

The comment on the resolution rule that resolved the record

Publication to Dashboard

The Parse processor's results may be published to the Dashboard.

The following interpretation of results is used by default:

Result Dashboard Interpretation

Pass

Pass

Review

Warning

Fail

Alert

Execution

Execution Mode Supported

Batch

Yes

Real time Monitoring

Yes

Real time Response

Yes

Results Browsing

The Parse processor produces a number of views of results as follows. Any of the views may be seen by clicking on the Parse processor in the process. The views may also be seen by expanding the Parse processor to view its sub-processors, and selecting the sub-processor that produces the view.

Base Tokenization View (produced by Tokenize)

This view shows the results of the Tokenize sub-processor, showing all the distinct patterns of Base Tokens across all the input attributes. The patterns are organized by frequency.

Note:

Each record has one, and only one, base token pattern. Many records have the same base token pattern.

Statistic Meaning

For each input attribute

The pattern of base tokens within the input attribute

Note that rows in the view exist for each distinct base token pattern across all attributes

Count

The number of records with the distinct base token pattern across all the input attributes

%

The percentage of records with the distinct base token pattern across all the input attributes

Token Checks View (produced by Classify)

This views shows the results of the Classify sub-processor, showing the results of each token check within each input attribute.

Statistic Meaning

Attribute

The attribute to which the token check was applied

Classifier

The name of the token check used to classify tokens

Valid

The number of distinct tokens that were classified as Valid by the token check

Possible

The number of distinct tokens that were classified as Possible by the token check

Drill down on the Valid or Possible statistics to see a summary of the distinct classified tokens and the number of records containing them. Drill down again to see the records containing those tokens.

Unclassified Tokens View (produced by Classify)

Statistic Meaning

Attribute

The input attribute

Unclassified Tokens

The total number of unclassified tokens in that attribute

Drill down on the Unclassified Tokens to see a list of all the unclassified tokens and their frequency. Drill down again to see the records containing those tokens.

Classification View (produced by Classify)

This view shows a list of all the generated token patterns after classification (but before reclassification). There may be many possible patterns for each input record.

Statistic Meaning

For each input attribute

The pattern of tokens across the attribute.

Note that rows in the view exist for each distinct token pattern across all attributes.

Count

The number of records where the token pattern is a possible description of the data. Note that the same record may have many possible token patterns in this view, and each token pattern may describe many records.

%

The Count expressed as a percentage of all the possible token patterns across the data set.

Reclassification Rules View (produced by Reclassify)

This view shows a list of all reclassification rules, and how they have affected your data.

Statistic Meaning

Rule Id

The id of the reclassification rule. The ids are auto-assigned. The id is useful where you have dependencies between rules - see the Precedents statistic below.

Rule Name

The name of the reclassification rule.

Attribute

The attribute to which the reclassification rule was applied.

Look for

The token pattern used to match the rule

Reclassify as

The target token of the reclassification rule

Result

The classification level (valid or possible) of the reclassification rule

Records affected

The number of records affected by the rule

Patterns affected

The number of classification patterns affected by the rule

Precedents

The number of other reclassification rules that preceded this rule before it could operate. For example, if you reclassify the <A> as <B> in one rule, and <B> to <C> in another rule, the first rule is a precedent of the second. Note that even reclassification rules that did not affect any records may have precedents, as these are calculated logically.

Reclassification View (produced by Reclassify)

This view shows a list of all the generated token patterns after reclassification (but before selection). There may be many possible patterns for each input record. The view presents all of the possible patterns and their frequency across the whole data set, before the Select step attempts to select the best pattern of each input record.

Note:

The data in this view may itself be used to drive which pattern to select; that is, it is possible to configure the Select step to select the pattern for a record by assessing how common it is across the data set. See the configuration of the Select sub-processor.

Statistic Meaning

For each input attribute

The pattern of tokens across the attribute.

Note that rows in the view exist for each distinct token pattern across all attributes.

Count

The number of records where the token pattern is a possible description of the data. Note that the same record may have many possible token patterns in this view, and each token pattern may describe many records.

%

The Count expressed as a percentage of all the possible token patterns across the data set.

Selection View (produced by Select)

After the Select step, each input record will have a selected token pattern.

This view shows a view of the selected patterns across the data set, and their frequency of occurrence.

Note:

Where the selection of a single token pattern to describe a record is not possible, because of ambiguities in selection, the pattern with the ambiguity (or ambiguities) is shown along with the number of records that had the same ambiguity; that is, the same set of potential patterns that selection could not choose between

Statistic Meaning

For each input attribute

The pattern of tokens across the attribute

Note that rows in the view exist for each distinct token pattern across all attributes.

Exact rule

The numeric identifier of the exact resolution rule (if any) that resolved the token pattern

Fuzzy rule

The numeric identifier of the fuzzy resolution rule (if any) that resolved the token pattern

Count

The number of records where the token pattern was selected as the best description of the data

%

The percentage of records where the token pattern was selected

Resolution Rule View (produced by Resolve)

This view shows a summary of the resolutions made by each Resolution Rule. This is useful to check that your rules are working as desired.

Statistic Meaning

ID

The numeric identifier of the rule as set during configuration.

Rule

The type of rule (Exact rule or Fuzzy rule)

Result

The Result of the rule (Pass, Review or Fail)

Comment

The Comment of the rule

Count

The number of records that were resolved using this rule. Click on the Additional Information button in the Results Browser to show this as a percentage.

Result View (produced by Resolve)

Statistic Meaning

Pass

The total number of records with a Pass result

Review

The total number of records with a Review result

Fail

The total number of records with a Fail result

Unknown

The number of records where Parse could not assign a result

Output Filters

The following output filters are available from the Parse processor:

  • Pass - records that were assigned a Pass result

  • Review - records that were assigned a Review result

  • Fail - records that were assigned a Fail result

  • Unknown - records that did not match any resolution rules, and therefore had no distinct result

Example