1.3.9.3 Classify

The Classify sub-processor of Parse adds semantic meaning to the data by classifying tokens using rules.

Classify applies a number of Token Checks to the data, where each Token Check attempts to classify either Base Tokens, or sequences of Base Tokens, with a given meaning (for example, as a 'Postcode').

Within each Token Check, a number of rules may be used. Each rule uses a method of checking the data - for example matching the data against a list - and classifies the data that passes the check with a tag, corresponding to the name of the Token Check, and a confidence level (Valid or Possible).

Note that a given token will have many possible semantic meanings if it matches more than one Token Check. For example, the token 'Scott' may be classified as both a 'valid Forename' and as a 'valid Surname'. The Select sub-processor will later attempt to assign each token its best meaning from all the possibilities available, based on the context of the token within the data.

Classification is a vital part of parsing. Use classification rules to give meaning to the tokens (such as numbers, words and phrases) that appear in your data. The patterns of token classifications may be used in later steps to validate the data, and to resolve it into a new output structure if desired.

Classification rules will often use lists of words and phrases that are created from the data itself using the results of the Phrase Profiler and Frequency Profiler.

The Configuration window for Classify has two tabs - Token Checks, and Attributes.

Use the Token Checks tab to configure your classification rules, in the form of a number of checks for different tokens.

Use the Attributes tab to associate these Token Checks with the input attributes.

Token Checks

A Token Check comprises one or more rules for identifying data that has a given semantic meaning.

Commonly, a Token Check will consist of a single rule that identifies data using a list of values. For example, for a 'Title' Token Check, a single List Check rule might be used with a reference list of valid titles (such as 'Mr', 'Mrs', 'Ms' etc.)

However, more complex types of Token Check may be configured. This is often necessary where it is not possible to maintain lists of valid token values (for example, because there are too many possible values).

For example, the following Token Checks might be used when parsing people's names:

Table 1-124 Token Check: Forename

Order Rule Type Condition Decision

1

List Check

Matches list of common Forenames

Valid

2

Base Token Check

Matches base token tag: A

Possible

Table 1-125 Token Check: Surname

Order Rule Type Condition Decision

1

List Check

Matches list of common Surnames

Valid

2

List Check

Matches list of bad data tokens

Invalid

3

Attribute word length check

Is more than 2 words in length

Invalid

4

Base Token Check

Matches base token patterns:

A (for example, 'Davies')

A-A (for example, 'Smith-Davies')

A_A (for example, 'Taylor Smith')

Possible

Note:

By default, all token checks are displayed. To filter them by the attribute to which they are applied, select the required attribute in the drop-down selection field above the list of tokens.

Importantly, the rules within each Token Check are processed in order, meaning that if a higher rule in the Check is hit, lower rules will not be. So, for example, if the token 'Smith' has been classified as a Valid Surname using the top rule in the Surname Token Check above, it will not be classified as a Possible Surname by rule 4. Equally, if the token 'Unknown' has been blocked from classification as a Surname by rule 2, it will not be classified as a Possible Surname by rule 4.

In this way, it is possible to use Token Checks either positively or negatively. You can positively identify Valid or Possible tokens by matching lists, or you can negatively identify tokens that are Invalid, and only classify the remaining tokens as Valid or Possible.

The following types of classification rule are available within each Token Check:

Table 1-126 Classification Rules within Token Check

Rule Type Description

List Check

Checks across the attribute for any data that matches a List or Map.

Where a Map is used, it is possible to perform replacements (standardizations) of matched tokens within the parser. If the Use replacements in output option is checked, the mapped value (where present) will be used in preference to the matched value in output.

It is possible to specify a Reference Data set of Noise Characters; that is, characters to remove before attempting to match the list.

Regular Expression Check

Checks across the attribute for any data that matches a Regular Expression.

Attribute Completeness Check

Checks that the attribute contains some meaningful data (other than whitespace).

Pattern Check

Checks across the attribute for any Base Tokens that match a Character Pattern, or list of Character Patterns. This check is case sensitive.

Attribute Character Length Check

Checks the length of the data in the attribute according to a number of characters.

Attribute Word Length Check

Checks the length of the data in the attribute according to a number of words.

Base Token Check

Checks across the attribute for tokens that match a given Base Token Tag (such as 'A', or patterns of tokens that match a given pattern of Base Token Tags (such as 'A-A').

See the note on Special Characters below.

Special Characters

Note that if you want to check for Base Token patterns including full stops, such as 'A.A.A' for values such as 'www.example.com', the full stops must be entered with a \ before them in the Reference Data, as the full stop is a special character in parsing. So, for example, to check for a Base Token pattern of 'A.A.A', you must enter 'A\.A\.A'.

Note: In order to tag a full stop as its own character (.), rather than with the default base token tag of P, you would need to edit the default Base Tokenization Map used in parsing.

Applying Token Checks to Attributes

To apply Token Checks to Attributes, use the Attributes tab, and use the arrow buttons (or drag-and-drop) to select and de-select Token Checks from Attributes. You will commonly want to apply the same Token Check to many Attributes, and many Token Checks per Attribute.

The results of Phrase Profiling are often useful in order to determine which Token Checks to apply to which Attributes, as it is easy to see which types of tokens appear where.

Note that if you have added any Token Checks that are not associated with any Attributes (and which therefore will have no effect), you are warned before exiting the Classify configuration dialog.

Example

In this example, TITLE and NAME attributes are parsed, using a number of Token Checks. The TITLE attribute is simply checked for Title tokens. The NAME attribute is checked for Forenames, Surnames, Initials, Name Qualifiers, and Name Suffixes.

Token Checks View

The Token Checks View shows the summary of each Token Check within each attribute, with the counts of the distinct classified token values at each classification level (Valid and Possible):

Table 1-127 Token Checks View

Attribute Token Check Valid Possible

NAME

<Forename>

772

72

NAME

<Initial>

19

0

NAME

<Surname>

1623

70

NAME

<Qualifier>

7

0

NAME

<Suffix>

0

0

TITLE

<Title>

10

0

It is then possible to drill down to see the distinct tokens, and the number of records that contain each token. For example, you could drill down on the tokens that were classified as valid forenames.

Drilling down again takes you to the records that contain the relevant token. Note that it is possible that a single record may contain the same token twice (but will only be counted once).

Classification View

The Classification View displays all of the token patterns (descriptions of the data) generated after the classification step. Note that for a given input record, multiple token patterns may be generated, because the same token may be classified by different checks. This means that the same record may appear under several of the token patterns.

Note that some of the records that have the most common token pattern (for example, <valid Title><valid Forename>_<valid Surname>) will also have the second most common token pattern (for example, <valid Title><valid Surname>_<valid Surname>). As the former pattern is more common, however, it can be selected as the most likely description of these records using pattern frequency selection in the Select sub-processor. Alternatively, we can use context-sensitive Reclassification rules (see Reclassify) to add the intelligence that a token in between a Title and Surname is unlikely to be another surname, even if it passes the context-free token check.

Unclassified Tokens View

The Unclassified Tokens View shows a view of the number of (base) tokens in each attribute that have not been classified by any of the token checks. This is useful to find values that may need to be added to lists used for classification.

From the example above, we might have the following view of unclassified tokens:

Table 1-128 Unclassifieds Tokens

Attribute Unclassified Tokens

NAME

55

TITLE

1

Drilling down shows each distinct token, and its frequency of occurrence. For example, we can drill down on the 55 unclassified tokens in the NAME field above. This may reveal some unusual characters, some dummy values, and some mis-spellings.

Table 1-129 Drilling down on Unclassified Tokens

Token Frequency No. records

#

13

13

-

12

12

TEST

4

4

Test

3

3

Cluadia

1

1

DO

1

1

WHUR

1

1

We can now use this view to add to the classification lists, or create new lists (for example, to create a list to recognize dummy values).

The next (optional) step in configuring a Parse processor is to Reclassify data.