1.3.9.9 Reclassify

The Reclassify sub-processor of Parse is an optional step that allows you to reduce the total number of different patterns of tokens in your data by recognizing sequences of classified and unclassified tokens, or tokens in a given context, and reclassifying them as a new token, with a given confidence level (valid or possible).

Use Reclassify wherever you have sequences of tokens that you want to consider together in output, and where you want to consider many similar patterns of tokens within an attribute as the same.

For example, when parsing addresses, you may see the following different patterns of data in an Address1 attribute after initial Classification:

<N>_<A>_<valid Road Hint> (for example, "10 Harwood Street")

<N>_<A>_<A>_<valid Road Hint> (for example, "15 Long End Road")

<A>_<valid Road Hint> (for example, "Nuttall Lane")

You may want to reclassify all these different sequences of tokens as valid Thoroughfares.

The Reclassify step is also useful because the Classify step does not consider context, other than which attribute a given piece of data is found in. You can use reclassification rules to reclassify tokens that were incorrectly classified within their context. For example, data such as "London Road" might be classified as <valid Town> <valid Road Hint>. You may choose to reclassify the sequence of tokens as a valid Thoroughfare, or to reclassify the <valid Town> part, by enclosing it in brackets in the rule, as a ThoroughfareName.

Configuration

Each reclassification rule uses a structured expression to match patterns of tokens after Classification, and to reclassify a part of each matched pattern as a new token.

Rules may be easily enabled and disabled using the tick box in the Reclassify Rules dialog. Rules must be associated with the required input attributes in the Attributes tab.

The following table gives a guide to the syntax of the expressions used in reclassification:

Characters Use Example

[ ]

Used to group a sequence of tokens in order to specify the number of times the sequence occurs. It is always followed by a range (enclosed in curly brackets) or by * or +.

If the group contains only a full stop [.] this means any token or tokens.

[<A>]

Matches the token <A>

{ }

Used to specify a range that expresses how many instances of the previous group (enclosed in square brackets) may occur in the pattern, in sequence. Ranges are specified with minimum and maximum numbers, separated by commas.

[<A>]{1,3}

Matches 1-3 occurrences of the token <A> in sequence

[<A>]{2,2}

Matches exactly 2 occurrences of the token <A> in sequence

?

Used to denote that the group is optional. This has the same meaning as {0,1}; that is, this matches if the group does not appear, or appears only once.

[<title>]?

Matches if the title token does not appear, or appears once

+

Used instead of numbers in curly brackets (as above) to indicate that the previous group must occur at least once, but may occur any number of times.

[<A>]+

Matches any number of occurrences of the token <A> in sequence

*

Used instead of numbers in curly brackets to indicate that the previous group may occur any number of times, or not at all.

[.]*

[.]

Used to denote a wild card; that is, any token.

Use this together with rules on how many tokens you expect. For example, [.]* means any number of occurrences of any token.

[.]*(<N><valid RoadHint>)[.]*

Reclassifies the sequence <N><valid RoadHint> wherever it occurs, without reclassifying any of the other tokens in the pattern

( )

Used to enclose the part of the pattern that you actually want to reclassify. Allows you to use pattern context in matching the Reclassification Rule, but not in the reclassification itself.

(<N><valid RoadHint>)<valid Town>

Reclassifies the sequence <N><valid RoadHint> provided it occurs before a valid Town token.

" "

Used to enclose exact data, where it is used in a rule instead of tokens.

-

Additional Notes

Note the following on reclassification rules:

  • Unless you use wild cards, each rule will attempt to match the whole of the token pattern in the attribute.

  • Rules do not need to be ordered, and will all be applied against the data set. Rules that are dependent on each other will be ordered automatically. For example, if you reclassify <A>_<valid Road Hint> as a <valid Thoroughfare>, and add an additional rule that reclassifies <N>_<valid Thoroughfare>, the latter rule will be processed after the first rule. Rules that are cyclical are not allowed. For example, it is not possible to reclassify <A> as <B> in one rule, and then <B> as <A> in another rule.

  • You may choose either to specify, or not specify, Base Tokens that represent either Delimiters or Whitespace, as driven from the configuration of the Tokenize step, in a reclassification rule. If you do not include them, the rule will ignore them when matching patterns against the rule. If you include them in the rule, they will have to match exactly. For example, the rule <N><valid Road Hint> will match both <N>_<valid Road Hint> and <N>___<valid Road Hint>, but the rule <N>_<valid Road Hint> would only match the former pattern.

  • You may choose either to specify, or not specify, the validity level of the tokens you are matching. For example, the rule <N><Road Hint> will match both <N><valid Road Hint> and <N><possible Road Hint>.

  • It is possible to use multiple reclassification rules that will reclassify different sequences of tokens to the same target token.

Example

In this example, two reclassification rules are used to reduce the total number of patterns to resolve when parsing addresses by recognizing sequences of tokens that represent thoroughfares, such as "Acacia Avenue" and "London Road".

Rule Name Look For Reclassify as Result

Thoroughfare

[.]*([<A>]+<RoadHint>)[.]*

thoroughfare

valid

Reclassify towns in road names

[.]*([<Town>]+<RoadHint>)[.]*

thoroughfare

valid

These two rules are both applied to an Address1 attribute. The results of the rules can be seen after running the process.

We can then drill down on the records and patterns affected to check that each rule is working correctly. Note that each reclassification rule may affect many patterns within the attribute it is applied to, and which in turn may affect many of the overall classification patterns (if parsing several attributes). As the same input record may have multiple classification patterns (as a single token may be classified in different ways), it may be that the same record may have multiple classification patterns affected by the same rule.

After the above reclassification rules have been applied, you can see the appearance of the <thoroughfare> token in a number of patterns in the Reclassification view.

The next step in configuring a Parse processor is to Select the best descriptions of your input data.