1.3.9.11 Select

Select is a sub-processor of Parse. The Select step takes all of the possible generated token patterns describing each record and attempts to select the pattern that corresponds to its best understanding of the data, using a combination of criteria.

The criteria used are:

  • The number of tokens that are unclassified

  • The frequency of occurrence of each possible description across the data set (optional)

  • The confidence level of the classifications (valid or possible)

Selection uses a tunable algorithm to select the token pattern that best describes the data. There are times when a single token pattern cannot be selected - such as because two or more candidate patterns have the same number of unclassified tokens, occur a similar number of times across the data set, and have the same confidence levels in their classifications. In this case, the record is marked as having an ambiguity in its pattern selection. Where a record has one or more ambiguities in its selection, it can be assigned a Result (according to the option for Ambiguous Patterns in the Resolve step), but its data cannot be mapped to an output format.

The Parse processor performs selection in order to gain the best understanding of the tokens in each input record, which is then used in the Resolve step to give a result, and to resolve the data into a new output format.

For example, when parsing a single NAME field, the data "ADAM SCOTT" might be understood by simple classification rules to be either "<valid Forename>_<valid Surname>" or "<valid Surname>_<valid Forename>". The correct answer probably depends on the expected format of the data in the data set. If most of the remaining names are in the format "<Forename> <Surname>", then that seems the most likely pattern, and the person's name is most likely to be Adam Scott. However, if the remaining names are normally in the format "<Surname> <Forename>", then it is more likely that the person's name is Scott Adam.

Also, if a token has been classified with two different token checks, and with two different confidence levels, for example, if the token "ADAM" has been designated as a <valid Forename> and as a <possible Surname>, then by implication, it is more likely to be a <valid Forename>.

To understand how to configure the Select sub-processor, it is important to understand the logic used for selecting the best pattern.

Delimiter treatment

This option defines the treatment of delimiter tokens by the selection process. In versions of EDQ prior to 8.1, delimiters were counted as unclassified tokens in the selection process. By default, new processors created in later versions of EDQ do not include delimiters in the unclassified tokens count.

Since only the patterns with the fewest unclassified tokens will be passed through to the final selection algorithm, the delimiter classification can alter the behavior of the processor.

Discounting patterns with more unclassified tokens

Selection automatically discounts any patterns with more unclassified tokens than others. For example, when parsing addresses, the data "Newcastle Upon Tyne" in a Town attribute might generate the following token patterns, assuming both "Newcastle" and "Newcastle Upon Tyne" are in a list of valid Towns, and were therefore classified as <valid Town> tokens:

For example, when parsing addresses, the data "Newcastle Upon Tyne" in a Town attribute might generate the following token patterns, assuming both "Newcastle" and "Newcastle Upon Tyne" are in a list of valid Towns, and were therefore classified as <valid Town> tokens:

<valid Town>_<A>_<A>

<valid Town>

In this case, Parse will always prefer the second pattern, as it contains fewer unclassified tokens.

Selection by algorithm

Select then attempts to select the best token pattern for a given record, using the following algorithm. The algorithm is tunable at certain points - as marked below - so that you can tune the sensitivity of selection.

The tunable parameters may all be adjusted in the Advanced tab.

Step Optional Criterion Used Logic Tunable Parameters

1

Yes (See below)

Frequency of token pattern occurrence across sample data (generated from results)

a) Select the most frequent pattern if it is more frequent than any other possible pattern by n % (configurable).

If >1 possible pattern remains, proceed to b.

b) Discount any patterns that are p % (configurable) less frequent than the most frequent pattern

If >1 possible pattern remains, proceed to Step 2.

n (defaults to 10%)

p (defaults to 20%)

2

No

Confidence level of token classifications in pattern (valid or possible)

Give each possible pattern a score as follows:

Starting with 100 points:

a) Subtract q points for each unclassified token

b) Subtract r points for each token with a confidence level of Possible.

Then, select the highest scoring pattern if it is highest by s points.

q (defaults to 10)

r (defaults to 5)

s (defaults to 5)

Selecting pattern using frequency sample (Step 1 in the table above)

This is an optional step, but is recommended for any complex parsing needs.

On the first run of the Parse processor, it is not possible to select the best token pattern by analyzing its frequency across the data set. This is because Parse first needs to generate all the possible patterns.

Once the Parse processor has run at least once:

  • You can create a new Pattern frequency sample from the latest results (data in the Reclassification View), by clicking on the + button.

  • You can update a selected Pattern frequency sample from the latest results, by clicking the ^ button.

Using a static sample, rather than automatically using the pre-selection patterns data generated on each run, ensures predictable selection for the Parse processor, regardless of the size of the input data set. This ensures that provided the sample is the same, the same description will always be selected for a given record.

Updating the Results is often necessary, since you may iterate round many runs of the parser, and changes of classification and reclassification rules, before generating a set of descriptions that you are happy with.

Further Options

Options are available to tune the parameters used in the pattern selection algorithm described in the section above. These options should only be changed by advanced users who are aware that changing them may have significant effects on the 'intelligence' of the parser.

Note that in this example, all the tunable parameters of the selection algorithm use their default values (as above).

In the analysis of a single NAME attribute, a record with the value "DR Adam FOTHERGILL ESQ" might produce the following potential token patterns (amongst others):

1. <valid Title>_<possible Surname>_<possible Surname>_<valid Honorific>
 
2. <valid Title>_<valid Forename>_<possible Surname>_<valid Honorific>
 
3. <valid Title>_<valid Forename>_<possible Surname>_<possible Surname>
 
4. <A>_<A>_<A>_<A>

Etc.

First of all, pattern 4 will be discounted, as it contains more unclassified tokens than any of the other patterns.

The remaining 3 token patterns are then passed to the selection algorithm:

At step 1a, if one of the patterns is more than 10% more frequent than either of the others in the sample data, this pattern will be selected. If not, the logic would proceed to step 1b.

At step 1b, if any of the patterns are more than 20% less frequent in the sample data than the most common pattern, these patterns would be discounted. If more than one pattern remained, the logic would proceed to step 2.

At step 2, the remaining patterns would be scored. Assuming patterns 1, 2 and 3 were all to remain, they would be scored as follows:

Pattern 1: 100 points – 10 points for 2 Possible tokens  = 90 points
Pattern 2: 100 points – 5 points for 1 Possible token = 95 points
Pattern 3: 100 points – 10 points for 2 Possible tokens  = 90 points

So, in this case pattern 2 would be selected, assuming the default threshold difference of 5 points (and not a higher value) is used.

Once you are happy with the way token patterns are selected, the final step in configuring a Parse processor is to Resolve data.