1.3.4.8.19 Comparison: Word Edit Distance

The Word Edit Distance comparison determines how well multi-word String/ String Array values match each other by calculating the minimum number of word edits (word insertions, deletions and substitutions) required to transform one value to another.

Use the Word Edit Distance comparison when matching multi-word String values (such as full names) that may be similar, but which would not match well using character based matching properties such as Character Edit Distance or Character Match Percentage. For example the values "Joseph Andrew Cole" and "Joseph Cole" could be considered as fairly strong matches, but they have a Character Edit Distance of 6, and a Character Match Percentage of 63%, indicating a fairly weak match. The same two values have a Word Edit Distance of only 1, however, which may be backed up by additional comparisons matching the first few and the last few characters to indicate a possible match.

This comparison supports the use of result bands.

The following table describes the configuration options:

Option Type Description Default Value

Match No Data pairs?

Yes/No

This option determines the result of a comparison when it compares two No Data (Null, or containing only whitespace characters) values for an identifier.

If set to False, the comparison will give a 'no data' result when comparing a No Data value against another No Data value.

If set to True, the comparison will return a result of 0 when comparing a No Data value against a No Data value (as the number of matching words will be 0). A 'no data' result will only be returned if a No Data value is compared against a populated value.

No

Ignore case?

Yes/No

Sets whether or not to ignore case when comparing values.

For example, if case is ignored, the Word Edit Distance between "Joseph Andrew COLE" and "Joseph Andrew Cole" will be 0. If case is not ignored, it will be 1.

Yes

Character error tolerance

Integer

This option specifies a number of character edits that are 'tolerated' when comparing words with each other. All words with a Character Edit Distance of less than or equal to the specified figure will be considered as the same.

For example, if set to 1, the Word Edit Distance between "Parnham, Middlesex" and "Parnam, Middlesex" would be 0, as all words match each other considering this tolerance.

0

Ignore tolerance on numbers?

Yes/No

This option allows the Character error tolerance to be ignored for words that consist entirely of numerics.

For example, if set to Yes, and using a Character error tolerance of 1, the Word Edit Distance between "95 Charnwood Court, Mile End, Parnham, Middlesex" and "96 Charnwood Court, Mile End, Parnam, Middlesex" would be 1, because the numbers 95 and 96 would be considered as different, despite the fact that they only differ by a single character.

If set to No, numbers will be treated like any other words, so in the example above, the Word Edit Distance would be 0 as 95 and 96 would be considered as the same.

Yes

Treat tolerance value as percentage?

Yes/No

This allows the Character error tolerance to be treated as a percentage of the word length in characters. For example, to tolerate a single character error for every five characters in a word, a value of 20% should be used.

This option may be useful to avoid treating short words as the same when they differ by a single character, but to retain the ability to be tolerant of typos in longer words - for example, to consider "Parnham" and "Parnam" as the same, but to treat "Bath" and "Batt" as different.

If set to Yes, the Character error tolerance option should be entered as a maximum percentage of the number of characters in a word that you allow to be different, while still considering each word as the same. For example, if set to True, a Character error tolerance of 20% will mean "Parnam" and "Parnham" will be considered as the same, as they have an edit distance of 1, and a longer word length of 7 characters - meaning a character match percentage error of 14%, which is below the 20% threshold.  The values "Bath" and "Batt", however, will not considered as the same, as they have a character match percentage error of 25% (1 error in 4 characters).

If set to No, the Character error tolerance option will be treated as a character edit tolerance between words.

No

Ignore word order?

Yes/No

If set to Yes, the order of the words in each value will not influence the result. For example, the Word Edit Distance between "Nomura International Bank" and "International Bank Nomura" would be 0.

If set to No, the order of the words in each value will be considered. So, the Word Edit Distance between "Nomura International Bank" and "International Bank Nomura" would be 3.

Yes

Example

In this example, the Word Edit Distance comparison is used to match company names. The following options are specified:

  • Match No Data pairs? = No

  • Ignore case? = Yes

  • Character error tolerance = 1

  • Ignore tolerance on numbers? = No

  • Treat tolerance value as percentage? = No

  • Ignore word order? = Yes

Also, a Strip Words transformation is added, using a Reference Data list that includes the following entries: PLC, LIMITED, OF

Example results

With the above configuration, the following table illustrates some comparison results:

Table 1-57 Example Results: Word Edit Distance

Value A Value B Comparison Result

International Bank of Nomura

Nomura International Bank

0

BA Systems Operations

BA SYSTEMS OPERATIONS

0

Oracle Limited

Oracle

0

Oracle Limited

Oraccle

0

George & Sons Plumbers Limited

George Plumber & Sons

0

Price Waterhouse Coopers

Price Waterhouse

1

British Telecom plc

First Telecom

1

Merrill Lynch

Merrills

1

Merrill Lynch

Merrillion Software

2