1.3.4.8.12 Comparison: Longest Common Phrase Percentage

The Longest Common Phrase Percentage comparison compares two String/String Array values and determines whether they might match by determining the longest sequence of words (separated by spaces) that is common to both values, and relating that to the number of words in either the value with the most words, or the value with the fewest words.

The Longest Common Phrase Percentage comparison is useful when determining how close two values with many words are to each other, where the order of words in the values is important. For example, when matching Company Names, words in the full name are frequently left out when capturing the name in a system. It is useful to be able to identify matches with a significant sequence of words, but where there aren't too many extra words in one of the values. For example, the Longest Common Phrase between "T D Waterhouse UK" and "Price Waterhouse UK" is 2, as it is between "Price Waterhouse" and "Price Waterhouse UK". However, the Longest Common Phrase Percentage takes into account the number of words, and so matches "T D Waterhouse UK" and "Price Waterhouse UK" with a score of only 50%, but "Price Waterhouse" and "Price Waterhouse UK" at 67% - a stronger match.

This comparison supports the use of result bands.

The following table describes the configuration options:

Option Type Description Default Value

Match No Data pairs?

Yes/No

This option determines the result of a comparison when it compares two No Data (Null, or containing only whitespace characters) values for an identifier.

If set to No, the comparison will give a 'no data' result when comparing a No Data value against another No Data value.

If set to Yes, the comparison will give a result of 0 when comparing a No Data value against another No Data value. A 'no data' result will only be returned if a No Data value is compared against a populated value.

No

Ignore case?

Yes/No

Sets whether or not to ignore case when comparing values.

Yes

Relate to shorter input?

Yes/No

This option determines whether to relate the Longest Common Phrase between two values to the longer, or the shorter, of those two values.

If set to Yes, the comparison will relate the Longest Common Phrase between two values to the shorter value of the two, in terms of the number of words. This would mean that the LCPP between "T D Waterhouse UK" and "Price Waterhouse UK" would be 67%.

If set to No, the Longest Common Substring will be related to the longer value of the two being compared. So, "T D Waterhouse UK" and "Price Waterhouse UK" would only be matched with a result of 50%.

No

Character error tolerance

Integer

This option specifies a number of character edits that are 'tolerated' when comparing words with each other. All words with a Character Edit Distance of less than or equal to the specified figure will be considered as the same.

For example, if set to 1, the Longest Common Phrase Percentage between "95 Charnwood Court, Mile End, Parnham, Middlesex" and "95 Charwood Court, Mile End, Parnam, Middlesex" would be 100%, as all words match each other considering this tolerance.

0

Ignore tolerance on numbers?

Yes/No

This option allows the Character error tolerance to be ignored for words that consist entirely of numerics.

For example, if set to Yes, and using a Character error tolerance of 1, the Longest Common Phrase Percentage between "95 Charnwood Court, Mile End, Parnham, Middlesex" and "96 Charnwood Court, Mile End, Parnam, Middlesex" would be 86%, because the numbers 95 and 96 would be considered as different, despite the fact that they only differ by a single character.

If set to No, numbers will be treated like any other words, so in the example above, the Longest Common Phrase Percentage would be 100% as 95 and 96 would be considered as the same.

Yes

Treat tolerance value as percentage?

Yes/No

This allows the Character error tolerance to be treated as a percentage of the word length in characters. For example, to tolerate a single character error for every five characters in a word, a value of 20% should be used.

This option may be useful to avoid treating short words as the same when they differ by a single character, but to retain the ability to be tolerant of typos in longer words - for example, to consider "Parnham" and "Parnam" as the same, but to treat "Bath" and "Batt" as different.

If set to Yes, the Character error tolerance option should be entered as a maximum percentage of the number of characters in a word that you allow to be different, while still considering each word as the same. For example, if set to True, a Character error tolerance of 20% will mean "Parnam" and "Parnham" will be considered as the same, as they have an edit distance of 1, and a longer word length of 7 characters - meaning a character match percentage error of 14%, which is below the 20% threshold.  The values "Bath" and "Batt", however, will not considered as the same, as they have a character match percentage error of 25% (1 error in 4 characters).

If set to No, the Character error tolerance option will be treated as a character edit tolerance between words.

No

Example

In this example, the Longest Common Phrase Percentage comparison is used to identify possible matches in company names.

The following options are specified:

  • Match No Data pairs? = No

  • Ignore case? = Yes

  • Relate to shorter input? = No

  • Character error tolerance = 1

  • Treat tolerance value as percentage? = No

  • Ignore tolerance on numbers? = Yes

Example results

With the above configuration, the following table illustrates some comparison results:

Table 1-49 Example Results: Longest Common Phrase Percentage

Value A Value B Comparison Result

Oracle Limited

ORACLES LIMITED

100%

Accounting Software and Services Ltd

Accounting Software and Services Ltd (E-Retail)

83%

The 365 Corporation

The 364 Corporation

33%

Barclays Bank International

Barrclays Bank

67%

Barclays

Barclays Bank

50%

Oracle Professional Services Ltd

Oracle Proffessional Services

75%

Marks and Spencer Financials

Marks and Spencer

75%

Marks and Spencer Head Office

Marks and Spencer

60%