1.3.4.8.14 Comparison: Longest Common Substring Percentage

The Longest Common Substring Percentage comparison determines the similarity of two String/String Array values to each other by finding the Longest Common Substring between two values, and relating its length in characters to the length in characters of either the longer or the shorter input value.

Use the Longest Common Substring Percentage comparison where the Longest Common Substring may not give such accurate results, because it might simply match a long word or words in a given value without considering other data in the value. For example, the values "Ardent Design Birmingham" and "Britannia Design Birmingham" have a Longest Common Substring of 17 characters, indicating a strong result. However, they have a Longest Common Substring Percentage of only 63% - a much weaker result.

When using the shorter value of the two Strings, the Longest Common Substring Percentage comparison also provides a way to perform either exact or fuzzy 'contains' matching between two values. For example, the values "Ardent" and "Ardent Design UK" will match with a score of 100% using the Longest Common Substring Percentage and relating it to the shorter value, and the values "Ardent UK" and "Ardent Design UK" will match with a score of 75% (assuming all whitespace is trimmed).

This comparison supports the use of result bands.

The following table describes the configuration options:

Option Type Description Default Value

Match No Data pairs?

Yes/No

This option determines the result of a comparison when it compares two No Data (Null, or containing only whitespace characters) values for an identifier.

If set to No, the comparison will give a 'no data' result when comparing a No Data value against another No Data value.

If set to Yes, the comparison will give a result of 0 when comparing a No Data value against another No Data value. A 'no data' result will only be returned if a No Data value is compared against a populated value.

No

Ignore case?

Yes/No

Sets whether or not to ignore case when comparing values.

Yes

Relate to shorter input?

Yes/No

This option determines whether to relate the Longest Common Substring between two values to the longer, or the shorter, of those two values.

If set to Yes, the shorter value will be used, meaning the Longest Common Substring Percentage will effectively measure how nearly one value is contained in another. For example, the LCSP between "Excel" and "Excel Europe" will be 100%.

If set to No, the Longest Common Substring will be related to the longer value of the two being compared. So, the LCSP between "Excel" and "Excel Europe" will be only 42%, and the LCSP between "Britannia Design" and "Britannia Desn. UK" will be 72%.

No

Example

In this example, the Longest Common Substring Percentage comparison is used to identify possible matches in the first line of an address.

The following options are specified:

  • Match No Data pairs? = No

  • Ignore case? = Yes

  • Relate to shorter String? = Yes

A Trim Whitespace transformation is used to remove all whitespace from values before they are compared.

Example results

With the above configuration, the following table illustrates some comparison results:

Table 1-51 Example Results: Longest Common Substring Percentage

Value A Value B Comparison Result

4 Briars Lane

4 briars lane

100%

10 Beckenham Drive

10 Beckenham Lane

73%

Church Farm Cottage

Church Farm Flat 2

67%

Broomfield House

Broomfield Court Flat 14

67%

10 Galloway Road

14 Galloway Street

57%

5 Jedburgh Street

5 Bath St, Jedburgh

53%