1.3.4.8.13 Comparison: Longest Common Substring

The Longest Common Substring comparison compares two String/String Array values and determines whether they might match by determining the longest length of a sequence of characters (substring) that is common to both values, whether that substring represents the whole or a part of the String value.

Use the Longest Common Substring comparison to find matches between String values where there may be 'noise' either at the beginning or the end of String that is difficult to ignore in a comparison by stripping words, or where you know that String values with a common sequence of characters over a certain length are likely to be related, for example, to match "Nomura Securities Co., Ltd." with "Nomura Investor Relations Co., Ltd." with a Longest Common Substring of 6 characters "Nomura".

The Longest Common Substring comparison is often used in match rules that are low down in the decision table in order to find and review possible matches that have similarity but which have failed to match using other rules, perhaps due to ordering issues, or due to excess 'noise'.

This comparison supports the use of result bands.

The following table describes the configuration options:

Option Type Description Default Value

Match No Data pairs?

Yes/No

This option determines the result of a comparison when it compares two No Data (Null, or containing only whitespace characters) values for an identifier.

If set to No, the comparison will give a 'no data' result when comparing a No Data value against another No Data value.

If set to Yes, the comparison will give a result of 0 when comparing a No Data value against another No Data value. A 'no data' result will only be returned if a No Data value is compared against a populated value.

No

Ignore case?

Yes/No

Sets whether or not to ignore case when comparing values.

Yes

Example

In this example, the Longest Common Substring comparison is used to identify possible matches in customer names.

The following options are specified:

  • Match No Data pairs? = No

  • Ignore case? = Yes

A Trim Whitespace transformation is used to remove all whitespace from values before they are compared.

Example results

With the above configuration, the following table illustrates some comparison results:

Table 1-50 Example Results: Longest Common Substring

Value A Value B Comparison Result

Jill Lewis

Jill Lewis-Thompson

9

Jill Lewis

Bill Lewis

8

Jill Lewis

Jill Lonerghan

5

Michael Davis **DO NOT CALL**

Michael Davis

12

Tom Featherstone ----DECEASED----

Thomas David Featherstone

12

Tom Featherstone

John Feathers

8