1.3.4.8.16 Comparison: Longest Common Substring Sum Percentage

The Longest Common Substring Sum Percentage comparison offers a powerful way of determining the similarity between two String/String Array values, particularly where those values contain long strings of characters, or many words.

The Longest Common Substring Sum Percentage (LCSSP) calculates the Longest Common Substring Sum between two string values, and then relates it to the number of characters in either the longer or the shorter string being compared.

The Longest Common Substring Sum Percentage comparison is particularly useful when matching multi-word text strings where both word order and whitespace differences exist, and where you want to consider the similarity of the strings in proportion to their length.

This can happen for example when matching Asian names from different sources, which may not consistently represent names in the same order, and where whitespace may differ because of transliteration differences, or typos. Note that whitespace differences will weaken the results of word matching comparisons (such as Word Match Percentage) as these rely on words being consistently separated.

For example, consider the following names:

Mary Elizabeth Angus

Mary Elizabeth Francis

Mary Elizabeth

Xiaojian Zhong

ZHONG Xiao Jian

The last two names are a strong match, even though they are different in both word order and spacing. They will not have a strong Word Match Percentage result. They have a strong Longest Common Substring Sum result, but so do the first two names, and these are not a strong match.

Longest Common Substring Sum Percentage offers a way of considering the total length of common substrings between two values and relating that to the total number of characters being considered.

This comparison supports the use of result bands.

The following table describes the configuration options:

Option Type Description Default Value

Match No Data pairs?

Yes/No

This option determines the result of a comparison when it compares two No Data (Null, or containing only whitespace characters) values for an identifier.

If set to No, the comparison will give a 'no data' result when comparing a No Data value against another No Data value.

If set to Yes, the comparison will give a result of 0 when comparing a No Data value against another No Data value. A 'no data' result will only be returned if a No Data value is compared against a populated value.

No

Ignore case?

Yes/No

Sets whether or not to ignore case when comparing values.

Yes

Include substrings greater than length

Yes/No

Common substrings between the two values being compared must be greater than the specified value to contribute to the overall Longest Common Substring Sum score.

If set to 3, distinct (non over-lapping) substrings of 4 or more characters that are common between two values will be included in the LCSS calculation. For example, the values "Acme Micros Ltd Serv" and "Acme and Partners Micro Services Ltd" would give an LCSS of 9, assuming whitespace is trimmed before comparing. This would be calculated as 4 characters for the common substring "Acme", and 5 characters for the common substring "Micro". Note that the common substring "Ltd" would not be included in the calculation as its length is not greater than 3 characters.

4

Relate to shorter input?

Yes/No

Sets whether to relate the Longest Common Substring Sum to the shorter or the longer of the two strings being compared. Relating to the shorter input allows for looser matching, where the majority of substrings in the shorter string are also found in the longer string, but allows the longer string to contain extra data.

No

Example

In this example, the Longest Common Substring Sum Percentage comparison is used when comparing full names.

The following options are specified:

  • Match No Data pairs? = No

  • Ignore case? = Yes

  • Include substrings greater than length = 3

  • Relate to shorter input? = No

A Trim Whitespace transformation is used to remove all whitespace from values before they are compared.

Example results

With the above configuration, the following table illustrates some comparison results:

Table 1-53 Example Results: Longest Common Substring Sum Percentage

Value A Value B Comparison Result

Mary Elizabeth Angus

Mary Elizabeth Francis

65

Xiaojian Zhong

ZHONG Xiao Jian

100

Mary Elizabeth Angus

Mary Elizabeth

72

Tan Tan WONG

WONG Tantan

100

James Patrick Robinson

Robin Patrick Jameson

85