Uncertainty String Comparators (Understanding the Sun Match Engine)

Understanding the Sun Match Engine

Uncertainty String Comparators

The Sun Match Engine provides the following uncertainty comparison functions for comparing string fields. Most uncertainty comparison functions are generic, but three comparison functions are designed for specific types of information (first name, last name, and house number).

Generic String Comparator (u)

This is the standard uncertainty comparison function, which processes string fields as described above. As more differences are found between two fields, the agreement weight decreases nonlinearly. Thus, the agreement weight can remain high for several differences, but will drop sharply at a certain point. This comparison function takes no parameters.

The uncertainty comparison function is based on the Jaro algorithm with McLaughlin adjustments for similarities. The Jaro algorithm is a string comparison function that accounts for insertions, deletions, and transpositions by performing the following steps.

Compute the lengths of both strings to be matched.
Determine the number of common characters between the two strings. In order for characters to be considered common, they must be within one-half the length of the shorter string.
Determine the number of transpositions. A transposition means a character from the first string is out of order with the corresponding common character from the second string.

Advanced Generic String Comparator (ua)

This comparison function is based on the standard uncertainty comparison function, u, with variants of Winkler/Lynch and McLaughlin. It has additional features to handle specific differences between fields, such as key punch and visual memory errors. Each feature makes use of the information made available from previous features. This comparison function takes no parameters. The following features are included in the advanced uncertainty function.

The function determines each character in exact agreement and then assigns a value of 1.0 to each agreeing character. It then determines each disagreeing but similar character and assigns a value of 0.3 to each. Similar characters might occur because of scanning errors (for example, “1” the number versus “l” the letter) or keypunch errors (for example, “S” versus “D”).
The function gives increased value to agreement on the beginning characters of a string. The algorithm adjusts the weighting value up by a fixed amount if the first four characters in each string agree; it adjusts the weighting value up by smaller value if only the first three, two, or one characters agree.
The function adjusts the string comparison value if the strings are longer than six characters and more than half of the characters after the fourth character agree.

Simplified String Comparator (us)

This comparison function is a custom version of a generic string comparison function. It is similar to the basic uncertainty comparison function, u, but processes data in a more simple and efficient manner, improving processing speed. The agreement weights generated by this comparison function decrease in a more uniform manner for each difference found between two fields.

Like the basic uncertainty function, the simplex function takes into account such uncertainty factors as string length, transpositions, key punch errors, and visual memory errors. Unlike the uncertainty comparison function (“u”), this function handles diacritical marks. This comparison function takes no parameters.

Simplified String Comparator - FirstName (uf)

This comparison function is designed specifically for matching on first name fields, and is based on the simplex uncertainty comparison function, us. This comparison function analyzes the string and then adjusts the weight based on statistical data. This comparison function takes no parameters.

Simplified String Comparator - LastName (ul)

This comparison function is designed specifically for matching on last name fields, and is based on the simplex uncertainty comparison function, us. This comparison function analyzes the string and then adjusts the weight based on statistical data. This comparison function takes no parameters.

Simplified String Comparator - House Numbers (un)

This comparison function is designed specifically for matching on house numbers, and is based on the simplex uncertainty comparison function, u. This comparison function analyzes the string and then adjusts the weight based on statistical data. This comparison function takes no parameters.

Language-specific String Comparator (usu)

This comparison function is a custom version of a generic string comparison function. It is similar to the simplex uncertainty comparison function, us, but is based in Unicode to enable multilingual support. This locale-oriented comparator recognizes the nuances of each language and supports the complexities and subtleties of each. For example, when configured to use the German language set, the function recognizes “ß” and “ss” as equivalent. Like the simplex uncertainty function, the Unicode function takes into account such uncertainty factors as string length, transpositions, key punch errors, and visual memory errors. This comparison function takes the parameter described in Table 38.

Table 38 usu Comparison Function Parameter


Parameter	Description
language	An indicator of the language being used for the information stored in the database. Enter one of the following codes to indicate the language in use. da - Danish sv - Swedish nb - Norwegian Bokmål nn - Norwegian Nynorsk nl - Dutch es - Spanish fr - French en - English it - Italian de - German