Uncertainty String Comparators

The Master Index Match Engine provides several uncertainty comparison functions for comparing string fields. Most uncertainty comparison functions are generic, but some comparison functions are designed for specific types of information (first name, last name, house number, and national identifiers).

The uncertainty functions include the following:

Advanced Jaro String Comparator (u)
Winkler-Jaro String Comparator (ua)
Condensed String Comparator (us)
Advanced Jaro Adjusted for First Names (uf)
Advanced Jaro Adjusted for Last Names (ul)
Advanced Jaro Adjusted for House Numbers (un)
Advanced Jaro AlphaNumeric Comparator (ujs)
Unicode String Comparator (usu)
Unicode AlphaNumeric Comparator (usus)

Advanced Jaro String Comparator (u)

The Advanced Jaro String Comparator is the standard uncertainty comparison function for processing string fields. This comparison function is based on the Jaro algorithm with McLaughlin adjustments for similarities. The Jaro algorithm is a string comparison function that accounts for insertions, deletions, and transpositions by performing the following steps.

Compute the lengths of both strings to be matched.
Determine the number of common characters between the two strings. In order for characters to be considered common, they must be within one-half the length of the shorter string.
Determine the number of transpositions. A transposition means a character from the first string is out of order with the corresponding common character from the second string.

As more differences are found between two fields, the agreement weight decreases nonlinearly. Thus, the agreement weight can remain high for several differences, but will drop sharply at a certain point. This comparison function takes no parameters.

Winkler-Jaro String Comparator (ua)

The Winkler-Jaro String Comparator is based on the standard uncertainty comparison function, u, with variants of Winkler/Lynch and McLaughlin. It has additional features to handle specific differences between fields, such as key punch and visual memory errors. Each feature makes use of the information made available from previous features. This comparison function takes no parameters.

The following features are included in the advanced uncertainty function.

The function determines each character in exact agreement and then assigns a value of 1.0 to each agreeing character. It then determines each disagreeing but similar character and assigns a value of 0.3 to each. Similar characters might occur because of scanning errors (for example, inserting “1” the number instead of “l” the letter) or keypunch errors (for example, typing “S” instead of “D”).
The function gives increased value to agreement on the beginning characters of a string. The algorithm adjusts the weighting value up by a fixed amount if the first four characters in each string agree; it adjusts the weighting value up by smaller value if only the first three, two, or one characters agree.
The function adjusts the string comparison value if the strings are longer than six characters and more than half of the characters after the fourth character agree.

Condensed String Comparator (us)

The Condensed String Comparator is a custom version of a generic string comparison function. It is similar to the Advanced Jaro String Comparator, u, but processes data in a more simple and efficient manner, improving processing speed. The agreement weights generated by this comparison function decrease in a more uniform manner for each difference found between two fields.

Like the Advanced Jaro String Comparator, the Condensed String Comparator takes into account such uncertainty factors as string length, transpositions, key punch errors, and visual memory errors. Unlike the uncertainty comparison function (“u”), this function handles diacritical marks. This comparison function takes no parameters.

Advanced Jaro Adjusted for First Names (uf)

The Advanced Jaro Adjusted for First Names comparator is designed specifically for matching on first name fields, and is based on the Condensed String Comparator, us. This comparison function analyzes the string and then adjusts the weight based on statistical data. This comparison function takes no parameters.

Advanced Jaro Adjusted for Last Names (ul)

The Advanced Jaro Adjusted for Last Names comparator is designed specifically for matching on last name fields, and is based on the Condensed String Comparator, us. This comparison function analyzes the string and then adjusts the weight based on statistical data. This comparison function takes no parameters.

Advanced Jaro Adjusted for House Numbers (un)

The Advanced Jaro Adjusted for House Numbers comparator is designed specifically for matching on house numbers, and is based on the Condensed String Comparator, us. This comparison function analyzes the string and then adjusts the weight based on statistical data. This comparison function takes no parameters.

Advanced Jaro AlphaNumeric Comparator (ujs)

The Advanced Jaro AlphaNumeric Comparator is a custom version of a generic string comparison function. It is based on the Advanced Jaro String Comparator, u, but is designed specifically for matching on national identifier, such as social security numbers. This function takes into account such uncertainty factors as string length, transpositions, key punch errors, and visual memory errors. It can also take into consideration field length, allowed character types, and invalid values. This comparison function takes the parameters described in the following table.

Table 4 ujs Comparison Function Parameters

Parameter	Description
ssnLength	An optional parameter that takes the length of the field value into account. If a fixed length is specified, the match engine considers any field of a different length to be a non-match. Specify any integer smaller than the value specified for the field size in the matching configuration file (for more information, see Matching Rules Section).
recType	An indicator of whether the field must be all numeric. Specify “nu” for numeric only, or specify “an” to allow alphanumeric characters. The match engine considers any fields containing characters that are not allowed to be a non-match.
ssnList	A list of invalid characters for the field. If you specify a character, the match engine considers fields that consist of only that character to be a non-match. For example, if you specify “0”, then an SSN field cannot contain all zeros. Specify as many alphanumeric characters as needed, separated by a space.

Unicode String Comparator (usu)

The Unicode String Comparator is a custom version of a generic string comparison function. It is similar to the Condensed String Comparator, us, but is based in Unicode to enable multilingual support. This locale-oriented comparator recognizes the nuances of each language and supports the complexities and subtleties of each. For example, when configured to use the German language set, the function recognizes “ß” and “ss” as equivalent. Like the simplex uncertainty function, the Unicode function takes into account such uncertainty factors as string length, transpositions, key punch errors, and visual memory errors. This comparison function takes the parameter described in the following table.

Table 5 usu Comparison Function Parameter

Parameter	Description
language	An indicator of the language being used for the information stored in the database. Enter one of the following codes to indicate the language in use. da - Danish sv - Swedish nb - Norwegian Bokmål nn - Norwegian Nynorsk nl - Dutch es - Spanish fr - French en - English it - Italian de - German

Unicode AlphaNumeric Comparator (usus)

This comparison function is a custom version of a generic string comparison function. It is similar to the Unicode String Comparator, but it is also similar to the Advanced Jaro AlphaNumeric Comparator in that it is designed to work on national identifiers like social security numbers. This locale-oriented comparator recognizes the nuances of each language and supports the complexities and subtleties of each. This function takes into account such uncertainty factors as string length, transpositions, key punch errors, and visual memory errors. It can also take into consideration field length, allowed character types, and invalid values. This comparison function takes the parameters described in the following table.

Table 6 usus Comparison Function Parameters

Parameter	Description
language	An indicator of the language being used for the information stored in the database. Enter one of the following codes to indicate the language in use. da - Danish sv - Swedish nb - Norwegian Bokmål nn - Norwegian Nynorsk nl - Dutch es - Spanish fr - French en - English it - Italian de - German
fixed-length	An optional parameter that takes the length of the field value into account. If a fixed length is specified, the match engine considers any field of a different length to be a non-match. Specify any integer smaller than the value specified for the size specified for the field (for more information, see Matching Rules Section).
character-type	An indicator of whether the field must be all numeric. Specify “nu” for numeric only, or specify “an” to allow alphanumeric characters. The match engine considers any fields containing characters that are not allowed to be a non-match.
invalid-characters	A list of invalid characters for the field. If you specify a character, the match engine considers fields that consist of only that character to be a non-match. For example, if you specify “0”, then an SSN field cannot contain all zeros. Specify as many alphanumeric characters as needed, separated by a space.

Skip Navigation Links
Exit Print View
	Oracle Java CAPS Master Index Match Engine Reference Java CAPS Documentation