3 OHMPI Match Engine Comparison Functions

This chapter introduces you to and provides conceptual infromation about the OHMPI Match engine comparison functions.

This chapter includes the following section:

"Learning About the OHMPI Match Engine Comparison Functions"

Learning About the OHMPI Match Engine Comparison Functions

Match field comparison functions, or comparators, compare the values of a field in two records to determine whether the fields match or how closely they match. The fields are then assigned a matching weight based on the results of the comparison function. You can use several different types of comparison functions in the match configuration file in order to customize how the OHMPI Match Engine matches records. The comparators themselves are highly configurable and can be configured to assign differing weights or handle null values. Several comparators accept parameters that further fine-tune the matching process.

The OHMPI Match Engine provides a comprehensive group of match comparison functions to enable ma

tching on a wide variety of data. While you should be able to configure any of the default comparison functions to accurately match your data, you can create new comparison functions and integrate them into a master person index application. For more information, see Chapter 4, "Creating Custom Comparators for the OHMPI Match Engine".

Certain comparison function types are very specific to the type of data being matched, such as the numeric functions and the date functions. Others, such as the Bigram and uncertainty functions, are more general and can be applied to various data fields.

"Bigram Comparators"
"Uncertainty String Comparators"
"Exact Character-to-Character Comparator (c)"
"Numeric Comparators"
"Condensed AlphaNumeric SSN Comparator (nS)"
"Date Comparators"
"Prorated Comparator (p)"

Be sure to review "Table 2-1 Match Configuration File Columns" for information about how the parameters in the match configuration file affect the outcome of the comparator functions. For example, parameters define how null fields are handled and what the actual agreement and disagreement weights are.

Note:

The names of the comparators are configurable. The default names are used here.

Bigram Comparators

The OHMPI Match Engine provides two different comparison functions based on the Bigram algorithm, the standard bigram (b1) and the transposition bigram (b2). A Bigram algorithm compares two strings using all combinations of two consecutive characters within each string. For example, the word “bigram” contains the following bigrams: “bi”, “ig”, "gr”, “ra”, and “am”. The Bigram comparison function returns a value between 0 and 1, which accounts for the total number of bigrams that are in common between the strings divided by the average number of bigrams in the strings. Bigrams handle minor typographical errors well.

Bigram Comparator (b1)

The Bigram Comparator is a standard Bigram comparison function, processing match fields as described above. This comparison function takes no parameters.

Advanced Bigram Comparator (b2)

The Advanced Bigram Comparator is based on the standard Bigram comparison function, but handles transpositions of characters within a string. This comparison function takes no parameters.

Uncertainty String Comparators

The OHMPI Match Engine provides several uncertainty comparison functions for comparing string fields. Most uncertainty comparison functions are generic, but some comparison functions are designed for specific types of information (first name, last name, house number, and national identifiers).

The uncertainty functions include the following:

"Advanced Jaro String Comparator (u)"
Winkler-Jaro String Comparator (ua)
"Condensed String Comparator (us)"
"Advanced Jaro Adjusted for First Names (uf)"
"Advanced Jaro Adjusted for Last Names (ul)"
"Advanced Jaro Adjusted for House Numbers (un)"
"Advanced Jaro AlphaNumeric Comparator (ujs)"
"Unicode String Comparator (usu)"
"Unicode AlphaNumeric Comparator (usus)"

Advanced Jaro String Comparator (u)

The Advanced Jaro String Comparator is the standard uncertainty comparison function for processing string fields. This comparison function is based on the Jaro algorithm with McLaughlin adjustments for similarities. The Jaro algorithm is a string comparison function that accounts for insertions, deletions, and transpositions by performing the following steps.

Compute the lengths of both strings to be matched.
Determine the number of common characters between the two strings. In order for characters to be considered common, they must be within one-half the length of the shorter string.
Determine the number of transpositions. A transposition means a character from the first string is out of order with the corresponding common character from the second string.

As more differences are found between two fields, the agreement weight decreases nonlinearly. Thus, the agreement weight can remain high for several differences, but will drop sharply at a certain point. This comparison function takes no parameters.

Winkler-Jaro String Comparator (ua)

The Winkler-Jaro String Comparator is based on the standard uncertainty comparison function, u, with variants of Winkler/Lynch and McLaughlin. It has additional features to handle specific differences between fields, such as key punch and visual memory errors. Each feature makes use of the information made available from previous features. This comparison function takes no parameters.

The following features are included in the advanced uncertainty function.

The function determines each character in exact agreement and then assigns a value of 1.0 to each agreeing character. It then determines each disagreeing but similar character and assigns a value of 0.3 to each. Similar characters might occur because of scanning errors (for example, inserting “1” the number instead of “l” the letter) or keypunch errors (for example, typing “S” instead of “D”).
The function gives increased value to agreement on the beginning characters of a string. The algorithm adjusts the weighting value up by a fixed amount if the first four characters in each string agree; it adjusts the weighting value up by smaller value if only the first three, two, or one characters agree.
The function adjusts the string comparison value if the strings are longer than six characters and more than half of the characters after the fourth character agree.

Condensed String Comparator (us)

The Condensed String Comparator is a custom version of a generic string comparison function. It is similar to the Advanced Jaro String Comparator, u, but processes data in a more simple and efficient manner, improving processing speed. The agreement weights generated by this comparison function decrease in a more uniform manner for each difference found between two fields.

Like the Advanced Jaro String Comparator, the Condensed String Comparator takes into account such uncertainty factors as string length, transpositions, key punch errors, and visual memory errors. Unlike the uncertainty comparison function (“u”), this function handles diacritical marks. This comparison function takes no parameters.

Advanced Jaro Adjusted for First Names (uf)

The Advanced Jaro Adjusted for First Names comparator is designed specifically for matching on first name fields, and is based on the Condensed String Comparator, us. This comparison function analyzes the string and then adjusts the weight based on statistical data. This comparison function takes no parameters.

Advanced Jaro Adjusted for Last Names (ul)

The Advanced Jaro Adjusted for Last Names comparator is designed specifically for matching on last name fields, and is based on the Condensed String Comparator, us. This comparison function analyzes the string and then adjusts the weight based on statistical data. This comparison function takes no parameters.

Advanced Jaro Adjusted for House Numbers (un)

The Advanced Jaro Adjusted for House Numbers comparator is designed specifically for matching on house numbers, and is based on the Condensed String Comparator, us. This comparison function analyzes the string and then adjusts the weight based on statistical data. This comparison function takes no parameters.

Advanced Jaro AlphaNumeric Comparator (ujs)

The Advanced Jaro AlphaNumeric Comparator is a custom version of a generic string comparison function. It is based on the Advanced Jaro String Comparator, u, but is designed specifically for matching on national identifier, such as social security numbers. This function takes into account such uncertainty factors as string length, transpositions, key punch errors, and visual memory errors. It can also take into consideration field length, allowed character types, and invalid values. This comparison function takes the parameters described in the following table.

Table 3-1 ujs Comparison Function Parameters

Parameter	Description
ssnLength	An optional parameter that takes the length of the field value into account. If a fixed length is specified, the match engine considers any field of a different length to be a non-match. Specify any integer smaller than the value specified for the field size in the matching configuration file (for more information, see "Matching Rules Section").
recType	An indicator of whether the field must be all numeric. Specify “nu” for numeric only, or specify “an” to allow alphanumeric characters. The match engine considers any fields containing characters that are not allowed to be a non-match.
ssnList	A list of invalid characters for the field. If you specify a character, the match engine considers fields that consist of only that character to be a non-match. For example, if you specify “0”, then an SSN field cannot contain all zeros. Specify as many alphanumeric characters as needed, separated by a space.

Unicode String Comparator (usu)

The Unicode String Comparator is a custom version of a generic string comparison function. It is similar to the Condensed String Comparator, us, but is based in Unicode to enable multilingual support. This locale-oriented comparator recognizes the nuances of each language and supports the complexities and subtleties of each. For example, when configured to use the German language set, the function recognizes “ß” and “ss” as equivalent. Like the simplex uncertainty function, the Unicode function takes into account such uncertainty factors as string length, transpositions, key punch errors, and visual memory errors. This comparison function takes the parameter described in the following table.

Table 3-2 usu Comparison Function Parameter

Parameter

Description

language

An indicator of the language being used for the information stored in the database. Enter one of the following codes to indicate the language in use.

da - Danish

sv - Swedish

nb - Norwegian Bokmål

nn - Norwegian Nynorsk

nl - Dutch

es - Spanish

fr - French

en - English

it - Italian

de - German

Unicode AlphaNumeric Comparator (usus)

This comparison function is a custom version of a generic string comparison function. It is similar to the Unicode String Comparator, but it is also similar to the Advanced Jaro AlphaNumeric Comparator in that it is designed to work on national identifiers like social security numbers. This locale-oriented comparator recognizes the nuances of each language and supports the complexities and subtleties of each. This function takes into account such uncertainty factors as string length, transpositions, key punch errors, and visual memory errors. It can also take into consideration field length, allowed character types, and invalid values. This comparison function takes the parameters described in the following table.

Table 3-3 usus Comparison Function Parameters

Parameter	Description
language	An indicator of the language being used for the information stored in the database. Enter one of the following codes to indicate the language in use. da - Danish sv - Swedish nb - Norwegian Bokmål nn - Norwegian Nynorsk nl - Dutch es - Spanish fr - French en - English it - Italian de - German
fixed-length	An optional parameter that takes the length of the field value into account. If a fixed length is specified, the match engine considers any field of a different length to be a non-match. Specify any integer smaller than the value specified for the size specified for the field (for more information, see "Matching Rules Section").
character-type	An indicator of whether the field must be all numeric. Specify “nu” for numeric only, or specify “an” to allow alphanumeric characters. The match engine considers any fields containing characters that are not allowed to be a non-match.
invalid-characters	A list of invalid characters for the field. If you specify a character, the match engine considers fields that consist of only that character to be a non-match. For example, if you specify “0,” then an SSN field cannot contain all zeros. Specify as many alphanumeric characters as needed, separated by a space.

Exact Character-to-Character Comparator (c)

The OHMPI Match Engine provides one exact-match comparison function, “c.” With this comparison function, two fields must match exactly on each character in order to be considered a match. This comparison function takes no parameters.

Numeric Comparators

The OHMPI Match Engine provides two comparison functions for matching on numeric fields:

"Integer Comparator (nl)"
"Real Number Comparator (nR)"

The Integer Comparator and Real Number Comparator can perform either numeric string comparisons or relative distance calculations. When set for a string comparison, the functions compare numeric strings based on the advanced uncertainty comparator. When set for relative distance calculations, the matching weight between two numbers decreases as the numbers become further apart, until the relative distance plus one is reached. At this point, the numbers are considered non-matches. For example, if the relative distance is “10” and the base number for comparison is “2,” a field value of 8 receives a lower matching weight than a field value of 4; but a field value of 13 is considered a complete non-match (since the distance between 2 and 13 is 11).

Integer Comparator (nl)

The Integer Comparator matches specifically on integers using the logic describe above. It accepts the parameters listed in the following table.

Table 3-4 nI Comparison Function Parameters

Parameter	Description
switch	Specifies whether a relative distance calculation or a direct string comparison is used. Specify “y” to use a relative distance calculation; specify “n” to use a string comparison.
range	The greatest difference between two integers at which the values could still be considered a possible match. When the difference between two numbers is greater than the relative distance, the numbers are considered a non-match (the weight becomes zero when the actual difference is the relative distance plus one).

Real Number Comparator (nR)

The Real Number Comparator function matches specifically on real numbers based on the logic described above. It accepts the parameters listed in the following table.

Table 3-5 nR Comparison Function Parameters

Parameter	Description
switch	Specifies whether a relative distance calculation or a direct string comparison is used. Specify “y” to use a relative distance calculation; specify “n” to use a string comparison.
range	The greatest difference between two integers at which the values could still be considered a possible match. When the difference between two numbers is greater than the relative distance, the numbers are considered a non-match (the weight becomes zero when the actual difference is the relative distance plus one).

Condensed AlphaNumeric SSN Comparator (nS)

The Condensed AlphaNumeric SSN Comparator is designed specifically for matching on numeric strings and is very useful for matching social security numbers or other unique identifiers. This comparison function can compare either alphanumeric values or numeric values, and takes into account such uncertainty factors as string length, transpositions, key punch errors, and visual memory errors. It can also take into consideration field length, allowed character types, and invalid values. It accepts the parameters listed in Table 3-6.

Table 3-6 nS Comparison Function Parameters

Parameter	Description
fixed-length	An optional parameter that takes the length of the field value into account. If a fixed length is specified, the match engine considers any field of a different length to be a non-match. Specify any integer smaller than the value specified for the size specified for the field (for more information, see Matching Rules Section).
character-type	An indicator of whether the field must be all numeric. Specify “nu” for numeric only, or specify “an” to allow alphanumeric characters. The match engine considers any fields containing characters that are not allowed to be a non-match.
invalid-characters	A list of invalid characters for the field. If you specify a character, the match engine considers fields that consist of only that character to be a non-match. For example, if you specify “0,” then an SSN field cannot contain all zeros. Specify as many alphanumeric characters as needed, separated by a space.

Date Comparators

The OHMPI Match Engine provides various date comparison functions. When comparing dates, the match engine compares each date component (for example, it compares the year in the first date against the year in the second date, the month against the month, and the day against the day). This allows for multiple transpositions in each date field. The date comparators use the Java date format (java.sql.Date), allowing the comparator to use the Gregorian calendar and to take into account the time zone where the date field originated.

The following comparison functions are available for matching on date fields.

"Date Comparator With Years as Units (dY)"
"Date Comparator With Months as Units (dM)"
"Date Comparator With Days as Units (dD)"
"Date Comparator With Hours as Units (dH)"
"Date Comparator With Minutes as Units (dm)"
"Date Comparator With Seconds as Units (ds)"

As with the numeric comparison functions, the date comparison functions can use either a direct string comparison or a relative distance calculation (see Numeric Comparators). When using a relative distance calculation, the matching weight between two dates decreases as the dates become further apart, until the relative distance is reached. When the difference becomes the relative distance plus one, the dates are considered non-matches. You can specify different relative distances for before and after the given date. Any dates falling outside of the specified time period receive a complete disagreement weight. The relative distances are specified in the smallest unit of time being matched.

Continuing, as the weight is decreased, when the difference between the two compared fields reaches either the before or after relative distance. For example, if the before relative distance is 11 and the after relative distance is 5, if this example had been charted a light blue line would represent the agreement weight. When the base date is later than the compared date and the difference between the dates reaches 11 (distance before plus one), the fields are considered a non-match and are given the full disagreement weight. When the base date is earlier than the compared date and the difference between the dates reaches 6 (distance after plus 1), the fields are considered a non-match.

The date comparison functions take the parameters listed in "Table 3-7 Date Comparison Function Parameters".

Table 3-7 Date Comparison Function Parameters

Parameter	Description
switch	Specifies whether a relative distance calculation or a direct string comparison is used. Specify “y” to use a relative distance calculation; specify “n” to use a string comparison.
llimit	The number of units prior to the reference date/time for which two date fields can still be considered a match.
ulimit	The number of units following the reference date/time for which two date fields can still be considered a match.

Date Comparator With Years as Units (dY)

This date comparison function takes only the 4-character year into account for matching. If relative distance calculation is specified, the relative distance is specified in years.

Date Comparator With Months as Units (dM)

This date comparison function takes the month and year into account for matching. If relative distance calculation is specified, the relative distance is specified in months.

Date Comparator With Days as Units (dD)

This date comparison function takes the day, month, and year into account for matching. If relative distance calculation is specified, the relative distance is specified in days.

Date Comparator With Hours as Units (dH)

This date comparison function takes the hour, day, month, and year into account for matching. If relative distance calculation is specified, the relative distance is specified in hours.

Date Comparator With Minutes as Units (dm)

This date comparison function takes the minute, hour, day, month, and year into account for matching. If relative distance calculation is specified, the relative distance is specified in minutes.

Date Comparator With Seconds as Units (ds)

This date comparison function takes the second, minute, hour, day, month, and year into account for matching. If relative distance calculation is specified, the relative distance is specified in seconds.

Prorated Comparator (p)

The Prorated Comparator uses a relative distance calculation and allows you to specify how quickly the agreement weight between two fields decreases. Matching weights are assigned with a linear adjustment according to the parameters you specify. You specify an initial agreement range. If the difference between two fields falls within that range, the fields are considered a complete match. You also specify a disagreement range ending with the relative distance. If the difference between two fields falls within that range, the fields are considered a non-match. When the difference between the fields falls between those two ranges, they are considered to be partial matches and the agreement weight is adjusted linearly. Any difference greater than the relative distance is always considered a non-match.

Note:

Increasing the disagreement weight causes the prorated agreement weight to decrease more sharply.

The prorated comparison functions takes the parameters listed in "Table 3-8 Prorated Comparison Function Parameters".

Table 3-8 Prorated Comparison Function Parameters

Parameter	Description
range	The greatest difference between two numbers at which they can still be considered a match or partial match.
tolerance1	The greatest difference between two numbers at which they are considered a full match. This number must be less than the relative distance.
tolerance2	This number indicates the minimum difference at which two numbers are considered a non-match and shortens or lengthens the weighting scale. To find this difference, the match engine subtracts this value from the relative distance. If the fields differ by that amount or greater, they are considered to be a non-match. The weighting scale decreases in size as the value of the full-disagreement parameter increases (see diagram).