2 Match Engine Matching Configuration

This chapter introduces you to the matching configuration files for the Oracle Healthcare Master Person Index (OHMPI) Match Engine, including certain rules for formatting and interdependencies that must be followed. The following sections provide an overview of the two matching configuration files, provided the architecture of those files, and formatting descriptions. They also include an overview of comparison functions used in the match configuration file.

This chapter includes the following sections:

Understanding the OHMPI Match Engine Match Configuration File
Learning About the OHMPI Match Engine Comparator Definition List

Understanding the OHMPI Match Engine Match Configuration File

The matching configuration files define how the OHMPI Match Engine processes records to assign matching probability weights, allowing the master person index application to identify matches, potential duplicates, and non-matches. The match engine includes two configurable files, the match configuration file and the comparators list. Together these files define additional logic for the OHMPI Match Engine to use when determining the matching probability between two records.

The matching configuration is very flexible, allowing you to customize the matching logic according to the type of data being matched and for the record matching requirements of your business. In a master person index application, the matching configuration files are stored in the master person index project and are located in the Match Engine node of the project. The OHMPI Standardization Engine typically standardizes the data prior to matching, so the match process is performed against the standardized data.

The match configuration file, matchConfigFile.cfg, contains the matching logic for each field on which matching is performed. By default, this file defines the matching logic for the three primary data types (person names, business names, and addresses), and can also handle generic data types, such as dates, numbers, social security numbers, and characters.

The match configuration file defines matching logic for each field on which matching is performed. The OHMPI Match Engine provides several comparison functions that you can call in this file to fine-tune the match process. Comparison functions contain the logic to compare different types of data in very specific ways in order to arrive at a match weight for each field. These functions allow you to define how matching is performed for different data types and can be used in conjunction with either matching and unmatching probabilities or agreement and disagreement weight ranges for each field. This file also defines how to handle missing fields.

The following sections describe the format of the configuration file and provide an overview of the predefined comparison functions:

OHMPI Match Engine Match Configuration File Format
OHMPI Match Engine Matching Comparison Functions at a Glance

These sections describe the format of the files so you can modify them directly. You can also modify the match configuration file using the OHMPI Configuration Editor, which provides an easy, graphical way to configure matching rules.

OHMPI Match Engine Match Configuration File Format

The match configuration file is divided into two sections. The first section consists of one line that indicates the matching probability type. The second section consists of the matching rules to use for each match field. In a master person index application, this file can be modified from the Matching tab of the Master Person Index Configuration Editor. For more information, see the Oracle Healthcare Master Person Index Configuration Guide.

Match Configuration File Sample

Following is an excerpt from the default match configuration file. This excerpt illustrates the components that are described in the following sections.

ProbabilityType            1

FirstName              15  0   uf    0.99  0.001   10  -8
LastName               15  0   ul    0.99  0.001   10  -10
String                 25  0   ua    0.99  0.001   8   -8
DateDays               20  0   dD    0.99  0.001   10  -10 y 15      30
DateMonths             20  0   dM    0.99  0.001   10  -10 n
DateHours              20  0   dH    0.99  0.001   10  -10 y 30      60
DateMinutes            20  0   dm    0.99  0.001   10  -10 y 300 600
DateSeconds            20  0   ds    0.99  0.001   10  -10 y 75      60
Integer                15  0   nI    0.99  0.001   10  -10 n
Real                   15  0   nR    0.99  0.001   10  -10 n
Char                   1   0   c     0.99  0.001   5   -5
pro                    15  0   p     0.99  0.001   10  -10 20 5 5

Probability Type Section

The first line of the match configuration file defines the probability type to use for matching. Specify "0" (zero) to use m-probabilities and u-probabilities to determine a field's match weight; specify "1" (one) to use agreement and disagreement weight ranges. If the probability type is set to use agreement and disagreement weight ranges, the m-prob and u-prob columns in the matching rules section are ignored. Likewise, if the probability type is set to use m-probabilities and u-probabilities, the agreement-weight and disagreement-weight columns in the matching rules section are ignored. The default is to use agreement and disagreement weight ranges because they are more intuitive.

For more information about probabilities and weights, see Probabilities and Direct Weights.

Matching Rules Section

The section after the first line of the match configuration file contains match field rows, with each row defining how a certain data type or field will be matched. These are the rules you specify in the match string you define for a master person index application. The syntax for this section is:

match-field size null-field function m-prob u-prob agreement disagreement params data-sources

Table 2-1 describes each element in a match field row.

Table 2-1 Match Configuration File Columns

Column Number	Column Name	Description
1	match-field	A value that indicates to the Master Person Index Match Engine how each field should be weighted. Each field included in the match string (the MatchingConfig section of `mefa.xml`) must have a match type corresponding to a value in this column.
2	size	The number of characters in the field on which matching is performed, beginning with the first character. For example, to match on only the first four characters in a 10-digit field, the value of this column should be "4."
3	null-field	An index that specifies how to calculate the total weight for null fields or fields that only contain spaces. You can specify any of the following values: 0 - (zero) If one or both fields are empty, the weight used for the field is 0 (zero). 1 - (one) If both fields are empty, the agreement weight is used; if only one field is empty, the disagreement weight is used. a# - An "a" followed by a number specifies to use the agreement weight if one or both fields are empty or null. The agreement weight is divided by the number following the "a" to obtain the match weight for that field. If no number is specified, the default is 2. You can specify any number from 1 through 10. d# - A "d" followed by a number specifies to use the disagreement weight if one or both fields are empty or null. The disagreement weight is divided by the number following the "d" to obtain the match weight for the field. If no number is specified, the default is 2. You can specify any number from 1 through 10. em# - An "em" (empty multiple) followed by a number specifies the use of a multiplication factor on disagreement weight if only one field is empty. The disagreement weight is multiplied by the number following the "em" to obtain the match weight for the field. If no number is specified, the default is 1. You can specify any number from 1 through 10. If both fields are empty, the weight used for the field is 0. ef# - An "ef" (empty fraction) followed by a number specifies the use of a fractional factor on disagreement weight if only one field is empty. The disagreement weight is divided by the number following the "ef" to obtain the match weight for the field. If no number is specified, the default is 1. You can specify any number from 1 through 10. If both fields are empty, the weight used for the field is 0. Note: In the above descriptions, the agreement and disagreement weights are either specified in the file or calculated using a logarithmic formula based on the m and u-probabilities (depending on the probability type).
4	function	The type of comparison to perform when weighting the field. For information about the available comparison functions, see Chapter 5, "OHMPI Match Engine Comparison Functions". An overview of the comparison functions is provided in Table 2-2.
5	m-prob	The initial probability that the specified field in two records will match if the records match. The probability is a double value between 0 and 1, and can have up to 16 decimal points.
6	u-prob	The initial probability that the specified field in two records will match if the records do not match. The probability is a double value between 0 and 1, and can have up to 16 decimal points.
7	agreement	The matching weight to be assigned to a field given that the fields match between two records. This number can be between 0 and 100 and can have up to 16 decimal points. It represents the maximum match weight for a field.
8	disagreement	The matching weight to be assigned to a field given that the fields do not match between two records. This number can be between 0 and -100 and can have up to 16 decimal points. It represents the minimum match weight for a field.
9	params	The parameters that correspond to the comparison function specified in column 4. Some comparison functions do not take any parameters and some take multiple parameters. For additional information about parameters, see Chapter 5, "OHMPI Match Engine Comparison Functions".
10	dataSources	The complete path to any data sources used by the comparison function specific in column 4. You can define as many data sources as there are data sources listed for the comparator in the comparators list file. The default comparators do not use data sources, but you can create a custom comparator that does.

OHMPI Match Engine Matching Comparison Functions at a Glance

Match field comparison functions, or comparators, compare the values of a field in two records to determine whether the fields match. The fields are then assigned a matching weight based on the results of the comparison function. You can use several different types of comparison functions in the match configuration file to define how the OHMPI Match Engine should match the fields in the match string. The OHMPI Match Engine provides several options to use with each function. You can also define custom comparison functions. For more information, see Chapter 6, "Creating Custom Comparators for the OHMPI Match Engine".

Table 2-2 summarizes each comparison function. A complete reference of the comparison functions and their parameters is included in Chapter 5, "OHMPI Match Engine Comparison Functions".

Note:

The names of these comparison functions are configurable. Table 2-2 lists their default names.

Table 2-2 Comparison Function Summary

Comparison Function	Name	Description
b1	Bigram Comparator	Compares two strings using an algorithm based on the Bigram algorithm. This function compares two strings using all combinations of two consecutive characters and returns the total number of combinations that are the same.
b2	Advanced Bigram Comparator	Compares two strings allowing for character transpositions. This function is similar to the standard Bigram Comparator (b1).
u	Advanced Jaro String Comparator	Compares two strings taking into account uncertainty factors, such as string length, transpositions, and characters in common. This function is based on the Jaro algorithm.
ua	Winkler-Jaro String Comparator	Compares two strings similar to the Advanced Jaro String Comparator (u), but increases the agreement weight if the initial characters of each string are exact matches. This function takes into account key punch and visual memory errors. It is based on the Jaro algorithm with variants of Winkler/Lynch and McLaughlin.
uf	Advanced Jaro Adjusted for First Names	Based on the generic string comparator (u), this function is designed to specifically weight first name values. The string is analyzed and the weight adjusted based on statistical data.
ul	Advanced Jaro Adjusted for Last Names	Based on the generic string comparator (u), this function is designed to specifically weight last name values. The string is analyzed and the weight adjusted based on statistical data.
un	Advanced Jaro Adjusted for House Numbers	Based on the generic string comparator (u), this function is designed to specifically weight house number values. The string is analyzed and the weight adjusted based on statistical data.
us	Condensed String Comparator	Compares two strings similar to the Advanced Jaro String Comparator (u), but this function is a custom string comparator that compares two strings taking into account such uncertainty factors as string length, transpositions, key punch errors, and visual memory errors. Unlike the Advanced Jaro String Comparator, this function handles diacritical marks. This function also improves processing speed.
usu	Unicode String Comparator	Compares two strings similar to the Condensed String Comparator (us), but this function is based in Unicode to support multiple languages and alphabets. This comparator takes one parameter indicating the language to use.
usus	Unicode AlphaNumeric Comparator	Compares two strings similar to the Unicode String Comparator, but this function is designed to match on unique identifiers such as national IDs. This comparator takes one parameter indicating the language to use plus any of the following parameters: Field length Character types Invalid values
ujs	Advanced Jaro AlphaNumeric Comparator	Compares two strings similar to the Advanced Jaro String Comparator, but this function is designed to match on unique identifiers such as national IDs. This comparator takes any of the following parameters: Field length Character types Invalid values
c	Exact Character-to-Character Comparator	Compares string fields character by character. Each character must match in order for an agreement weight to be assigned.
nI	Integer Comparator	Compares integer fields using a relative distance value to determine the match weight. As the difference between the two fields increases, the match weight decreases. Once the difference is beyond the relative distance, a disagreement weight is assigned. This comparator takes two parameters; the first indicates whether to use a relative distance or direct string comparison and the second indicates the relative distance to use.
nR	Real Number Comparator	Compares fields containing real numbers using a relative distance value to determine the match weight. As the difference between the two fields increases, the match weight decreases. Once the difference is beyond the relative distance, a disagreement weight is assigned. This comparator takes two parameters; the first indicates whether to use a relative distance or direct string comparison, and the second indicates the relative distance to use.
nS	Condensed AlphaNumeric SSN Comparator	Compares social security numbers or other unique identifiers, taking into account any of these parameters: Field length Character types Invalid values
dY	Date Comparator With Years as Units	Compares year values using relative distance values prior to and following the given year to determine the match weight. As the difference between the two fields increases, the match weight decreases. Once the difference is beyond the relative distance, a disagreement weight is assigned. The date comparison functions handle Gregorian years. This comparator takes up to three parameters; the first indicates whether to use a relative distance or direct string comparison, and the second and third indicate the relative distance before and after.
dM	Date Comparator With Months as Units	Compares the month and year using a relative distance as described above for the year comparison function (dY).
dD	Date Comparator With Days as Units	Compares the day, month, and year using a relative distance as described above for the year comparison function (dY).
dH	Date Comparator With Hours as Units	Compares the hour, day, month, and year using a relative distance as described above for the year comparison function (dY).
dm	Date Comparator With Minutes as Units	Compares the minute, hour, day, month, and year using a relative distance as described above for the year comparison function (dY).
ds	Date Comparator With Seconds as Units	Compares the second, minute, hour, day, month, and year using a relative distance as described above for the year comparison function (dY).
p	Prorated Comparator	Prorates the disagreement weight for a date or numeric field based on values you specify. Differences greater than the amount you specify receive the full disagreement weight. This comparator takes three parameters indicating the relative distance and the agreement and disagreement ranges.

Learning About the OHMPI Match Engine Comparator Definition List

The comparator definition list defines each comparator that is included in a master person index application. If a comparator is not included in this list, it cannot be used in the application. If you define a comparator in this list that is not provided with the OHMPI Match Engine, you need to define the logic of the new comparator in Java classes (for more information, see Chapter 6, "Creating Custom Comparators for the OHMPI Match Engine").

Below is an excerpt from the default comparators list file that defines two numeric comparators, Real Number Comparator and Integer Comparator. Both comparators take two parameters, and are dependent on a second comparator class named CondensedStringComparator.

<comparator description="Numerics comparator">
  <className>NumericsComparator</className>
  <codes>            
    <code description="Real Number Comparator" name="n[R, ]"/>
    <code description="Integer Comparator" name="nI" />
  </codes>   
  <params>
    <param description="distance/string comparison option" 
           name="switch" type="java.lang.String"/>
    <param description="Spectrum of comparison" 
           name="range" type="java.lang.Integer|java.lang.Double"/>
  </params>                                  
  <data-sources/>
  <dependency-classes>
    <dependency-class matchfield="CSC"
     name="com.sun.mdm.matcher.comparators.base.CondensedStringComparator"/>
  </dependency-classes>
  <curve-adjust status="false"/>
</comparator>

The comparators are defined in XML format. Table 2-3 lists and describes each element in the XML file.

Table 2-3 Comparator Definition List Elements

Element	Attribute	Description
group	-	An element that contains a list of comparators that all share the same Java package.
group	description	A brief description of the comparator group.
group	path	The Java package that contains the code that defines the comparators in the group.
comparator	-	A definition for one subgroup of comparators that are all based on the same Java class, have the same Java class dependencies, accept the same parameters and data sources, and have the same curve adjustment setting.
comparator	description	A brief description of the comparator subgroup.
className	-	The name of the class that defines the logic for the comparators. The class must be contained in the package specified for the group element, as described above.
codes	-	A container element for a list of the comparators in the subgroup, with descriptions and processing codes of each comparator.
code	-	A description and processing code for one comparator.
code	description	A description of the comparator. The value you specify here appears in the comparator drop-down list on the Master Person Index Configuration Editor.
code	name	A unique identifying name for the comparator. These are the comparator names used in the rules definitions in the match configuration file (`matchConfigFile.cfg`).
params	-	A container element for a list of static parameters for the subgroup of comparators. Parameters are optional.
param	-	One parameter definition for the comparators.
param	description	A brief description of the parameter.
param	name	A short name for the parameter.
param	type	The Java data type of the values that can be specified for the parameter.
data-sources	-	A container element for a list of data files that contain additional information for the subgroup of comparators. For example, a comparator that generates weights based on the distance between postal codes might use lookup files containing information about the zip codes. Data sources are optional.
data-source	-	A definition for one data source. Currently, only file data sources are supported.
data-source	description	A brief description of the data source.
data-source	name	The complete path and filename of the data source.
data-source	type	The type of data source being used. Currently, the only value you can specify is "java.io.File".
dependency-classes	-	A container element that defines a list of Java classes on which the comparator class is dependent. The current comparator class inherits from the comparator classes you specify here as well as all the match fields (defined in `matchConfigFile.cfg`) that use that comparator.
dependency-class	-	A definition for one comparator class, called a dependency comparator, on which the current comparator class is dependent.
dependency-class	matchField	The name of the dependency comparator's match field.
dependency-class	name	The name of the dependency comparator class.
curve-adjust	-	An indicator of whether to apply special adjustments to the weighting curve. The curve adjustment is defined for each comparator individually in a Java class named `comparator_nam`e`CurveAdjustor`.
curve-adjust	status	The status of the curve adjustor. Specify true to use the curve adjustor; specify false to disable the curve adjustor.