Oracle® Healthcare Master Person Index Match Engine Reference Release 1.1 Part Number E18470-01 |
|
|
View PDF |
This chapter introduces you to the matching configuration files for the Oracle Healthcare Master Person Index (OHMPI) Match Engine, including certain rules for formatting and interdependencies that must be followed. The following sections provide an overview of the two matching configuration files provided, the architecture of those files, and formatting descriptions. They also include an overview of comparison functions used in the match configuration file.
This chapter includes the following sections:
"Understanding the OHMPI Match Engine Match Configuration File"
"Learning About the OHMPI Match Engine Comparitor Definition List"
The matching configuration files define how the OHMPI Match Engine processes records to assign matching probability weights, allowing the master person index application to identify matches, potential duplicates, and non-matches. The match engine includes two configurable files, the match configuration file and the comparators list. Together these files define additional logic for the OHMPI Match Engine to use when determining the matching probability between two records.
The matching configuration is very flexible, allowing you to customize the matching logic according to the type of data being matched and for the record matching requirements of your business. In a master person index application, the matching configuration files are stored in the master person index project and are located in the Match Engine node of the project. The OHMPI Standardization Engine typically standardizes the data prior to matching, so the match process is performed against the standardized data.
The match configuration file, matchConfigFile.cfg
, contains the matching logic for each field on which matching is performed. By default, this file defines the matching logic for the three primary data types (person names, business names, and addresses), and can also handle generic data types, such as dates, numbers, social security numbers, and characters.
The match configuration file defines matching logic for each field on which matching is performed. The OHMPI Match Engine provides several comparison functions that you can call in this file to fine-tune the match process. Comparison functions contain the logic to compare different types of data in very specific ways in order to arrive at a match weight for each field. These functions allow you to define how matching is performed for different data types and can be used in conjunction with either matching and unmatching probabilities or agreement and disagreement weight ranges for each field. This file also defines how to handle missing fields.
The following sections describe the format of the configuration file and provide an overview of the predefined comparison functions:
These sections describe the format of the files so you can modify them directly. You can also modify the match configuration file using the OHMPI Configuration Editor, which provides an easy, graphical way to configure matching rules.
The match configuration file is divided into two sections. The first section consists of one line that indicates the matching probability type. The second section consists of the matching rules to use for each match field. In a master person index application, this file can be modified from the Matching tab of the Master Person Index Configuration Editor. For more information, see “Configuring Comparison Functions for a Master Person Index Application” in Oracle Healthcare Master Person Index Configuration Guide (Part Number E18473-01).
Following is an excerpt from the default match configuration file. This excerpt illustrates the components that are described in the following sections.
ProbabilityType 1 FirstName 15 0 uf 0.99 0.001 10 -8 LastName 15 0 ul 0.99 0.001 10 -10 String 25 0 ua 0.99 0.001 8 -8 DateDays 20 0 dD 0.99 0.001 10 -10 y 15 30 DateMonths 20 0 dM 0.99 0.001 10 -10 n DateHours 20 0 dH 0.99 0.001 10 -10 y 30 60 DateMinutes 20 0 dm 0.99 0.001 10 -10 y 300 600 DateSeconds 20 0 ds 0.99 0.001 10 -10 y 75 60 Integer 15 0 nI 0.99 0.001 10 -10 n Real 15 0 nR 0.99 0.001 10 -10 n Char 1 0 c 0.99 0.001 5 -5 pro 15 0 p 0.99 0.001 10 -10 20 5 5
The first line of the match configuration file defines the probability type to use for matching. Specify “0” (zero) to use m-probabilities and u-probabilities to determine a field's match weight; specify “1” (one) to use agreement and disagreement weight ranges. If the probability type is set to use agreement and disagreement weight ranges, the m-prob and u-prob columns in the matching rules section are ignored. Likewise, if the probability type is set to use m-probabilities and u-probabilities, the agreement-weight and disagreement-weight columns in the matching rules section are ignored. The default is to use agreement and disagreement weight ranges because they are more intuitive.
For more information about probabilities and weights, see "Probabilities and Direct Weights".
The section after the first line of the match configuration file contains match field rows, with each row defining how a certain data type or field will be matched. These are the rules you specify in the match string you define for a master person index application. The syntax for this section is:
match-field size null-field function m-prob u-prob agreement disagreement params data-sources
The following table describes each element in a match field row.
Table 2-1 Match Configuration File Columns
Column Number | Column Name | Description |
---|---|---|
1 | match-field | A value that indicates to the Master Person Index Match Engine how each field should be weighted. Each field included in the match string (the MatchingConfig section of mefa.xml ) must have a match type corresponding to a value in this column. |
2 | size | The number of characters in the field on which matching is performed, beginning with the first character. For example, to match on only the first four characters in a 10-digit field, the value of this column should be “4.” |
3 | null-field | An index that specifies how to calculate the total weight for null fields or fields that only contain spaces. You can specify any of the following values:
Note: In the above descriptions, the agreement and disagreement weights are either specified in the file or calculated using a logarithmic formula based on the m and u-probabilities (depending on the probability type). |
4 | function | The type of comparison to perform when weighting the field. For information about the available comparison functions, see Chapter 3, "OHMPI Match Engine Comparison Functions,". An overview of the comparison functions is provided in "Table 2-2 Comparison Function Summary". |
5 | m-prob | The initial probability that the specified field in two records will match if the records match. The probability is a double value between 0 and 1, and can have up to 16 decimal points. |
6 | u-prob | The initial probability that the specified field in two records will match if the records do not match. The probability is a double value between 0 and 1, and can have up to 16 decimal points. |
7 | agreement | The matching weight to be assigned to a field given that the fields match between two records. This number can be between 0 and 100 and can have up to 16 decimal points. It represents the maximum match weight for a field. |
8 | disagreement | The matching weight to be assigned to a field given that the fields do not match between two records. This number can be between 0 and -100 and can have up to 16 decimal points. It represents the minimum match weight for a field. |
9 | params | The parameters that correspond to the comparison function specified in column 4. Some comparison functions do not take any parameters and some take multiple parameters. For additional information about parameters, see Chapter 3, "OHMPI Match Engine Comparison Functions". |
10 | dataSources | The complete path to any data sources used by the comparison function specific in column 4. You can define as many data sources as there are data sources listed for the comparator in the comparators list file. The default comparators do not use data sources, but you can create a custom comparator that does. |
Match field comparison functions, or comparators, compare the values of a field in two records to determine whether the fields match. The fields are then assigned a matching weight based on the results of the comparison function. You can use several different types of comparison functions in the match configuration file to define how the OHMPI Match Engine should match the fields in the match string. The OHMPI Match Engine provides several options to use with each function. You can also define custom comparison functions. For more information, see Chapter 4, "Creating Custom Comparators for the OHMPI Match Engine".
The following table summarizes each comparison function. A complete reference of the comparison functions and their parameters is included in Chapter 3, "OHMPI Match Engine Comparison Functions".
Note:
The names of these comparison functions are configurable. The following table lists their default names.Table 2-2 Comparison Function Summary
Comparison Function | Name | Description |
---|---|---|
b1 | Bigram Comparator | Compares two strings using an algorithm based on the Bigram algorithm. This function compares two strings using all combinations of two consecutive characters and returns the total number of combinations that are the same. |
b2 | Advanced Bigram Comparator | Compares two strings allowing for character transpositions. This function is similar to the standard Bigram Comparator (b1). |
u | Advanced Jaro String Comparator | Compares two strings taking into account uncertainty factors, such as string length, transpositions, and characters in common. This function is based on the Jaro algorithm. |
ua | Winkler-Jaro String Comparator | Compares two strings similar to the Advanced Jaro String Comparator (u), but increases the agreement weight if the initial characters of each string are exact matches. This function takes into account key punch and visual memory errors. It is based on the Jaro algorithm with variants of Winkler/Lynch and McLaughlin. |
uf | Advanced Jaro Adjusted for First Names | Based on the generic string comparator (u), this function is designed to specifically weight first name values. The string is analyzed and the weight adjusted based on statistical data. |
ul | Advanced Jaro Adjusted for Last Names | Based on the generic string comparator (u), this function is designed to specifically weight last name values. The string is analyzed and the weight adjusted based on statistical data. |
un | Advanced Jaro Adjusted for House Numbers | Based on the generic string comparator (u), this function is designed to specifically weight house number values. The string is analyzed and the weight adjusted based on statistical data. |
us | Condensed String Comparator | Compares two strings similar to the Advanced Jaro String Comparator (u), but this function is a custom string comparator that compares two strings taking into account such uncertainty factors as string length, transpositions, key punch errors, and visual memory errors. Unlike the Advanced Jaro String Comparator, this function handles diacritical marks. This function also improves processing speed. |
usu | Unicode String Comparator | Compares two strings similar to the Condensed String Comparator (us), but this function is based in Unicode to support multiple languages and alphabets. This comparator takes one parameter indicating the language to use. |
usus | Unicode AlphaNumeric Comparator | Compares two strings similar to the Unicode String Comparator, but this function is designed to match on unique identifiers such as national IDs. This comparator takes one parameter indicating the language to use plus any of the following parameters:
|
ujs | Advanced Jaro AlphaNumeric Comparator | Compares two strings similar to the Advanced Jaro String Comparator, but this function is designed to match on unique identifiers such as national IDs. This comparator takes any of the following parameters:
|
c | Exact Character-to-Character Comparator | Compares string fields character by character. Each character must match in order for an agreement weight to be assigned. |
nI | Integer Comparator | Compares integer fields using a relative distance value to determine the match weight. As the difference between the two fields increases, the match weight decreases. Once the difference is beyond the relative distance, a disagreement weight is assigned. This comparator takes two parameters; the first indicates whether to use a relative distance or direct string comparison and the second indicates the relative distance to use. |
nR | Real Number Comparator | Compares fields containing real numbers using a relative distance value to determine the match weight. As the difference between the two fields increases, the match weight decreases. Once the difference is beyond the relative distance, a disagreement weight is assigned. This comparator takes two parameters; the first indicates whether to use a relative distance or direct string comparison, and the second indicates the relative distance to use. |
nS | Condensed AlphaNumeric SSN Comparator | Compares social security numbers or other unique identifiers, taking into account any of these parameters:
|
dY | Date Comparator With Years as Units | Compares year values using relative distance values prior to and following the given year to determine the match weight. As the difference between the two fields increases, the match weight decreases. Once the difference is beyond the relative distance, a disagreement weight is assigned. The date comparison functions handle Gregorian years. This comparator takes up to three parameters; the first indicates whether to use a relative distance or direct string comparison, and the second and third indicate the relative distance before and after. |
dM | Date Comparator With Months as Units | Compares the month and year using a relative distance as described above for the year comparison function (dY). |
dD | Date Comparator With Days as Units | Compares the day, month, and year using a relative distance as described above for the year comparison function (dY). |
dH | Date Comparator With Hours as Units | Compares the hour, day, month, and year using a relative distance as described above for the year comparison function (dY). |
dm | Date Comparator With Minutes as Units | Compares the minute, hour, day, month, and year using a relative distance as described above for the year comparison function (dY). |
ds | Date Comparator With Seconds as Units | Compares the second, minute, hour, day, month, and year using a relative distance as described above for the year comparison function (dY). |
p | Prorated Comparator | Prorates the disagreement weight for a date or numeric field based on values you specify. Differences greater than the amount you specify receive the full disagreement weight. This comparator takes three parameters indicating the relative distance and the agreement and disagreement ranges. |
The comparator definition list defines each comparator that is included in a master person index application. If a comparator is not included in this list, it cannot be used in the application. If you define a comparator in this list that is not provided with the OHMPI Match Engine, you need to define the logic of the new comparator in Java classes (for more information, see Chapter 4, "Creating Custom Comparators for the OHMPI Match Engine").
Below is an excerpt from the default comparators list file that defines two numeric comparators, Real Number Comparator and Integer Comparator. Both comparators take two parameters, and are dependent on a second comparator class named CondensedStringComparator
.
<comparator description="Numerics comparator"> <className>NumericsComparator</className> <codes> <code description="Real Number Comparator" name="n[R, ]"/> <code description="Integer Comparator" name="nI" /> </codes> <params> <param description="distance/string comparison option" name="switch" type="java.lang.String"/> <param description="Spectrum of comparison" name="range" type="java.lang.Integer|java.lang.Double"/> </params> <data-sources/> <dependency-classes> <dependency-class matchfield="CSC" name="com.sun.mdm.matcher.comparators.base.CondensedStringComparator"/> </dependency-classes> <curve-adjust status="false"/> </comparator>
The comparators are defined in XML format. The following table lists and describes each element in the XML file.
Table 2-3 Comparator Definition List Elements
Element | Attribute | Description |
---|---|---|
group | An element that contains a list of comparators that all share the same Java package. | |
description | A brief description of the comparator group. | |
path | The Java package that contains the code that defines the comparators in the group. | |
comparator | A definition for one subgroup of comparators that are all based on the same Java class, have the same Java class dependencies, accept the same parameters and data sources, and have the same curve adjustment setting. | |
description | A brief description of the comparator subgroup. | |
className | The name of the class that defines the logic for the comparators. The class must be contained in the package specified for the group element, as described above. | |
codes | A container element for a list of the comparators in the subgroup, with descriptions and processing codes of each comparator. | |
code | A description and processing code for one comparator. | |
description | A description of the comparator. The value you specify here appears in the comparator drop-down list on the Master Person Index Configuration Editor. | |
name | A unique identifying name for the comparator. These are the comparator names used in the rules definitions in the match configuration file (matchConfigFile.cfg ). |
|
params | A container element for a list of static parameters for the subgroup of comparators. Parameters are optional. | |
param | One parameter definition for the comparators. | |
description | A brief description of the parameter. | |
name | A short name for the parameter. | |
type | The Java data type of the values that can be specified for the parameter. | |
data-sources | A container element for a list of data files that contain additional information for the subgroup of comparators. For example, a comparator that generates weights based on the distance between postal codes might use lookup files containing information about the zip codes. Data sources are optional. | |
data-source | A definition for one data source. Currently, only file data sources are supported. | |
description | A brief description of the data source. | |
name | The complete path and filename of the data source. | |
type | The type of data source being used. Currently, the only value you can specify is "java.io.File". | |
dependency-classes | A container element that defines a list of Java classes on which the comparator class is dependent. The current comparator class inherits from the comparator classes you specify here as well as all the match fields (defined in matchConfigFile.cfg ) that use that comparator. |
|
dependency-class | A definition for one comparator class, called a dependency comparator, on which the current comparator class is dependent. | |
matchField | The name of the dependency comparator's match field. | |
name | The name of the dependency comparator class. | |
curve-adjust | An indicator of whether to apply special adjustments to the weighting curve. The curve adjustment is defined for each comparator individually in a Java class named comparator_nam eCurveAdjustor . |
|
status | The status of the curve adjustor. Specify true to use the curve adjustor; specify false to disable the curve adjustor. |