5 Match Engine Configuration for Common Data

This chapter provides conceptual information on how the OHMPI Match Engine can match on any type of data. Common data types for matching include person names, addresses, and business names. It also provides information on configuring the match engine for matching on these data types in a master person index application, fine-tuning weights and measures, and customizing match configuration and thresholds.

This chapter includes the following sections:

"Learning About the OHMPI Match String and Match Types"
Configuring the Match String for a Master Person Index Application

Learning About the OHMPI Match String and Match Types

This section provides information about the OHMPI match string, match string fields, and match types.

"The OHMPI Match String"
"OHMPI Match Engine Match String Fields"
"OHMPI Match Engine Match Types"

The OHMPI Match String

The data string that is passed to the OHMPI Match Engine for match processing is called the match string. For a master person index application, the match string is defined in the MatchingConfig section of mefa.xml. The match and standardization engine configuration files, the blocking query, and the matching configuration are closely linked in the search and matching processes. The blocking query defines the select statements for creating the candidate selection pool during the matching process. The matching configuration defines the match string that is passed to the match engine from the records in the candidate selection pool. Finally, the OHMPI Match Engine configuration files define how the match string is processed.

The OHMPI Match Engine configuration files are dependent upon the match string, and it is very important when you modify the match string to ensure that the match type you specify corresponds to the correct row in the match configuration file (matchConfigFile.cfg). For example, if you are using person matching and add “MaritalStatus” as a match field, you need to specify a match type for the MaritalStatus field that is listed in the first column of the match configuration file. You must also make sure that the matching logic defined in the corresponding row of the match configuration file is defined appropriately for matching on the MaritalStatus field. For more information about match types, see "OHMPI Match Engine Match Types").

OHMPI Match Engine Match String Fields

In a master person index application, the match string processed by the OHMPI Match Engine is defined by the match fields specified in mefa.xml, and the logic for how the fields are matched is defined in the match configuration file (matchConfigFile.cfg). The match engine can process any combination of fields you specify for matching using the predefined comparators or any new comparators you define. Not all fields in a record need to be processed by the OHMPI Match Engine. Before you define the match string, analyze your data to determine the fields that are most likely to indicate a match or non-match between two records.

The following sections provide additional information about the match string for different data types:

"Person Data Match String Fields"
"Address Data Match String Fields"
"Business Name Match String Fields"

Person Data Match String Fields

By default, the match configuration file (matchConfigFile.cfg) includes rows specifically for matching on first name, last name, social security numbers, and dates (such as a date of birth). It also includes a row for matching a single character with logic specialized for a gender field. You can use any of the existing rows for matching or you can add rows for the fields you want to match. When matching on person names, determine whether you want to use the original field values, the normalized field values, or the phonetic values. The match engine can handle any of these types of fields, but the best comparator for each type might be different. Also determine how much weight you want to give each field type and configure the match configuration file accordingly.

Address Data Match String Fields

By default, the match configuration file (matchConfigFile.cfg) includes rows specifically for matching on the fields that are parsed from the street address fields, such as the street number, street direction, and so on. The file also defines several generic match types you can configure for address fields. You can use any of the existing rows for matching or you can add rows for the fields you want to match. If you specify an “Address” match type for any field in the Master Person Index Wizard, the default fields that store the parsed data are automatically added to the match string in mefa.xml. These fields include the house number, street direction, street type, and street name. You can remove any of these fields from the match string.

When matching on address fields, determine whether you want to use the original field values, the standardized field values, or the phonetic values. The match engine can handle any of these types of fields, but the best comparator for each type might be different. Also determine how much weight you want to give each field type and configure the match configuration file accordingly.

Business Name Match String Fields

By default, the match configuration file (matchConfigFile.cfg) includes rows specifically for matching on the fields that are parsed from the business name fields. The file also defines several generic match types you can customize to use with business name fields. You can use any of the existing rows for matching or you can add rows for the fields you want to match. If you specify a “BusinessName” match type for any field in the wizard, most of the parsed business name fields are automatically added to the match string in mefa.xml, including the name, organization type, association type, sector, industry, and URL. You can remove any of these fields from the match string.

When matching on business name fields, determine whether you want to use the original field values, the standardized field values, or the phonetic values. The match engine can handle any of these types of fields, but the best comparator for each type might be different. Also determine how much weight you want to give each field type and configure the match configuration file accordingly.

OHMPI Match Engine Match Types

The default match configuration file, matchConfigFile.cfg, defines several rules that you can customize for the type of data being processed. Each rule is identified by a match type in the first column of each row. This value identifies the type of matching to perform to the match engine. In a master person index application, the match type is entered for each field in the match string section of mefa.xml.

The match configuration OHMPI Match Engine's matchConfigFile.cfg appears under the Match Engine node of the master person index project. For more information about the comparison functions used for each match type and how the weights are tuned, see "Customizing the Match Configuration" and Chapter 3, "OHMPI Match Engine Comparison Functions".

The following four tables list match types that are typically used in processing different data types, including:

"Table 5-1 Person Data Match Types"
"Table 5-2 Address Match Types"
"Table 5-3 Business Name Match Types"
"Table 5-4 Miscellaneous Match Types"

The following match types are designed for matching on person data.

Table 5-1 Person Data Match Types

This indicator ...	processes this data type ...
FirstName	A first name field, including middle name, alias first name, and alias middle name fields.
LastName	A last name field, including alias last name fields.
SSN	A field containing a social security number.
Gender	A field containing a gender code.

The following match types are designed for matching on address data.

Table 5-2 Address Match Types

This indicator ...	processes this data type ...
StreetName	The parsed street name field of a street address.
HouseNumber	The parsed house number field of a street address.
StreetDir	The parsed street direction field of a street address.
StreetType	The parsed street type field of a street address.

The following match types are designed for matching on business names.

Table 5-3 Business Name Match Types

This match type ...	processes this data type ...
PrimaryName	The parsed name field of a business name.
OrgTypeKeyword	The parsed organization type field of a business name.
AssocTypeKeyword	The parsed association type field of a business name.
LocationTypeKeyword	The parsed location type field of a business name.
AliasList	The parsed alias type field of a business name.
IndustrySectorList	The parsed industry sector field of a business name.
IndustryTypeKeyword	The parsed industry type field of a business name.
Url	The parsed URL field of a business name.

Miscellaneous match types provide additional logic for matching on a variety of data types, such as date, numeric, string, and character fields.

Table 5-4 Miscellaneous Match Types

This indicator ...	processes this data type ...
Date	The year of a date field.
DateDays	The day, month, and year of a date field.
DateMonths	The month and year of a date field.
DateHours	The hour, day, month, and year of a date field.
DateMinutes	The minute, hour, day, month, and year of a date field.
DateSeconds	The seconds, minute, hour, day, month, and year of a date field.
String	A generic string field.
Unistring	A generic Unicode string field.
Integer	A field containing integers.
Real	A field containing real numbers.
Char	A field containing a single character.
pro	Any field on which you want the OHMPI Match Engine to use prorated weights.
Exac	Any field you want the OHMPI Match Engine to match character for character.
CSC	A generic string.
DOB	A date of birth in string rather than date format.

Configuring the Match String for a Master Person Index Application

The MatchingConfig section of mefa.xml determines which fields are passed to the OHMPI Match Engine for matching (the match string). The match types specified in this section help the match engine determine the algorithm and custom logic to use for matching on each field.

If you are matching on fields parsed from a free-form text field, define each individual parsed field you want to use for matching in the Master Person Index Wizard or Configuration Editor. The match types you can use for each field in this section are defined in the first column of the match configuration file (matchConfigFile.cfg). Make sure the match type you specify has the correct matching logic defined in the match configuration file. See "OHMPI Match Engine Match Types" for more information.

The following topics provide more information about matching on different types of data:

"Configuring the Match String for Person Data"
"Configuring the Match String for Address Data"
"Configuring the Match String for Business Names"

Configuring the Match String for Person Data

When matching on person data, you can include any field stored in the database for matching. To configure the match string, follow the instructions under “Defining the Master Person Index Match String” in Oracle Healthcare Master Person Index Configuration Guide (Part Number 18473-01). For the OHMPI Match Engine, each data type has a different match type (specified by the match-type element in the matching configuration file). The FirstName, LastName, SSN, Gender, and DOB match types are specific to person matching. You can specify any of the other match types defined in the match configuration file as well. For more information, see "OHMPI Match Engine Match Types".

A sample match string for person matching is shown below. This sample matches on first and last names, date of birth, social security number, gender, and the street name of the address.

<match-system-object>
   <object-name>Person</object-name>
   <match-columns>
      <match-column>
         <column-name>
            Enterprise.SystemSBR.Person.FirstName_Std
         </column-name>
         <match-type>FirstName</match-type>
      </match-column>
      <match-column>
         <column-name>Enterprise.SystemSBR.Person.LastName_Std
         </column-name>
         <match-type>LastName</match-type>
      </match-column>
      <match-column>
         <column-name>Enterprise.SystemSBR.Person.SSN
         </column-name>
         <match-type>SSN</match-type>
      </match-column>
      <match-column>
         <column-name>Enterprise.SystemSBR.Person.DOB
         </column-name>
         <match-type>DateDays</match-type>
      </match-column>
      <match-column>
         <column-name>Enterprise.SystemSBR.Person.Gender
         </column-name>
         <match-type>Char</match-type>
      </match-column>
      <match-column>
         <column-name>Enterprise.SystemSBR.Person.Address.StreetName
         </column-name>
         <match-type>StreetName</match-type>
      </match-column>
   </match-columns>
</match-system-object>

Configuring the Match String for Address Data

For matching on street address fields, make sure the match string you specify in the MatchingConfig section of mefa.xml contains all or a subset of the fields that contain the standardized data (the original text in street address fields is generally too inconsistent to use for matching). You can include additional fields for matching, such as the city name or postal code.

To configure the match string, follow the instructions under “Defining the Master Person Index Match String” in Oracle Healthcare Master Person Index Configuration Guide (Part Number 18473-01). For the OHMPI Match Engine, each component of a street address has a different match type (specified by the match-type element in the matching configuration file). The default match types for addresses are StreetName, HouseNumber, StreetDir, and StreetType. You can specify any of the other match types defined in the match configuration file, as well. For more information, see "OHMPI Match Engine Match Types".

A sample match string for address matching is shown below.

<match-system-object>
   <object-name>Person</object-name>
   <match-columns>
      <match-column>
         <column-name>Enterprise.SystemSBR.Person.Address.StreetName
         </column-name>
         <match-type>StreetName</match-type>
      </match-column>
      <match-column>
         <column-name>Enterprise.SystemSBR.Person.Address.HouseNumber
         </column-name>
         <match-type>HouseNumber</match-type>
      </match-column>
      <match-column>
         <column-name>Enterprise.SystemSBR.Person.Address.StreetDir
         </column-name>
         <match-type>StreetDir</match-type>
      </match-column>
      <match-column>
         <column-name>Enterprise.SystemSBR.Person.Address.StreetType
         </column-name>
         <match-type>StreetType</match-type>
   </match-column>
   </match-columns>
</match-system-object>

Configuring the Match String for Business Names

For matching on business name fields, make sure the match string you specify in the MatchingConfig section of mefa.xml contains all or a subset of the fields that contain the standardized data (the unparsed business names are typically too inconsistent for matching). You can include additional fields for matching if required.

To configure the match string, follow the instructions under “Defining the Master Person Index Match String” in Oracle Healthcare Master Person Index Configuration Guide (Part Number E18473-01). For the OHMPI Match Engine, each data type has a different match type (specified by the match-type element of the matching configuration file). The PrimaryName, OrgTypeKeyword, AssocTypeKeyword, IndustrySectorList, IndustryTypeKeyword, and Url match types are specific to business name matching. You can specify any of the other match types defined in the match configuration file, as well. For more information, see "OHMPI Match Engine Match Types".

A sample match string for business name matching is shown below. This sample matches on the company name, the organization type, and the sector.

<match-system-object>
   <object-name>Company/object-name>
   <match-columns>
      <match-column>
         <column-name>Enterprise.SystemSBR.Company.Name_PrimaryName
         </column-name>
         <match-type>PrimaryName</match-type>
      </match-column>
      <match-column>
         <column-name>Enterprise.SystemSBR.Company.Name_OrgType
         </column-name>
         <match-type>OrgTypeKeyword</match-type>
      </match-column>
      <match-column>
         <column-name>Enterprise.SystemSBR.Company.Name_Sector
         </column-name>
         <match-type>IndustryTypeKeyword</match-type>
      </match-column>
   </match-columns>
</match-system-object>

Fine-Tuning Weights and Thresholds for Oracle Healthcare Master Person Index

Each Oracle Healthcare Master Person Index implementation is unique, typically requiring extensive data analysis to determine how to best configure the structure and matching logic of the master person index application. The following topics provide an overview of the process of fine-tuning the matching logic in the match configuration file and fine-tuning the match and duplicate thresholds.

"Data Analysis Overview"
"Customizing the Match Configuration and Thresholds"

Data Analysis Overview

A thorough analysis of the data to be shared with the master person index application is a must before beginning any implementation. This analysis not only defines the types of data to include in the object structure, but indicates the relative reliability of each system's data, helps determine which fields to use for matching, and indicates the relative reliability of each match field.

To begin the analysis, the legacy data that will be converted into the master person index database is extracted and analyzed. Once the initial analysis is complete, you can perform an iterative process to fine-tune the matching and duplicate thresholds and to determine the level of potential duplication in the existing data. If you plan to use the Data Profiler and Bulk Matcher tools generated by Oracle Healthcare Master Person Index to analyze data, review the information in Oracle Healthcare Master Person Index Analyzing and Cleansing Data User's Guide (Part Number 18589-01) and Oracle Healthcare Master Person Index Loading the Initial Data Set User's Guide (Part Number 18590-01) before you extract the legacy data.

Customizing the Match Configuration and Thresholds

There are three primary steps to customizing how records are matched in a master person index application.

"Determining the Match Fields"
"Customizing the Match Configuration"
"Determining the Weight Thresholds"

Determining the Match Fields

Before extracting data for analysis, review the types of data stored in the messages generated by each system. Use these messages to determine which fields and objects to include in the object structure of the master person index application. From this object structure, select the fields to use for matching. When selecting these fields, keep in mind how representative each field is of a specific object. For example, in a master person index, the social security number field, first and last name fields, and birth date are good representations whereas marital status, suffix, and title are not. Certain address information or a home telephone number might also be considered. In a master company index, the match fields might include any of the fields parsed from the complete company name field, as well as a tax ID number or address and telephone information.

Customizing the Match Configuration

Once you determine the fields to use for matching, determine how the weights will be generated for each field. The primary tasks include determining whether to use probabilities or agreement weight ranges and then choosing the best comparison functions to use for each match field.

Probabilities or Agreement Weights

The first step in configuring the match configuration is to decide whether to use m-probabilities and u-probabilities or agreement and disagreement weight ranges. Both methods will give you similar results, but agreement and disagreement weight ranges allow you to specify the precise maximum and minimum weights that can be applied to each match field, giving you control over the value of the highest and lowest matching weights that can be assigned to each record.

Defining Relative Value

For each field used for matching, define either the m-probabilities and u-probabilities or the agreement and disagreement weight ranges in the match configuration file. Review the information provided under "OHMPI Match Engine Matching Weight Formulation" to help determine how to configure these values. Remember that a higher m-probability or agreement weight gives the field a higher weight when field values agree.

Determining the Weight Range

In order to find the initial values to set for the match and duplicate thresholds, you must determine the total range of matching weights that can be assigned to a record. This weight is the sum of all weights assigned to each match field. Using the data analysis tool provided can help you determine the match and duplicate thresholds.

Weight Ranges Using Agreement Weights

For agreement and disagreement weight ranges, determining the match weight ranges is very straightforward. Simply total the maximum agreement weights for each field to determine the maximum match weight. Then total the minimum disagreement weights for each match field to determine the minimum match weight. The following table provides a sample agreement/disagreement configuration for matching on person data. As you can see, the range of match weights generated for a master person index application with this configuration is from -36 to +38.

Table 5-5 Sample Agreement and Disagreement Weight Ranges

Field Name	Maximum Agreement Weight	Minimum Disagreement Weight
First Name	8	-8
Last Name	8	-8
Date of Birth	7	-5
Gender	5	-5
SSN	10	-10
Maximum Match Weight	38
Minimum Match Weight		-36

Weight Ranges Using Probabilities

Determining the match weight ranges when using m-probabilities and u-probabilities is a little more complicated than using agreement and disagreement weights. To determine the maximum weight that will be generated for each field, use the following formula:

LOG2(m_prob/u_prob)

To determine the minimum match weight that will be generated for each field, use the following formula:

LOG2((1-m_prob)/(1-u_prob))

The following table illustrates m-probabilities and u-probabilities, including the corresponding agreement and disagreement weights that are generated with each combination of probabilities. As you can see, the range of match weights generated for a master person index application with this configuration is from -35.93 to +38

Table 5-6 Sample m-probabilities and u-probabilities

Field Name	m-probability	u-probability	Max Agreement Weight	Min Disagreement Weight
First Name	.996	.004	7.96	-7.96
Last Name	.996	.004	7.96	-7.96
Date of Birth	.97	.007	7.11	-5.04
Gender	.97	.03	5.01	-5.01
SSN	.999	.001	9.96	-9.96
Maximum Match Weight			38
Minimum Match Weight				-35.93

Comparison Functions

The match configuration file defines several match types for different types of fields. You can either modify existing rows in this file or create new rows that define custom matching logic. To determine which comparison functions to use, review the information provided in Chapter 3, "OHMPI Match Engine Comparison Functions". Choose the comparison functions that best suit how you want the match fields to be processed.

Determining the Weight Thresholds

Weight thresholds tell the master person index application how to process incoming records based on the matching probability weights generated by the OHMPI Match Engine. Two parameters in master.xml provide the master person index application with the information needed to determine if records should be flagged as potential duplicates, if records should be automatically matched, or if a record is not a potential match to any existing records.

Match Threshold - Specifies the weight at which two profiles are assumed to represent the same person and are automatically matched (this depends on the setting for the OneExactMatch parameter).
Duplicate Threshold - Specifies the minimum weight at which two profiles are considered potential duplicates of one another. The matching threshold indicates the maximum weight for potential duplicates.

Specifying the Weight Thresholds

There are many techniques for determining the initial settings for the match and duplicate thresholds. This section discusses two methods. You can also use the Data Profiler and Bulk Matcher to determine these thresholds. For more information, see Oracle Healthcare Master Person Index Analyzing and Cleansing Data User's Guide (Part Number 18589-01) and Oracle Healthcare Master Person Index Loading the Initial Data Set User's Guide (Part Number 18590-01).

The first method, the weight distribution method, is based on the calculation of the error rates of false matches and false non-matches from analyzing the distribution spectrum of all the weighted pairs. This is the standard method. The second method, the percentage method relies on measuring the total maximum and minimum weights of all the matched fields and then specifying a certain percentage of these values as the initial thresholds.

The weight distribution method is more thorough and powerful but requires analyzing a large amount of data (match weights) to be statistically reliable. It does not apply well in cases where one candidate record is matched against very few reference records. The percentage method, though simple, is very reliable and precise when dealing with such situations. For both methods, defining the match threshold and the duplicate threshold is an iterative process.

Weight Distribution Method

Each record pair in the master person index application can be classified into three categories: matches, non-matches, and potential matches. Your goal is to make sure that very few records fall into the False Matches region (if any), and that as few as possible fall into the False Non-matches region. You can see how modifying the thresholds changes this distribution. Balance this against the number of records falling within the Manual Review section, as these will each need to be reviewed, researched, and resolved individually.

Percentage Method

Using this method, you set the initial thresholds as a percentage of the maximum and minimum weights. Using the information provided under "Weight Ranges Using Agreement Weights" or "Weight Ranges Using Probabilities", determine the maximum and minimum values that can be generated for composite match weights. For the initial run, the match threshold is set intentionally high to catch only the most probable matches. The duplicate threshold is set intentionally low to catch a large set of possible matches.

Set the match threshold at 70% of the maximum composite weight starting from zero as the neutral value. Using the weight range samples in Table 17, this would be 70% of 38, or 26.6. Set the duplicate threshold near the neutral value (that is, the value in the center of the maximum and minimum weight range). The value could be set between 10% of the maximum weight and 10% of the minimum weight. Using the samples above, this would be between 3.8 (10% of 38) and -3.6 (10% of -36).

Fine-tuning the Thresholds

Achieving the correct thresholds for your implementation is an iterative process. First, using the initial thresholds described earlier, process the data extracts into the master person index database. Then analyze the resulting assumed match and potential duplicates, paying close attention to the assumed match records with matching weights close to the match threshold, to potential duplicate records close to either threshold, and to non-matches near the duplicate threshold.

If you find that most or all of the assumed matches at the low end of the match range are not actually duplicate records, raise the match threshold accordingly. If, on the other hand, you find several potential duplicates at the high end of the duplicate range that are actual matches, decrease the match threshold accordingly. If you find that most or all of the potential duplicate records in the low end of the duplicate range should not be considered duplicate matches, consider raising the duplicate threshold. Conversely, if you find several non-matches with weight near the duplicate threshold that should be considered potential duplicates, lower the duplicate threshold.

Repeat the process of loading and analyzing data and adjusting the thresholds until you are satisfied with the results.