1 Oracle Healthcare Master Person Index Match Engine Reference

This chapter introduces you conceptual information about the Oracle Healthcare Master Person Index (OHMPI) Match Engine and how it matches data in a master person index application. It also introduces you to the OHMPI Standardization Engine, with which the OHMPI Match Engine works closely. For more information about the standardization engine, see Oracle Healthcare Master Person Index Standardization Engine Reference (Part Number E18471-01).

This chapter includes the following sections:

"Learning About the OHMPI Match Engine"
"Understanding the OHMPI Index Standardization and Matching Process"

Learning About the OHMPI Match Engine

The OHMPI Match Engine provides record matching capabilities for external applications, such as master person index applications. It works best along with the OHMPI Standardization Engine, which provides the preprocessing of data that is required for accurate matching, such as data parsing, data standardization, and also the OHMPI phonetic encoders. Before records can be compared to evaluate the possibility of a match, the data contained in those records must be standardized and in certain cases phonetically encoded. Once the data is conditioned, the match engine determines a match weight for each field defined for matching. The match weight is based on the fields on which matching is performed and how the matching logic is configured. The composite weight is usually the sum of weights generated for all match fields in the records (but could also be a function of the match field weights). The composite weight indicates how closely two records match.

The OHMPI Match Engine is the standard match engine designed to work with the master person index applications created by the Oracle Healthcare Master Person Index. The match engine can also be called from other applications. It is highly configurable in the Oracle Healthcare Master Person Index environment and can be used to match on various types of data. The OHMPI Match Engine works in conjunction with the OHMPI Standardization Engine to improve the quality of your data.

The following sections provide information about matching concepts, the match process, and how the OHMPI Match Engine matches data.

"Data Matching Concepts"
"Understanding How the OHMPI Match Engine Works"

Data Matching Concepts

Data matching compares data stored in disparate systems in and across organizations, helping you reduce data duplication and improve data accuracy. Matching involves comparing specific fields in two standardized records and returning a weight that indicates the likelihood of a match between the two records. A higher weight between two records indicates a greater likelihood of a match. Data matching is based on proven algorithms that are designed to compare different types of data, such as strings, dates, integers, and so on. Matching is a key step in managing data quality, and the algorithms are typically quite complex. Some algorithms are configured to compare more specialized types of data, including first and last names, social security numbers, and dates of various formats.

The following topics provide additional information about standard data matching concepts:

"Deterministic and Probabilistic Data Matching"
"Weighting Thresholds"
"Probabilities and Direct Weights"

Deterministic and Probabilistic Data Matching

Data matching can be either deterministic or probabilistic. In deterministic matching, either unique identifiers for each record are compared to determine a match or an exact comparison is used between fields. Unique identifiers can include national IDs, system IDs, and so on. This can include system IDs, national IDs, and so on. Deterministic matching is generally not completely reliable since in some cases no single field can provide a reliable match between two records. This is where probabilistic, or fuzzy, matching comes in. In probabilistic matching, several field values are compared between two records and each field is assigned a weight that indicates how closely the two field values match. The sum of the individual fields weights indicates the likelihood of a match between two records.

Weighting Thresholds

In a data management system, you can set duplicate and match threshold weights. The duplicate threshold is the weight above which two records potentially represent the same entity. The match threshold is the weight above which two records are considered to represent the same entity. Any records below the duplicate threshold are considered to represent completely separate and different entities.

Probabilities and Direct Weights

Optimum (or ceiling) matching weights can be assigned to field values using matching (m) and unmatching (u) probabilities or using agreement and disagreements weights in an equivalent way. Both types are based on a logarithmic function. Optimum agreement and disagreement weights are an equivalent logarithmic expression of the matching and unmatching probabilities, but for an end user, defining agreement and disagreement weight ranges is a more direct way to implement m-probabilities and u-probabilities.

Matching and Unmatching Probabilities

When matching and unmatching conditional probabilities are used, the match engine uses a logarithmic formula to determine agreement and disagreement weights between fields. The m-probabilities and u-probabilities you specify determine the maximum agreement weight and minimum disagreement weight for each field, and so define the agreement and disagreement weight ranges for each field and for the entire record. These probabilities allow you to specify which fields provide the most reliable matching information and which provide the least. For example, in person matching, the gender field is not as reliable as the SSN field for determining a match since a person's SSN is more specific. Therefore, the SSN field should have a higher m-probability than the gender field. The more reliable the field, the greater the m-probability for that field should be.

If a field matches between two records, an agreement weight, determined by the logarithmic formula using the m-probability and u-probability, is added to the composite match weight for the record. If the fields disagree, the logarithmic formula using the m-probability and u-probability is negative, and a disagreement weight is subtracted from the composite match weight.

Agreement and Disagreement Weight Ranges

Like probabilities, the maximum agreement and minimum disagreement weights you define for each field allow you to specify the relative reliability of each field; however, the match weight has a more linear relationship with the numbers you specify. When you use agreement and disagreement weight ranges to determine the match weight, you define a maximum weight for each field when they are in complete agreement and a minimum weight for when they are in complete disagreement. The value assigned to a field is somewhere between the two numbers based on an underlying logarithmic formula. This provides a more convenient and intuitive representation of conditional probabilities.

Using the SSN and gender field example above, the SSN field is assigned a higher maximum agreement weight and a lower minimum disagreement weight than the gender field because it is more reliable. If you assign a maximum agreement weight of "10" and two SSNs match, the match weight for that field is "10". If you assign a minimum disagreement weight of "-10" and two SSNs are in complete disagreement, the match weight for that field is "-10".

Understanding How the OHMPI Match Engine Works

The OHMPI Match Engine compares records containing similar data types by calculating how closely certain fields in the records match. The resulting comparison weight is either a positive or negative numeric value that represents the degree to which the two sets of data are similar. The match engine relies on probabilistic algorithms to compare data of a given type using a comparison function specific to the type of data being compared. The comparison functions for each matching field are defined in a match configuration file that you can customize for the type of data you are indexing. You can also define custom comparison functions to plug in to the match engine. The formula used to determine the matching weight is based on either matching and unmatching probabilities or on agreement and disagreement weight ranges (described in Probabilities and Direct Weights).

The following sections provide additional information about how the OHMPI Match Engine works:

"OHMPI Match Engine Structure"
"OHMPI Match Engine Configuration Files"
"OHMPI Match Engine Matching Weight Formulation"
"OHMPI Match Engine Matching Weight Formulation"
"OHMPI Match Engine Data Types"

OHMPI Match Engine Structure

The OHMPI Match Engine was designed to be very flexible and generic, allowing you to customize existing matching rules and to define additional rules using Java. The match engine framework allows you to create and plug in custom matching comparison functions, or comparators, to the match engine to enable matching against any type of data. The OHMPI Match Engine framework includes two main modules. The real-time module stores the predefined and user-defined Java classes that define the matching comparator logic. The design-time modules stores the configuration and validation classes for the comparators.

The OHMPI Match Engine provides a wide variety of customizable comparators for you to choose from. You can also create comparators in the real-time module, and create new validation and configuration rules in the design-time module. The structure of the design-time module supports validations, weighting curves, and class dependencies. There is also an option that allows you load information from a data file and use that information to calculate a matching weight.

OHMPI Match Engine Configuration Files

The OHMPI Match Engine compares two records and returns a match weight indicating the likelihood of a match between the two records based on information provided in configuration files. In a master person index application, the match engine is configured by these two files in the Match Engine node of the master person index project: the matching configuration file (matchConfigFile.cfg) and the comparators list (comparatorsList.xml). The matching configuration file defines the configuration and parameters for the matching comparator functions and the comparators list defines each comparator available to the match engine.

Matching criteria and logic are defined in the match configuration file in the master person index project (matchConfigFile.cfg). The data fields that are sent to the OHMPI Match Engine for matching, known as the match string, are defined in the MatchingConfig section of mefa.xml in the master person index project. The match engine configuration files define which matching rules to use to process each match field. The match engine provides a comprehensive set of comparator functions, and you can create custom comparators if needed.

OHMPI Match Engine Matching Weight Formulation

The OHMPI Match Engine determines the matching weight between two records by comparing the match string fields between the two records using the rules defined in the match configuration file and taking into account the matching logic specified for each field. The OHMPI Match Engine can use either matching (m) and unmatching (u) conditional probabilities or agreement and disagreement weight ranges to fine-tune the match process. It uses the underlying algorithm to arrive at a match weight for each match string field. The weight generated for each field in the match string indicates the level of match between each field. The weights assigned to each field are then summed together for a total, composite matching weight between the two records. Agreement and disagreement weight ranges or m-probabilities and u-probabilities are defined in the match configuration file.

m-probabilities and u-probabilities are expressed as double values between one and zero (excluding one and zero) and can have up to 16 decimal points. Agreement and disagreement weights are expressed as double values and can have up to 16 decimal points. When using agreement and disagreement weights, the OHMPI Match Engine assigns a matching weight to each field that falls between the agreement and disagreement weights specified for the field. Thus, the maximum agreement weight between two records is the sum of the defined agreement weights for each field. The minimum disagreement weight is the sum of the defined disagreement weights for each field. For more information about weight calculation, see "Determining the Weight Range".

OHMPI Match Engine Data Types

The OHMPI Match Engine is built on a flexible framework that allows you to customize and create matching rules for various types of data. The match engine provides an extensive set of comparison functions for matching on various types of fields, such as numbers, dates, single characters, and so on. The match engine also provides more specialized comparison functions for searching on specific types of data, such as person names, address fields, social security numbers, genders. You can define custom comparison functions and custom standardization logic for different data types or variants on data types. These customizations are easily incorporated into a master person index application, allowing you to completely customize the match and standardization process for your specific data format.

The OHMPI Match Engine and the OHMPI Standardization Engine

The OHMPI Match Engine works with the OHMPI Standardization Engine to provide an accurate comparison of two records. The standardization engine reads input data and determines how to parse, normalize, and standardize the data in order to create a standard set of values to use for match comparison. The standardization engine can standardize free-form text fields, such as street address fields or business names, and separate them into their individual parts, such as house numbers, street names, and so on, allowing the match engine to generate a more accurate weight for free-form data.

Understanding the OHMPI Index Standardization and Matching Process

In a default Oracle Healthcare Master Person Index implementation, the master person index application uses the OHMPI Match Engine and the OHMPI Standardization Engine to cleanse data in real time. The standardization engine uses configurable pattern-matching logic to identify data and reformat it into a standardized form. The match engine uses a matching algorithm with a proven methodology to process and weight records in the master person index database. By incorporating both standardization and matching capabilities, you can condition data prior to matching. You can also use these capabilities to review legacy data prior to loading it into the database. This review helps you determine data anomalies, invalid or default values, and missing fields.

In a master person index application, both matching and standardization occur when two records are analyzed for the probability of a match. Before matching, certain fields are normalized, parsed, or converted into their phonetic values if necessary. The match fields are then analyzed and weighted according to the rules defined in a match configuration file. The weights for each field are combined to determine the overall matching weight for the two records. After these steps are complete, survivorship is determined by the master person index application based on how the overall matching weight compares to the duplicate and match thresholds of the master person index application.

In a master person index application, the standardization and matching process includes the following steps:

The master person index application receives an incoming record.
The OHMPI Standardization Engine standardizes the fields specified for parsing, normalization. Phonetic encoding is also performed. These fields are defined in mefa.xml and the rules for standardization are defined in the standardization engine configuration files.
The master person index application queries the database for a candidate selection pool (records that are possible matches) using the blocking query specified in master.xml. If the blocking query uses standardized or phonetic fields, the criteria values are obtained from the database.
For each possible match, the master person index application creates a match string (based on the match columns in mefa.xml) and sends the string to the OHMPI Match Engine.
The OHMPI Match Engine checks the incoming record against each possible match, producing a matching weight for each. Matching is performed using the weighting rules defined in the match configuration file.