JavaScript is required to for searching.
Skip Navigation Links
Exit Print View
Oracle Java CAPS Master Index Match Engine Reference     Java CAPS Documentation
search filter icon
search icon

Document Information

Master Index Match Engine Reference

About the Master Index Match Engine

Related Topics

Master Index Match Engine Overview

Data Matching Concepts

Deterministic and Probabilistic Data Matching

Weighting Thresholds

Probabilities and Direct Weights

Matching and Unmatching Probabilities

Agreement and Disagreement Weight Ranges

How the Master Index Match Engine Works

Master Index Match Engine Structure

Master Index Match Engine Configuration Files

Master Index Match Engine Matching Weight Formulation

Master Index Match Engine Data Types

The Master Index Match Engine and the Master Index Standardization Engine

Oracle Java CAPS Master Index Standardization and Matching Process

Master Index Match Engine Matching Configuration

The Master Index Match Engine Match Configuration File

Master Index Match Engine Match Configuration File Format

Match Configuration File Sample

Probability Type Section

Matching Rules Section

Master Index Match Engine Matching Comparison Functions At a Glance

Master Index Match Engine Comparator Definition List

Master Index Match Engine Comparison Functions

Bigram Comparators

Bigram Comparator (b1)

Advanced Bigram Comparator (b2)

Uncertainty String Comparators

Advanced Jaro String Comparator (u)

Winkler-Jaro String Comparator (ua)

Condensed String Comparator (us)

Advanced Jaro Adjusted for First Names (uf)

Advanced Jaro Adjusted for Last Names (ul)

Advanced Jaro Adjusted for House Numbers (un)

Advanced Jaro AlphaNumeric Comparator (ujs)

Unicode String Comparator (usu)

Unicode AlphaNumeric Comparator (usus)

Exact Character-to-Character Comparator (c)

Numeric Comparators

Integer Comparator (nI)

Real Number Comparator (nR)

Condensed AlphaNumeric SSN Comparator (nS)

Date Comparators

Date Comparator With Years as Units (dY)

Date Comparator With Months as Units (dM)

Date Comparator With Days as Units (dD)

Date Comparator With Hours as Units (dH)

Date Comparator With Minutes as Units (dm)

Date Comparator With Seconds as Units (ds)

Prorated Comparator (p)

Creating Custom Comparators for the Master Index Match Engine

Custom Comparator Overview

About the Comparator Package

Defining Custom Comparators

Before You Begin

Step 1: Create the Custom Comparator Java Class

initialize

Description

Syntax

Parameters

Return Value

Throws

compareFields

Description

Syntax

Parameters

Return Value

Throws

setRTParameters

Description

Syntax

Parameters

Return Value

Throws

stop

Description

Syntax

Parameters

Return Value

Throws

Step 2: Register the Comparator in the Comparators List

To Register the Comparators

Step 3: Define Parameter Validations (Optional)

To Define Parameter Validations

validateComparatorsParameters

Description

Syntax

Parameters

Return Value

Throws

Step 4: Define Data Source Handling (Optional)

To Define Data Source Handling

handleComparatorsDataSources

Description

Syntax

Parameters

Return Value

Throws

DataSourcesProperties Class

getDataSourcesList

Description

Syntax

Parameters

Return Value

Throws

isDataSourceLoaded

Description

Syntax

Parameters

Return Value

Throws

setDataSourceLoaded

Description

Syntax

Parameters

Return Value

Throws

getDataSourceObject

Description

Syntax

Parameters

Return Value

Throws

Step 5: Define Curve Adjustment or Linear Fitting (Optional)

To Define Curve Adjustment or Linear Fitting

processCurveAdjustment

Description

Syntax

Parameters

Return Value

Throws

Step 6: Compile and Package the Comparator

Step 7: Import the Comparator Package Into Oracle Java CAPS Master Index

To Import a Comparison Function

Step 8: Configure the Comparator in the Match Configuration File

Master Index Match Engine Configuration for Common Data Types

The Master Index Match String

Master Index Match Engine Match String Fields

Person Data Match String Fields

Address Data Match String Fields

Business Name Match String Fields

Master Index Match Engine Match Types

Configuring the Match String for a Master Index Application

Configuring the Match String for Person Data

Configuring the Match String for Address Data

Configuring the Match String for Business Names

Fine-Tuning Weights and Thresholds for Oracle Java CAPS Master Index

Data Analysis Overview

Customizing the Match Configuration and Thresholds

Determining the Match Fields

Customizing the Match Configuration

Probabilities or Agreement Weights

Defining Relative Value

Determining the Weight Range

Weight Ranges Using Agreement Weights

Weight Ranges Using Probabilities

Comparison Functions

Determining the Weight Thresholds

Specifying the Weight Thresholds

Weight Distribution Method

Percentage Method

Fine-tuning the Thresholds

Data Matching Concepts

Data matching compares data stored in disparate systems in and across organizations, helping you reduce data duplication and improve data accuracy. Matching involves comparing specific fields in two standardized records and returning a weight that indicates the likelihood of a match between the two records. A higher weight between two records indicates a greater likelihood of a match. Data matching is based on proven algorithms that are designed to compare different types of data, such as strings, dates, integers, and so on. Matching is a key step in managing data quality, and the algorithms are typically quite complex. Some algorithms are configured to compare more specialized types of data, including first and last names, social security numbers, and dates of various formats.

The following topics provide additional information about standard data matching concepts:

Deterministic and Probabilistic Data Matching

Data matching can be either deterministic or probabilistic. In deterministic matching, either unique identifiers for each record are compared to determine a match or an exact comparison is used between fields. Unique identifiers can include national IDs, system IDs, and so on. This can include system IDs, national IDs, and so on. Deterministic matching is generally not completely reliable since in some cases no single field can provide a reliable match between two records. This is where probabilistic, or fuzzy, matching comes in. In probabilistic matching, several field values are compared between two records and each field is assigned a weight that indicates how closely the two field values match. The sum of the individual fields weights indicates the likelihood of a match between two records.

Weighting Thresholds

In a data management system, you can set duplicate and match threshold weights. The duplicate threshold is the weight above which two records potentially represent the same entity. The match threshold is the weight above which two records are considered to represent the same entity. Any records below the duplicate threshold are considered to represent completely separate and different entities.

Probabilities and Direct Weights

Optimum (or ceiling) matching weights can be assigned to field values using matching (m) and unmatching (u) probabilities or using agreement and disagreements weights in an equivalent way. Both types are based on a logarithmic function. Optimum agreement and disagreement weights are an equivalent logarithmic expression of the matching and unmatching probabilities, but for an end user, defining agreement and disagreement weight ranges is a more direct way to implement m-probabilities and u-probabilities.

Matching and Unmatching Probabilities

When matching and unmatching conditional probabilities are used, the match engine uses a logarithmic formula to determine agreement and disagreement weights between fields. The m-probabilities and u-probabilities you specify determine the maximum agreement weight and minimum disagreement weight for each field, and so define the agreement and disagreement weight ranges for each field and for the entire record. These probabilities allow you to specify which fields provide the most reliable matching information and which provide the least. For example, in person matching, the gender field is not as reliable as the SSN field for determining a match since a person’s SSN is more specific. Therefore, the SSN field should have a higher m-probability than the gender field. The more reliable the field, the greater the m-probability for that field should be.

If a field matches between two records, an agreement weight, determined by the logarithmic formula using the m-probability and u-probability, is added to the composite match weight for the record. If the fields disagree, the logarithmic formula using the m-probability and u-probability is negative, and a disagreement weight is subtracted from the composite match weight.

Agreement and Disagreement Weight Ranges

Like probabilities, the maximum agreement and minimum disagreement weights you define for each field allow you to specify the relative reliability of each field; however, the match weight has a more linear relationship with the numbers you specify. When you use agreement and disagreement weight ranges to determine the match weight, you define a maximum weight for each field when they are in complete agreement and a minimum weight for when they are in complete disagreement. The value assigned to a field is somewhere between the two numbers based on an underlying logarithmic formula. This provides a more convenient and intuitive representation of conditional probabilities.

Using the SSN and gender field example above, the SSN field is assigned a higher maximum agreement weight and a lower minimum disagreement weight than the gender field because it is more reliable. If you assign a maximum agreement weight of “10” and two SSNs match, the match weight for that field is “10”. If you assign a minimum disagreement weight of “-10” and two SSNs are in complete disagreement, the match weight for that field is “-10”.