JavaScript is required to for searching.
Skip Navigation Links
Exit Print View
Oracle Java CAPS Master Index Match Engine Reference     Java CAPS Documentation
search filter icon
search icon

Document Information

Master Index Match Engine Reference

About the Master Index Match Engine

Related Topics

Master Index Match Engine Overview

Data Matching Concepts

Deterministic and Probabilistic Data Matching

Weighting Thresholds

Probabilities and Direct Weights

Matching and Unmatching Probabilities

Agreement and Disagreement Weight Ranges

How the Master Index Match Engine Works

Master Index Match Engine Structure

Master Index Match Engine Configuration Files

Master Index Match Engine Matching Weight Formulation

Master Index Match Engine Data Types

The Master Index Match Engine and the Master Index Standardization Engine

Oracle Java CAPS Master Index Standardization and Matching Process

Master Index Match Engine Matching Configuration

The Master Index Match Engine Match Configuration File

Master Index Match Engine Match Configuration File Format

Match Configuration File Sample

Probability Type Section

Matching Rules Section

Master Index Match Engine Matching Comparison Functions At a Glance

Master Index Match Engine Comparator Definition List

Master Index Match Engine Comparison Functions

Bigram Comparators

Bigram Comparator (b1)

Advanced Bigram Comparator (b2)

Uncertainty String Comparators

Advanced Jaro String Comparator (u)

Winkler-Jaro String Comparator (ua)

Condensed String Comparator (us)

Advanced Jaro Adjusted for First Names (uf)

Advanced Jaro Adjusted for Last Names (ul)

Advanced Jaro Adjusted for House Numbers (un)

Advanced Jaro AlphaNumeric Comparator (ujs)

Unicode String Comparator (usu)

Unicode AlphaNumeric Comparator (usus)

Exact Character-to-Character Comparator (c)

Numeric Comparators

Integer Comparator (nI)

Real Number Comparator (nR)

Condensed AlphaNumeric SSN Comparator (nS)

Date Comparators

Date Comparator With Years as Units (dY)

Date Comparator With Months as Units (dM)

Date Comparator With Days as Units (dD)

Date Comparator With Hours as Units (dH)

Date Comparator With Minutes as Units (dm)

Date Comparator With Seconds as Units (ds)

Prorated Comparator (p)

Creating Custom Comparators for the Master Index Match Engine

Custom Comparator Overview

About the Comparator Package

Defining Custom Comparators

Before You Begin

Step 1: Create the Custom Comparator Java Class

initialize

Description

Syntax

Parameters

Return Value

Throws

compareFields

Description

Syntax

Parameters

Return Value

Throws

setRTParameters

Description

Syntax

Parameters

Return Value

Throws

stop

Description

Syntax

Parameters

Return Value

Throws

Step 2: Register the Comparator in the Comparators List

To Register the Comparators

Step 3: Define Parameter Validations (Optional)

To Define Parameter Validations

validateComparatorsParameters

Description

Syntax

Parameters

Return Value

Throws

Step 4: Define Data Source Handling (Optional)

To Define Data Source Handling

handleComparatorsDataSources

Description

Syntax

Parameters

Return Value

Throws

DataSourcesProperties Class

getDataSourcesList

Description

Syntax

Parameters

Return Value

Throws

isDataSourceLoaded

Description

Syntax

Parameters

Return Value

Throws

setDataSourceLoaded

Description

Syntax

Parameters

Return Value

Throws

getDataSourceObject

Description

Syntax

Parameters

Return Value

Throws

Step 5: Define Curve Adjustment or Linear Fitting (Optional)

To Define Curve Adjustment or Linear Fitting

processCurveAdjustment

Description

Syntax

Parameters

Return Value

Throws

Step 6: Compile and Package the Comparator

Step 7: Import the Comparator Package Into Oracle Java CAPS Master Index

To Import a Comparison Function

Step 8: Configure the Comparator in the Match Configuration File

Master Index Match Engine Configuration for Common Data Types

The Master Index Match String

Master Index Match Engine Match String Fields

Person Data Match String Fields

Address Data Match String Fields

Business Name Match String Fields

Master Index Match Engine Match Types

Configuring the Match String for a Master Index Application

Configuring the Match String for Person Data

Configuring the Match String for Address Data

Configuring the Match String for Business Names

Fine-Tuning Weights and Thresholds for Oracle Java CAPS Master Index

Data Analysis Overview

Customizing the Match Configuration and Thresholds

Determining the Match Fields

Customizing the Match Configuration

Probabilities or Agreement Weights

Defining Relative Value

Determining the Weight Range

Weight Ranges Using Agreement Weights

Weight Ranges Using Probabilities

Comparison Functions

Determining the Weight Thresholds

Specifying the Weight Thresholds

Weight Distribution Method

Percentage Method

Fine-tuning the Thresholds

The Master Index Match Engine Match Configuration File

The match configuration file, matchConfigFile.cfg, contains the matching logic for each field on which matching is performed. By default, this file defines the matching logic for the three primary data types (person names, business names, and addresses), and can also handle generic data types, such as dates, numbers, social security numbers, and characters.

The match configuration file defines matching logic for each field on which matching is performed. The Master Index Match Engine provides several comparison functions that you can call in this file to fine-tune the match process. Comparison functions contain the logic to compare different types of data in very specific ways in order to arrive at a match weight for each field. These functions allow you to define how matching is performed for different data types and can be used in conjunction with either matching and unmatching probabilities or agreement and disagreement weight ranges for each field. This file also defines how to handle missing fields.

The following topics describe the format of the configuration file and provide an overview of the predefined comparison functions:

These topics describe the format of the files so you can modify them directly. You can also modify the match configuration file using the Master Index Configuration Editor, which provides an easy, graphical way to configure matching rules.

Master Index Match Engine Match Configuration File Format

The match configuration file is divided into two sections. The first section consists of one line that indicates the matching probability type. The second section consists of the matching rules to use for each match field. In a master index application, this file can be modified from the Matching tab of the Master Index Configuration Editor. For more information, see Configuring the Comparison Functions for a Master Index Application in Oracle Java CAPS Master Index Configuration GuideConfiguring the Comparison Functions

Match Configuration File Sample

Following is an excerpt from the default match configuration file. This excerpt illustrates the components that are described in the following sections.

ProbabilityType            1

FirstName              15  0   uf    0.99  0.001   15  -5
LastName               15  0   ul    0.99  0.001   15  -5
String                 25  0   ua    0.99  0.001   10  -5
DateDays               20  0   dD    0.99  0.001   10  -10 y 15      30
DateMonths             20  0   dM    0.99  0.001   10  -10 n
DateHours              20  0   dH    0.99  0.001   10  -10 y 30      60
DateMinutes            20  0   dm    0.99  0.001   10  -10 y 300 600
DateSeconds            20  0   ds    0.99  0.001   10  -10 y 75      60
Integer                15  0   nI    0.99  0.001   10  -10 n
Real                   15  0   nR    0.99  0.001   10  -10 n
Char                   1   0   c     0.99  0.001   5   -5
pro                    15  0   p     0.99  0.001   10  -10 20 5 5

Probability Type Section

The first line of the match configuration file defines the probability type to use for matching. Specify “0” (zero) to use m-probabilities and u-probabilities to determine a field’s match weight; specify “1” (one) to use agreement and disagreement weight ranges. If the probability type is set to use agreement and disagreement weight ranges, the m-prob and u-prob columns in the matching rules section are ignored. Likewise, if the probability type is set to use m-probabilities and u-probabilities, the agreement-weight and disagreement-weight columns in the matching rules section are ignored. The default is to use agreement and disagreement weight ranges because they are more intuitive.

For more information about probabilities and weights, see Probabilities and Direct Weights.

Matching Rules Section

The section after the first line of the match configuration file contains match field rows, with each row defining how a certain data type or field will be matched. These are the rules you specify in the match string you define for a master index application. The syntax for this section is:

match-field size null-field function m-prob u-prob agreement disagreement params data-sources

The following table describes each element in a match field row.

Table 1 Match Configuration File Columns

Column Number
Column Name
Description
1
match-field
A value that indicates to the Master Index Match Engine how each field should be weighted. Each field included in the match string (the MatchingConfig section of mefa.xml) must have a match type corresponding to a value in this column.
2
size
The number of characters in the field on which matching is performed, beginning with the first character. For example, to match on only the first four characters in a 10-digit field, the value of this column should be “4”.
3
null-field
An index that specifies how to calculate the total weight for null fields or fields that only contain spaces. You can specify any of the following values:
  • 0 - (zero) If one or both fields are empty, the weight used for the field is 0 (zero).

  • 1 - (one) If both fields are empty, the agreement weight is used; if only one field is empty, the disagreement weight is used.

  • a# - An “a” followed by a number specifies to use the agreement weight if both fields are empty. The agreement weight is divided by the number following the “a” to obtain the match weight for that field. If no number is specified, the default is “2”. You can specify any number from 1 through 10.

  • d# - A “d” followed by a number specifies to use the disagreement weight if only one field is empty. The disagreement weight is divided by the number following the “d” to obtain the match weight for the field. If no number is specified, the default is “2”. You can specify any number from 1 through 10.


Note - In the above descriptions, the agreement and disagreement weights are either specified in the file or calculated using a logarithmic formula based on the m and u-probabilities (depending on the probability type).


4
function
The type of comparison to perform when weighting the field. For information about the available comparison functions, see Master Index Match Engine Comparison Functions and Options. An overview of the comparison functions is provided Table 2
5
m-prob
The initial probability that the specified field in two records will match if the records match. The probability is a double value between 0 and 1, and can have up to 16 decimal points.
6
u-prob
The initial probability that the specified field in two records will match if the records do not match. The probability is a double value between 0 and 1, and can have up to 16 decimal points.
7
agreement
The matching weight to be assigned to a field given that the fields match between two records. This number can be between 0 and 100 and can have up to 16 decimal points. It represents the maximum match weight for a field.
8
disagreement
The matching weight to be assigned to a field given that the fields do not match between two records. This number can be between 0 and -100 and can have up to 16 decimal points. It represents the minimum match weight for a field.
9
params
The parameters that correspond to the comparison function specified in column 4. Some comparison functions do not take any parameters and some take multiple parameters. For additional information about parameters, see Master Index Match Engine Comparison Functions and Options.
10
dataSources
The complete path to any data sources used by the comparison function specific in column 4. You can define as many data sources as there are data sources listed for the comparator in the comparators list file. The default comparators do not use data sources, but you can create a custom comparator that does.

Master Index Match Engine Matching Comparison Functions At a Glance

Match field comparison functions, or comparators, compare the values of a field in two records to determine whether the fields match. The fields are then assigned a matching weight based on the results of the comparison function. You can use several different types of comparison functions in the match configuration file to define how the Master Index Match Engine should match the fields in the match string. The Master Index Match Engine provides several options to use with each function. You can also define custom comparison functions. For more information, see Creating Custom Comparators for the Master Index Match Engine.

The following table summarizes each comparison function. A complete reference of the comparison functions and their parameters is included in Master Index Match Engine Comparison Functions and Options.


Note - The names of these comparison functions are configurable. The following table lists their default names.


Table 2 Comparison Function Summary

Comparison Function
Name
Description
b1
Bigram Comparator
Compares two strings using an algorithm based on the Bigram algorithm. This function compares two strings using all combinations of two consecutive characters and returns the total number of combinations that are the same.
b2
Advanced Bigram Comparator
Compares two strings allowing for character transpositions. This function is similar to the standard Bigram Comparator (b1).
u
Advanced Jaro String Comparator
Compares two strings taking into account uncertainty factors, such as string length, transpositions, and characters in common. This function is based on the Jaro algorithm.
ua
Winkler-Jaro String Comparator
Compares two strings similar to the Advanced Jaro String Comparator (u), but increases the agreement weight if the initial characters of each string are exact matches. This function takes into account key punch and visual memory errors. It is based on the Jaro algorithm with variants of Winkler/Lynch and McLaughlin.
uf
Advanced Jaro Adjusted for First Names

Based on the generic string comparator (u), this function is designed to specifically weight first name values. The string is analyzed and the weight adjusted based on statistical data.

ul
Advanced Jaro Adjusted for Last Names
Based on the generic string comparator (u), this function is designed to specifically weight last name values. The string is analyzed and the weight adjusted based on statistical data.
un
Advanced Jaro Adjusted for House Numbers
Based on the generic string comparator (u), this function is designed to specifically weight house number values. The string is analyzed and the weight adjusted based on statistical data.
us
Condensed String Comparator
Compares two strings similar to the Advanced Jaro String Comparator (u), but this function is a custom string comparator that compares two strings taking into account such uncertainty factors as string length, transpositions, key punch errors, and visual memory errors. Unlike the Advanced Jaro String Comparator, this function handles diacritical marks. This function also improves processing speed.
usu
Unicode String Comparator
Compares two strings similar to the Condensed String Comparator (us), but this function is based in Unicode to support multiple languages and alphabets. This comparator takes one parameter indicating the language to use.
usus
Unicode AlphaNumeric Comparator
Compares two strings similar to the Unicode String Comparator, but this function is designed to match on unique identifiers such as national IDs. This comparator takes one parameter indicating the language to use plus any of the following parameters:
  • Field length

  • Character types

  • Invalid values

ujs
Advanced Jaro AlphaNumeric Comparator
Compares two strings similar to the Advanced Jaro String Comparator, but this function is designed to match on unique identifiers such as national IDs. This comparator takes any of the following parameters:
  • Field length

  • Character types

  • Invalid values

c
Exact Character-to-Character Comparator
Compares string fields character by character. Each character must match in order for an agreement weight to be assigned.
nI
Integer Comparator
Compares integer fields using a relative distance value to determine the match weight. As the difference between the two fields increases, the match weight decreases. Once the difference is beyond the relative distance, a disagreement weight is assigned. This comparator takes two parameters; the first indicates whether to use a relative distance or direct string comparison and the second indicates the relative distance to use.
nR
Real Number Comparator
Compares fields containing real numbers using a relative distance value to determine the match weight. As the difference between the two fields increases, the match weight decreases. Once the difference is beyond the relative distance, a disagreement weight is assigned. This comparator takes two parameters; the first indicates whether to use a relative distance or direct string comparison, and the second indicates the relative distance to use.
nS
Condensed AlphaNumeric SSN Comparator
Compares social security numbers or other unique identifiers, taking into account any of these parameters:
  • Field length

  • Character types

  • Invalid values

dY
Date Comparator With Years as Units
Compares year values using relative distance values prior to and following the given year to determine the match weight. As the difference between the two fields increases, the match weight decreases. Once the difference is beyond the relative distance, a disagreement weight is assigned. The date comparison functions handle Gregorian years. This comparator takes up to three parameters; the first indicates whether to use a relative distance or direct string comparison, and the second and third indicate the relative distance before and after.
dM
Date Comparator With Months as Units
Compares the month and year using a relative distance as described above for the year comparison function (dY).
dD
Date Comparator With Days as Units
Compares the day, month, and year using a relative distance as described above for the year comparison function (dY).
dH
Date Comparator With Hours as Units
Compares the hour, day, month, and year using a relative distance as described above for the year comparison function (dY).
dm
Date Comparator With Minutes as Units
Compares the minute, hour, day, month, and year using a relative distance as described above for the year comparison function (dY).
ds
Date Comparator With Seconds as Units
Compares the second, minute, hour, day, month, and year using a relative distance as described above for the year comparison function (dY).
p
Prorated Comparator
Prorates the disagreement weight for a date or numeric field based on values you specify. Differences greater than the amount you specify receive the full disagreement weight. This comparator takes three parameters indicating the relative distance and the agreement and disagreement ranges.

Master Index Match Engine Comparator Definition List

The comparator definition list defines each comparator that is included in a master index application. If a comparator is not included in this list, it cannot be used in the application. If you define a comparator in this list that is not provided with the Master Index Match Engine, you need to define the logic of the new comparator in Java classes (for more information, see Creating Custom Comparators for the Master Index Match Engine.

Below is an excerpt from the default comparators list file that defines two numeric comparators, Real Number Comparator and Integer Comparator. Both comparators take two parameters, and are dependent on a second comparator class named CondensedStringComparator.

<comparator description="Numerics comparator">
  <className>NumericsComparator</className>
  <codes>            
    <code description="Real Number Comparator" name="n[R, ]"/>
    <code description="Integer Comparator" name="nI" />
  </codes>   
  <params>
    <param description="distance/string comparison option" 
           name="switch" type="java.lang.String"/>
    <param description="Spectrum of comparison" 
           name="range" type="java.lang.Integer|java.lang.Double"/>
  </params>                                  
  <data-sources/>
  <dependency-classes>
    <dependency-class matchfield="CSC"
     name="com.sun.mdm.matcher.comparators.base.CondensedStringComparator"/>
  </dependency-classes>
  <curve-adjust status="false"/>
</comparator>            

The comparators are defined in XML format. The following table lists and describes each element in the XML file.

Table 3 Comparator Definition List Elements

Element
Attribute
Description
group
An element that contains a list of comparators that all share the same Java package.
description
A brief description of the comparator group.
path
The Java package that contains the code that defines the comparators in the group.
comparator
A definition for one subgroup of comparators that are all based on the same Java class, have the same Java class dependencies, accept the same parameters and data sources, and have the same curve adjustment setting.
description
A brief description of the comparator subgroup.
className
The name of the class that defines the logic for the comparators. The class must be contained in the package specified for the group element, as described above.
codes
A container element for a list of the comparators in the subgroup, with descriptions and processing codes of each comparator.
code
A description and processing code for one comparator.
description
A description of the comparator. The value you specify here appears in the comparator drop-down list on the Master Index Configuration Editor.
name
A unique identifying name for the comparator. These are the comparator names used in the rules definitions in the match configuration file (matchConfigFile.cfg).
params
A container element for a list of static parameters for the subgroup of comparators. Parameters are optional.
param
One parameter definition for the comparators.
description
A brief description of the parameter.
name
A short name for the parameter.
type
The Java data type of the values that can be specified for the parameter.
data-sources
A container element for a list of data files that contain additional information for the subgroup of comparators. For example, a comparator that generates weights based on the distance between postal codes might use lookup files containing information about the zip codes. Data sources are optional.
data-source
A definition for one data source. Currently, only file data sources are supported.
description
A brief description of the data source.
name
The complete path and filename of the data source.
type
The type of data source being used. Currently, the only value you can specify is “java.io.File”.
dependency-classes
A container element that defines a list of Java classes on which the comparator class is dependent. The current comparator class inherits from the comparator classes you specify here as well as all the match fields (defined in matchConfigFile.cfg) that use that comparator.
dependency-class
A definition for one comparator class, called a dependency comparator, on which the current comparator class is dependent.
matchField
The name of the dependency comparator's match field.
name
The name of the dependency comparator class.
curve-adjust
An indicator of whether to apply special adjustments to the weighting curve. The curve adjustment is defined for each comparator individually in a Java class named comparator_nameCurveAdjustor.
status
The status of the curve adjustor. Specify true to use the curve adjustor; specify false to disable the curve adjustor.