JavaScript is required to for searching.
Skip Navigation Links
Exit Print View
Understanding the Oracle Java CAPS Match Engine     Java CAPS Documentation
search filter icon
search icon

Document Information

Understanding the Oracle Java CAPS Match Engine

Related Topics

About the Oracle Java CAPS Match Engine

Oracle Java CAPS Match Engine Overview

About the Oracle Java CAPS Match Engine Matching Algorithm

Oracle Java CAPS Match Engine Standardization and Matching Process

Oracle Java CAPS Match Engine Data Types

How the Oracle Java CAPS Match Engine Works

Oracle Java CAPS Match Engine Matching Weight Formulation

Matching and Unmatching Probabilities

Agreement and Disagreement Weight Ranges

Oracle Java CAPS Match Engine Standardization Configuration

Oracle Java CAPS Match Engine Standardization File Types

Oracle Java CAPS Match Engine Internationalization

Oracle Java CAPS Match Engine Matching Configuration

The Oracle Java CAPS Match Engine Match Configuration File

Oracle Java CAPS Match Engine Match Configuration File Format

Match Configuration File Sample

Probability Type

Matching Rules

Oracle Java CAPS Match Engine Matching Comparison Functions

The Match Constants File

Oracle Java CAPS Match Engine and the Oracle Java CAPS Match Engine

Master Index Components and the Oracle Java CAPS Match Engine

Searching and Matching in Oracle Java CAPS Match Engine Applications (Repository)

Standardization and Matching Process in Master Index Applications (Repository)

The Master Index Match String (Repository)

Oracle Java CAPS Match Engine Field Identifiers

Oracle Java CAPS Match Engine Match and Standardization Types

Oracle Java CAPS Match Engine Configuration File Modifications

Configuring the Master Index Matching Service (Repository)

Master Index Standardization Configuration (Repository)

Normalization Structures

Standardization Structures (Parsing and Normalization)

Phonetic Encoding Structures

Master Index Match String Configuration (Repository)

Match and Standardization Engine Configuration

Master Index Phonetic Encoder Configuration (Repository)

Oracle Java CAPS Match Engine Person Data Type Configuration

Oracle Java CAPS Match Engine Person Matching Overview

Oracle Java CAPS Match Engine Person Data Processing Fields

Person Data Match String Fields

Person Data Standardized Fields

Person Data Object Structure

Oracle Java CAPS Match Engine Match Configuration for Person Data

Oracle Java CAPS Match Engine Person Data Standardization Files

Oracle Java CAPS Match Engine Common Standardization Files for Person Data

The Hyphenated Name Category File (personFirstNameDash.dat)

The Person Name Patterns File (personNamePatt.dat)

The Special Characters Reference File (personRemoveSpecChars.dat)

Oracle Java CAPS Match Engine Domain-Specific Standardization Files for Person Data

The Conjunction Reference File (personConjon*.dat)

The Person Constants File (personConstants*.cfg)

The First Name Category File (personFirstName*.dat)

The Generational Suffix Category File (personGenSuffix*.dat)

Last Name Prefix Category File (personLastNamePrefix*.dat)

The Last Name Category File (personLastName*.dat)

The Occupational Suffix Category File (personOccupSuffix*.dat)

The Three-Character Suffix File (personThree*.dat)

The Title Category File (personTitle*.dat)

The Two-Character Suffix File (personTwo*.dat)

The Business-Related Category File (businessOrRelated*.dat)

Configuring the Oracle Java CAPS Match Engine Standardization Files for Person Data

Configuring the Master Index Matching Service for Person Data (Repository)

Configuring the Standardization Structure for Person Data (Repository)

Person Data Normalization Structures

Person Data Phonetic Encoding

Configuring the Match String for Person Data (Repository)

Oracle Java CAPS Match Engine Address Data Type Configuration

Oracle Java CAPS Match Engine Address Matching Overview

Oracle Java CAPS Match Engine Address Data Processing Fields

Address Data Match String Fields

Address Data Standardized Fields

Address Data Object Structure

Match Configuration for Address Data (Repository)

Oracle Java CAPS Match Engine Standardization Configuration for Address Data

The Address Constants File (addressConstants*.cfg)

The Address Clues File (addressClueAbbrev*.dat)

The Address Internal Constants File (addressInternalConstants*.cfg)

The Address Master Clues File (addressMasterClues*.dat)

The Address Patterns File (addressPatterns*.dat)

The Address Output Patterns File (addressOutPatterns*.dat)

Address Pattern File Components

Address Type Tokens

Pattern Classes

Pattern Modifiers

Priority Indicators

Modifying Oracle Java CAPS Match Engine Address Data Configuration Files

Configuring the Matching Service for Address Data (Repository)

Configuring the Standardization Structure for Address Data (Repository)

Address Standardization Structures

Address Phonetic Encoding

Configuring the Match String for Address Data (Repository)

Oracle Java CAPS Match Engine Business Names Data Type Configuration

Oracle Java CAPS Match Engine Business Name Matching Overview

Oracle Java CAPS Match Engine Business Name Processing Fields

Business Name Match String Fields

Business Name Standardized Fields

Business Name Object Structure

Oracle Java CAPS Match Engine Match Configuration for Business Names

Oracle Java CAPS Match Engine Standardization Configuration for Business Names

The Business Constants File (bizConstants.cfg)

The Adjectives Key Type File (bizAdjectivesTypeKeys.dat)

The Alias Key Type File (bizAliasTypeKeys.dat)

The Association Key Type File (bizAssociationTypeKeys.dat)

The General Terms Reference File (bizBusinessGeneralTerms.dat)

The City or State Key Type File (bizCityorStateTypeKeys.dat)

The Business Former Name Reference File (bizCompanyFormerNames.dat)

The Merged Business Name Category File (bizCompanyMergerNames.dat)

The Primary Business Name Reference File (bizCompanyPrimaryNames.dat)

The Connector Tokens Reference File (bizConnectorTokens.dat)

The Country Key Type File (bizCountryTypeKeys.dat)

The Industry Sector Reference File (bizIndustryCategoryCode.dat)

The Industry Key Type File (bizIndustryTypeKeys.dat)

The Organization Key Type File (bizOrganizationTypeKeys.dat)

The Business Patterns File (bizPatterns.dat)

Business Name Tokens

The Special Characters Reference File (bizRemoveSpecChars.dat)

Modifying Oracle Java CAPS Match Engine Business Name Configuration Files

Configuring the Matching Service for Business Names (Repository)

Configuring the Standardization Structure for Business Names (Repository)

Business Name Standardization Structures

Business Name Phonetic Encoding

Configuring the Match String for Business Names (Repository)

Fine-Tuning Weights and Thresholds for Oracle Java CAPS Match Engine (Repository)

Data Analysis Overview

Customizing the Match Configuration and Thresholds

Determining the Match Fields

Customizing the Match Configuration

Probabilities or Agreement Weights

Defining Relative Value

Determining the Weight Range

Weight Ranges Using Agreement Weights

Weight Ranges Using Probabilities

Comparison Functions

Determining the Weight Thresholds

Specifying the Weight Thresholds

Fine-tuning the Thresholds

Match Configuration Comparison Functions for Oracle Java CAPS Match Engine (Repository)

Oracle Java CAPS Match Engine Comparison Functions

Bigram Comparators

Bigram String Comparator (b1)

Advanced Bigram String Comparator (b2)

Uncertainty String Comparators

Generic String Comparator (u)

Advanced Generic String Comparator (ua)

Simplified String Comparator (us)

Simplified String Comparator - FirstName (uf)

Simplified String Comparator - LastName (ul)

Simplified String Comparator - House Numbers (un)

Language-specific String Comparator (usu)

Exact char-by-char Comparator (c)

Numeric Comparators

Generic Number Comparator (n)

Integer Comparator (nI)

Real Number Comparator (nR)

Alphanumeric Comparator (nS)

Date Comparators

Date Comparator - Year only (dY)

Date Comparator - Month-Year (dM)

Date Comparator - Day-Month-Year (dD)

Date Comparator - Hour-Day-Month-Year (dH)

Date Comparator - Min-Hour-Day- Month-Year (dm)

Date Comparator - Sec-Min-Hour-Day- Month-Year (ds)

Prorated Comparator (p)

Oracle Java CAPS Match Engine Comparison Function Options

Oracle Java CAPS Match Engine Matching Configuration

The matching configuration files define how the Oracle Java CAPS Match Engine processes records to assign matching probability weights, allowing the master index application to identify matches, potential duplicates, and non-matches. These files consist of two configurable files, the match configuration file and the match constants file. Together these files define additional logic for the Oracle Java CAPS Match Engine to use when determining the matching probability between two records. A third file, the internal match constants file, is read-only and used internally by the match engine. It defines each comparison function and the comparison options.

The matching configuration files are very flexible, allowing you to customize the matching logic according to the type of data stored in the master index application and for the record matching requirements of your business. The matching configuration files are stored in the master index project and appear as nodes in the Match Engine node of the project. The Oracle Java CAPS Match Engine typically standardizes the data prior to matching, so the match process is performed against the standardized data.

The matching configuration files for the Oracle Java CAPS Match Engine must follow certain rules for formatting and interdependencies. The following topics provide an overview of the two matching configuration files provided, the architecture of those files, and formatting descriptions. They also include an overview of comparison functions used in the match configuration file.

The Oracle Java CAPS Match Engine Match Configuration File

The match configuration file, matchConfigFile.cfg, contains the matching logic for each field on which matching is performed. By default, this file defines the matching logic for the three primary data types (person names, business names, and addresses), and can also handle generic data types, such as dates, numbers, social security numbers, and characters.

The match configuration file defines matching logic for each field on which matching is performed. The Oracle Java CAPS Match Engine provides several comparison functions that you can call in this file to fine-tune the match process. Comparison functions contain the logic to compare different types of data in very specific ways in order to arrive at a match weight for each field. These functions allow you to define how matching is performed for different data types and can be used in conjunction with either matching and unmatching probabilities or agreement and disagreement weight ranges for each field. This file also defines how to handle missing fields.

The following topics describe the format of the configuration file and provide an overview of the predefined comparison functions:

Oracle Java CAPS Match Engine Match Configuration File Format

The match configuration file is divided into two sections. The first section consists of one line that indicates the matching probability type. The second section consists of the matching rules to use for each match field.

Match Configuration File Sample

Following is an excerpt from the default match configuration file. This excerpt illustrates the components that are described in the following sections.

ProbabilityType            1

FirstName              15  0   uf    0.99  0.001   15  -5
LastName               15  0   ul    0.99  0.001   15  -5
String                 25  0   ua    0.99  0.001   10  -5
DateDays               20  0   dD    0.99  0.001   10  -10 y 15      30
DateMonths             20  0   dM    0.99  0.001   10  -10 n
DateHours              20  0   dH    0.99  0.001   10  -10 y 30      60
DateMinutes            20  0   dm    0.99  0.001   10  -10 y 300 600
DateSeconds            20  0   ds    0.99  0.001   10  -10 y 75      60
Numeric                15  0   n     0.99  0.001   10  -10 y 8
Integer                15  0   nI    0.99  0.001   10  -10 n
Real                   15  0   nR    0.99  0.001   10  -10 n
Char                   1   0   c     0.99  0.001   5   -5
pro                    15  0   p     0.99  0.001   10  -10 20 5 5
Probability Type

The first line of the match configuration file defines the probability type to use for matching. Specify “0” (zero) to use m-probabilities and u-probabilities to determine a field’s match weight; specify “1” (one) to use agreement and disagreement weight ranges. If the probability type is set to use agreement and disagreement weight ranges, the m-prob and u-prob columns in the matching rules section are ignored. Likewise, if the probability type is set to use m-probabilities and u-probabilities, the agreement-weight and disagreement-weight columns in the matching rules section are ignored. The default is to use agreement and disagreement weight ranges because they are more intuitive.

Matching Rules

The section after the first line of the match configuration file contains match field rows, with each row defining how a certain data type or field will be matched. The syntax for this section is:

match-type size null-field function m-prob u-prob agreement-weight disagreement-weight parameters

Table 1 describes each element in a match field row.

Table 1 Match Configuration File Columns

Column Number
Column Name
Description
1
match-type
A value that indicates to the Oracle Java CAPS Match Engine how each field should be weighted. Each field included in the match string (the MatchingConfig section of the Match Field file) must have a match type corresponding to a value in this column.
2
size
The number of characters in the field on which matching is performed, beginning with the first character. For example, to match on only the first four characters in a 10-digit field, the value of this column should be “4”.
3
null-field
An index that specifies how to calculate the total weight for null fields or fields that only contain spaces. You can specify any of the following values:
  • 0 - (zero) If one or both fields are empty, the weight used for the field is 0 (zero).

  • 1 - (one) If both fields are empty, the agreement weight is used; if only one field is empty, the disagreement weight is used.

  • a# - An “a” followed by a number specifies to use the agreement weight if both fields are empty. The agreement weight is divided by the number following the “a” to obtain the match weight for that field. If no number is specified, the default is “2”. You can specify any number from 1 through 10.

  • d# - A “d” followed by a number specifies to use the disagreement weight if only one field is empty. The disagreement weight is divided by the number following the “d” to obtain the match weight for the field. If no number is specified, the default is “2”. You can specify any number from 1 through 10.


Note - In the above descriptions, the agreement and disagreement weights are either specified in this file or calculated using a logarithmic formula based on the m and u-probabilities (depending on the probability type).


4
function
The type of comparison to perform when weighting the field. For information about the available comparison functions, see Match Configuration Comparison Functions for Oracle Java CAPS Match Engine (Repository).
5
m-prob
The initial probability that the specified field in two records will match if the records match. The probability is a double value between 0 and 1, and can have up to 16 decimal points.
6
u-prob
The initial probability that the specified field in two records will match if the records do not match. The probability is a double value between 0 and 1, and can have up to 16 decimal points.
7
agreement-weight
The matching weight to be assigned to a field given that the fields match between two records. This number can be between 0 and 100 and can have up to 16 decimal points. It represents the maximum match weight for a field.
8
disagreement-weight
The matching weight to be assigned to a field given that the fields do not match between two records. This number can be between 0 and -100 and can have up to 16 decimal points. It represents the minimum match weight for a field.
9
parameters
The parameters that correspond to the comparison function specified in column 4. Some comparison functions do not take any parameters and some take multiple parameters. For additional information about parameters, see Match Configuration Comparison Functions for Oracle Java CAPS Match Engine (Repository).

Oracle Java CAPS Match Engine Matching Comparison Functions

Match field comparison functions, or comparators, compare the values of a field in two records to determine whether the fields match. The fields are then assigned a matching weight based on the results of the comparison function. You can use several different types of comparison functions in the match configuration file to define how the Oracle Java CAPS Match Engine should match the fields in the match string. The Oracle Java CAPS Match Engine provides several options to use with each function.

The following table summarizes each comparison function. A complete reference of the comparison functions and their parameters is included in Match Configuration Comparison Functions for Oracle Java CAPS Match Engine (Repository).

Table 2 Comparison Functions

Comparison Function
Name
Description
b1
Bigram String Comparator
Based on the Bigram algorithm, this function compares two strings using all combinations of two consecutive characters and returns the total number of combinations that are the same.
b2
Advanced Bigram String Comparator
Similar to the standard Bigram comparison function (b1), but allows for character transpositions.
u
Generic String Comparator
Based on the Jaro algorithm, this function compares two strings taking into account uncertainty factors, such as string length, transpositions, and characters in common.
ua
Advanced Generic String Comparator
Based on the Jaro algorithm with variants of Winkler/Lynch and McLaughlin, this function is similar to the generic string comparator (u), but increases the agreement weight if the initial characters of each string are exact matches. This comparison function takes into account key punch and visual memory errors.
uf
Simplified String Comparator - FirstName

Based on the generic string comparator (u), this function is designed to specifically weight first name values. The string is analyzed and the weight adjusted based on statistical data.

ul
Simplified String Comparator - LastName
Based on the generic string comparator (u), this function is designed to specifically weight last name values. The string is analyzed and the weight adjusted based on statistical data.
un
Simplified String Comparator - House Numbers
Based on the generic string comparator (u), this function is designed to specifically weight house number values. The string is analyzed and the weight adjusted based on statistical data.
us
Simplified String Comparator
A custom string comparator that compares two strings taking into account such uncertainty factors as string length, transpositions, key punch errors, and visual memory errors. Unlike the generic string comparator (“u”), this function handles diacritical marks. This function also improves processing speed.
usu
Language-specific String Comparator
A custom string comparator similar to the “us” comparator with the exception that it is based in Unicode to support multiple languages and alphabets. This comparator takes one parameter indicating the language to use.
c
Exact char-by-char Comparator
Compares string fields character by character. Each character must match in order for an agreement weight to be assigned.
n
Generic Number Comparator
Compares numeric fields using a relative distance value to determine the match weight. As the difference between the two fields increases, the match weight decreases. Once the difference is beyond the relative distance, a disagreement weight is assigned. This comparator takes two parameters; the first indicates whether to use a relative distance or direct string comparison, and the second indicates the relative distance to use.
nI
Integer Comparator
Compares integer fields using a relative distance comparison. This comparison function is based on the generic number comparator (n), and accepts the same parameters.
nR
Real Number Comparator
Compares fields containing real numbers using a relative distance comparison. This comparison function is based on the generic number comparator (n), and accepts the same parameters.
nS
Alphanumeric Comparator
Compares social security numbers or other unique identifiers, taking into account any of these parameters:
  • Field length

  • Character types

  • Invalid values

dY
Date Comparator - Year only
Compares year values using relative distance values prior to and following the given year to determine the match weight. As the difference between the two fields increases, the match weight decreases. Once the difference is beyond the relative distance, a disagreement weight is assigned. The date comparison functions handle Gregorian years. This comparator takes up to three parameters; the first indicates whether to use a relative distance or direct string comparison, and the second and third indicate the relative distance before and after.
dM
Date Comparator - Month-Year
Compares the month and year using a relative distance as described above for the year comparison function (dY).
dD
Date Comparator - Day-Month-Year
Compares the day, month, and year using a relative distance as described above for the year comparison function (dY).
dH
Date Comparator - Hour-Day-Month-Year
Compares the hour, day, month, and year using a relative distance as described above for the year comparison function (dY).
dm
Date Comparator - Min-Hour-Day- Month-Year
Compares the minute, hour, day, month, and year using a relative distance as described above for the year comparison function (dY).
ds
Date Comparator - Sec-Min-Hour-Day- Month-Year
Compares the second, minute, hour, day, month, and year using a relative distance as described above for the year comparison function (dY).
p
Prorated Comparator
Prorates the disagreement weight for a date or numeric field based on values you specify. Differences greater than the amount you specify receive the full disagreement weight. This comparator takes three parameters indicating the relative distance and the agreement and disagreement ranges.

The Match Constants File

The match constants file, matchConstants.cfg, defines certain configurable constants used by the match engine. This file includes four parameters, but currently only the first parameter, nFields, is used. This parameter defines the maximum number of fields being used for matching. This must be equal to or greater than the number of fields defined in the match-columns element of the Match Field file. The match constants file defines the following constants for the match engine.