JavaScript is required to for searching.
Skip Navigation Links
Exit Print View
Understanding the Oracle Java CAPS Match Engine     Java CAPS Documentation
search filter icon
search icon

Document Information

Understanding the Oracle Java CAPS Match Engine

Related Topics

About the Oracle Java CAPS Match Engine

Oracle Java CAPS Match Engine Overview

About the Oracle Java CAPS Match Engine Matching Algorithm

Oracle Java CAPS Match Engine Standardization and Matching Process

Oracle Java CAPS Match Engine Data Types

How the Oracle Java CAPS Match Engine Works

Oracle Java CAPS Match Engine Matching Weight Formulation

Matching and Unmatching Probabilities

Agreement and Disagreement Weight Ranges

Oracle Java CAPS Match Engine Standardization Configuration

Oracle Java CAPS Match Engine Standardization File Types

Oracle Java CAPS Match Engine Internationalization

Oracle Java CAPS Match Engine Matching Configuration

The Oracle Java CAPS Match Engine Match Configuration File

Oracle Java CAPS Match Engine Match Configuration File Format

Match Configuration File Sample

Probability Type

Matching Rules

Oracle Java CAPS Match Engine Matching Comparison Functions

The Match Constants File

Oracle Java CAPS Match Engine and the Oracle Java CAPS Match Engine

Master Index Components and the Oracle Java CAPS Match Engine

Searching and Matching in Oracle Java CAPS Match Engine Applications (Repository)

Standardization and Matching Process in Master Index Applications (Repository)

The Master Index Match String (Repository)

Oracle Java CAPS Match Engine Field Identifiers

Oracle Java CAPS Match Engine Match and Standardization Types

Oracle Java CAPS Match Engine Configuration File Modifications

Configuring the Master Index Matching Service (Repository)

Master Index Standardization Configuration (Repository)

Normalization Structures

Standardization Structures (Parsing and Normalization)

Phonetic Encoding Structures

Master Index Match String Configuration (Repository)

Match and Standardization Engine Configuration

Master Index Phonetic Encoder Configuration (Repository)

Oracle Java CAPS Match Engine Person Data Type Configuration

Oracle Java CAPS Match Engine Person Matching Overview

Oracle Java CAPS Match Engine Person Data Processing Fields

Person Data Match String Fields

Person Data Standardized Fields

Person Data Object Structure

Oracle Java CAPS Match Engine Match Configuration for Person Data

Oracle Java CAPS Match Engine Person Data Standardization Files

Oracle Java CAPS Match Engine Common Standardization Files for Person Data

The Hyphenated Name Category File (personFirstNameDash.dat)

The Person Name Patterns File (personNamePatt.dat)

The Special Characters Reference File (personRemoveSpecChars.dat)

Oracle Java CAPS Match Engine Domain-Specific Standardization Files for Person Data

The Conjunction Reference File (personConjon*.dat)

The Person Constants File (personConstants*.cfg)

The First Name Category File (personFirstName*.dat)

The Generational Suffix Category File (personGenSuffix*.dat)

Last Name Prefix Category File (personLastNamePrefix*.dat)

The Last Name Category File (personLastName*.dat)

The Occupational Suffix Category File (personOccupSuffix*.dat)

The Three-Character Suffix File (personThree*.dat)

The Title Category File (personTitle*.dat)

The Two-Character Suffix File (personTwo*.dat)

The Business-Related Category File (businessOrRelated*.dat)

Configuring the Oracle Java CAPS Match Engine Standardization Files for Person Data

Configuring the Master Index Matching Service for Person Data (Repository)

Configuring the Standardization Structure for Person Data (Repository)

Person Data Normalization Structures

Person Data Phonetic Encoding

Configuring the Match String for Person Data (Repository)

Oracle Java CAPS Match Engine Address Data Type Configuration

Oracle Java CAPS Match Engine Address Matching Overview

Oracle Java CAPS Match Engine Address Data Processing Fields

Address Data Match String Fields

Address Data Standardized Fields

Address Data Object Structure

Match Configuration for Address Data (Repository)

Oracle Java CAPS Match Engine Standardization Configuration for Address Data

The Address Constants File (addressConstants*.cfg)

The Address Clues File (addressClueAbbrev*.dat)

The Address Internal Constants File (addressInternalConstants*.cfg)

The Address Master Clues File (addressMasterClues*.dat)

The Address Patterns File (addressPatterns*.dat)

The Address Output Patterns File (addressOutPatterns*.dat)

Address Pattern File Components

Address Type Tokens

Pattern Classes

Pattern Modifiers

Priority Indicators

Modifying Oracle Java CAPS Match Engine Address Data Configuration Files

Configuring the Matching Service for Address Data (Repository)

Configuring the Standardization Structure for Address Data (Repository)

Address Standardization Structures

Address Phonetic Encoding

Configuring the Match String for Address Data (Repository)

Oracle Java CAPS Match Engine Business Names Data Type Configuration

Oracle Java CAPS Match Engine Business Name Matching Overview

Oracle Java CAPS Match Engine Business Name Processing Fields

Business Name Match String Fields

Business Name Standardized Fields

Business Name Object Structure

Oracle Java CAPS Match Engine Match Configuration for Business Names

Oracle Java CAPS Match Engine Standardization Configuration for Business Names

The Business Constants File (bizConstants.cfg)

The Adjectives Key Type File (bizAdjectivesTypeKeys.dat)

The Alias Key Type File (bizAliasTypeKeys.dat)

The Association Key Type File (bizAssociationTypeKeys.dat)

The General Terms Reference File (bizBusinessGeneralTerms.dat)

The City or State Key Type File (bizCityorStateTypeKeys.dat)

The Business Former Name Reference File (bizCompanyFormerNames.dat)

The Merged Business Name Category File (bizCompanyMergerNames.dat)

The Primary Business Name Reference File (bizCompanyPrimaryNames.dat)

The Connector Tokens Reference File (bizConnectorTokens.dat)

The Country Key Type File (bizCountryTypeKeys.dat)

The Industry Sector Reference File (bizIndustryCategoryCode.dat)

The Industry Key Type File (bizIndustryTypeKeys.dat)

The Organization Key Type File (bizOrganizationTypeKeys.dat)

The Business Patterns File (bizPatterns.dat)

Business Name Tokens

The Special Characters Reference File (bizRemoveSpecChars.dat)

Modifying Oracle Java CAPS Match Engine Business Name Configuration Files

Configuring the Matching Service for Business Names (Repository)

Configuring the Standardization Structure for Business Names (Repository)

Business Name Standardization Structures

Business Name Phonetic Encoding

Configuring the Match String for Business Names (Repository)

Fine-Tuning Weights and Thresholds for Oracle Java CAPS Match Engine (Repository)

Data Analysis Overview

Customizing the Match Configuration and Thresholds

Determining the Match Fields

Customizing the Match Configuration

Probabilities or Agreement Weights

Defining Relative Value

Determining the Weight Range

Weight Ranges Using Agreement Weights

Weight Ranges Using Probabilities

Comparison Functions

Determining the Weight Thresholds

Specifying the Weight Thresholds

Fine-tuning the Thresholds

Match Configuration Comparison Functions for Oracle Java CAPS Match Engine (Repository)

Oracle Java CAPS Match Engine Comparison Functions

Bigram Comparators

Bigram String Comparator (b1)

Advanced Bigram String Comparator (b2)

Uncertainty String Comparators

Generic String Comparator (u)

Advanced Generic String Comparator (ua)

Simplified String Comparator (us)

Simplified String Comparator - FirstName (uf)

Simplified String Comparator - LastName (ul)

Simplified String Comparator - House Numbers (un)

Language-specific String Comparator (usu)

Exact char-by-char Comparator (c)

Numeric Comparators

Generic Number Comparator (n)

Integer Comparator (nI)

Real Number Comparator (nR)

Alphanumeric Comparator (nS)

Date Comparators

Date Comparator - Year only (dY)

Date Comparator - Month-Year (dM)

Date Comparator - Day-Month-Year (dD)

Date Comparator - Hour-Day-Month-Year (dH)

Date Comparator - Min-Hour-Day- Month-Year (dm)

Date Comparator - Sec-Min-Hour-Day- Month-Year (ds)

Prorated Comparator (p)

Oracle Java CAPS Match Engine Comparison Function Options

Match Configuration Comparison Functions for Oracle Java CAPS Match Engine (Repository)

Match field comparison functions, or comparators, compare the values of a field in two records to determine whether the fields match or how closely they match. The fields are then assigned a matching weight based on the results of the comparison function. You can use several different types of comparison functions in the match configuration file in order to customize how the Oracle Java CAPS Match Engine matches records. The following topics provide information about the predefined comparison functions and their parameters and options.

Oracle Java CAPS Match Engine Comparison Functions

There are six primary types of comparison functions used by the Oracle Java CAPS Match Engine. The following types of comparison functions are available.

Certain comparison function types are very specific to the type of data being matched, such as the numeric functions and the date functions. Others, such as the Bigram and uncertainty functions, are more general and can be applied to various data fields.

Be sure to review Table 1 for information about how the parameters in the match configuration file affect the outcome of the comparator functions. For example, these parameters define how null fields are handled and what the actual agreement and disagreement weights will be.


Note - The names of the comparators are configurable. The default names are used here.


Bigram Comparators

The Oracle Java CAPS Match Engine provides two different comparison functions based on the Bigram algorithm, the standard bigram (b1) and the transposition bigram (b2). A Bigram algorithm compares two strings using all combinations of two consecutive characters within each string. For example, the word “bigram” contains the following bigrams: “bi”, “ig”, “gr”, “ra”, and “am”. The Bigram comparison function returns a value between 0 and 1, which accounts for the total number of bigrams that are in common between the strings divided by the average number of bigrams in the strings. Bigrams handle minor typographical errors well.

Bigram String Comparator (b1)

This is a standard Bigram comparison function, processing match fields as described above. This comparison function takes no parameters.

Advanced Bigram String Comparator (b2)

This comparison function is based on the standard Bigram comparison function, but handles transpositions of characters within a string. This comparison function takes no parameters.

Uncertainty String Comparators

The Oracle Java CAPS Match Engine provides the following uncertainty comparison functions for comparing string fields. Most uncertainty comparison functions are generic, but three comparison functions are designed for specific types of information (first name, last name, and house number).

Generic String Comparator (u)

This is the standard uncertainty comparison function, which processes string fields as described above. As more differences are found between two fields, the agreement weight decreases nonlinearly. Thus, the agreement weight can remain high for several differences, but will drop sharply at a certain point. This comparison function takes no parameters.

The uncertainty comparison function is based on the Jaro algorithm with McLaughlin adjustments for similarities. The Jaro algorithm is a string comparison function that accounts for insertions, deletions, and transpositions by performing the following steps.

  1. Compute the lengths of both strings to be matched.

  2. Determine the number of common characters between the two strings. In order for characters to be considered common, they must be within one-half the length of the shorter string.

  3. Determine the number of transpositions. A transposition means a character from the first string is out of order with the corresponding common character from the second string.

Advanced Generic String Comparator (ua)

This comparison function is based on the standard uncertainty comparison function, u, with variants of Winkler/Lynch and McLaughlin. It has additional features to handle specific differences between fields, such as key punch and visual memory errors. Each feature makes use of the information made available from previous features. This comparison function takes no parameters. The following features are included in the advanced uncertainty function.

Simplified String Comparator (us)

This comparison function is a custom version of a generic string comparison function. It is similar to the basic uncertainty comparison function, u, but processes data in a more simple and efficient manner, improving processing speed. The agreement weights generated by this comparison function decrease in a more uniform manner for each difference found between two fields.

Like the basic uncertainty function, the simplex function takes into account such uncertainty factors as string length, transpositions, key punch errors, and visual memory errors. Unlike the uncertainty comparison function (“u”), this function handles diacritical marks. This comparison function takes no parameters.

Simplified String Comparator - FirstName (uf)

This comparison function is designed specifically for matching on first name fields, and is based on the simplex uncertainty comparison function, us. This comparison function analyzes the string and then adjusts the weight based on statistical data. This comparison function takes no parameters.

Simplified String Comparator - LastName (ul)

This comparison function is designed specifically for matching on last name fields, and is based on the simplex uncertainty comparison function, us. This comparison function analyzes the string and then adjusts the weight based on statistical data. This comparison function takes no parameters.

Simplified String Comparator - House Numbers (un)

This comparison function is designed specifically for matching on house numbers, and is based on the simplex uncertainty comparison function, u. This comparison function analyzes the string and then adjusts the weight based on statistical data. This comparison function takes no parameters.

Language-specific String Comparator (usu)

This comparison function is a custom version of a generic string comparison function. It is similar to the simplex uncertainty comparison function, us, but is based in Unicode to enable multilingual support. This locale-oriented comparator recognizes the nuances of each language and supports the complexities and subtleties of each. For example, when configured to use the German language set, the function recognizes “ß” and “ss” as equivalent. Like the simplex uncertainty function, the Unicode function takes into account such uncertainty factors as string length, transpositions, key punch errors, and visual memory errors. This comparison function takes the parameter described in Table 38.

Table 38 usu Comparison Function Parameter

Parameter
Description
language
An indicator of the language being used for the information stored in the database. Enter one of the following codes to indicate the language in use.

da - Danish

sv - Swedish

nb - Norwegian Bokmål

nn - Norwegian Nynorsk

nl - Dutch

es - Spanish

fr - French

en - English

it - Italian

de - German

Exact char-by-char Comparator (c)

The Oracle Java CAPS Match Engine provides one exact-match comparison function, “c”. With this comparison function, two fields must match on each character in order to be considered a match. This comparison function takes no parameters.

Numeric Comparators

The Oracle Java CAPS Match Engine provides several comparison functions for matching on numeric fields.

All but the nS comparison function can perform numeric string comparisons or relative distance calculations. When set for a string comparison, the functions compare numeric strings based on the advanced uncertainty comparator. When set for relative distance calculations, the matching weight between two numbers decreases as the numbers become further apart, until the relative distance plus one is reached. At this point, the numbers are considered non-matches. For example, if the relative distance is “10” and the base number for comparison is “2”, a field value of 8 receives a lower matching weight than a field value of 4; but a field value of 13 is considered a complete non-match (since the distance between 2 and 13 is 11).

Figure 3 illustrates how the weight is decreased as the difference between the two compared fields reaches the relative distance. In this diagram, the relative distance is 10 and the light blue line represents the agreement weight. When the difference between two fields reaches 11 (relative distance plus one), the fields are considered a non-match and are given the full disagreement weight.

Figure 3 Numeric Relative Distance Comparison

image:Figure shows how weights are assigned based on relative distances.
Generic Number Comparator (n)

This is a basic numeric comparison function, processing numeric fields as described above. It accepts the parameters listed in Table 39.

Table 39 n, nI, and nR Comparison Function Parameters

Parameter
Description
distance-or-string
Specifies whether a relative distance calculation or a direct string comparison is used. Specify “y” to use a relative distance calculation; specify “n” to use a string comparison.
relative-distance
The greatest difference between two integers at which the values could still be considered a possible match. When the difference between two numbers is greater than the relative distance, the numbers are considered a non-match (the weight becomes zero when the actual difference is the relative distance plus one).
Integer Comparator (nI)

This numeric comparison function matches specifically on integers and accepts the parameters listed in Table 39.

Real Number Comparator (nR)

This numeric comparison function matches specifically on real numbers and accepts the parameters listed in Table 39.

Alphanumeric Comparator (nS)

This numeric comparison function is designed specifically for matching on numeric strings and is very useful for matching social security numbers or other unique identifiers. This is the only numeric comparator that can compare alphanumeric values rather than just numeric values. It accepts the parameters listed in Table 40.

Table 40 nS Comparison Function Parameters

Parameter
Description
fixed-length
An optional parameter that takes the length of the field value into account. If a fixed length is specified, the match engine considers any field of a different length to be a non-match. Specify any integer smaller than the value specified for the size specified for the field (for more information, see Matching Rules).
character-type
An indicator of whether the field must be all numeric. Specify “nu” for numeric only, or specify “an” to allow alphanumeric characters. The match engine considers any fields containing characters that are not allowed to be a non-match.
invalid-characters
A list of invalid characters for the field. If you specify a character, the match engine considers fields that consist of only that character to be a non-match. For example, if you specify “0”, then an SSN field cannot contain all zeros. Specify as many alphanumeric characters as needed, separated by a space.

Date Comparators

The Oracle Java CAPS Match Engine provides various date comparison functions. When comparing dates, the match engine compares each date component (for example, it compares the year in the first date against the year in the second date, the month against the month, and the day against the day). This allows for multiple transpositions in each date field. The date comparators use the Java date format (java.sql.Date), allowing the comparator to use the Gregorian calendar and to take into account the time zone where the date field originated.

The following comparison functions are available for matching on date fields.

As with the numeric comparison functions, the date comparison functions can use either a direct string comparison or a relative distance calculation. When using a relative distance calculation, the matching weight between two dates decreases as the dates become further apart, until the relative distance is reached. When the difference becomes the relative distance plus one, the dates are considered non-matches. You can specify different relative distances for before and after the given date. Any dates falling outside of the specified time period receive a complete disagreement weight. The relative distances are specified in the smallest unit of time being matched.

Figure 4 illustrates how the weight is decreased as the difference between the two compared fields reaches either the before or after relative distance. In this diagram, the before relative distance is 11, the after relative distance is 5, and the light blue line represents the agreement weight. When the base date is later than the compared date and the difference between the dates reaches 11 (distance before plus one), the fields are considered a non-match and are given the full disagreement weight. When the base date is earlier than the compared date and the difference between the dates reaches 6 (distance after plus 1), the fields are considered a non-match.

Figure 4 Date Relative Distance Comparison

image:Figure illustrates how weights are assigned when using relative distance for date comparators.

The date comparison functions take the parameters listed in Table 41.

Table 41 Date Comparison Function Parameters

Parameter
Description
distance-or-string
Specifies whether a relative distance calculation or a direct string comparison is used. Specify “y” to use a relative distance calculation; specify “n” to use a string comparison.
distance-before
The number of units prior to the reference date/time for which two date fields can still be considered a match.
distance-after
The number of units following the reference date/time for which two date fields can still be considered a match.
Date Comparator - Year only (dY)

This date comparison function takes only the 4-character year into account for matching. If relative distance calculation is specified, the relative distance is specified in years.

Date Comparator - Month-Year (dM)

This date comparison function takes the month and year into account for matching. If relative distance calculation is specified, the relative distance is specified in months.

Date Comparator - Day-Month-Year (dD)

This date comparison function takes the day, month, and year into account for matching. If relative distance calculation is specified, the relative distance is specified in days.

Date Comparator - Hour-Day-Month-Year (dH)

This date comparison function takes the hour, day, month, and year into account for matching. If relative distance calculation is specified, the relative distance is specified in hours.

Date Comparator - Min-Hour-Day- Month-Year (dm)

This date comparison function takes the minute, hour, day, month, and year into account for matching. If relative distance calculation is specified, the relative distance is specified in minutes.

Date Comparator - Sec-Min-Hour-Day- Month-Year (ds)

This date comparison function takes the second, minute, hour, day, month, and year into account for matching. If relative distance calculation is specified, the relative distance is specified in seconds.

Prorated Comparator (p)

The prorated comparison function uses a relative distance calculation and allows you to specify how quickly the agreement weight between two fields decreases. Matching weights are assigned with a linear adjustment according to the parameters you specify. You specify an initial agreement range. If the difference between two fields falls within that range, the fields are considered a complete match. You also specify a disagreement range ending with the relative distance. If the difference between two fields falls within that range, the fields are considered a non-match. When the difference between the fields falls between those two ranges, they are considered to be partial matches and the agreement weight is adjusted linearly. Any difference greater than the relative distance is always considered a non-match.

Figure 5 illustrates how weighting is adjusted per the parameters you define. In these diagrams, the green line indicates full agreement, the light blue line indicates prorated agreement, and the red line indicates full disagreement. The diagrams illustrate how increasing the disagreement weight causes the prorated agreement weight to decrease more sharply.

Figure 5 Prorated Linear Adjustment Comparison

image:Figure shows two examples of how weights are assigned using the prorated comparison function.

The prorated comparison functions takes the parameters listed in Table 42.

Table 42 Prorated Comparison Function Parameters

Parameter
Description
relative-distance
The greatest difference between two numbers at which they can still be considered a match or partial match.
agreement-range
The greatest difference between two numbers at which they are considered a full match. This number must be less than the relative distance.
disagreement-range
This number indicates the minimum difference at which two numbers are considered a non-match and shortens or lengthens the weighting scale. To find this difference, the match engine subtracts this value from the relative distance. If the fields differ by that amount or greater, they are considered to be a non-match.

The weighting scale decreases in size as the value of the full-disagreement parameter increases (see diagram).

Oracle Java CAPS Match Engine Comparison Function Options

The options listed below can be used in conjunction with the above string comparison functions to give them more functionality. For example, you can use an ”ufI’ string comparison function that refers to the first name comparison function with the possibility to switch fields if the first one does not match.