Skip Navigation Links | |
Exit Print View | |
Understanding the Oracle Java CAPS Match Engine Java CAPS Documentation |
Understanding the Oracle Java CAPS Match Engine
About the Oracle Java CAPS Match Engine
Oracle Java CAPS Match Engine Overview
About the Oracle Java CAPS Match Engine Matching Algorithm
Oracle Java CAPS Match Engine Standardization and Matching Process
Oracle Java CAPS Match Engine Data Types
How the Oracle Java CAPS Match Engine Works
Oracle Java CAPS Match Engine Matching Weight Formulation
Matching and Unmatching Probabilities
Agreement and Disagreement Weight Ranges
Oracle Java CAPS Match Engine Standardization Configuration
Oracle Java CAPS Match Engine Standardization File Types
Oracle Java CAPS Match Engine Internationalization
Oracle Java CAPS Match Engine Matching Configuration
The Oracle Java CAPS Match Engine Match Configuration File
Oracle Java CAPS Match Engine Match Configuration File Format
Match Configuration File Sample
Oracle Java CAPS Match Engine Matching Comparison Functions
Oracle Java CAPS Match Engine and the Oracle Java CAPS Match Engine
Master Index Components and the Oracle Java CAPS Match Engine
Searching and Matching in Oracle Java CAPS Match Engine Applications (Repository)
Standardization and Matching Process in Master Index Applications (Repository)
The Master Index Match String (Repository)
Oracle Java CAPS Match Engine Field Identifiers
Oracle Java CAPS Match Engine Match and Standardization Types
Oracle Java CAPS Match Engine Configuration File Modifications
Configuring the Master Index Matching Service (Repository)
Master Index Standardization Configuration (Repository)
Standardization Structures (Parsing and Normalization)
Master Index Match String Configuration (Repository)
Match and Standardization Engine Configuration
Master Index Phonetic Encoder Configuration (Repository)
Oracle Java CAPS Match Engine Person Data Type Configuration
Oracle Java CAPS Match Engine Person Matching Overview
Oracle Java CAPS Match Engine Person Data Processing Fields
Person Data Match String Fields
Person Data Standardized Fields
Oracle Java CAPS Match Engine Match Configuration for Person Data
Oracle Java CAPS Match Engine Person Data Standardization Files
Oracle Java CAPS Match Engine Common Standardization Files for Person Data
The Hyphenated Name Category File (personFirstNameDash.dat)
The Person Name Patterns File (personNamePatt.dat)
The Special Characters Reference File (personRemoveSpecChars.dat)
Oracle Java CAPS Match Engine Domain-Specific Standardization Files for Person Data
The Conjunction Reference File (personConjon*.dat)
The Person Constants File (personConstants*.cfg)
The First Name Category File (personFirstName*.dat)
The Generational Suffix Category File (personGenSuffix*.dat)
Last Name Prefix Category File (personLastNamePrefix*.dat)
The Last Name Category File (personLastName*.dat)
The Occupational Suffix Category File (personOccupSuffix*.dat)
The Three-Character Suffix File (personThree*.dat)
The Title Category File (personTitle*.dat)
The Two-Character Suffix File (personTwo*.dat)
The Business-Related Category File (businessOrRelated*.dat)
Configuring the Oracle Java CAPS Match Engine Standardization Files for Person Data
Configuring the Master Index Matching Service for Person Data (Repository)
Configuring the Standardization Structure for Person Data (Repository)
Person Data Normalization Structures
Configuring the Match String for Person Data (Repository)
Oracle Java CAPS Match Engine Address Data Type Configuration
Oracle Java CAPS Match Engine Address Matching Overview
Oracle Java CAPS Match Engine Address Data Processing Fields
Address Data Match String Fields
Address Data Standardized Fields
Match Configuration for Address Data (Repository)
Oracle Java CAPS Match Engine Standardization Configuration for Address Data
The Address Constants File (addressConstants*.cfg)
The Address Clues File (addressClueAbbrev*.dat)
The Address Internal Constants File (addressInternalConstants*.cfg)
The Address Master Clues File (addressMasterClues*.dat)
The Address Patterns File (addressPatterns*.dat)
The Address Output Patterns File (addressOutPatterns*.dat)
Address Pattern File Components
Modifying Oracle Java CAPS Match Engine Address Data Configuration Files
Configuring the Matching Service for Address Data (Repository)
Configuring the Standardization Structure for Address Data (Repository)
Address Standardization Structures
Configuring the Match String for Address Data (Repository)
Oracle Java CAPS Match Engine Business Names Data Type Configuration
Oracle Java CAPS Match Engine Business Name Matching Overview
Oracle Java CAPS Match Engine Business Name Processing Fields
Business Name Match String Fields
Business Name Standardized Fields
Business Name Object Structure
Oracle Java CAPS Match Engine Match Configuration for Business Names
Oracle Java CAPS Match Engine Standardization Configuration for Business Names
The Business Constants File (bizConstants.cfg)
The Adjectives Key Type File (bizAdjectivesTypeKeys.dat)
The Alias Key Type File (bizAliasTypeKeys.dat)
The Association Key Type File (bizAssociationTypeKeys.dat)
The General Terms Reference File (bizBusinessGeneralTerms.dat)
The City or State Key Type File (bizCityorStateTypeKeys.dat)
The Business Former Name Reference File (bizCompanyFormerNames.dat)
The Merged Business Name Category File (bizCompanyMergerNames.dat)
The Primary Business Name Reference File (bizCompanyPrimaryNames.dat)
The Connector Tokens Reference File (bizConnectorTokens.dat)
The Country Key Type File (bizCountryTypeKeys.dat)
The Industry Sector Reference File (bizIndustryCategoryCode.dat)
The Industry Key Type File (bizIndustryTypeKeys.dat)
The Organization Key Type File (bizOrganizationTypeKeys.dat)
The Business Patterns File (bizPatterns.dat)
The Special Characters Reference File (bizRemoveSpecChars.dat)
Modifying Oracle Java CAPS Match Engine Business Name Configuration Files
Configuring the Matching Service for Business Names (Repository)
Configuring the Standardization Structure for Business Names (Repository)
Business Name Standardization Structures
Business Name Phonetic Encoding
Configuring the Match String for Business Names (Repository)
Fine-Tuning Weights and Thresholds for Oracle Java CAPS Match Engine (Repository)
Customizing the Match Configuration and Thresholds
Customizing the Match Configuration
Probabilities or Agreement Weights
Weight Ranges Using Agreement Weights
Weight Ranges Using Probabilities
Determining the Weight Thresholds
Match Configuration Comparison Functions for Oracle Java CAPS Match Engine (Repository)
Oracle Java CAPS Match Engine Comparison Functions
Advanced Bigram String Comparator (b2)
Uncertainty String Comparators
Advanced Generic String Comparator (ua)
Simplified String Comparator (us)
Simplified String Comparator - FirstName (uf)
Simplified String Comparator - LastName (ul)
Simplified String Comparator - House Numbers (un)
Language-specific String Comparator (usu)
Exact char-by-char Comparator (c)
Date Comparator - Year only (dY)
Date Comparator - Month-Year (dM)
Date Comparator - Day-Month-Year (dD)
Date Comparator - Hour-Day-Month-Year (dH)
Date Comparator - Min-Hour-Day- Month-Year (dm)
Each Oracle Java CAPS Match Engine implementation is unique, typically requiring extensive data analysis to determine how to best configure the structure and matching logic of the master index application. The following topics provide an overview of the process of fine-tuning the matching logic in the match configuration file and fine-tuning the match and duplicate thresholds.
A thorough analysis of the data to be shared with the master index application is a must before beginning any implementation. This analysis not only defines the types of data to include in the object structure, but indicates the relative reliability of each system’s data, helps determine which fields to use for matching, and indicates the relative reliability of each match field.
To begin the analysis, the legacy data that will be converted into the master index database is extracted and analyzed. Once the initial analysis is complete, you can perform an iterative process to fine-tune the matching and duplicate thresholds and to determine the level of potential duplication in the existing data. If you plan to use the Data Profiler and Bulk Matcher tools generated by Oracle Java CAPS Match Engine to analyze data, review the information in Analyzing and Cleansing Data for a Master Index before you extract the legacy data.
Note - These tools are only available from service-enabled master index applications. The document referenced above describes what you need to do to generate the tools if you are using Oracle Java CAPS Match Engine (Repository).
There are three primary steps to customizing how records are matched in a master index application.
Before extracting data for analysis, review the types of data stored in the messages generated by each system. Use these messages to determine which fields and objects to include in the object structure of the master index application. From this object structure, select the fields to use for matching. When selecting these fields, keep in mind how representative each field is of a specific object. For example, in a master person index, the social security number field, first and last name fields, and birth date are good representations whereas marital status, suffix, and title are not. Certain address information or a home telephone number might also be considered. In a master company index, the match fields might include any of the fields parsed from the complete company name field, as well as a tax ID number or address and telephone information.
Once you determine the fields to use for matching, determine how the weights will be generated for each field. The primary tasks include determining whether to use probabilities or agreement weight ranges and then choosing the best comparison functions to use for each match field.
The first step in configuring the match configuration is to decide whether to use m-probabilities and u-probabilities or agreement and disagreement weight ranges. Both methods will give you similar results, but agreement and disagreement weight ranges allow you to specify the precise maximum and minimum weights that can be applied to each match field, giving you control over the value of the highest and lowest matching weights that can be assigned to each record.
For each field used for matching, define either the m-probabilities and u-probabilities or the agreement and disagreement weight ranges in the match configuration file. Review the information provided under Oracle Java CAPS Match Engine Matching Weight Formulation to help determine how to configure these values. Remember that a higher m-probability or agreement weight gives the field a higher weight when field values agree.
In order to find the initial values to set for the match and duplicate thresholds, you must determine the total range of matching weights that can be assigned to a record. This weight is the sum of all weights assigned to each match field. Running the Bulk Matcher in match analysis mode can help you determine the match and duplicate thresholds. For more information about this tool, see Performing a Match Analysis in Loading the Initial Data Set for a Master Index.
The way you determine weight ranges varies depending on whether you are using m and u-probabilities or agreement and disagreement weights.
For agreement and disagreement weight ranges, determining the match weight ranges is very straightforward. Simply total the maximum agreement weights for each field to determine the maximum match weight. Then total the minimum disagreement weights for each match field to determine the minimum match weight. Table 36 provides a sample agreement/disagreement configuration for matching on person data. As you can see, the range of match weights generated for a master index application with this configuration is from -36 to +38.
Table 36 Sample Agreement and Disagreement Weight Ranges
|
Determining the match weight ranges when using m-probabilities and u-probabilities is a little more complicated than using agreement and disagreement weights. To determine the maximum weight that will be generated for each field, use the following formula:
LOG2(m_prob/u_prob)
To determine the minimum match weight that will be generated for each field, use the following formula:
LOG2((1-m_prob)/(1-u_prob))
Table 37 below illustrates a sample of m-probabilities and u-probabilities, including the corresponding agreement and disagreement weights that are generated with each combination of probabilities. As you can see, the range of match weights generated for a master index application with this configuration is from -35.93 to +38
Table 37 Sample m-probabilities and u-probabilities
|
The match configuration file defines several match types for different types of fields. You can either modify existing rows in this file or create new rows that define custom matching logic. To determine which comparison functions to use, review the information provided in Match Configuration Comparison Functions for Oracle Java CAPS Match Engine (Repository). Choose the comparison functions that best suit how you want the match fields to be processed.
Weight thresholds tell the master index application how to process incoming records based on the matching probability weights generated by the Oracle Java CAPS Match Engine. Two parameters in the Threshold file provide the master index application with the information needed to determine if records should be flagged as potential duplicates, if records should be automatically matched, or if a record is not a potential match to any existing records.
Match Threshold - Specifies the weight at which two profiles are assumed to represent the same person and are automatically matched (this depends on the setting for the OneExactMatch parameter).
Duplicate Threshold - Specifies the minimum weight at which two profiles are considered potential duplicates of one another. The matching threshold indicates the maximum weight for potential duplicates.
Figure 1 illustrates the match and duplicate thresholds in comparison to total composite match weights.
Figure 1 Weight Thresholds
There are many techniques for determining the initial settings for the match and duplicate thresholds. This section discusses two methods.
The first method, the weight distribution method, is based on the calculation of the error rates of false matches and false non-matches from analyzing the distribution spectrum of all the weighted pairs. This is the standard method, and is illustrated in Figure 2. The second method, the percentage method relies on measuring the total maximum and minimum weights of all the matched fields and then specifying a certain percentage of these values as the initial thresholds.
The weight distribution method is more thorough and powerful but requires analyzing a large amount of data (match weights) to be statistically reliable. It does not apply well in cases where one candidate record is matched against very few reference records. The percentage method, though simple, is very reliable and precise when dealing with such situations. For both methods, defining the match threshold and the duplicate threshold is an iterative process.
Weight Distribution Method
Each record pair in the master index application can be classified into three categories: matches, non-matches, and potential matches. In general, the distribution of records is similar to the graph shown in Figure 2. Your goal is to make sure that very few records fall into the False Matches region (if any), and that as few as possible fall into the False Non-matches region. You can see how modifying the thresholds changes this distribution. Balance this against the number of records falling within the Manual Review section, as these will each need to be reviewed, researched, and resolved individually.
Figure 2 Weight Distribution Chart
Percentage Method
Using this method, you set the initial thresholds as a percentage of the maximum and minimum weights. Using the information provided under Weight Ranges Using Agreement Weightsor Weight Ranges Using Probabilities, determine the maximum and minimum values that can be generated for composite match weights. For the initial run, the match threshold is set intentionally high to catch only the most probable matches. The duplicate threshold is set intentionally low to catch a large set of possible matches.
Set the match threshold at 70% of the maximum composite weight starting from zero as the neutral value. Using the weight range samples in Table 37, this would be 70% of 38, or 26.6. Set the duplicate threshold near the neutral value (that is, the value in the center of the maximum and minimum weight range). The value could be set between 10% of the maximum weight and 10% of the minimum weight. Using the samples above, this would be between 3.8 (10% of 38) and -3.6 (10% of -36).
Achieving the correct thresholds for your implementation is an iterative process. First, using the initial thresholds, process the data extracts into the master index database. Then analyze the resulting assumed match and potential duplicates, paying close attention to the assumed match records with matching weights close to the match threshold, to potential duplicate records close to either threshold, and to non-matches near the duplicate threshold.
If you find that most or all of the assumed matches at the low end of the match range are not actually duplicate records, raise the match threshold accordingly. If, on the other hand, you find several potential duplicates at the high end of the duplicate range that are actual matches, decrease the match threshold accordingly. If you find that most or all of the potential duplicate records in the low end of the duplicate range should not be considered duplicate matches, consider raising the duplicate threshold. Conversely, if you find several non-matches with weight near the duplicate threshold that should be considered potential duplicates, lower the duplicate threshold.
Repeat the process of loading and analyzing data and adjusting the thresholds until you are satisfied with the results.