Skip Navigation Links | |
Exit Print View | |
Understanding the Oracle Java CAPS Match Engine Java CAPS Documentation |
Understanding the Oracle Java CAPS Match Engine
About the Oracle Java CAPS Match Engine
Oracle Java CAPS Match Engine Overview
About the Oracle Java CAPS Match Engine Matching Algorithm
Oracle Java CAPS Match Engine Standardization and Matching Process
Oracle Java CAPS Match Engine Data Types
How the Oracle Java CAPS Match Engine Works
Oracle Java CAPS Match Engine Matching Weight Formulation
Oracle Java CAPS Match Engine Standardization Configuration
Oracle Java CAPS Match Engine Standardization File Types
Oracle Java CAPS Match Engine Internationalization
Oracle Java CAPS Match Engine Matching Configuration
The Oracle Java CAPS Match Engine Match Configuration File
Oracle Java CAPS Match Engine Match Configuration File Format
Match Configuration File Sample
Oracle Java CAPS Match Engine Matching Comparison Functions
Oracle Java CAPS Match Engine and the Oracle Java CAPS Match Engine
Master Index Components and the Oracle Java CAPS Match Engine
Searching and Matching in Oracle Java CAPS Match Engine Applications (Repository)
Standardization and Matching Process in Master Index Applications (Repository)
The Master Index Match String (Repository)
Oracle Java CAPS Match Engine Field Identifiers
Oracle Java CAPS Match Engine Match and Standardization Types
Oracle Java CAPS Match Engine Configuration File Modifications
Configuring the Master Index Matching Service (Repository)
Master Index Standardization Configuration (Repository)
Standardization Structures (Parsing and Normalization)
Master Index Match String Configuration (Repository)
Match and Standardization Engine Configuration
Master Index Phonetic Encoder Configuration (Repository)
Oracle Java CAPS Match Engine Person Data Type Configuration
Oracle Java CAPS Match Engine Person Matching Overview
Oracle Java CAPS Match Engine Person Data Processing Fields
Person Data Match String Fields
Person Data Standardized Fields
Oracle Java CAPS Match Engine Match Configuration for Person Data
Oracle Java CAPS Match Engine Person Data Standardization Files
Oracle Java CAPS Match Engine Common Standardization Files for Person Data
The Hyphenated Name Category File (personFirstNameDash.dat)
The Person Name Patterns File (personNamePatt.dat)
The Special Characters Reference File (personRemoveSpecChars.dat)
Oracle Java CAPS Match Engine Domain-Specific Standardization Files for Person Data
The Conjunction Reference File (personConjon*.dat)
The Person Constants File (personConstants*.cfg)
The First Name Category File (personFirstName*.dat)
The Generational Suffix Category File (personGenSuffix*.dat)
Last Name Prefix Category File (personLastNamePrefix*.dat)
The Last Name Category File (personLastName*.dat)
The Occupational Suffix Category File (personOccupSuffix*.dat)
The Three-Character Suffix File (personThree*.dat)
The Title Category File (personTitle*.dat)
The Two-Character Suffix File (personTwo*.dat)
The Business-Related Category File (businessOrRelated*.dat)
Configuring the Oracle Java CAPS Match Engine Standardization Files for Person Data
Configuring the Master Index Matching Service for Person Data (Repository)
Configuring the Standardization Structure for Person Data (Repository)
Person Data Normalization Structures
Configuring the Match String for Person Data (Repository)
Oracle Java CAPS Match Engine Address Data Type Configuration
Oracle Java CAPS Match Engine Address Matching Overview
Oracle Java CAPS Match Engine Address Data Processing Fields
Address Data Match String Fields
Address Data Standardized Fields
Match Configuration for Address Data (Repository)
Oracle Java CAPS Match Engine Standardization Configuration for Address Data
The Address Constants File (addressConstants*.cfg)
The Address Clues File (addressClueAbbrev*.dat)
The Address Internal Constants File (addressInternalConstants*.cfg)
The Address Master Clues File (addressMasterClues*.dat)
The Address Patterns File (addressPatterns*.dat)
The Address Output Patterns File (addressOutPatterns*.dat)
Address Pattern File Components
Modifying Oracle Java CAPS Match Engine Address Data Configuration Files
Configuring the Matching Service for Address Data (Repository)
Configuring the Standardization Structure for Address Data (Repository)
Address Standardization Structures
Configuring the Match String for Address Data (Repository)
Oracle Java CAPS Match Engine Business Names Data Type Configuration
Oracle Java CAPS Match Engine Business Name Matching Overview
Oracle Java CAPS Match Engine Business Name Processing Fields
Business Name Match String Fields
Business Name Standardized Fields
Business Name Object Structure
Oracle Java CAPS Match Engine Match Configuration for Business Names
Oracle Java CAPS Match Engine Standardization Configuration for Business Names
The Business Constants File (bizConstants.cfg)
The Adjectives Key Type File (bizAdjectivesTypeKeys.dat)
The Alias Key Type File (bizAliasTypeKeys.dat)
The Association Key Type File (bizAssociationTypeKeys.dat)
The General Terms Reference File (bizBusinessGeneralTerms.dat)
The City or State Key Type File (bizCityorStateTypeKeys.dat)
The Business Former Name Reference File (bizCompanyFormerNames.dat)
The Merged Business Name Category File (bizCompanyMergerNames.dat)
The Primary Business Name Reference File (bizCompanyPrimaryNames.dat)
The Connector Tokens Reference File (bizConnectorTokens.dat)
The Country Key Type File (bizCountryTypeKeys.dat)
The Industry Sector Reference File (bizIndustryCategoryCode.dat)
The Industry Key Type File (bizIndustryTypeKeys.dat)
The Organization Key Type File (bizOrganizationTypeKeys.dat)
The Business Patterns File (bizPatterns.dat)
The Special Characters Reference File (bizRemoveSpecChars.dat)
Modifying Oracle Java CAPS Match Engine Business Name Configuration Files
Configuring the Matching Service for Business Names (Repository)
Configuring the Standardization Structure for Business Names (Repository)
Business Name Standardization Structures
Business Name Phonetic Encoding
Configuring the Match String for Business Names (Repository)
Fine-Tuning Weights and Thresholds for Oracle Java CAPS Match Engine (Repository)
Customizing the Match Configuration and Thresholds
Customizing the Match Configuration
Probabilities or Agreement Weights
Weight Ranges Using Agreement Weights
Weight Ranges Using Probabilities
Determining the Weight Thresholds
Specifying the Weight Thresholds
Match Configuration Comparison Functions for Oracle Java CAPS Match Engine (Repository)
Oracle Java CAPS Match Engine Comparison Functions
Advanced Bigram String Comparator (b2)
Uncertainty String Comparators
Advanced Generic String Comparator (ua)
Simplified String Comparator (us)
Simplified String Comparator - FirstName (uf)
Simplified String Comparator - LastName (ul)
Simplified String Comparator - House Numbers (un)
Language-specific String Comparator (usu)
Exact char-by-char Comparator (c)
Date Comparator - Year only (dY)
Date Comparator - Month-Year (dM)
Date Comparator - Day-Month-Year (dD)
Date Comparator - Hour-Day-Month-Year (dH)
Date Comparator - Min-Hour-Day- Month-Year (dm)
The Oracle Java CAPS Match Engine is the standard match engine designed to work with the master index applications created by Oracle Java CAPS Match Engine (Repository). It is highly configurable in the Oracle Java CAPS Match Engine environment and can be used to match on various types of data.
The Oracle Java CAPS Match Engine provides data parsing, data standardization, phonetic encoding, and record matching capabilities for master index applications created by Oracle Java CAPS Match Engine. Before records can be compared to evaluate the possibility of a match, the data contained in those records must be standardized and in certain cases phonetically encoded or parsed. Once the data is conditioned, the match engine determines a match weight for each field defined for matching. The match weight is based on your configuration of the match engine and the fields on which matching is performed. The composite weight (the sum of weights generated for all match fields in the records) indicates how closely two records match.
The following topics provide information about the configurable components of the match engine and how the Oracle Java CAPS Match Engine standardizes and matches data.
The Oracle Java CAPS Match Engine compares records containing similar data types by calculating how closely the records match. The resulting comparison weight is either a positive or negative numeric value that represents the degree to which the two sets of data are similar. The match engine relies on probabilistic algorithms to compare data of a given type using a comparison function specific to the type of data being compared. The comparison functions for each matching field are defined in a match configuration file that you can customize for the type of data you are indexing. The formula used to determine the matching weight is based on either matching and unmatching probabilities or on agreement and disagreement weight ranges.
The Oracle Java CAPS Match Engine is also designed to standardize free-form text fields, such as street address fields or business names. This allows the match engine to generate a more accurate weight for free-form data.
The Oracle Java CAPS Match Engine matching algorithm uses a proven methodology to process and weight records in the master index database. By providing both standardization and matching capabilities, the match engine allows you to condition data prior to matching. You can also use these capabilities to review legacy data prior to loading it into the database. This review helps you determine data anomalies, invalid or default values, and missing fields.
Both matching and standardization occur when two records are analyzed for the probability of a match. Before matching, certain fields are normalized, parsed, or converted into their phonetic values if necessary. The match fields are then analyzed and weighted according to the rules defined in a match configuration file. The weights for each field are combined to determine the overall matching weight for the two records. After the match engine has performed these steps, survivorship is determined by the master index application, based on how the overall matching weight compares to the duplicate and match thresholds of the master index application. These thresholds are configured for the Manager Service in the Threshold file.
You can standardize and match on different types of data with the Oracle Java CAPS Match Engine. In its default implementation with a master index application, the match engine supports data standardization and matching on the three primary types of data listed below.
Person Information (described in Oracle Java CAPS Match Engine and the Oracle Java CAPS Match Engine)
Street Addresses (described in Oracle Java CAPS Match Engine Person Data Type Configuration)
Business Names (described in Oracle Java CAPS Match Engine Address Data Type Configuration)
In addition, the Oracle Java CAPS Match Engine provides comparison functions for matching on various types of fields contained within the primary data types, such as numbers, dates, Social Security Numbers, single characters, and so on.
When processing person information, the match engine assumes that each match field is stored in a separate field. For street address and business name processing, the match engine can parse free-form text fields for searching and matching. Each data type requires specific customization to the Match Field file in the master index project.
The Oracle Java CAPS Match Engine compares two records and returns a match weight indicating the likelihood of a match between the two records. The three primary components of the Oracle Java CAPS Match Engine are the configuration files, the standardization engine, and the match engine.
Configuration Files - The Oracle Java CAPS Match Engine includes several sets of files that define standardization and matching logic for all supported data types. One set of standardization files is common to all national domains, and one additional set is provided for the following national domains: Australia, France, Great Britain, and the United States. You can customize these files to adapt the standardization and matching logic to your specific needs. The matching configuration file defines the configuration of the matching comparator functions.
Standardization Engine - Standardization involves converting nonstandard data into a standardized form for more accurate and efficient processing. Standardization consists of any one or more of the following actions:
Parsing - Separating a free-text field into its individual components, such as street address information or a business name.
Normalization - Changing the value of a field to a standard version, such as changing a nickname to a common name.
Phonetic Encoding - Changing the value of a field to its phonetic version. The field to be converted can be the original field, a parsed field, a normalized field, or a parsed and normalized field.
Using the person data type, for example, first names such as “Bill” and “Will” are normalized to “William”, which is then phonetically converted. Using the street address data type, street addresses are parsed into their component parts, such as house numbers, street names, and so on. The street name is then phonetically converted. Standardization logic is defined in the standardization engine configuration files and in the StandardizationConfig section of the Match Field file, and is performed prior to assigning match weights.
Match Engine– Matching involves comparing two standardized records and returning a weight that indicates the likelihood of a match between the two records. A higher weight indicates a greater likelihood of a match. Matching criteria and logic are defined in the match engine configuration file. The data fields that are sent to the Oracle Java CAPS Match Engine for matching, known as the match string, are defined in the MatchingConfig section of the Match Field file. The match engine configuration files define how the match string is standardized and which matching rules to use to process each match field.
The Oracle Java CAPS Match Engine determines the matching weight between two records by comparing the match string fields between the two records using the rules defined in the match configuration file and taking into account the matching logic specified for each field. The Oracle Java CAPS Match Engine can use either matching (m) and unmatching (u) conditional probabilities or agreement and disagreement weight ranges to fine-tune the match process. It uses the underlying algorithm to arrive at a match weight for each match string field. The weight generated for each field in the match string indicates the level of match between each field. The weights assigned to each field are then summed together for a total, composite matching weight between the two records. Agreement and disagreement weight ranges or m-probabilities and u-probabilities are defined in the match configuration file.
The following topics describe probabilities and weights.
When matching and unmatching conditional probabilities are used, the match engine uses a logarithmic formula to determine agreement and disagreement weights between fields. The m-probabilities and u-probabilities you specify determine the maximum agreement weight and minimum disagreement weight for each field, and so define the agreement and disagreement weight ranges for each field and for the entire record. These probabilities allow you to specify which fields provide the most reliable matching information and which provide the least. For example, in person matching, the gender field is not as reliable as the SSN field for determining a match since a person’s SSN is more specific. Therefore, the SSN field should have a higher m-probability than the gender field. The more reliable the field, the greater the m-probability for that field should be.
If a field matches between two records, an agreement weight, determined by the logarithmic formula using the m-probability and u-probability, is added to the composite match weight for the record. If the fields disagree, a disagreement weight is subtracted from the composite match weight. m-probabilities and u-probabilities are expressed as double values between one and zero (excluding one and zero) and can have up to 16 decimal points.
Defining agreement and disagreement weight ranges is a more direct way to implement m-probabilities and u-probabilities. Like probabilities, the maximum agreement and minimum disagreement weights you define for each field allow you to define the relative reliability of each field; however, the match weight has a more linear relationship with the numbers you specify. When you use agreement and disagreement weight ranges to determine the match weight, you define a maximum weight for each field when they are in complete agreement and a minimum weight for when they are in complete disagreement. The Oracle Java CAPS Match Engine assigns a matching weight to each field that falls between the agreement and disagreement weights specified for the field. This provides a more convenient and intuitive representation of conditional probabilities.
Using the SSN and gender field example above, the SSN field would be assigned a higher maximum agreement weight and a lower minimum disagreement weight than the gender field because it is more reliable. If you assign a maximum agreement weight of “10” and two SSNs match, the match weight for that field is “10”. If you assign a minimum disagreement weight of “-10” and two SSNs are in complete disagreement, the match weight for that field is “-10”. Agreement and disagreement weights are expressed as double values and can have up to 16 decimal points.