Skip Navigation Links | |
Exit Print View | |
![]() |
Understanding the Oracle Java CAPS Match Engine Java CAPS Documentation |
Understanding the Oracle Java CAPS Match Engine
About the Oracle Java CAPS Match Engine
Oracle Java CAPS Match Engine Overview
About the Oracle Java CAPS Match Engine Matching Algorithm
Oracle Java CAPS Match Engine Standardization and Matching Process
Oracle Java CAPS Match Engine Data Types
How the Oracle Java CAPS Match Engine Works
Oracle Java CAPS Match Engine Matching Weight Formulation
Matching and Unmatching Probabilities
Agreement and Disagreement Weight Ranges
Oracle Java CAPS Match Engine Standardization Configuration
Oracle Java CAPS Match Engine Matching Configuration
The Oracle Java CAPS Match Engine Match Configuration File
Oracle Java CAPS Match Engine Match Configuration File Format
Match Configuration File Sample
Oracle Java CAPS Match Engine Matching Comparison Functions
Oracle Java CAPS Match Engine and the Oracle Java CAPS Match Engine
Master Index Components and the Oracle Java CAPS Match Engine
Searching and Matching in Oracle Java CAPS Match Engine Applications (Repository)
Standardization and Matching Process in Master Index Applications (Repository)
The Master Index Match String (Repository)
Oracle Java CAPS Match Engine Field Identifiers
Oracle Java CAPS Match Engine Match and Standardization Types
Oracle Java CAPS Match Engine Configuration File Modifications
Configuring the Master Index Matching Service (Repository)
Master Index Standardization Configuration (Repository)
Standardization Structures (Parsing and Normalization)
Master Index Match String Configuration (Repository)
Match and Standardization Engine Configuration
Master Index Phonetic Encoder Configuration (Repository)
Oracle Java CAPS Match Engine Person Data Type Configuration
Oracle Java CAPS Match Engine Person Matching Overview
Oracle Java CAPS Match Engine Person Data Processing Fields
Person Data Match String Fields
Person Data Standardized Fields
Oracle Java CAPS Match Engine Match Configuration for Person Data
Oracle Java CAPS Match Engine Person Data Standardization Files
Oracle Java CAPS Match Engine Common Standardization Files for Person Data
The Hyphenated Name Category File (personFirstNameDash.dat)
The Person Name Patterns File (personNamePatt.dat)
The Special Characters Reference File (personRemoveSpecChars.dat)
Oracle Java CAPS Match Engine Domain-Specific Standardization Files for Person Data
The Conjunction Reference File (personConjon*.dat)
The Person Constants File (personConstants*.cfg)
The First Name Category File (personFirstName*.dat)
The Generational Suffix Category File (personGenSuffix*.dat)
Last Name Prefix Category File (personLastNamePrefix*.dat)
The Last Name Category File (personLastName*.dat)
The Occupational Suffix Category File (personOccupSuffix*.dat)
The Three-Character Suffix File (personThree*.dat)
The Title Category File (personTitle*.dat)
The Two-Character Suffix File (personTwo*.dat)
The Business-Related Category File (businessOrRelated*.dat)
Configuring the Oracle Java CAPS Match Engine Standardization Files for Person Data
Configuring the Master Index Matching Service for Person Data (Repository)
Configuring the Standardization Structure for Person Data (Repository)
Person Data Normalization Structures
Configuring the Match String for Person Data (Repository)
Oracle Java CAPS Match Engine Address Data Type Configuration
Oracle Java CAPS Match Engine Address Matching Overview
Oracle Java CAPS Match Engine Address Data Processing Fields
Address Data Match String Fields
Address Data Standardized Fields
Match Configuration for Address Data (Repository)
Oracle Java CAPS Match Engine Standardization Configuration for Address Data
The Address Constants File (addressConstants*.cfg)
The Address Clues File (addressClueAbbrev*.dat)
The Address Internal Constants File (addressInternalConstants*.cfg)
The Address Master Clues File (addressMasterClues*.dat)
The Address Patterns File (addressPatterns*.dat)
The Address Output Patterns File (addressOutPatterns*.dat)
Address Pattern File Components
Modifying Oracle Java CAPS Match Engine Address Data Configuration Files
Configuring the Matching Service for Address Data (Repository)
Configuring the Standardization Structure for Address Data (Repository)
Address Standardization Structures
Configuring the Match String for Address Data (Repository)
Oracle Java CAPS Match Engine Business Names Data Type Configuration
Oracle Java CAPS Match Engine Business Name Matching Overview
Oracle Java CAPS Match Engine Business Name Processing Fields
Business Name Match String Fields
Business Name Standardized Fields
Business Name Object Structure
Oracle Java CAPS Match Engine Match Configuration for Business Names
Oracle Java CAPS Match Engine Standardization Configuration for Business Names
The Business Constants File (bizConstants.cfg)
The Adjectives Key Type File (bizAdjectivesTypeKeys.dat)
The Alias Key Type File (bizAliasTypeKeys.dat)
The Association Key Type File (bizAssociationTypeKeys.dat)
The General Terms Reference File (bizBusinessGeneralTerms.dat)
The City or State Key Type File (bizCityorStateTypeKeys.dat)
The Business Former Name Reference File (bizCompanyFormerNames.dat)
The Merged Business Name Category File (bizCompanyMergerNames.dat)
The Primary Business Name Reference File (bizCompanyPrimaryNames.dat)
The Connector Tokens Reference File (bizConnectorTokens.dat)
The Country Key Type File (bizCountryTypeKeys.dat)
The Industry Sector Reference File (bizIndustryCategoryCode.dat)
The Industry Key Type File (bizIndustryTypeKeys.dat)
The Organization Key Type File (bizOrganizationTypeKeys.dat)
The Business Patterns File (bizPatterns.dat)
The Special Characters Reference File (bizRemoveSpecChars.dat)
Modifying Oracle Java CAPS Match Engine Business Name Configuration Files
Configuring the Matching Service for Business Names (Repository)
Configuring the Standardization Structure for Business Names (Repository)
Business Name Standardization Structures
Business Name Phonetic Encoding
Configuring the Match String for Business Names (Repository)
Fine-Tuning Weights and Thresholds for Oracle Java CAPS Match Engine (Repository)
Customizing the Match Configuration and Thresholds
Customizing the Match Configuration
Probabilities or Agreement Weights
Weight Ranges Using Agreement Weights
Weight Ranges Using Probabilities
Determining the Weight Thresholds
Specifying the Weight Thresholds
Match Configuration Comparison Functions for Oracle Java CAPS Match Engine (Repository)
Oracle Java CAPS Match Engine Comparison Functions
Advanced Bigram String Comparator (b2)
Uncertainty String Comparators
Advanced Generic String Comparator (ua)
Simplified String Comparator (us)
Simplified String Comparator - FirstName (uf)
Simplified String Comparator - LastName (ul)
Simplified String Comparator - House Numbers (un)
Language-specific String Comparator (usu)
Exact char-by-char Comparator (c)
Date Comparator - Year only (dY)
Date Comparator - Month-Year (dM)
Date Comparator - Day-Month-Year (dD)
Date Comparator - Hour-Day-Month-Year (dH)
Date Comparator - Min-Hour-Day- Month-Year (dm)
The standardization configuration files define additional logic used by the Oracle Java CAPS Match Engine to standardize specific data types. This logic helps define how fields in incoming records are parsed, standardized, and classified for processing. Standardization files include data patterns files, category files, clues files, key type tables, constants files, and reference files.
The standardization configuration files are stored in the master index project and appear as nodes in the Standardization Engine node of the project. Several standardization files are common to all implementations of the Oracle Java CAPS Match Engine, but each national domain uses a subset of unique files. The common files are listed directly under the Standardization Engine node of the master index project; the files unique to each national domain are listed in individual sub-folders under the Standardization Engine node.
The standardization configuration files for the Oracle Java CAPS Match Engine must follow certain rules for formatting and interdependencies. The following topics provide an overview of the types of configuration files provided for standardization.
Several different types of configuration files are included with the Oracle Java CAPS Match Engine, each providing specific information to help the engine standardize and match data according to requirements. Several of these files are common to all supported nationalities, but a small subset is specific to each.
Category Files - The Oracle Java CAPS Match Engine uses category files when processing person or business names. These files list common values for certain types of data, such as titles, suffixes, and nicknames for person names or industries and organizations for business names. Category files also define standardized versions of each term or classify the terms into different categories, and some files perform both functions. When processing address files, category files named “clues files” are used.
Clues Files - The Oracle Java CAPS Match Engine uses clues files when processing address data types. These files list general terms used in street address fields, define standardized versions of each term, and classify the terms into various component types using predefined address tokens. These files are used by the standardization engine to determine how to parse a street address into its various components. Clues files provide clues in the form of tokens to help the engine recognize the component type of certain values in the input fields.
Constants Files - The Oracle Java CAPS Match Engine refers to constants files for information about the standardization files, such as the maximum length of the files. For the address data type, the constants file also describes input and output field lengths.
Patterns Files - The patterns files specify how incoming data should be interpreted for standardization based on the format, or pattern, of the data. These files are used only for processing data contained in free-form text fields that must be parsed prior to matching (such as street address fields or business names). Patterns files list possible input data patterns, which are encoded in the form of tokens. Each token signifies a specific component of the free-form text field. For example, in a street address field, the house number is identified by one token, the street name by another, and so on. Patterns files also define the format of the output fields for each input pattern.
Key Type Files - For business name processing, the Oracle Java CAPS Match Engine refers to a number of key type files for processing information. These files generally define standard versions of terms commonly found in business names and some classify these terms into various components or industries. These files are used by the standardization engine to determine how to parse a business name into its different components and to recognize the component type of certain values in the input fields.
Reference Files - Reference files define general terms that appear in input fields for each data type. Some reference files define terms to ignore and some define terms that indicate the business name is continuing. For example, in business name processing “and” is defined as a joining term. This helps the standardization engine to recognize that the primary business name in “Martin and Sons, Inc.” is “Martin and Sons” instead of just “Martin”. Reference files can also define characters to be ignored by the standardization engine.
By default, the Oracle Java CAPS Match Engine supports addresses and names originating from Australia, France, Great Britain, and the United States. Each national domain uses a set of common standardization files and a smaller set of unique, domain-specific files to account for international differences in address formats, names, and so on. You can process with your data using the standardization files for a single domain or you can use multiple domains depending on how the Match Field file is configured.