Skip Navigation Links | |
Exit Print View | |
Understanding the Oracle Java CAPS Match Engine Java CAPS Documentation |
Understanding the Oracle Java CAPS Match Engine
About the Oracle Java CAPS Match Engine
Oracle Java CAPS Match Engine Overview
About the Oracle Java CAPS Match Engine Matching Algorithm
Oracle Java CAPS Match Engine Standardization and Matching Process
Oracle Java CAPS Match Engine Data Types
How the Oracle Java CAPS Match Engine Works
Oracle Java CAPS Match Engine Matching Weight Formulation
Matching and Unmatching Probabilities
Agreement and Disagreement Weight Ranges
Oracle Java CAPS Match Engine Standardization Configuration
Oracle Java CAPS Match Engine Standardization File Types
Oracle Java CAPS Match Engine Internationalization
Oracle Java CAPS Match Engine Matching Configuration
The Oracle Java CAPS Match Engine Match Configuration File
Oracle Java CAPS Match Engine Match Configuration File Format
Match Configuration File Sample
Oracle Java CAPS Match Engine Matching Comparison Functions
Oracle Java CAPS Match Engine and the Oracle Java CAPS Match Engine
Master Index Components and the Oracle Java CAPS Match Engine
Searching and Matching in Oracle Java CAPS Match Engine Applications (Repository)
Standardization and Matching Process in Master Index Applications (Repository)
The Master Index Match String (Repository)
Oracle Java CAPS Match Engine Field Identifiers
Oracle Java CAPS Match Engine Match and Standardization Types
Oracle Java CAPS Match Engine Configuration File Modifications
Configuring the Master Index Matching Service (Repository)
Master Index Standardization Configuration (Repository)
Standardization Structures (Parsing and Normalization)
Master Index Match String Configuration (Repository)
Match and Standardization Engine Configuration
Master Index Phonetic Encoder Configuration (Repository)
Oracle Java CAPS Match Engine Person Data Type Configuration
Oracle Java CAPS Match Engine Person Matching Overview
Oracle Java CAPS Match Engine Person Data Processing Fields
Person Data Match String Fields
Person Data Standardized Fields
Oracle Java CAPS Match Engine Match Configuration for Person Data
Oracle Java CAPS Match Engine Person Data Standardization Files
Oracle Java CAPS Match Engine Common Standardization Files for Person Data
The Hyphenated Name Category File (personFirstNameDash.dat)
The Person Name Patterns File (personNamePatt.dat)
The Special Characters Reference File (personRemoveSpecChars.dat)
Oracle Java CAPS Match Engine Domain-Specific Standardization Files for Person Data
The Conjunction Reference File (personConjon*.dat)
The Person Constants File (personConstants*.cfg)
The First Name Category File (personFirstName*.dat)
The Generational Suffix Category File (personGenSuffix*.dat)
Last Name Prefix Category File (personLastNamePrefix*.dat)
The Last Name Category File (personLastName*.dat)
The Occupational Suffix Category File (personOccupSuffix*.dat)
The Three-Character Suffix File (personThree*.dat)
The Title Category File (personTitle*.dat)
The Two-Character Suffix File (personTwo*.dat)
The Business-Related Category File (businessOrRelated*.dat)
Configuring the Oracle Java CAPS Match Engine Standardization Files for Person Data
Configuring the Master Index Matching Service for Person Data (Repository)
Configuring the Standardization Structure for Person Data (Repository)
Person Data Normalization Structures
Configuring the Match String for Person Data (Repository)
Oracle Java CAPS Match Engine Address Data Type Configuration
Oracle Java CAPS Match Engine Address Matching Overview
Oracle Java CAPS Match Engine Address Data Processing Fields
Address Data Match String Fields
Address Data Standardized Fields
Match Configuration for Address Data (Repository)
Oracle Java CAPS Match Engine Standardization Configuration for Address Data
The Address Constants File (addressConstants*.cfg)
The Address Clues File (addressClueAbbrev*.dat)
The Address Internal Constants File (addressInternalConstants*.cfg)
The Address Master Clues File (addressMasterClues*.dat)
The Address Patterns File (addressPatterns*.dat)
The Address Output Patterns File (addressOutPatterns*.dat)
Address Pattern File Components
Modifying Oracle Java CAPS Match Engine Address Data Configuration Files
Configuring the Matching Service for Address Data (Repository)
Configuring the Standardization Structure for Address Data (Repository)
Address Standardization Structures
Configuring the Match String for Address Data (Repository)
Oracle Java CAPS Match Engine Business Names Data Type Configuration
Oracle Java CAPS Match Engine Business Name Matching Overview
Oracle Java CAPS Match Engine Business Name Processing Fields
Business Name Match String Fields
Business Name Standardized Fields
Business Name Object Structure
Oracle Java CAPS Match Engine Match Configuration for Business Names
Oracle Java CAPS Match Engine Standardization Configuration for Business Names
The Business Constants File (bizConstants.cfg)
The Adjectives Key Type File (bizAdjectivesTypeKeys.dat)
The Alias Key Type File (bizAliasTypeKeys.dat)
The Association Key Type File (bizAssociationTypeKeys.dat)
The General Terms Reference File (bizBusinessGeneralTerms.dat)
The City or State Key Type File (bizCityorStateTypeKeys.dat)
The Business Former Name Reference File (bizCompanyFormerNames.dat)
The Merged Business Name Category File (bizCompanyMergerNames.dat)
The Primary Business Name Reference File (bizCompanyPrimaryNames.dat)
The Connector Tokens Reference File (bizConnectorTokens.dat)
The Country Key Type File (bizCountryTypeKeys.dat)
The Industry Sector Reference File (bizIndustryCategoryCode.dat)
The Industry Key Type File (bizIndustryTypeKeys.dat)
The Organization Key Type File (bizOrganizationTypeKeys.dat)
The Business Patterns File (bizPatterns.dat)
The Special Characters Reference File (bizRemoveSpecChars.dat)
Modifying Oracle Java CAPS Match Engine Business Name Configuration Files
Configuring the Matching Service for Business Names (Repository)
Configuring the Standardization Structure for Business Names (Repository)
Business Name Standardization Structures
Business Name Phonetic Encoding
Configuring the Match String for Business Names (Repository)
Fine-Tuning Weights and Thresholds for Oracle Java CAPS Match Engine (Repository)
Customizing the Match Configuration and Thresholds
Customizing the Match Configuration
Probabilities or Agreement Weights
Weight Ranges Using Agreement Weights
Weight Ranges Using Probabilities
Determining the Weight Thresholds
Specifying the Weight Thresholds
Match Configuration Comparison Functions for Oracle Java CAPS Match Engine (Repository)
Oracle Java CAPS Match Engine Comparison Functions
Advanced Bigram String Comparator (b2)
Uncertainty String Comparators
Advanced Generic String Comparator (ua)
Simplified String Comparator (us)
Simplified String Comparator - FirstName (uf)
Simplified String Comparator - LastName (ul)
Simplified String Comparator - House Numbers (un)
Language-specific String Comparator (usu)
Exact char-by-char Comparator (c)
Date Comparator - Year only (dY)
Date Comparator - Month-Year (dM)
Date Comparator - Day-Month-Year (dD)
Date Comparator - Hour-Day-Month-Year (dH)
Date Comparator - Min-Hour-Day- Month-Year (dm)
Processing business name fields involves parsing, normalizing, and phonetically encoding certain fields prior to matching. The following topics describe the configuration files that define business name processing logic and provide instructions for modifying the Match Field file for processing business names.
Oracle Java CAPS Match Engine Business Name Matching Overview
Oracle Java CAPS Match Engine Match Configuration for Business Names
Oracle Java CAPS Match Engine Standardization Configuration for Business Names
Modifying Oracle Java CAPS Match Engine Business Name Configuration Files
Configuring the Matching Service for Business Names (Repository)
Matching on the business name data type includes standardizing and matching on free-form business name fields. You can implement business name standardization and matching on its own or within a master index application designed to process person information. For example, standardizing business name fields allows you to include these fields as search criteria, even though matching might not be performed against these fields.
The Oracle Java CAPS Match Engine can create standardized and phonetic values for business names. Several configuration files are designed specifically to handle business names to define additional logic for the standardization and phonetic encoding process. These include reference files, a patterns file, and key type files. The Oracle Java CAPS Match Engine can match on any field as long as the match type for the field is defined in the match configuration file (matchConfigFile.cfg). The business name standardization files are common to all national domains, so no domain-specific configuration is required.
For more information about the fields involved in business name standardization and matching, see Oracle Java CAPS Match Engine Business Name Processing Fields.
When matching on free-form business names, not all fields in a record need to be processed by the Oracle Java CAPS Match Engine. The match engine only needs to process fields that must be parsed, normalized, or phonetically converted, and the fields against which matching is performed. These fields are defined in the Match Field file, and processing logic for each field is defined in the standardization and matching configuration files.
The match string processed by the Oracle Java CAPS Match Engine is defined by the match fields specified in the Match Field file. If you specify a “BusinessName” match type for any field in the wizard, most of the parsed business name fields are automatically added to the match string in the Match Field file, including the name, organization type, association type, sector, industry, and URL. You can remove any of these fields from the match string.
The match engine can process any combination of fields you specify for matching. By default, the match configuration file (matchConfigFile.cfg) includes rows specifically for matching on the fields that are parsed from the business name fields. The file also defines several generic match types. You can use any of the existing rows for matching or you can add rows for the fields you want to match.
The Oracle Java CAPS Match Engine expects that business name data will be provided in a free-form text field containing several components that must be parsed. The match engine is designed to parse these components, and to normalize and phonetically encode the business name. You can specify additional fields for phonetic encoding.
If you specify the “BusinessName” match type for any field in the wizard, a standardization structure for that field is defined in the Match Field file. The fields defined as the target fields are listed in the next section, Business Name Object Structure.
For the default configuration of the business name data type, the address fields specified for standardization are parsed into several additional fields, one of which is also normalized. If you specify the appropriate match type in the wizard, the following fields are automatically added to the object structure and database creation script.
field_name_Name
field_name_NamePhon
field_name_OrgType
field_name_AssocType
field_name_Industry
field_name_Sector
field_name_Alias
field_name_Url
where field_name is the name of the field for which you specified business name matching. For example, if you specify the BusinessName match type for the Company field, the fields automatically added to the structure include Company_Name, Company_NamePhon, Company_OrgType, and so on.
You can add these fields manually if you do not specify a match type in the wizard.
The default match configuration file, matchConfigFile.cfg, defines several match types for the kinds of business name data typically included in the match string. You can customize the existing match types or create new match types for the data being processed. The following match types are typical for matching on business names.
|
In addition, you can use any of these generic match types for matching on business names.
|
This file appears under the Match Engine node of the master index project. For more information about the comparison functions used for each match type and how the weights are tuned, see Customizing the Match Configuration and Match Configuration Comparison Functions for Oracle Java CAPS Match Engine (Repository).
Several configuration files are used to define business name processing logic for the Oracle Java CAPS Match Engine. You can customize any of the configuration files described in this section to fit your data processing and standardization requirements. These files appear under the Standardization Engine node of the master index project.
The following topics described each file used for business name standardization:
The General Terms Reference File (bizBusinessGeneralTerms.dat)
The City or State Key Type File (bizCityorStateTypeKeys.dat)
The Business Former Name Reference File (bizCompanyFormerNames.dat)
The Merged Business Name Category File (bizCompanyMergerNames.dat)
The Primary Business Name Reference File (bizCompanyPrimaryNames.dat)
The Connector Tokens Reference File (bizConnectorTokens.dat)
The Industry Sector Reference File (bizIndustryCategoryCode.dat)
The Organization Key Type File (bizOrganizationTypeKeys.dat)
The Special Characters Reference File (bizRemoveSpecChars.dat)
The business constants file defines certain information about the standardization files used for processing business data, primarily the number of lines contained in each file. The number of lines specified must be equal to or greater than the number of lines actually contained in each file.
Table 22 lists and describes each parameter in the constants file. The files referenced by these parameters are described on the following pages.
Table 22 Business Constants File Parameters
|
The adjectives key type file defines adjectives commonly found in business names so the Oracle Java CAPS Match Engine can recognize and process these values as a part of the business name. This file contains one column with a list of commonly used adjectives, such as General, Financial, Central, and so on.
You can modify or add entries in this file as needed. Following is an excerpt from the bizAdjectivesTypeKeys.dat file.
DIGITAL DIRECTED DIVERSIFIED EDUCATIONAL ELECTROCHEMICAL ENGINEERED EVOLUTIONARY EXTENDED FACTUAL FEDERAL
The alias key type file lists business name acronyms and abbreviations along with their standardized names so the Oracle Java CAPS Match Engine can recognize and process these values correctly. You can add entries to the alias key type file using the following syntax.
alias standardized-name
Table 23 describes the columns in the bizAliasTypeKeys.dat file.
Table 23 Alias Key Type File
|
Following is an excerpt from the bizAliasTypeKeys.dat file.
BBH BARTLE BOGLE HEGARTY BBH BROWN BROTHERS HARRIMAN IBM INTERNATIONAL BUSINESS MACHINE IDS INCOMES DATA SERVICES IDS INSURANCE DATA SERVICES IDS THE INTEGRATED DECISION SUPPORT GROUP IDS THE INTERNET DATABASE SERVICE CAL-TECH CALIFORNIA INSTITUTE OF TECHNOLOGY
The association key type file lists business association types along with their standardized names so the Oracle Java CAPS Match Engine can recognize and process these values correctly. You can add entries to the association key type file using the following syntax.
association-type standardized-type
Table 24 describes the columns in the bizAssociationTypeKeys.dat file.
Table 24 Association Type Key Table
|
Following is an excerpt from the bizAssociationTypeKeys.dat file.
ASSOCIATES 0 BANCORP 0 BANCORPORATION BANCORP COMPANIES 0 GP GROUP GROUP 0 PARTNERS 0
The general terms reference file lists terms commonly used in business names. This file is used to identify terms that indicate a business, such as bank, supply, factory, and so on, so the Oracle Java CAPS Match Engine can recognize and process the business name.
This file contains one column that lists common terms in the business names you process. You can add entries as needed. Below is an excerpt from the bizBusinessGeneralTerms.dat file.
BUILDING CITY CONSUMER EAST EYE FACTORY LATIN NORTH SOUTH
The city or state key type file lists various cities and states that might be used in business names. It also classifies each entry as a city (CT) or state (ST) and indicates the country in which the city or state is located. This enables the Oracle Java CAPS Match Engine to recognize and process these values correctly. You can add entries to the city or state key type file using the following syntax.
city-or-state type country
Table 25 describes the columns in the bizCityorStateTypeKeys.dat file.
Table 25 City or State Key Type File
|
Following is an excerpt from the bizCityorStateTypeKeys.dat file.
ADELAIDE CT AU ALABAMA ST US ALASKA ST US ALGIERS CT DZ AMSTERDAM CT NL ARIZONA ST US ARKANSAS ST US ASUNCION CT PY ATHENS CT GR
The business former name reference file provides a list of common company names along with names by which the companies were formerly known so the Oracle Java CAPS Match Engine can recognize a business when a record processing a record containing a previous business name. You can add entries to the business former name table using the following syntax.
former-name current-name
Table 26 describes each column in the bizCompanyFormerNames.dat file.
Table 26 Business Former Name Reference File
|
Below is an excerpt from the bizCompanyFormerNames.dat file.
HELLENIC BOTTLING COCA-COLA HBC INTERNATIONAL PRODUCTS THE TERLATO WINE ORGANIC FOOD PRODUCTS SPECTRUM ORGANIC PRODUCTS SUTTER HOME WINERY TRINCHERO FAMILY ESTATES
The merged business name category file provides a list of companies whose name changed because of a merger along with the name of the company after the merge. It also classifies the business names into industry sectors and sub-sectors. This enables the Oracle Java CAPS Match Engine to recognize the current company name and determine the sector of the business. You can add entries to the business merger name file using the following syntax.
former-name/merged-name sector-code
Table 27 describes each column in the bizCompanyMergerNames.dat file.
Table 27 Business Merger Name Category File
|
Below is an excerpt from the bizCompanyMergerNames.dat file.
DUKE/FLUOR DANIEL 20005 FAULTLESS STARCH/BON AMI 09004 FIND/SVP 10013 FIRST WAVE/NEWPARK SHIPBUILDING 27005 GUNDLE/SLT 19020 HMG/COURTLAND 23004 J BROWN/LMC 10014 KORN/FERRY 10020 LINSCO/PRIVATE LEDGER 14005
The primary business name reference file provides a list of companies by their primary name. It also classifies the business names into industry sectors and sub-sectors. This enables the Oracle Java CAPS Match Engine to determine the correct value of the sector field when parsing the business name. You can add entries to the primary business name file using the following syntax.
primary-name sector-code
Table 28 describes the columns in the bizCompanyPrimaryNames.dat file.
Table 28 Business Primary Name Reference File
|
Below is an excerpt from the bizCompanyPrimaryNames.dat file.
BROTHER INTERNATIONAL 12006 BRYSTOL-MYERS SQUIBB 11005 BURLINGTON COAT FACTORY 24003 BURLINGTON NORTHERN SANTA FE 27005 BV SOLUTIONS 06012 CABLEVISION 26001 CABOT 04006 CADENCE 06010 CAMPBELL 22006 CAPITAL BLUE CROSS 17001
The connector tokens reference file defines common values (typically conjunctions) that connect words in business names. For example, in the business name “Nursery of Venice”, “of” is a connector token. This helps the Oracle Java CAPS Match Engine recognize and process the full name of a business by indicating that the token connects two parts of the full name.
This file contains one column that lists the connector tokens in the business names you process. You can add entries as needed. Below is an excerpt from the bizConnectorTokens.dat file.
AN DE DES DOS LA LAS LE OF THE
The country key type file lists countries and continents, along with their abbreviations and assigned nationalities. For continents, the abbreviation is “CON” to separate them from countries. This enables the Oracle Java CAPS Match Engine to recognize and process these values as countries or continents. You can add entries to the country key type file using the following syntax.
country abbreviation nationality
Table 29 describes the columns in the bizCountryTypeKeys.dat file.
Table 29 Country Key Type Files
|
Following is an excerpt from the bizCountryTypeKeys.dat file.
AMERICA CON AMERICAN AFRICA CON AFRICAN EUROPE CON EUROPEAN ASIA CON ASIAN AFGHANISTAN AF AFGHAN ALBANIA AL ALBANIAN ALGERIA DZ ALGERIAN
The industry sector reference file lists and groups various industry sectors and sub-sectors, and includes an identification code for each type so the Oracle Java CAPS Match Engine can identify and process the industry sectors for different businesses. You can add entries to the industry sector reference file using the following syntax.
sector-code industry-sector
Table 30 describes each column in the bizIndustryCategoryCode.dat file.
Table 30 Industry Sector Reference File
|
Following is an excerpt from the bizIndustryCategoryCode.dat file.
02006 Automotive & Transport Equipment - Recreational Vehicles 02007 Automotive & Transport Equipment - Shipbuilding & Related Services 02008 Automotive & Transport Equipment - Trucks, Buses & Other Vehicles 03001 Banking - Banking 04001 Chemicals - Agricultural Chemicals 04002 Chemicals - Basic & Intermediate Chemicals & Petrochemicals 04003 Chemicals - Diversified Chemicals 04004 Chemicals - Paints, Coatings & Other Finishing Products 04005 Chemicals - Plastics & Fibers 04006 Chemicals - Specialty Chemicals 05001 Computer Hardware - Computer Peripherals 05002 Computer Hardware - Data Storage Devices 05003 Computer Hardware - Diversified Computer Products
The industry key type file is used to standardize the value of the Industry field into common industries to which businesses belong so the Oracle Java CAPS Match Engine can recognize and process the industry types for different businesses. You can add entries to the industry key type file using the following syntax.
industry-type standardized-form sectors
Table 31 describes each column in the bizIndustryTypeKeys.dat file.
Table 31 Industry Key Type File
|
Below is an excerpt from the bizIndustryTypeKeys.dat file.
TECH TECHNOLOGY 05001-05007 TECHNOLOGIES TECHNOLOGY 05001-05007 TECHNOLOGY 0 05001-05007 TECHSYSTEMS 0 05001-05007 TELE PHONE TELEPHONE 16005 TELE PHONES TELEPHONES 16005 TELEVISION TV 11013 21014 TELECOM 0 16005 26006 26009 26010 TELECOMM TELECOMMUNICATION 16005 26006 26008 TELECOMMUNICATION 0 16005 26006 26008
The organization key type file is used to standardize the value of the Organization field into common organizations to which businesses belong. This helps the Oracle Java CAPS Match Engine recognize and process the organization types for different businesses. You can add entries to the organization key type file using the following syntax.
original-type standardized-form
Table 32 describes each column in the bizOrganizationTypeKeys.dat file.
Table 32 Organization Key Type File
|
Below is an excerpt from the bizOrganizationTypeKeys.dat file.
INC INCORPORATED INCORPORATED 0 KG 0 KK 0 LIMITED 0 LIMITED PARTNERSHIP 0 LLC 0 LLP 0 LP LIMITED PARTNERSHIP LTD LIMITED
The business patterns file defines multiple formats expected from the business name input fields along with the standardized output of each format. The patterns and output appear in two-row pairs in this file, as shown below.
4 PNT AST SEP-GLC ORT PNT AST DEL ORT
The first line describes the input pattern and the second describes the output pattern using tokens to denote each component. The supported tokens are described in Business Name Tokens. A number at the beginning of the first line indicates the number of components in the given business name format. You can modify this file using the following syntax.
length input-pattern output-pattern
Table 33 lists and describes the syntax components.
Table 33 Business Patterns File Components
|
Below is an excerpt from the bizPatterns.dat file.
4 PNT AST SEP-GLC ORT PNT AST DEL ORT 4 NFG AJT SEP-GLC ORT PNT PNT DEL ORT 4 NF AJT SEP-GLC ORT PNT PNT DEL ORT 4 CST IDT NF ORT PNT PNT PNT ORT 4 PNT AJT SEP-GLC ORT PNT PNT DEL ORT
The business patterns file uses tokens to denote different components in a business name, such as the primary name, alias type key, URL, and so on. The file uses one set of tokens for input fields and another set for output fields. The tokens indicate the type key files to use to determine the appropriate values for each output field. You can use only the predefined tokens to represent business name components; the Oracle Java CAPS Match Engine does not recognize custom tokens.
Table 34 lists and describes each input token; Table 35 lists and describes each output token.
Table 34 Business Name Input Pattern Tokens
|
Table 35 lists and describes each output token.
Table 35 Business Name Output Pattern Tokens
|
The special characters reference file lists certain characters that should be removed from a business name prior to processing the field, which typically include punctuation marks such as exclamation points, parenthesis, and so on. This enables the Oracle Java CAPS Match Engine to recognize the business name.
This file contains one column that lists the characters to be removed from the business names you process. You can add entries as needed. Below is an excerpt from the bizRemoveSpecChars.dat file.
[ ] { } < > / ?
To customize the Oracle Java CAPS Match Engine configuration files for processing business names, you can modify any of the business name standardization files using the text editor provided in NetBeans. Before modifying the match configuration file, review the information provided in Oracle Java CAPS Match Engine Matching Configuration and Match Configuration Comparison Functions for Oracle Java CAPS Match Engine (Repository). Make sure a thorough data analysis has been performed to determine the best fields for matching, and the best comparison functions to use for each field.
Updating most standardization files is a straightforward process. Make sure to follow the syntax guidelines provided in Oracle Java CAPS Match Engine Standardization Configuration for Business Names. If you add rows to any standardization files, make sure to modify the corresponding parameter in the business constants file (bizConstants.cfg). Before making any changes to the patterns file, make sure you understand the tokens used to represent business name field components.
For information about the standardization files you can modify, see Oracle Java CAPS Match Engine Standardization Configuration for Business Names.
To ensure correct processing of business names, you must customize the Matching Service. This includes modifying the Match Field file to support the fields on which you want to match, to standardize the appropriate fields, and to specify the Oracle Java CAPS Match Engine as the match and standardization engine (by default, the Oracle Java CAPS Match Engine is already specified so this does not need to be changed). Perform the following tasks to configure the Matching Service.
Configuring the Standardization Structure for Business Names (Repository)
Configuring the Match String for Business Names (Repository)
When configuring the Matching Service, keep in mind the information presented in Configuring the Master Index Matching Service (Repository).
The standardization structure is configured in the StandardizationConfig section of the Match Field file, which is described in detail in Match Field Configuration (Repository) in Understanding Oracle Java CAPS Master Index Configuration Options (Repository). To configure the required fields for standardization and phonetic encoding, modify the standardization and phonetic encoding structures. The following sections provide additional guidelines and samples specific to standardizing business names.
Note - In the default configuration, the rules defined for the business data type assume that all input fields must be parsed as well as normalized. Thus, there is no need to configure fields only for normalization.
For business name fields, the source fields in the standardization structure must include the fields predefined for parsing and normalization. This includes any fields containing business name information, which are parsed into the business name fields listed in Business Name Object Structure (excluding the phonetic business name field). The target fields can include any of these parsed fields. Follow the instructions under Defining Master Index Standardization Rules (Repository) in Configuring Oracle Java CAPS Master Indexes (Repository) to define fields for normalization. For the standardization-type element, enter BusinessName (for more information, see Oracle Java CAPS Match Engine Match and Standardization Types). For a list of field IDs to use in the standardized-object-field-id element, see Table 3.
A sample standardization structure for business name data is shown below. This structure parses a business name field into the standard business name fields. Note that there is no domain selector specified, which would normally default to the United States domain; however, since business names are not domain dependent, it is irrelevant here.
<free-form-texts-to-standardize> <group standardization-type="BusinessName"> <unstandardized-source-fields> <unstandardized-source-field-name>Company.Name </unstandardized-source-field-name> </unstandardized-source-fields> <standardization-targets> <target-mapping> <standardized-object-field-id>PrimaryName </standardized-object-field-id> <standardized-target-field-name>Company.Name_Name </standardized-target-field-name> </target-mapping> <target-mapping> <standardized-object-field-id>OrgTypekeyword </standardized-object-field-id> <standardized-target-field-name>Company.Name_OrgType </standardized-target-field-name> </target-mapping> <target-mapping> <standardized-object-field-id>AssocTypeKeyword </standardized-object-field-id> <standardized-target-field-name>Company.Name_AssocType </standardized-target-field-name> </target-mapping> <target-mapping> <standardized-object-field-id>IndustrySectorList </standardized-object-field-id> <standardized-target-field-name>Company.Name_Sector </standardized-target-field-name> </target-mapping> <target-mapping> <standardized-object-field-id>IndustryTypeKeyword </standardized-object-field-id> <standardized-target-field-name>Company.Name_Industry </standardized-target-field-name> </target-mapping> <target-mapping> <standardized-object-field-id>AliasList </standardized-object-field-id> <standardized-target-field-name>Company.Name_Alias </standardized-target-field-name> </target-mapping> <target-mapping> <standardized-object-field-id>Url </standardized-object-field-id> <standardized-target-field-name>Company.Name_URL </standardized-target-field-name> </target-mapping> </standardization-targets> </group> </free-form-texts-to-standardize>
When you match on business name fields, the name field should be specified for phonetic conversion (by default, the wizard defines this for you). Follow the instructions under Defining Phonetic Encoding for the Master Index (Repository) in Configuring Oracle Java CAPS Master Indexes (Repository) to define fields for phonetic encoding.
A sample of the phoneticize-fields element is shown below. This sample only converts the business name. You can define additional fields for phonetic encoding.
<phoneticize-fields> <phoneticize-field> <unphoneticized-source-field-name>Company.Name_Name </unphoneticized-source-field-name> <phoneticized-target-field-name>Company.Name_NamePhon </phoneticized-target-field-name> <encoding-type>NYSIIS</encoding-type> </phoneticize-field> </phoneticize-fields>
For matching on business name fields, make sure the match string you specify in the MatchingConfig section of the Match Field file contains all or a subset of the fields that contain the standardized data (the unparsed business names are typically too inconsistent for matching). You can include additional fields for matching if required.
To configure the match string, follow the instructions under Defining the Master Index Match String (Repository) in Configuring Oracle Java CAPS Master Indexes (Repository). For the Oracle Java CAPS Match Engine, each data type has a different match type (specified by the match-type element). The PrimaryName, OrgTypeKeyword, AssocTypeKeyword, IndustrySectorList, IndustryTypeKeyword, and Url match types are specific to business name matching. You can specify any of the other match types defined in the match configuration file, as well. For more information, see Oracle Java CAPS Match Engine Match and Standardization Types.
A sample match string for business name matching is shown below. This sample matches on the company name, the organization type, and the sector.
<match-system-object> <object-name>Company/object-name> <match-columns> <match-column> <column-name>Enterprise.SystemSBR.Company.Name_PrimaryName </column-name> <match-type>PrimaryName</match-type> </match-column> <match-column> <column-name>Enterprise.SystemSBR.Company.Name_OrgType </column-name> <match-type>OrgTypeKeyword</match-type> </match-column> <match-column> <column-name>Enterprise.SystemSBR.Company.Name_Sector </column-name> <match-type>IndustryTypeKeyword</match-type> </match-column> </match-columns> </match-system-object>