Skip Navigation Links | |
Exit Print View | |
![]() |
Oracle Java CAPS Master Index Standardization Engine Reference Java CAPS Documentation |
Oracle Java CAPS Master Index Standardization Engine Reference
About the Master Index Standardization Engine
Master Index Standardization Engine Overview
How the Master Index Standardization Engine Works
Master Index Standardization Engine Data Types and Variants
Master Index Standardization Engine Standardization Components
Finite State Machine Framework
About the Finite State Machine Framework
About the Rules-Based Framework
Oracle Java CAPS Master Index Standardization and Matching Process
Master Index Standardization Engine Internationalization
Finite State Machine Framework Configuration
FSM Framework Configuration Overview
Standardization State Definitions
Data Normalization Definitions
Standardization Processing Rules Reference
FSM-Based Person Name Configuration
Person Name Standardization Overview
Person Name Standardization Components
Person Name Standardization Files
Person Name Normalization Files
Person Name Process Definition Files
Person Name Standardization and Oracle Java CAPS Master Index
Person Name Standardized Fields
Configuring a Normalization Structure for Person Names
Configuring a Standardization Structure for Person Names
Configuring Phonetic Encoding for Person Names
FSM-Based Telephone Number Configuration
Telephone Number Standardization Overview
Telephone Number Standardization Components
Telephone Number Standardization Files
Telephone Number Standardization and Oracle Java CAPS Master Index
Telephone Number Processing Fields
Telephone Number Standardized Fields
Telephone Number Object Structure
Configuring a Standardization Structure for Telephone Numbers
Rules-Based Address Data Configuration
Address Data Standardization Overview
Address Data Standardization Components
Address Data Standardization Files
Address Pattern File Components
Address Standardization and Oracle Java CAPS Master Index
Address Data Processing Fields
Configuring a Standardization Structure for Address Data
Configuring Phonetic Encoding for Address Data
Rules-Based Business Name Configuration
Business Name Standardization Overview
Business Name Standardization Components
Business Name Standardization Files
Business Name Adjectives Key Type File
Business Association Key Type File
Business General Terms Reference File
Business City or State Key Type File
Business Former Name Reference File
Merged Business Name Category File
Primary Business Name Reference File
Business Connector Tokens Reference File
Business Country Key Type File
Business Industry Sector Reference File
Business Industry Key Type File
Business Name Standardization and Oracle Java CAPS Master Index
Business Name Processing Fields
Business Name Standardized Fields
Business Name Object Structure
Configuring a Standardization Structure for Business Names
Configuring Phonetic Encoding for Business Names
Custom FSM-Based Data Types and Variants
About Custom FSM-Based Data Types and Variants
About the Standardization Packages
Creating Custom FSM-Based Data Types
Creating the Working Directory
To Create the Working Directory
Packaging and Importing the Data Type
To Package and Import the Data Type
Creating Custom FSM-Based Variants
Creating the Working Directory
To Create the Working Directory
To Define the Service Instance
Defining the State Model and Processing Rules
To Define the State Model and Processing Rules
Creating Normalization and Lexicon Files
To Create Normalization and Lexicon Files
Packaging and Importing the Variant
Several configuration files are used to define business name processing logic for the Master Index Standardization Engine. These files provide information about business name patterns and tokens to help the standardization engine determine how to recognize business name components and break them out into their respective tokens. You can customize any of the configuration files described in this section to fit your processing and standardization requirements for business names.
The following topics described each file used for business name standardization:
The adjectives key type file (bizAdjectivesTypeKeys.dat) defines adjectives commonly found in business names so the Master Index Standardization Engine can recognize and process these values as a part of the business name. This file contains one column with a list of commonly used adjectives, such as General, Financial, Central, and so on.
You can modify or add entries in this file as needed. Following is an excerpt from the adjectives key type file.
DIGITAL DIRECTED DIVERSIFIED EDUCATIONAL ELECTROCHEMICAL ENGINEERED EVOLUTIONARY EXTENDED FACTUAL FEDERAL
The alias key type file (bizAliasTypeKeys.dat) lists business name acronyms and abbreviations along with their standardized names so the standardization engine can recognize and process these values correctly. You can add entries to the alias key type file using the following syntax.
alias standardized-name
The following table describes the columns in the alias key type file.
Table 10 Alias Key Type File
|
Following is an excerpt from the alias key type file.
BBH BARTLE BOGLE HEGARTY BBH BROWN BROTHERS HARRIMAN IBM INTERNATIONAL BUSINESS MACHINE IDS INCOMES DATA SERVICES IDS INSURANCE DATA SERVICES IDS THE INTEGRATED DECISION SUPPORT GROUP IDS THE INTERNET DATABASE SERVICE CAL-TECH CALIFORNIA INSTITUTE OF TECHNOLOGY
The association key type file (bizAssociationTypeKeys.dat) lists business association types along with their standardized names so the standardization engine can recognize and process these values correctly. You can add entries to the association key type file using the following syntax.
association-type standardized-type
The following table describes the columns in the association key type file.
Table 11 Association Key Type Table
|
Following is an excerpt from the bizAssociationTypeKeys.dat file.
ASSOCIATES 0 BANCORP 0 BANCORPORATION BANCORP COMPANIES 0 GP GROUP GROUP 0 PARTNERS 0
The general terms reference file (bizBusinessGeneralTerms.dat) lists terms commonly used in business names. This file is used to identify terms that indicate a business, such as bank, supply, factory, and so on, so the Master Index Standardization Engine can recognize and process the business name.
This file contains one column that lists common terms in the business names you process. You can add entries as needed. Below is an excerpt from the general terms reference file.
BUILDING CITY CONSUMER EAST EYE FACTORY LATIN NORTH SOUTH
The city or state key type file (bizCityorStateTypeKeys.dat) lists various cities and states that might be used in business names. It also classifies each entry as a city (CT) or state (ST) and indicates the country in which the city or state is located. This enables the standardization engine to recognize and process these values correctly. You can add entries to the city or state key type file using the following syntax.
city-or-state type country
The following table describes the columns in the file.
Table 12 City or State Key Type File
|
Following is an excerpt from the city or state key type file.
ADELAIDE CT AU ALABAMA ST US ALASKA ST US ALGIERS CT DZ AMSTERDAM CT NL ARIZONA ST US ARKANSAS ST US ASUNCION CT PY ATHENS CT GR
The business former name reference file (bizCompanyFormerNames.dat) provides a list of common company names along with names by which the companies were formerly known so the standardization engine can recognize a business when processing a record containing a previous business name. You can add entries to the business former name table using the following syntax.
former-name current-name
The following table describes each column in the business former name reference file.
Table 13 Business Former Name Reference File
|
Below is an excerpt from the business former name reference file.
HELLENIC BOTTLING COCA-COLA HBC INTERNATIONAL PRODUCTS THE TERLATO WINE ORGANIC FOOD PRODUCTS SPECTRUM ORGANIC PRODUCTS SUTTER HOME WINERY TRINCHERO FAMILY ESTATES
The merged business name category file (bizCompanyMergerNames.dat) provides a list of companies whose name changed because of a merger along with the name of the company after the merge. It also classifies the business names into industry sectors and sub-sectors. This enables the standardization engine to recognize the current company name and determine the sector of the business. You can add entries to the business merger name file using the following syntax.
former-name/merged-name sector-code
The following table describes each column in the merged business name category file.
Table 14 Merged Business Name Category File
|
Below is an excerpt from the merged business name category file.
DUKE/FLUOR DANIEL 20005 FAULTLESS STARCH/BON AMI 09004 FIND/SVP 10013 FIRST WAVE/NEWPARK SHIPBUILDING 27005 GUNDLE/SLT 19020 HMG/COURTLAND 23004 J BROWN/LMC 10014 KORN/FERRY 10020 LINSCO/PRIVATE LEDGER 14005
The primary business name reference file (bizCompanyPrimaryNames.dat) provides a list of companies by their primary name. It also classifies the business names into industry sectors and sub-sectors. This enables the standardization engine to determine the correct value of the sector field when parsing the business name. You can add entries to the primary business name file using the following syntax.
primary-name sector-code
The following table describes the columns in the primary business name reference file.
Table 15 Primary Business Name Reference File
|
Below is an excerpt from the primary business name reference file.
BROTHER INTERNATIONAL 12006 BRYSTOL-MYERS SQUIBB 11005 BURLINGTON COAT FACTORY 24003 BURLINGTON NORTHERN SANTA FE 27005 BV SOLUTIONS 06012 CABLEVISION 26001 CABOT 04006 CADENCE 06010 CAMPBELL 22006 CAPITAL BLUE CROSS 17001
The connector tokens reference file (bizConnectorTokens.dat) defines common values (typically conjunctions) that connect words in business names. For example, in the business name “Nursery of Venice”, “of” is a connector token. This helps the standardization engine recognize and process the full name of a business by indicating that the token connects two parts of the full name.
This file contains one column that lists the connector tokens in the business names you process. You can add entries as needed. Below is an excerpt from the connector tokens reference file.
AN DE DES DOS LA LAS LE OF THE
The country key type file (bizCountryTypeKeys.dat) lists countries and continents, along with their abbreviations and assigned nationalities. For continents, the abbreviation is “CON” to separate them from countries. This enables the standardization engine to recognize and process these values as countries or continents. You can add entries to the country key type file using the following syntax.
country abbreviation nationality
The following table describes the columns in the country key type file.
Table 16 Country Key Type File
|
Following is an excerpt from the country key type file.
AMERICA CON AMERICAN AFRICA CON AFRICAN EUROPE CON EUROPEAN ASIA CON ASIAN AFGHANISTAN AF AFGHAN ALBANIA AL ALBANIAN ALGERIA DZ ALGERIAN
The industry sector reference file (bizIndustryCategoryCode.dat) lists and groups various industry sectors and sub-sectors, and includes an identification code for each type so the standardization engine can identify and process the industry sectors for different businesses. You can add entries to the industry sector reference file using the following syntax.
sector-code industry-sector
The following table describes each column in the industry sector reference file.
Table 17 Industry Sector Reference File
|
Following is an excerpt from the industry sector reference file.
02006 Automotive & Transport Equipment - Recreational Vehicles 02007 Automotive & Transport Equipment - Shipbuilding & Related Services 02008 Automotive & Transport Equipment - Trucks, Buses & Other Vehicles 03001 Banking - Banking 04001 Chemicals - Agricultural Chemicals 04002 Chemicals - Basic & Intermediate Chemicals & Petrochemicals 04003 Chemicals - Diversified Chemicals 04004 Chemicals - Paints, Coatings & Other Finishing Products 04005 Chemicals - Plastics & Fibers 04006 Chemicals - Specialty Chemicals 05001 Computer Hardware - Computer Peripherals 05002 Computer Hardware - Data Storage Devices 05003 Computer Hardware - Diversified Computer Products
The industry key type file (bizIndustryTypeKeys.dat) is used to standardize the value of the Industry field into common industries to which businesses belong so the standardization engine can recognize and process the industry types for different businesses. You can add entries to the industry key type file using the following syntax.
industry-type standardized-form sectors
The following table describes each column in the industry key type file.
Table 18 Industry Key Type File
|
Below is an excerpt from the industry key type file.
TECH TECHNOLOGY 05001-05007 TECHNOLOGIES TECHNOLOGY 05001-05007 TECHNOLOGY 0 05001-05007 TECHSYSTEMS 0 05001-05007 TELE PHONE TELEPHONE 16005 TELE PHONES TELEPHONES 16005 TELEVISION TV 11013 21014 TELECOM 0 16005 26006 26009 26010 TELECOMM TELECOMMUNICATION 16005 26006 26008 TELECOMMUNICATION 0 16005 26006 26008
The organization key type file (bizOrganizationTypeKeys.dat) is used to standardize the value of the Organization field into common organizations to which businesses belong. This helps the standardization engine recognize and process the organization types for different businesses. You can add entries to the organization key type file using the following syntax.
original-type standardized-form
The following table describes each column in the organization key type file.
Table 19 Organization Key Type File
|
Below is an excerpt from the organization key type file.
INC INCORPORATED INCORPORATED 0 KG 0 KK 0 LIMITED 0 LIMITED PARTNERSHIP 0 LLC 0 LLP 0 LP LIMITED PARTNERSHIP LTD LIMITED
The business patterns file (bizpatterns.dat) defines multiple formats expected from the business name input fields along with the standardized output of each format. The patterns and output appear in two-row pairs in this file, as shown below.
4 PNT AST SEP-GLC ORT PNT AST DEL ORT
The first line describes the input pattern and the second describes the output pattern using tokens to denote each component. The supported tokens are described in Business Name Tokens. A number at the beginning of the first line indicates the number of components in the given business name format. You can modify this file using the following syntax.
length input-pattern output-pattern
The following table lists and describes the components in the above syntax.
Table 20 Business Patterns File Components
|
Below is an excerpt from the business patterns file.
4 PNT AST SEP-GLC ORT PNT AST DEL ORT 4 NFG AJT SEP-GLC ORT PNT PNT DEL ORT 4 NF AJT SEP-GLC ORT PNT PNT DEL ORT 4 CST IDT NF ORT PNT PNT PNT ORT 4 PNT AJT SEP-GLC ORT PNT PNT DEL ORT
The business patterns file uses tokens to denote different components in a business name, such as the primary name, alias type key, URL, and so on. The file uses one set of tokens for input fields and another set for output fields. The tokens indicate the type key files to use to determine the appropriate values for each output field. You can use only the predefined tokens to represent business name components; the standardization engine does not recognize custom tokens.
Table 21 lists and describes each input token; Table 22 lists and describes each output token.
Table 21 Business Name Input Pattern Tokens
|
Table 22 lists and describes each output token.
Table 22 Business Name Output Pattern Tokens
|