JavaScript is required to for searching.
Skip Navigation Links
Exit Print View
Oracle Java CAPS Master Index Standardization Engine Reference     Java CAPS Documentation
search filter icon
search icon

Document Information

Oracle Java CAPS Master Index Standardization Engine Reference

About the Master Index Standardization Engine

Related Topics

Master Index Standardization Engine Overview

Standardization Concepts

Data Parsing or Reformatting

Data Normalization

Phonetic Encoding

How the Master Index Standardization Engine Works

Master Index Standardization Engine Data Types and Variants

Master Index Standardization Engine Standardization Components

Finite State Machine Framework

About the Finite State Machine Framework

FSM-Based Configuration

Rules-Based Framework

About the Rules-Based Framework

Rules-Based Configuration

Oracle Java CAPS Master Index Standardization and Matching Process

Master Index Standardization Engine Internationalization

Finite State Machine Framework Configuration

FSM Framework Configuration Overview

Process Definition File

Standardization State Definitions

Input Symbol Definitions

Output Symbol Definitions

Data Cleansing Definitions

Data Normalization Definitions

Standardization Processing Rules Reference

dictionary

fixedString

lexicon

normalizeSpace

pattern

replace

replaceAll

transliterate

uppercase

Lexicon Files

Normalization Files

FSM-Based Person Name Configuration

Person Name Standardization Overview

Person Name Standardization Components

Person Name Standardization Files

Person Name Lexicon Files

Person Name Normalization Files

Person Name Process Definition Files

Person Name Standardization and Oracle Java CAPS Master Index

Person Name Processing Fields

Person Name Standardized Fields

Person Name Object Structure

Configuring a Normalization Structure for Person Names

Configuring a Standardization Structure for Person Names

Configuring Phonetic Encoding for Person Names

FSM-Based Telephone Number Configuration

Telephone Number Standardization Overview

Telephone Number Standardization Components

Telephone Number Standardization Files

Telephone Number Standardization and Oracle Java CAPS Master Index

Telephone Number Processing Fields

Telephone Number Standardized Fields

Telephone Number Object Structure

Configuring a Standardization Structure for Telephone Numbers

Rules-Based Address Data Configuration

Address Data Standardization Overview

Address Data Standardization Components

Address Data Standardization Files

Address Clues File

Address Master Clues File

Address Patterns File

Address Pattern File Components

Address Type Tokens

Pattern Classes

Pattern Modifiers

Priority Indicators

Address Standardization and Oracle Java CAPS Master Index

Address Data Processing Fields

Address Standardized Fields

Address Object Structure

Configuring a Standardization Structure for Address Data

Configuring Phonetic Encoding for Address Data

Rules-Based Business Name Configuration

Business Name Standardization Overview

Business Name Standardization Components

Business Name Standardization Files

Business Name Adjectives Key Type File

Business Alias Key Type File

Business Association Key Type File

Business General Terms Reference File

Business City or State Key Type File

Business Former Name Reference File

Merged Business Name Category File

Primary Business Name Reference File

Business Connector Tokens Reference File

Business Country Key Type File

Business Industry Sector Reference File

Business Industry Key Type File

Business Organization Key Type File

Business Patterns File

Business Name Tokens

Business Name Standardization and Oracle Java CAPS Master Index

Business Name Processing Fields

Business Name Standardized Fields

Business Name Object Structure

Configuring a Standardization Structure for Business Names

Configuring Phonetic Encoding for Business Names

Custom FSM-Based Data Types and Variants

About Custom FSM-Based Data Types and Variants

About the Standardization Packages

Creating Custom FSM-Based Data Types

Creating the Working Directory

To Create the Working Directory

Defining the Service Type

To Define the Service Type

Defining the Variants

To Define the Variants

Packaging and Importing the Data Type

To Package and Import the Data Type

Service Type Definition File

Creating Custom FSM-Based Variants

Creating the Working Directory

To Create the Working Directory

Defining the Service Instance

To Define the Service Instance

Defining the State Model and Processing Rules

To Define the State Model and Processing Rules

Creating Normalization and Lexicon Files

To Create Normalization and Lexicon Files

Packaging and Importing the Variant

To Package and Import the Variant

Service Instance Definition File

Business Name Standardization Files

Several configuration files are used to define business name processing logic for the Master Index Standardization Engine. These files provide information about business name patterns and tokens to help the standardization engine determine how to recognize business name components and break them out into their respective tokens. You can customize any of the configuration files described in this section to fit your processing and standardization requirements for business names.

The following topics described each file used for business name standardization:

Business Name Adjectives Key Type File

The adjectives key type file (bizAdjectivesTypeKeys.dat) defines adjectives commonly found in business names so the Master Index Standardization Engine can recognize and process these values as a part of the business name. This file contains one column with a list of commonly used adjectives, such as General, Financial, Central, and so on.

You can modify or add entries in this file as needed. Following is an excerpt from the adjectives key type file.

DIGITAL
DIRECTED
DIVERSIFIED
EDUCATIONAL
ELECTROCHEMICAL
ENGINEERED
EVOLUTIONARY
EXTENDED
FACTUAL
FEDERAL

Business Alias Key Type File

The alias key type file (bizAliasTypeKeys.dat) lists business name acronyms and abbreviations along with their standardized names so the standardization engine can recognize and process these values correctly. You can add entries to the alias key type file using the following syntax.

alias standardized-name

The following table describes the columns in the alias key type file.

Table 10 Alias Key Type File

Column
Description
alias
An abbreviation or acronym commonly used in place of a specific business name.
standardized-name
The normalized version of the alias name.

Following is an excerpt from the alias key type file.

BBH                 BARTLE BOGLE HEGARTY
BBH                 BROWN BROTHERS HARRIMAN
IBM                 INTERNATIONAL BUSINESS MACHINE
IDS                 INCOMES DATA SERVICES
IDS                 INSURANCE DATA SERVICES
IDS                 THE INTEGRATED DECISION SUPPORT GROUP
IDS                 THE INTERNET DATABASE SERVICE
CAL-TECH            CALIFORNIA INSTITUTE OF TECHNOLOGY

Business Association Key Type File

The association key type file (bizAssociationTypeKeys.dat) lists business association types along with their standardized names so the standardization engine can recognize and process these values correctly. You can add entries to the association key type file using the following syntax.

association-type standardized-type

The following table describes the columns in the association key type file.

Table 11 Association Key Type Table

Column
Description
association-type
A common association type for businesses, such as Partners, Group, and so on.
standardized-type
The standardized version of the association type. If this column contains a name instead of a zero, that name must also be listed in a different entry as an association type with a standardized form of “0”.

Following is an excerpt from the bizAssociationTypeKeys.dat file.

ASSOCIATES          0
BANCORP             0
BANCORPORATION      BANCORP
COMPANIES           0
GP                  GROUP
GROUP               0
PARTNERS            0

Business General Terms Reference File

The general terms reference file (bizBusinessGeneralTerms.dat) lists terms commonly used in business names. This file is used to identify terms that indicate a business, such as bank, supply, factory, and so on, so the Master Index Standardization Engine can recognize and process the business name.

This file contains one column that lists common terms in the business names you process. You can add entries as needed. Below is an excerpt from the general terms reference file.

BUILDING
CITY
CONSUMER
EAST
EYE
FACTORY
LATIN
NORTH
SOUTH

Business City or State Key Type File

The city or state key type file (bizCityorStateTypeKeys.dat) lists various cities and states that might be used in business names. It also classifies each entry as a city (CT) or state (ST) and indicates the country in which the city or state is located. This enables the standardization engine to recognize and process these values correctly. You can add entries to the city or state key type file using the following syntax.

city-or-state type country

The following table describes the columns in the file.

Table 12 City or State Key Type File

Column
Description
city-or-state
The name of a city or state used in business names.
type
An indicator of whether the value is a city or state. “CT” indicates city and “ST” indicates state.
country
The country code of the country in which the city or state is located.

Following is an excerpt from the city or state key type file.

ADELAIDE                 CT   AU
ALABAMA                  ST   US
ALASKA                   ST   US
ALGIERS                  CT   DZ
AMSTERDAM                CT   NL
ARIZONA                  ST   US
ARKANSAS                 ST   US
ASUNCION                 CT   PY
ATHENS                   CT   GR

Business Former Name Reference File

The business former name reference file (bizCompanyFormerNames.dat) provides a list of common company names along with names by which the companies were formerly known so the standardization engine can recognize a business when processing a record containing a previous business name. You can add entries to the business former name table using the following syntax.

former-name current-name

The following table describes each column in the business former name reference file.

Table 13 Business Former Name Reference File

Column
Description
former-name
One of the company’s previous names.
current-name
The company’s current name.

Below is an excerpt from the business former name reference file.

HELLENIC BOTTLING                       COCA-COLA HBC
INTERNATIONAL PRODUCTS                  THE TERLATO WINE
ORGANIC FOOD PRODUCTS                   SPECTRUM ORGANIC PRODUCTS
SUTTER HOME WINERY                      TRINCHERO FAMILY ESTATES

Merged Business Name Category File

The merged business name category file (bizCompanyMergerNames.dat) provides a list of companies whose name changed because of a merger along with the name of the company after the merge. It also classifies the business names into industry sectors and sub-sectors. This enables the standardization engine to recognize the current company name and determine the sector of the business. You can add entries to the business merger name file using the following syntax.

former-name/merged-name sector-code

The following table describes each column in the merged business name category file.

Table 14 Merged Business Name Category File

Column
Description
former-name
The name of the company whose name was not kept after the merger.
merged-name
The name of the company whose name was kept after the merger.
sector-code
The industry sector code of the business. Sector codes are listed in the bizIndustryCategoriesCode.dat file.

Below is an excerpt from the merged business name category file.

DUKE/FLUOR DANIEL                                 20005
FAULTLESS STARCH/BON AMI                          09004
FIND/SVP                                          10013
FIRST WAVE/NEWPARK SHIPBUILDING                   27005
GUNDLE/SLT                                        19020
HMG/COURTLAND                                     23004
J BROWN/LMC                                       10014
KORN/FERRY                                        10020
LINSCO/PRIVATE LEDGER                             14005

Primary Business Name Reference File

The primary business name reference file (bizCompanyPrimaryNames.dat) provides a list of companies by their primary name. It also classifies the business names into industry sectors and sub-sectors. This enables the standardization engine to determine the correct value of the sector field when parsing the business name. You can add entries to the primary business name file using the following syntax.

primary-name sector-code

The following table describes the columns in the primary business name reference file.

Table 15 Primary Business Name Reference File

Column
Description
primary-name
The primary name of the company.
sector-code
The industry sector code of the business. Sector codes are listed in the bizIndustryCategoriesCode.dat file.

Below is an excerpt from the primary business name reference file.

BROTHER INTERNATIONAL                             12006
BRYSTOL-MYERS SQUIBB                              11005
BURLINGTON COAT FACTORY                           24003
BURLINGTON NORTHERN SANTA FE                      27005
BV SOLUTIONS                                      06012
CABLEVISION                                       26001
CABOT                                             04006
CADENCE                                           06010
CAMPBELL                                          22006
CAPITAL BLUE CROSS                                17001

Business Connector Tokens Reference File

The connector tokens reference file (bizConnectorTokens.dat) defines common values (typically conjunctions) that connect words in business names. For example, in the business name “Nursery of Venice”, “of” is a connector token. This helps the standardization engine recognize and process the full name of a business by indicating that the token connects two parts of the full name.

This file contains one column that lists the connector tokens in the business names you process. You can add entries as needed. Below is an excerpt from the connector tokens reference file.

AN
DE
DES
DOS
LA
LAS
LE
OF
THE

Business Country Key Type File

The country key type file (bizCountryTypeKeys.dat) lists countries and continents, along with their abbreviations and assigned nationalities. For continents, the abbreviation is “CON” to separate them from countries. This enables the standardization engine to recognize and process these values as countries or continents. You can add entries to the country key type file using the following syntax.

country abbreviation nationality

The following table describes the columns in the country key type file.

Table 16 Country Key Type File

Column
Description
country
The name of a country or continent.
abbreviation
The common abbreviation for the specified country. The abbreviation for a continent is always “CON”.
nationality
The nationality assigned to a person or business originating in the specified country.

Following is an excerpt from the country key type file.

AMERICA                       CON  AMERICAN
AFRICA                        CON  AFRICAN
EUROPE                        CON  EUROPEAN
ASIA                          CON  ASIAN
AFGHANISTAN                   AF   AFGHAN
ALBANIA                       AL   ALBANIAN
ALGERIA                       DZ   ALGERIAN

Business Industry Sector Reference File

The industry sector reference file (bizIndustryCategoryCode.dat) lists and groups various industry sectors and sub-sectors, and includes an identification code for each type so the standardization engine can identify and process the industry sectors for different businesses. You can add entries to the industry sector reference file using the following syntax.

sector-code industry-sector

The following table describes each column in the industry sector reference file.

Table 17 Industry Sector Reference File

Column
Description
sector-code
The identification code of the specified sector. The first two numbers of each code identify the general industry sector; the last three number identify a sub-sector.
industry-sector
A description of the industry category. This is written in the format “sector - sub-sector”, where sector is a general category of industry types, and sub-sector is a specific industry within that category.

Following is an excerpt from the industry sector reference file.

02006         Automotive & Transport Equipment - Recreational Vehicles
02007         Automotive & Transport Equipment - Shipbuilding & Related Services
02008         Automotive & Transport Equipment - Trucks, Buses & Other Vehicles
03001         Banking - Banking
04001         Chemicals - Agricultural Chemicals
04002         Chemicals - Basic & Intermediate Chemicals & Petrochemicals
04003         Chemicals - Diversified Chemicals
04004         Chemicals - Paints, Coatings & Other Finishing Products
04005         Chemicals - Plastics & Fibers
04006         Chemicals - Specialty Chemicals
05001         Computer Hardware - Computer Peripherals
05002         Computer Hardware - Data Storage Devices
05003         Computer Hardware - Diversified Computer Products

Business Industry Key Type File

The industry key type file (bizIndustryTypeKeys.dat) is used to standardize the value of the Industry field into common industries to which businesses belong so the standardization engine can recognize and process the industry types for different businesses. You can add entries to the industry key type file using the following syntax.

industry-type standardized-form sectors

The following table describes each column in the industry key type file.

Table 18 Industry Key Type File

Column
Description
industry-type
The original value of the industry type in the input record.
standardized-form
The normalized version of the industry type. If this column contains a name instead of a zero, that name must also be listed in a different entry as an industry type with a standardized form of “0”.
sectors
The industry categories of the specified industry type. These values correspond to the sector codes listed in the industry sector file (bizIndustryCategoryCode.dat). You can list as many categories as apply for each type, but they must be entered with a space between each and no line breaks, and they must correspond to an entry in the industry sector file.

Below is an excerpt from the industry key type file.

TECH                TECHNOLOGY          05001-05007
TECHNOLOGIES        TECHNOLOGY          05001-05007
TECHNOLOGY          0                   05001-05007
TECHSYSTEMS         0                   05001-05007
TELE PHONE          TELEPHONE           16005
TELE PHONES         TELEPHONES          16005
TELEVISION          TV                  11013  21014
TELECOM             0                   16005  26006  26009  26010
TELECOMM            TELECOMMUNICATION   16005  26006  26008
TELECOMMUNICATION   0                   16005  26006  26008

Business Organization Key Type File

The organization key type file (bizOrganizationTypeKeys.dat) is used to standardize the value of the Organization field into common organizations to which businesses belong. This helps the standardization engine recognize and process the organization types for different businesses. You can add entries to the organization key type file using the following syntax.

original-type standardized-form

The following table describes each column in the organization key type file.

Table 19 Organization Key Type File

Column
Description
original-type
The original value of the organization field in an input record.
standardized-form
The normalized version of an organization type. A zero (0) in this field indicates that the value in the first column is already in its standardized form. If this column contains a name instead of a zero, that name must also be listed in a different entry as an original type with a standardized form of “0”.

Below is an excerpt from the organization key type file.

INC                 INCORPORATED
INCORPORATED        0
KG                  0
KK                  0
LIMITED             0
LIMITED PARTNERSHIP 0
LLC                 0
LLP                 0
LP                  LIMITED PARTNERSHIP
LTD                 LIMITED

Business Patterns File

The business patterns file (bizpatterns.dat) defines multiple formats expected from the business name input fields along with the standardized output of each format. The patterns and output appear in two-row pairs in this file, as shown below.

4 PNT AST SEP-GLC ORT
PNT AST DEL ORT

The first line describes the input pattern and the second describes the output pattern using tokens to denote each component. The supported tokens are described in Business Name Tokens. A number at the beginning of the first line indicates the number of components in the given business name format. You can modify this file using the following syntax.

length input-pattern
output-pattern

The following table lists and describes the components in the above syntax.

Table 20 Business Patterns File Components

Component
Description
length
The number of business name components in the input field.
input-pattern
Tokens that represent a possible input pattern from the unparsed business name fields. Each token represents one component. For more information about address tokens, see Business Name Tokens.
output-pattern
Tokens that represent the output pattern for the specified input pattern. Each token represents one component. For more information about business name tokens, see Business Name Tokens.

Below is an excerpt from the business patterns file.

4 PNT AST SEP-GLC ORT
PNT AST DEL ORT

4 NFG AJT SEP-GLC ORT
PNT PNT DEL ORT

4 NF AJT SEP-GLC ORT
PNT PNT DEL ORT

4 CST IDT NF ORT
PNT PNT PNT ORT

4 PNT AJT SEP-GLC ORT
PNT PNT DEL ORT

Business Name Tokens

The business patterns file uses tokens to denote different components in a business name, such as the primary name, alias type key, URL, and so on. The file uses one set of tokens for input fields and another set for output fields. The tokens indicate the type key files to use to determine the appropriate values for each output field. You can use only the predefined tokens to represent business name components; the standardization engine does not recognize custom tokens.

Table 21 lists and describes each input token; Table 22 lists and describes each output token.

Table 21 Business Name Input Pattern Tokens

Pattern Identifier
Description
CTT
A connector token
PNT
A primary name of a business
PN-PN
A hyphenated primary name of a business
BCT
A common business term
URL
The URL of the business’ web site
ALT
A business alias type key (usually an acronym)
CNT
A country name
NAT
A nationality
CST
A city or state type key
IDT
An industry type key
IDT-AJT
Both an industry and an adjective type key
AJT
An adjective type key
AST
An association type key
ORT
An organization type key
SEP
A separator key
NFG
Generic term, not recognized as a specific business name component, with an internal hyphen
NF
Generic term, not recognized as a specific business name component
NFC
A single character, not recognized as a specific business name component
SEP-GLC
A joining comma (a glue type separator)
SEP-GLD
A joining hyphen (a glue type separator)
AND
The text “and”
GLU
A glue type key, such as a forward slash, connecting two parts of a business name component
PN-NF
A business primary name followed by a hyphen and a generic term that is not recognized as a specific business name component
NF-PN
A generic term that is not recognized as a specific business name component, followed by a hyphen and a recognized business primary name
NF-NF
Two generic terms, not recognized as specific business name components and separated by a hyphen

Table 22 lists and describes each output token.

Table 22 Business Name Output Pattern Tokens

Pattern Identifier
Description
PNT
The primary name of the business
URL
The URL of the business
ALT
The alias type key of the business (usually an acronym)
IDT
The industry type key of the business
AST
The association type key of the business
ORT
The organization type key of the business
NF
A generic term not recognized as a business name component