Several configuration files are used to define business name processing logic for the Sun Match Engine. You can customize any of the configuration files described in this section to fit your data processing and standardization requirements. These files appear under the Standardization Engine node of the master index project.
The following topics described each file used for business name standardization:
The General Terms Reference File (bizBusinessGeneralTerms.dat)
The City or State Key Type File (bizCityorStateTypeKeys.dat)
The Business Former Name Reference File (bizCompanyFormerNames.dat)
The Merged Business Name Category File (bizCompanyMergerNames.dat)
The Primary Business Name Reference File (bizCompanyPrimaryNames.dat)
The Connector Tokens Reference File (bizConnectorTokens.dat)
The Industry Sector Reference File (bizIndustryCategoryCode.dat)
The Organization Key Type File (bizOrganizationTypeKeys.dat)
The Special Characters Reference File (bizRemoveSpecChars.dat)
The business constants file defines certain information about the standardization files used for processing business data, primarily the number of lines contained in each file. The number of lines specified must be equal to or greater than the number of lines actually contained in each file.
Table 22 lists and describes each parameter in the constants file. The files referenced by these parameters are described on the following pages.
Table 22 Business Constants File Parameters
Parameter |
Description |
---|---|
The maximum number of lines in the city or state key type file (bizCityorStateTypeKey.dat). |
|
The maximum number of lines in the primary business names reference file (bizCompanyPrimaryNames.dat). |
|
The maximum number of lines in the country key type file (bizCountryTypeKeys.dat). |
|
The maximum number of lines in the industry key type file (bizIndustryTypeKeys.dat). |
|
The maximum number of lines in the business patterns file (bizPatterns.dat). |
|
The maximum number of lines in the merged business name category file (bizCompanyMergerNames.dat). |
|
The maximum number of lines in the adjective key type file (bizAdjectiveTypeKeys.dat). |
|
The maximum number of lines in the organization key type file (bizOrganizationTypeKeys.dat). |
|
The maximum number of lines in the association key type file (bizAssociationTypeKeys.dat). |
|
The maximum number of lines in the general terms reference file (bizBusinessGeneralTerms.dat). |
|
The maximum number of lines in the special characters reference file (bizRemoveSpecChars.dat). |
|
The maximum number of tokens allowed in the input business name. If no value is defined for this parameter, the default is the value set for the words parameter in the personConstants.cfg file. |
The adjectives key type file defines adjectives commonly found in business names so the Sun Match Engine can recognize and process these values as a part of the business name. This file contains one column with a list of commonly used adjectives, such as General, Financial, Central, and so on.
You can modify or add entries in this file as needed. Following is an excerpt from the bizAdjectivesTypeKeys.dat file.
DIGITAL DIRECTED DIVERSIFIED EDUCATIONAL ELECTROCHEMICAL ENGINEERED EVOLUTIONARY EXTENDED FACTUAL FEDERAL |
The alias key type file lists business name acronyms and abbreviations along with their standardized names so the Sun Match Engine can recognize and process these values correctly. You can add entries to the alias key type file using the following syntax.
alias standardized-name
Table 23 describes the columns in the bizAliasTypeKeys.dat file.
Table 23 Alias Key Type File
Column |
Description |
---|---|
An abbreviation or acronym commonly used in place of a specific business name. |
|
standardized-name |
The normalized version of the alias name. |
Following is an excerpt from the bizAliasTypeKeys.dat file.
BBH BARTLE BOGLE HEGARTY BBH BROWN BROTHERS HARRIMAN IBM INTERNATIONAL BUSINESS MACHINE IDS INCOMES DATA SERVICES IDS INSURANCE DATA SERVICES IDS THE INTEGRATED DECISION SUPPORT GROUP IDS THE INTERNET DATABASE SERVICE CAL-TECH CALIFORNIA INSTITUTE OF TECHNOLOGY |
The association key type file lists business association types along with their standardized names so the Sun Match Engine can recognize and process these values correctly. You can add entries to the association key type file using the following syntax.
association-type standardized-type
Table 24 describes the columns in the bizAssociationTypeKeys.dat file.
Table 24 Association Type Key Table
Column |
Description |
---|---|
A common association type for businesses, such as Partners, Group, and so on. |
|
standardized-type |
The standardized version of the association type. If this column contains a name instead of a zero, that name must also be listed in a different entry as an association type with a standardized form of “0”. |
Following is an excerpt from the bizAssociationTypeKeys.dat file.
ASSOCIATES 0 BANCORP 0 BANCORPORATION BANCORP COMPANIES 0 GP GROUP GROUP 0 PARTNERS 0 |
The general terms reference file lists terms commonly used in business names. This file is used to identify terms that indicate a business, such as bank, supply, factory, and so on, so the Sun Match Engine can recognize and process the business name.
This file contains one column that lists common terms in the business names you process. You can add entries as needed. Below is an excerpt from the bizBusinessGeneralTerms.dat file.
BUILDING CITY CONSUMER EAST EYE FACTORY LATIN NORTH SOUTH |
The city or state key type file lists various cities and states that might be used in business names. It also classifies each entry as a city (CT) or state (ST) and indicates the country in which the city or state is located. This enables the Sun Match Engine to recognize and process these values correctly. You can add entries to the city or state key type file using the following syntax.
city-or-state type country |
Table 25 describes the columns in the bizCityorStateTypeKeys.dat file.
Table 25 City or State Key Type File
Column |
Description |
---|---|
The name of a city or state used in business names. |
|
type |
An indicator of whether the value is a city or state. “CT” indicates city and “ST” indicates state. |
country |
The country code of the country in which the city or state is located. |
Following is an excerpt from the bizCityorStateTypeKeys.dat file.
ADELAIDE CT AU ALABAMA ST US ALASKA ST US ALGIERS CT DZ AMSTERDAM CT NL ARIZONA ST US ARKANSAS ST US ASUNCION CT PY ATHENS CT GR |
The business former name reference file provides a list of common company names along with names by which the companies were formerly known so the Sun Match Engine can recognize a business when a record processing a record containing a previous business name. You can add entries to the business former name table using the following syntax.
former-name current-name
Table 26 describes each column in the bizCompanyFormerNames.dat file.
Table 26 Business Former Name Reference File
Column |
Description |
---|---|
One of the company’s previous names. |
|
current-name |
The company’s current name. |
Below is an excerpt from the bizCompanyFormerNames.dat file.
HELLENIC BOTTLING COCA-COLA HBC INTERNATIONAL PRODUCTS THE TERLATO WINE ORGANIC FOOD PRODUCTS SPECTRUM ORGANIC PRODUCTS SUTTER HOME WINERY TRINCHERO FAMILY ESTATES |
The merged business name category file provides a list of companies whose name changed because of a merger along with the name of the company after the merge. It also classifies the business names into industry sectors and sub-sectors. This enables the Sun Match Engine to recognize the current company name and determine the sector of the business. You can add entries to the business merger name file using the following syntax.
former-name/merged-name sector-code
Table 27 describes each column in the bizCompanyMergerNames.dat file.
Table 27 Business Merger Name Category File
Column |
Description |
---|---|
The name of the company whose name was not kept after the merger. |
|
merged-name |
The name of the company whose name was kept after the merger. |
sector-code |
The industry sector code of the business. Sector codes are listed in the bizIndustryCategoriesCode.dat file. |
Below is an excerpt from the bizCompanyMergerNames.dat file.
DUKE/FLUOR DANIEL 20005 FAULTLESS STARCH/BON AMI 09004 FIND/SVP 10013 FIRST WAVE/NEWPARK SHIPBUILDING 27005 GUNDLE/SLT 19020 HMG/COURTLAND 23004 J BROWN/LMC 10014 KORN/FERRY 10020 LINSCO/PRIVATE LEDGER 14005 |
The primary business name reference file provides a list of companies by their primary name. It also classifies the business names into industry sectors and sub-sectors. This enables the Sun Match Engine to determine the correct value of the sector field when parsing the business name. You can add entries to the primary business name file using the following syntax.
primary-name sector-code
Table 28 describes the columns in the bizCompanyPrimaryNames.dat file.
Table 28 Business Primary Name Reference File
Column |
Description |
---|---|
The primary name of the company. |
|
sector-code |
The industry sector code of the business. Sector codes are listed in the bizIndustryCategoriesCode.dat file. |
Below is an excerpt from the bizCompanyPrimaryNames.dat file.
BROTHER INTERNATIONAL 12006 BRYSTOL-MYERS SQUIBB 11005 BURLINGTON COAT FACTORY 24003 BURLINGTON NORTHERN SANTA FE 27005 BV SOLUTIONS 06012 CABLEVISION 26001 CABOT 04006 CADENCE 06010 CAMPBELL 22006 CAPITAL BLUE CROSS 17001 |
The connector tokens reference file defines common values (typically conjunctions) that connect words in business names. For example, in the business name “Nursery of Venice”, “of” is a connector token. This helps the Sun Match Engine recognize and process the full name of a business by indicating that the token connects two parts of the full name.
This file contains one column that lists the connector tokens in the business names you process. You can add entries as needed. Below is an excerpt from the bizConnectorTokens.dat file.
AN DE DES DOS LA LAS LE OF THE |
The country key type file lists countries and continents, along with their abbreviations and assigned nationalities. For continents, the abbreviation is “CON” to separate them from countries. This enables the Sun Match Engine to recognize and process these values as countries or continents. You can add entries to the country key type file using the following syntax.
country abbreviation nationality
Table 29 describes the columns in the bizCountryTypeKeys.dat file.
Table 29 Country Key Type Files
Column |
Description |
---|---|
The name of a country or continent. |
|
abbreviation |
The common abbreviation for the specified country. The abbreviation for a continent is always “CON”. |
nationality |
The nationality assigned to a person or business originating in the specified country. |
Following is an excerpt from the bizCountryTypeKeys.dat file.
AMERICA CON AMERICAN AFRICA CON AFRICAN EUROPE CON EUROPEAN ASIA CON ASIAN AFGHANISTAN AF AFGHAN ALBANIA AL ALBANIAN ALGERIA DZ ALGERIAN |
The industry sector reference file lists and groups various industry sectors and sub-sectors, and includes an identification code for each type so the Sun Match Engine can identify and process the industry sectors for different businesses. You can add entries to the industry sector reference file using the following syntax.
sector-code industry-sector
Table 30 describes each column in the bizIndustryCategoryCode.dat file.
Table 30 Industry Sector Reference File
Column |
Description |
---|---|
The identification code of the specified sector. The first two numbers of each code identify the general industry sector; the last three number identify a sub-sector. |
|
industry-sector |
A description of the industry category. This is written in the format “sector - sub-sector”, where sector is a general category of industry types, and sub-sector is a specific industry within that category. |
Following is an excerpt from the bizIndustryCategoryCode.dat file.
02006 Automotive & Transport Equipment - Recreational Vehicles 02007 Automotive & Transport Equipment - Shipbuilding & Related Services 02008 Automotive & Transport Equipment - Trucks, Buses & Other Vehicles 03001 Banking - Banking 04001 Chemicals - Agricultural Chemicals 04002 Chemicals - Basic & Intermediate Chemicals & Petrochemicals 04003 Chemicals - Diversified Chemicals 04004 Chemicals - Paints, Coatings & Other Finishing Products 04005 Chemicals - Plastics & Fibers 04006 Chemicals - Specialty Chemicals 05001 Computer Hardware - Computer Peripherals 05002 Computer Hardware - Data Storage Devices 05003 Computer Hardware - Diversified Computer Products |
The industry key type file is used to standardize the value of the Industry field into common industries to which businesses belong so the Sun Match Engine can recognize and process the industry types for different businesses. You can add entries to the industry key type file using the following syntax.
industry-type standardized-form sectors
Table 31 describes each column in the bizIndustryTypeKeys.dat file.
Table 31 Industry Key Type File
Column |
Description |
---|---|
The original value of the industry type in the input record. |
|
standardized-form |
The normalized version of the industry type. If this column contains a name instead of a zero, that name must also be listed in a different entry as an industry type with a standardized form of “0”. |
sectors |
The industry categories of the specified industry type. These values correspond to the sector codes listed in the industry sector file (bizIndustryCategoryCode.dat). You can list as many categories as apply for each type, but they must be entered with a space between each and no line breaks, and they must correspond to an entry in the industry sector file. |
Below is an excerpt from the bizIndustryTypeKeys.dat file.
TECH TECHNOLOGY 05001-05007 TECHNOLOGIES TECHNOLOGY 05001-05007 TECHNOLOGY 0 05001-05007 TECHSYSTEMS 0 05001-05007 TELE PHONE TELEPHONE 16005 TELE PHONES TELEPHONES 16005 TELEVISION TV 11013 21014 TELECOM 0 16005 26006 26009 26010 TELECOMM TELECOMMUNICATION 16005 26006 26008 TELECOMMUNICATION 0 16005 26006 26008 |
The organization key type file is used to standardize the value of the Organization field into common organizations to which businesses belong. This helps the Sun Match Engine recognize and process the organization types for different businesses. You can add entries to the organization key type file using the following syntax.
original-type standardized-form
Table 32 describes each column in the bizOrganizationTypeKeys.dat file.
Table 32 Organization Key Type File
Column |
Description |
---|---|
The original value of the organization field in an input record. |
|
standardized-form |
The normalized version of an organization type. A zero (0) in this field indicates that the value in the first column is already in its standardized form. If this column contains a name instead of a zero, that name must also be listed in a different entry as an original type with a standardized form of “0”. |
Below is an excerpt from the bizOrganizationTypeKeys.dat file.
INC INCORPORATED INCORPORATED 0 KG 0 KK 0 LIMITED 0 LIMITED PARTNERSHIP 0 LLC 0 LLP 0 LP LIMITED PARTNERSHIP LTD LIMITED |
The business patterns file defines multiple formats expected from the business name input fields along with the standardized output of each format. The patterns and output appear in two-row pairs in this file, as shown below.
4 PNT AST SEP-GLC ORT PNT AST DEL ORT |
The first line describes the input pattern and the second describes the output pattern using tokens to denote each component. The supported tokens are described in Business Name Tokens. A number at the beginning of the first line indicates the number of components in the given business name format. You can modify this file using the following syntax.
length input-pattern output-pattern |
Table 33 lists and describes the syntax components.
Table 33 Business Patterns File Components
Component |
Description |
---|---|
The number of business name components in the input field. |
|
input-pattern |
Tokens that represent a possible input pattern from the unparsed business name fields. Each token represents one component. For more information about address tokens, see Business Name Tokens. |
output-pattern |
Tokens that represent the output pattern for the specified input pattern. Each token represents one component. For more information about business name tokens, see Business Name Tokens. |
Below is an excerpt from the bizPatterns.dat file.
4 PNT AST SEP-GLC ORT PNT AST DEL ORT 4 NFG AJT SEP-GLC ORT PNT PNT DEL ORT 4 NF AJT SEP-GLC ORT PNT PNT DEL ORT 4 CST IDT NF ORT PNT PNT PNT ORT 4 PNT AJT SEP-GLC ORT PNT PNT DEL ORT |
The business patterns file uses tokens to denote different components in a business name, such as the primary name, alias type key, URL, and so on. The file uses one set of tokens for input fields and another set for output fields. The tokens indicate the type key files to use to determine the appropriate values for each output field. You can use only the predefined tokens to represent business name components; the Sun Match Engine does not recognize custom tokens.
Table 34 lists and describes each input token; Table 35 lists and describes each output token.
Table 34 Business Name Input Pattern Tokens
Pattern Identifier |
Description |
---|---|
A connector token |
|
A primary name of a business |
|
A hyphenated primary name of a business |
|
A common business term |
|
The URL of the business’ web site |
|
A business alias type key (usually an acronym) |
|
A country name |
|
A nationality |
|
A city or state type key |
|
An industry type key |
|
Both an industry and an adjective type key |
|
An adjective type key |
|
An association type key |
|
An organization type key |
|
A separator key |
|
Generic term, not recognized as a specific business name component, with an internal hyphen |
|
Generic term, not recognized as a specific business name component |
|
A single character, not recognized as a specific business name component |
|
A joining comma (a glue type separator) |
|
A joining hyphen (a glue type separator) |
|
The text “and” |
|
A glue type key, such as a forward slash, connecting two parts of a business name component |
|
A business primary name followed by a hyphen and a generic term that is not recognized as a specific business name component |
|
A generic term that is not recognized as a specific business name component, followed by a hyphen and a recognized business primary name |
|
Two generic terms, not recognized as specific business name components and separated by a hyphen |
Table 35 lists and describes each output token.
Table 35 Business Name Output Pattern Tokens
Pattern Identifier |
Description |
---|---|
The primary name of the business |
|
The URL of the business |
|
The alias type key of the business (usually an acronym) |
|
The industry type key of the business |
|
The association type key of the business |
|
The organization type key of the business |
|
A generic term not recognized as a business name component |
The special characters reference file lists certain characters that should be removed from a business name prior to processing the field, which typically include punctuation marks such as exclamation points, parenthesis, and so on. This enables the Sun Match Engine to recognize the business name.
This file contains one column that lists the characters to be removed from the business names you process. You can add entries as needed. Below is an excerpt from the bizRemoveSpecChars.dat file.
[ ] { } < > / ? |