Understanding the Sun Match Engine

The Business Patterns File (bizPatterns.dat)

The business patterns file defines multiple formats expected from the business name input fields along with the standardized output of each format. The patterns and output appear in two-row pairs in this file, as shown below.


4 PNT AST SEP-GLC ORT
PNT AST DEL ORT

The first line describes the input pattern and the second describes the output pattern using tokens to denote each component. The supported tokens are described in Business Name Tokens. A number at the beginning of the first line indicates the number of components in the given business name format. You can modify this file using the following syntax.


length input-pattern
output-pattern

Table 33 lists and describes the syntax components.

Table 33 Business Patterns File Components

Component 

Description 

length

The number of business name components in the input field. 

input-pattern 

Tokens that represent a possible input pattern from the unparsed business name fields. Each token represents one component. For more information about address tokens, see Business Name Tokens.

output-pattern 

Tokens that represent the output pattern for the specified input pattern. Each token represents one component. For more information about business name tokens, see Business Name Tokens.

Below is an excerpt from the bizPatterns.dat file.


4 PNT AST SEP-GLC ORT
PNT AST DEL ORT

4 NFG AJT SEP-GLC ORT
PNT PNT DEL ORT

4 NF AJT SEP-GLC ORT
PNT PNT DEL ORT

4 CST IDT NF ORT
PNT PNT PNT ORT

4 PNT AJT SEP-GLC ORT
PNT PNT DEL ORT

Business Name Tokens

The business patterns file uses tokens to denote different components in a business name, such as the primary name, alias type key, URL, and so on. The file uses one set of tokens for input fields and another set for output fields. The tokens indicate the type key files to use to determine the appropriate values for each output field. You can use only the predefined tokens to represent business name components; the Sun Match Engine does not recognize custom tokens.

Table 34 lists and describes each input token; Table 35 lists and describes each output token.

Table 34 Business Name Input Pattern Tokens

Pattern Identifier 

Description 

CTT

A connector token 

PNT

A primary name of a business 

PN-PN

A hyphenated primary name of a business 

BCT

A common business term 

URL

The URL of the business’ web site 

ALT

A business alias type key (usually an acronym) 

CNT

A country name 

NAT

A nationality 

CST

A city or state type key 

IDT

An industry type key 

IDT-AJT

Both an industry and an adjective type key 

AJT

An adjective type key 

AST

An association type key 

ORT

An organization type key 

SEP

A separator key 

NFG

Generic term, not recognized as a specific business name component, with an internal hyphen 

NF

Generic term, not recognized as a specific business name component 

NFC

A single character, not recognized as a specific business name component 

SEP-GLC

A joining comma (a glue type separator)

SEP-GLD

A joining hyphen (a glue type separator)

AND

The text “and” 

GLU

A glue type key, such as a forward slash, connecting two parts of a business name component 

PN-NF

A business primary name followed by a hyphen and a generic term that is not recognized as a specific business name component 

NF-PN

A generic term that is not recognized as a specific business name component, followed by a hyphen and a recognized business primary name 

NF-NF

Two generic terms, not recognized as specific business name components and separated by a hyphen 

Table 35 lists and describes each output token.

Table 35 Business Name Output Pattern Tokens

Pattern Identifier 

Description 

PNT

The primary name of the business 

URL

The URL of the business 

ALT

The alias type key of the business (usually an acronym) 

IDT

The industry type key of the business 

AST

The association type key of the business 

ORT

The organization type key of the business 

NF

A generic term not recognized as a business name component