Skip Navigation Links | |
Exit Print View | |
Oracle Java CAPS Master Index Standardization Engine Reference Java CAPS Documentation |
Oracle Java CAPS Master Index Standardization Engine Reference
About the Master Index Standardization Engine
Master Index Standardization Engine Overview
How the Master Index Standardization Engine Works
Master Index Standardization Engine Data Types and Variants
Master Index Standardization Engine Standardization Components
Finite State Machine Framework
About the Finite State Machine Framework
About the Rules-Based Framework
Oracle Java CAPS Master Index Standardization and Matching Process
Master Index Standardization Engine Internationalization
Finite State Machine Framework Configuration
FSM Framework Configuration Overview
Standardization State Definitions
Data Normalization Definitions
Standardization Processing Rules Reference
FSM-Based Person Name Configuration
Person Name Standardization Overview
Person Name Standardization Components
Person Name Standardization Files
Person Name Normalization Files
Person Name Process Definition Files
Person Name Standardization and Oracle Java CAPS Master Index
Person Name Standardized Fields
Configuring a Normalization Structure for Person Names
Configuring a Standardization Structure for Person Names
Configuring Phonetic Encoding for Person Names
FSM-Based Telephone Number Configuration
Telephone Number Standardization Overview
Telephone Number Standardization Components
Telephone Number Standardization Files
Telephone Number Standardization and Oracle Java CAPS Master Index
Telephone Number Processing Fields
Telephone Number Standardized Fields
Telephone Number Object Structure
Configuring a Standardization Structure for Telephone Numbers
Rules-Based Address Data Configuration
Address Data Standardization Overview
Address Data Standardization Components
Address Standardization and Oracle Java CAPS Master Index
Address Data Processing Fields
Configuring a Standardization Structure for Address Data
Configuring Phonetic Encoding for Address Data
Rules-Based Business Name Configuration
Business Name Standardization Overview
Business Name Standardization Components
Business Name Standardization Files
Business Name Adjectives Key Type File
Business Association Key Type File
Business General Terms Reference File
Business City or State Key Type File
Business Former Name Reference File
Merged Business Name Category File
Primary Business Name Reference File
Business Connector Tokens Reference File
Business Country Key Type File
Business Industry Sector Reference File
Business Industry Key Type File
Business Organization Key Type File
Business Name Standardization and Oracle Java CAPS Master Index
Business Name Processing Fields
Business Name Standardized Fields
Business Name Object Structure
Configuring a Standardization Structure for Business Names
Configuring Phonetic Encoding for Business Names
Custom FSM-Based Data Types and Variants
About Custom FSM-Based Data Types and Variants
About the Standardization Packages
Creating Custom FSM-Based Data Types
Creating the Working Directory
To Create the Working Directory
Packaging and Importing the Data Type
To Package and Import the Data Type
Creating Custom FSM-Based Variants
Creating the Working Directory
To Create the Working Directory
To Define the Service Instance
Defining the State Model and Processing Rules
To Define the State Model and Processing Rules
Creating Normalization and Lexicon Files
To Create Normalization and Lexicon Files
Packaging and Importing the Variant
Three configuration files define address processing logic for the Master Index Standardization Engine. These files provide information about address patterns and tokens to help the standardization engine determine how to recognize address components and break them out into their respective tokens. You can customize any of the configuration files described in this section to fit your processing and standardization requirements for address data.
The address configuration files are located in the resource folder under each variant name for the Address data type. The following topics provide information about each configuration file.
The address clues file (clues.dat) lists common terms in street addresses, specifies a normalized value for each common term, and categorizes the terms into street address component types. A term can be categorized into multiple component types. A relevance value specifies which of the component types the term is most likely to be. For example, the term “Junction” is standardized as “Jct” and is classified as a street type, building unit, and generic term (giving relevance in that order).
This file helps the Master Index Standardization Engine recognize common terms in street addresses in order to parse and normalize the values correctly. The syntax of this file is:
common-term normalized-term ID-number/type-token
You can modify or add entries in this table as needed. The following table describes the columns in the address clues file.
Table 4 Address Clues File Columns
|
Following is an excerpt from the US address clues file.
TRLR VLG Trpk 59BU TRPK Trpk 59BU TRPRK Trpk 59BU VILLA Vlla 305TY 60BU VLLA Vlla 305TY 60BU VILLAS Vlla 60BU VILL Vlg 317TY 61BU 364AU VILLAG Vlg 317TY 61BU 364AU VLG Vlg 317TY 61BU 364AU VILLAGE Vlg 317TY 61BU 364AU VILLG Vlg 317TY 61BU 364AU VILLIAGE Vlg 317TY 61BU 364AU VLGE Vlg 317TY 61BU 364AU VIVI Vivi 62BU VIVIENDA Vivi 62BU COLLEGE Coll 64BU 0AU CLG Coll 64BU COTTAGE Cott 65BU 65BP 0AU
The address master clues file (masterClues.dat) lists common terms in street addresses as defined by the United States Postal Service (USPS), the United Kingdom’s Royal Mail, the Australian Postal Corporation, or France’s La Poste (depending on the variant in use). For each common term, this file specifies a normalized value, defines postal information, and categorizes the terms into street address component types. A term can be categorized into multiple component types.
The syntax of this file is:
ID-number common-term normalized-term short-abbrev postal-abbrev CFCCS type-token usage-flag postal-flag
You can modify or add entries in this table as needed. The following table describes the columns in the address master clues file.
Table 5 Address Master Clue File Columns
|
Following is an excerpt from the US address master clues file.
11Alley Alley Al Aly A TY R U 12Alternate Route Alt Rte Alt Alt A TY R 15Arcade Arcade Arc Arc A TY R U 16Arroyo Arroyo Arryo ArryHA TY R 17Autopista Atpta Apta AptaA TY R 18Avenida Avenida Ava Ava A TY R 19Avenue Avenue Ave Ave A TY R U 26Boulevard Blvd Blvd BlvdA TY R U 32Bulevar Blvr Blv Blv A TY R 33Business Route Bus Rte BusRt BsRtA TY R 34Bypass Bypass Byp Byp A TY R U 36Calle Calle Calle ClleA TY R 37Calleja Calleja Cja Cja A TY R 38Callejon Callej Cjon CjonA TY R 39Camino Camino Cam Cam A TY R 47Carretera Carrt Carr CarrA TY R 48Causeway Cswy Cswy CswyAH TY R U 51Center Center Ctr Ctr DA TY R U
The address patterns file (patterns.dat) defines the expected input patterns of each individual street address field being standardized so the Master Index Standardization Engine can recognize and process these values. Tokens indicate the type of address component in the input and output fields. This file contains two rows for each pattern. The first row defines the input pattern for each address field and provides an example. The second row defines the output pattern for each address field, the pattern type, the relative importance of the pattern compared to other patterns, and usage flags. Below is an example.
AU A1 TY 01 Oak B Street NA NA ST T* 75 TX
When an address is parsed, each line of the address is delineated by a pipe (|) and sent to the parser separately. The output tokens for each line are then concatenated and the output pattern is processed using the address patterns file to determine whether the output pattern is listed in the file. If the pattern is found, output patterns are modified as indicated in the patterns file to resolve any ambiguities that might arise when two lines of address information contain common elements. The relative importance determines which pattern to use when the format of the input field matches more than one pattern. This file should only be modified by personnel with a thorough understanding of address patterns and tokens.
The syntax of this file is:
input-pattern example output-pattern pattern-class pattern-modifier priority usage-flag exclude-flag
You can modify or add entries in this table as needed. The following table describes the columns in the address patterns file.
Table 6 Address Patterns File
|
Following is an excerpt from the address patterns file.
NU DR TY A1 AU 01 123 South Avenida B Oak HN PD PT NA NA H* 70 NU DR TY NU DR 01 123 South Avenida 1 West HN PD PT NA SD H* 70 NU A1 TY AU TY 01 123 C circle hill drive HN HS NA NA ST H* 70 NU A1 AM A1 TY 01 123 M & M road HN NA NA NA ST H* 65 NU TY AU A1 01 123 Avenida Oak B HN PT NA NA H* 60 NU TY NU A1 01 123 Avenida 1 B HN PT NA NA H* 60
The address patterns files use pattern type tokens, pattern classes, pattern modifiers, and priority indicators to process and parse address data. Before modifying any of the patterns files, you must have a good understanding of these file components.
The address pattern and clues files use tokens to denote different components in a street address, such as street type, house number, street names, and so on. These files use one set of tokens for input fields and another set for output fields. You can use only the predefined tokens to represent address components; the Master Index Standardization Engine does not recognize custom tokens.
The following table lists and describes each input token.
Table 7 Input Address Pattern Type Tokens
|
The following table lists and describes each output token.
Table 8 Output Address Pattern Tokens
|
Each pattern defined in the address patterns file must have an associated pattern class. The pattern class indicates a portion of the input pattern or the type of address data that is represented by the pattern. You can specify any of the following pattern classes.
W - the address pattern represents a unit within a structure, such as an apartment or suite number
T - the address pattern represents a street type or direction
These classes are also specified as usage flags in the patterns file and the master clues file.
Each pattern type must be followed by a pattern modifier that indicates how to handle cases where one or more defined patterns is found to be a sub-pattern of a larger input pattern. In this case, the Master Index Standardization Engine must know how to prioritize each defined pattern that is a part of the larger pattern. There are two pattern modifiers.
* - An asterisk indicates that the priority weight for the matching pattern is averaged down equally with the other matching sub-patterns.
+ - A plus sign indicates that the priority weight for the matching pattern is not averaged down equally with the other matching sub-patterns.
The priority indicator is a numeric value following the pattern modifier that indicates the priority weight of the pattern. These values work best when defined as a multiple of five between and including 35 and 95. If a pattern is assigned a priority of 90 or 95 and the pattern matches, or is a sub-pattern of, the input pattern, the standardization engine stops searching for additional matching patterns and uses the high-priority matching pattern.