Skip Navigation Links | |
Exit Print View | |
Oracle Java CAPS Master Index Standardization Engine Reference Java CAPS Documentation |
Oracle Java CAPS Master Index Standardization Engine Reference
About the Master Index Standardization Engine
Master Index Standardization Engine Overview
How the Master Index Standardization Engine Works
Master Index Standardization Engine Data Types and Variants
Master Index Standardization Engine Standardization Components
Finite State Machine Framework
About the Finite State Machine Framework
About the Rules-Based Framework
Oracle Java CAPS Master Index Standardization and Matching Process
Master Index Standardization Engine Internationalization
Finite State Machine Framework Configuration
FSM Framework Configuration Overview
Standardization State Definitions
Data Normalization Definitions
Standardization Processing Rules Reference
FSM-Based Person Name Configuration
Person Name Standardization Overview
Person Name Standardization Components
Person Name Standardization Files
Person Name Normalization Files
Person Name Process Definition Files
Person Name Standardization and Oracle Java CAPS Master Index
Person Name Standardized Fields
Configuring a Normalization Structure for Person Names
Configuring a Standardization Structure for Person Names
Configuring Phonetic Encoding for Person Names
FSM-Based Telephone Number Configuration
Telephone Number Standardization Overview
Telephone Number Standardization Components
Telephone Number Standardization Files
Telephone Number Standardization and Oracle Java CAPS Master Index
Telephone Number Processing Fields
Telephone Number Standardized Fields
Telephone Number Object Structure
Configuring a Standardization Structure for Telephone Numbers
Rules-Based Address Data Configuration
Address Data Standardization Overview
Address Data Standardization Components
Address Data Standardization Files
Address Pattern File Components
Address Standardization and Oracle Java CAPS Master Index
Address Data Processing Fields
Configuring a Standardization Structure for Address Data
Configuring Phonetic Encoding for Address Data
Rules-Based Business Name Configuration
Business Name Standardization Overview
Business Name Standardization Components
Business Name Standardization Files
Business Name Adjectives Key Type File
Business Association Key Type File
Business General Terms Reference File
Business City or State Key Type File
Business Former Name Reference File
Merged Business Name Category File
Primary Business Name Reference File
Business Connector Tokens Reference File
Business Country Key Type File
Business Industry Sector Reference File
Business Industry Key Type File
Business Organization Key Type File
Business Name Standardization and Oracle Java CAPS Master Index
Business Name Processing Fields
Business Name Standardized Fields
Business Name Object Structure
Configuring a Standardization Structure for Business Names
Configuring Phonetic Encoding for Business Names
Custom FSM-Based Data Types and Variants
About Custom FSM-Based Data Types and Variants
About the Standardization Packages
Creating Custom FSM-Based Data Types
Creating the Working Directory
To Create the Working Directory
Packaging and Importing the Data Type
To Package and Import the Data Type
Creating Custom FSM-Based Variants
Creating the Working Directory
To Create the Working Directory
To Define the Service Instance
Defining the State Model and Processing Rules
To Define the State Model and Processing Rules
Creating Normalization and Lexicon Files
To Create Normalization and Lexicon Files
Packaging and Importing the Variant
Data standardization transforms input data into common representations of values to give you a single, consistent view of the data stored in and across organizations. Standardizing the data stored in disparate systems provides a common representation of the data so you can easily and accurately compare data between systems.
Data standardization applies three transformations against the data: parsing into individual components, normalization, and phonetic encoding. These actions help cleanse data to prepare it for matching and searching. Some fields might require all three steps, some just normalization and phonetic conversion, and other data might only need phonetic encoding. Typically data is first parsed, then normalized, and then phonetically encoded, though some cleansing might be needed prior to parsing.
Standardization can include any one or any combination of the following phases.
If incoming records contain data that is not formatted properly, it must be reformatted before it can be normalized. This process identifies and separates each component of a free-form text field that contains multiple pieces of information. Reformatting can also include removing characters or strings from a field that are not relevant to the data. A good example is standardizing free-form text address fields. If you are comparing or searching on street addresses that are contained in one or more free-form text fields (that is, the street address is contained in one field, apartment number in another, and so on), those fields need to be parsed into their individual components, such as house number, street name, street type, and street direction. Then certain components of the address, such as the street name and type, can be normalized. Field components are also known as tokens, and the process of separating data into its tokens is known as tokenization.
Normalizing data converts it into a standard or common form. A common use for normalization is to convert nicknames into their standard names, such as converting “Rich” to “Richard” or “Meg” to “Margaret”. Another example is normalizing street address components. For example, both “Dr.” or “Drv” in a street address might be normalized to “Drive”. Normalized values are obtained from lookup tables. Once a field value is normalized, that value can be more accurately compared against values in other records to determine whether they are a match.
Once data has gone through any necessary reformatting and normalization, it can be phonetically encoded. In a master index application, phonetic values are generally used in blocking queries in order to obtain all possible matches to an incoming record. They are also used to perform searches from the Master Index Data Manager (MIDM) that allow for misspellings and typographic errors. Typically, first names use Soundex encoding and last names and street names use NYSIIS encoding, but the Master Index Standardization Engine supports several additional phonetic encoders as well.