JavaScript is required to for searching.
Skip Navigation Links
Exit Print View
Oracle Java CAPS Master Index Standardization Engine Reference     Java CAPS Documentation
search filter icon
search icon

Document Information

Oracle Java CAPS Master Index Standardization Engine Reference

About the Master Index Standardization Engine

Related Topics

Master Index Standardization Engine Overview

Standardization Concepts

Data Parsing or Reformatting

Data Normalization

Phonetic Encoding

How the Master Index Standardization Engine Works

Master Index Standardization Engine Data Types and Variants

Master Index Standardization Engine Standardization Components

Finite State Machine Framework

About the Finite State Machine Framework

FSM-Based Configuration

Rules-Based Framework

About the Rules-Based Framework

Rules-Based Configuration

Oracle Java CAPS Master Index Standardization and Matching Process

Master Index Standardization Engine Internationalization

Finite State Machine Framework Configuration

FSM Framework Configuration Overview

Process Definition File

Standardization State Definitions

Input Symbol Definitions

Output Symbol Definitions

Data Cleansing Definitions

Data Normalization Definitions

Standardization Processing Rules Reference

dictionary

fixedString

lexicon

normalizeSpace

pattern

replace

replaceAll

transliterate

uppercase

Lexicon Files

Normalization Files

FSM-Based Person Name Configuration

Person Name Standardization Overview

Person Name Standardization Components

Person Name Standardization Files

Person Name Lexicon Files

Person Name Normalization Files

Person Name Process Definition Files

Person Name Standardization and Oracle Java CAPS Master Index

Person Name Processing Fields

Person Name Standardized Fields

Person Name Object Structure

Configuring a Normalization Structure for Person Names

Configuring a Standardization Structure for Person Names

Configuring Phonetic Encoding for Person Names

FSM-Based Telephone Number Configuration

Telephone Number Standardization Overview

Telephone Number Standardization Components

Telephone Number Standardization Files

Telephone Number Standardization and Oracle Java CAPS Master Index

Telephone Number Processing Fields

Telephone Number Standardized Fields

Telephone Number Object Structure

Configuring a Standardization Structure for Telephone Numbers

Rules-Based Address Data Configuration

Address Data Standardization Overview

Address Data Standardization Components

Address Data Standardization Files

Address Clues File

Address Master Clues File

Address Patterns File

Address Pattern File Components

Address Type Tokens

Pattern Classes

Pattern Modifiers

Priority Indicators

Address Standardization and Oracle Java CAPS Master Index

Address Data Processing Fields

Address Standardized Fields

Address Object Structure

Configuring a Standardization Structure for Address Data

Configuring Phonetic Encoding for Address Data

Rules-Based Business Name Configuration

Business Name Standardization Overview

Business Name Standardization Components

Business Name Standardization Files

Business Name Adjectives Key Type File

Business Alias Key Type File

Business Association Key Type File

Business General Terms Reference File

Business City or State Key Type File

Business Former Name Reference File

Merged Business Name Category File

Primary Business Name Reference File

Business Connector Tokens Reference File

Business Country Key Type File

Business Industry Sector Reference File

Business Industry Key Type File

Business Organization Key Type File

Business Patterns File

Business Name Tokens

Business Name Standardization and Oracle Java CAPS Master Index

Business Name Processing Fields

Business Name Standardized Fields

Business Name Object Structure

Configuring a Standardization Structure for Business Names

Configuring Phonetic Encoding for Business Names

Custom FSM-Based Data Types and Variants

About Custom FSM-Based Data Types and Variants

About the Standardization Packages

Creating Custom FSM-Based Data Types

Creating the Working Directory

To Create the Working Directory

Defining the Service Type

To Define the Service Type

Defining the Variants

To Define the Variants

Packaging and Importing the Data Type

To Package and Import the Data Type

Service Type Definition File

Creating Custom FSM-Based Variants

Creating the Working Directory

To Create the Working Directory

Defining the Service Instance

To Define the Service Instance

Defining the State Model and Processing Rules

To Define the State Model and Processing Rules

Creating Normalization and Lexicon Files

To Create Normalization and Lexicon Files

Packaging and Importing the Variant

To Package and Import the Variant

Service Instance Definition File

Standardization Concepts

Data standardization transforms input data into common representations of values to give you a single, consistent view of the data stored in and across organizations. Standardizing the data stored in disparate systems provides a common representation of the data so you can easily and accurately compare data between systems.

Data standardization applies three transformations against the data: parsing into individual components, normalization, and phonetic encoding. These actions help cleanse data to prepare it for matching and searching. Some fields might require all three steps, some just normalization and phonetic conversion, and other data might only need phonetic encoding. Typically data is first parsed, then normalized, and then phonetically encoded, though some cleansing might be needed prior to parsing.

Standardization can include any one or any combination of the following phases.

Data Parsing or Reformatting

If incoming records contain data that is not formatted properly, it must be reformatted before it can be normalized. This process identifies and separates each component of a free-form text field that contains multiple pieces of information. Reformatting can also include removing characters or strings from a field that are not relevant to the data. A good example is standardizing free-form text address fields. If you are comparing or searching on street addresses that are contained in one or more free-form text fields (that is, the street address is contained in one field, apartment number in another, and so on), those fields need to be parsed into their individual components, such as house number, street name, street type, and street direction. Then certain components of the address, such as the street name and type, can be normalized. Field components are also known as tokens, and the process of separating data into its tokens is known as tokenization.

Data Normalization

Normalizing data converts it into a standard or common form. A common use for normalization is to convert nicknames into their standard names, such as converting “Rich” to “Richard” or “Meg” to “Margaret”. Another example is normalizing street address components. For example, both “Dr.” or “Drv” in a street address might be normalized to “Drive”. Normalized values are obtained from lookup tables. Once a field value is normalized, that value can be more accurately compared against values in other records to determine whether they are a match.

Phonetic Encoding

Once data has gone through any necessary reformatting and normalization, it can be phonetically encoded. In a master index application, phonetic values are generally used in blocking queries in order to obtain all possible matches to an incoming record. They are also used to perform searches from the Master Index Data Manager (MIDM) that allow for misspellings and typographic errors. Typically, first names use Soundex encoding and last names and street names use NYSIIS encoding, but the Master Index Standardization Engine supports several additional phonetic encoders as well.