Master Index Standardization Engine (Sun Master Data Management Suite Primer)

Sun Master Data Management Suite Primer

Master Index Standardization Engine

The Master Index Standardization Engine parses, normalizes, and phonetically encodes data for external applications, such as master index applications. Before records can be compared to evaluate the possibility of a match, the data must be normalized and in certain cases parsed or phonetically encoded. Once the data is conditioned, the match engine can determine a match weight for the records. The standardization engine is built on a flexible framework that allows you to customize the standardization process and extend standardization rules.

Standardization Concepts

Data standardization transforms input data into common representations of values to give you a single, consistent view of the data stored in and across organizations. This common representation allows you to easily and accurately compare data between systems.

Data standardization applies three transformations against the data: parsing into individual components, normalization, and phonetic encoding. These actions help cleanse data to prepare it for matching and searching. Some fields might require all three steps, some just normalization and phonetic conversion, and other data might only need phonetic encoding. Typically data is first parsed, then normalized, and then phonetically encoded, though some cleansing might be needed prior to parsing.

A common use of normalization is for first names. Nicknames need to be converted to their common names in order to make an accurate match; for example, converting “Beth” and “Liz” to “Elizabeth”. An example of data that needs to be parsed into its individual components before matching is street addresses. For example, the string “800 W. Royal Oaks Boulevard” would be parsed as follows:

Street Number: 800
Street Name: Royal Oaks
Street Type: Boulevard
Street Direction: W.

Once parsed the data can then be normalized so it is in a common form. For example, “W.” might be converted to “West” and ”Boulevard” to Blvd” so these values are similar for all addresses.

Phonetic encoding allows for typos and input errors when searching for data. Several different phonetic encoders are supported, but the two most commonly used are NYSIIS and Soundex.

Master Index Standardization Engine Configuration

The Master Index Standardization Engine uses two frameworks to define standardization logic. One framework is based on a finite state machine (FSM) model and the other is based on rules programmed in Java. In the current implementation, the person names and telephone numbers are processed using the FSM framework, and addresses and business names are processed using the rules-based framework. Both frameworks can be customized as needed.

A finite state machine (FSM) is composed of one or more states and the transitions between those states. In this case, a state is a value within a text field, such as a street address, that needs to be parsed from the text. The Master Index Standardization Engine FSM framework is designed to be highly configurable and can be easily extended. Standardization is defined using a simple markup language and no Java coding is required.

The Master Index Standardization Engine rules-based framework defines the standardization process for addresses and business names in Java classes. This framework can be extended by creating additional Java packages to define processing.

Both frameworks rely on sets of text files that help identify field values and to determine how to parse and normalize the values.

Master Index Standardization Engine Features

The Master Index Standardization Engine provides proven and extensive standardization capabilities to the Sun MDM Suite, and includes the following features:

Works with Sun Master Index applications and can also be called from other applications, such as Data Integrator, web services, web applications, and so on.
Uses standardization algorithms based on research at the U.S. Census Bureau, Statistical Research Division (SRD).
Is highly configurable and can be used to standardize various types of data.
Supports data sets specific to Australia, France, Great Britain, and the United States by default, and can be extended to support additional locales.
Processes data using one of the defined locales or using multiple locales.
Supports a variety of data types, including addresses, person names, businesses, and telephone numbers. Additional data types can be easily added.
Provides comprehensive person name normalization tables for the four default locales.
Uses a probability-based mechanism to resolve ambiguity during processing.
Allows you to apply cleansing rules prior to the standardization processing.
Performs preprocessing, matching, and postprocessing during the parsing process based on customizable rules.
Is highly configurable and the standardization and matching logic can be adapted to specific needs. New data types or variants can be created for even more customized processing.
Supports pluggable standardization sets, so you can define custom standardization processing for most types of data.