Understanding the Master Index Standardization Engine

How the Master Index Standardization Engine Works

The Master Index Standardization Engine uses two frameworks to define standardization logic. One framework is based on a finite state machine (FSM) model and the other is based on rules programmed in Java. In the current implementation, the person names and telephone numbers are processed using the FSM framework, and addresses and business names are processed using the rules-based framework. The Master Index Standardization Engine includes several sets of files that define standardization logic for all supported data types. For person data and addresses, one set of standardization files is provided for the following national variants: Australia, France, Great Britain, and the United States. You can customize these files to adapt the standardization and matching logic to your specific needs or you can create new data types or variants for even more customized processing. With pluggable standardization sets, you can define custom standardization processing for most types of data.

The following topics provide information about the Master Index Standardization Engine, the standardization frameworks, and data is standardized:

Master Index Standardization Engine Data Types and Variants

A data type is the primary kind of data you are processing, such as person names, addresses, business names, automotive parts, and so on. A variant is a subset of a data type that is designed to standardize a specific kind of data. For example, for addresses and names, the variants typically define rules for the different countries in which the data originates. For automotive parts, the variants might be different manufacturers. Each data type and variant uses its own configuration files to define how fields in incoming records are parsed, standardized, and classified for processing. Data types are sometimes referred to as standardization types.

In the default implementation with a master index application, the engine supports data standardization on the following types of data:

Person Information (described in FSM–Based Person Name Configuration)
Telephone Numbers (described in FSM–Based Telephone Number Configuration)
Street Addresses (described in Rules–Based Address Data Configuration)
Business Names (described in Rules-Based Business Name Configuration)

In the default configuration, the standardization engine expects street address and business names to be in free-form text fields that need to be parsed prior to normalization and phonetic encoding. Person and phone information can also be contained in free-form text fields, but theses types of information can also be processed if the data is already parsed into its individual components. Each data type requires specific customization to mefa.xml in the master index project. This can be done by modifying the file directly or by using the Master Index Configuration Editor.

Master Index Standardization Engine Standardization Components

The Master Index Standardization Engine breaks down fields into various components during the parsing process. This is known an tokenization. For example, it breaks addresses into floor number, street number, street name, street direction, and so on. Some of these components are similar and might be stored in the same output field. In the default configuration for a master index application, for example, when the standardization engine finds a house number, rural route number, or PO box number, the value is stored in the HouseNumber database field. You can customize this as needed, as long as any field you specify to store a component is also included in the object structure defined for the master index application.

The standardization engine uses tokens to determine how to process fields that are defined for normalization or parsing into their individual standardization components. For FSM-based data types, the tokens are defined as output symbols in the process definition files and are referenced in the standardization structures in the Master Index Configuration Editor and in mefa.xml. The tokens determine how each field is normalized or how a free-form text field is parsed and normalized. For rules-based data types, the tokens are defined internally in the Java code. The tokens for business names specify which business type key file to use to normalize a specific standardization component. The tokens for addresses determine which database fields store each standardization component and how each component is standardized.

Finite State Machine Framework

A finite state machine (FSM) is composed of one or more states and the transitions between those states. The Master Index Standardization Engine FSM framework is designed to be highly configurable and can be easily extended with no Java coding. The following topics describe the FSM framework and the configuration files that define FSM–based standardization.

About the Finite State Machine Framework

In an FSM framework, the standardization process is defined as one or more states. In a state, only the input symbols defined for that state are recognized. When one of those symbols is recognized, the following action or transition is based on configurable processing rules. For example, when an input symbol is recognized, it might be preprocessed by removing punctuation, matched against a list of tokens, and then postprocessed by normalizing the input value. Once this has been completed for all input symbols, the standardization engine determines which token is the most likely match.

FSM-based processing includes the following steps:

Cleansing – The entire input string is modified to make sure it is broken down into its individual components correctly.
Tokenization – The input string is broken down into its individual components.
Parsing – The individual field components are processed according to configurable rules. Parsing can include any combination of the following three stages:
- Preprocessing – Each token is cleansed prior to matching to make the value more uniform.
- Matching – The cleansed token is matched against patterns or value lists.
- Postprocessing – The matched token is normalized.
  
  Note –
  Several parsing sequences might be performed against one field component in order to best match it with a token. Each sequence is carried out until a match is made.
Ambiguity Resolution – Some input strings might match more than one processing rule, so the FSM framework includes a probability-based mechanism for determining the correct state transition.

Using the person data type, for example, first names such as “Bill” and “Will” are normalized to “William”, which is then converted to its phonetic equivalent. Standardization logic is defined in the standardization engine configuration files and in the Master Index Configuration Editor or mefa.xml in a master index project.

FSM-Based Configuration

The FSM-based standardization configuration files are stored in the master index project and appear in the Standardization Engine node of the project. These files are separated into groups based on the primary data types being processed. Data type groups have further subsets of configuration files based on the variants for each data type. FSM-based data types and variants, such as PersonName and PhoneNumber, include the following configuration file types.

Service Definition Files – Each data type and data type variant is defined by a service definition file. Service type files define the fields to be standardized for a data type and service instance files define the variant and Java factory class for the variant. Both files are in XML format and should not be modified unless the data type is extended to include more output symbols.
Process Definition Files – These files define the different stages of processing data for the data type or variant. It defines the FSM states, input and output symbols, patterns, and data cleansing rules. These files use a domain-specific language (DSL) to define how the data fields are processed.
Lexicon Files – The standardization engine uses these files to recognize input data. A lexicon provides a list of possible values for a specific field, and one lexicon file should be defined for each field on which standardization is performed.
Normalization Files – The standardization engine uses these files to convert nonstandard values into a common form. For example, a nickname file provides a list of nicknames along with the common version of each name. For example, “Beth” and “Liz” might both be normalized to “Elizabeth”. Each row in the file contains a nickname and its corresponding normalized version separated by a pipe character (|).

Rules-Based Framework

In the rules-based framework, the standardization process is define in the underlying Java code. You can configure several aspects of the standardization process, such as the detectable patterns for each data type, how values are normalized, and how the input string is cleansed and parsed. You can define custom rules-based data types and variants by creating custom Java packages that define processing.

About the Rules-Based Framework

In the rules-based framework, individual field components are recognized by the patterns defined for each data type and by information provided in configurable files about how to preprocess, match, and postprocess each field components. The rules-based framework processes data in the following stages.

Parsing - A free-form text field is separated into its individual components, such as street address information or a business name. This process takes into account logic you can customize, such as token patterns, special characters, and priority weights for patterns.
Normalization - Once a field is parsed, individual components of the field are normalized based on the configuration files. This can include changing the input street name to a common form or changing the input business name to its official form.
Phonetic Encoding - After a field is parsed and optionally normalized, the value of a field is converted to its phonetic version. The value to be converted can be the original input value, a parsed value, a normalized value, or a parsed and normalized value.

Using the street address data type, for example, street addresses are parsed into their component parts, such as house numbers, street names, and so on. Certain fields are normalized, such as street name, street type, and street directions. The street name is then phonetically converted. Standardization logic is defined in the standardization engine configuration files and in the Master Index Configuration Editor or mefa.xml in a master index project.

Rules-Based Configuration

The rules-based standardization configuration files are stored in the master index project and appear as nodes in the Standardization Engine node of the project. These files are separated into groups based on the primary data types and variants being processed. Rules-based data types and variants, such as the default Address and Business Name types, use the following configuration file types.

Service Definition Files – Each data type and data type variant is configured by a service definition file. Service type files define the fields to be standardized for a data type, and service instance definition files define the variant and Java factory class for the variant. Both files are in XML format. These files should not be modified.
Category Files - The standardization engine uses category files when processing business names. These files list common values for certain types of data, such as industries and organizations for business names. Category files also define standardized versions of each term or classify the terms into different categories, and some files perform both functions. When processing address files, category files named clues files are used.
Clues Files - The standardization engine uses clues files when processing address data types. These files list general terms used in street address fields, define standardized versions of each term, and classify the terms into various component types using predefined address tokens. These files are used by the standardization engine to determine how to parse a street address into its various components. Clues files provide clues in the form of tokens to help the engine recognize the component type of certain values in the input fields.
Patterns Files - The patterns files specify how incoming data should be interpreted for standardization based on the format, or pattern, of the data. These files are used only for processing data contained in free-form text fields that must be parsed prior to matching (such as street address fields or business names). Patterns files list possible input data patterns, which are encoded in the form of tokens. Each token signifies a specific component of the free-form text field. For example, in a street address field, the house number is identified by one token, the street name by another, and so on. Patterns files also define the format of the output fields for each input pattern.
Key Type Files - For business name processing, the standardization engine refers to a number of key type files for processing data. These files generally define standard versions of terms commonly found in business names and some classify these terms into various components or industries. These files are used by the standardization engine to determine how to parse a business name into its different components and to recognize the component type of certain values in the input fields.
Reference Files - Reference files define general terms that appear in input fields for each data type. Some reference files define terms to ignore and some define terms that indicate the business name is continuing. For example, in business name processing “and” is defined as a joining term. This helps the standardization engine to recognize that the primary business name in “Martin and Sons, Inc.” is “Martin and Sons” instead of just “Martin”. Reference files can also define characters to be ignored by the standardization engine.