Finite State Machine Framework (Understanding the Master Index Standardization Engine)

Understanding the Master Index Standardization Engine

Finite State Machine Framework

A finite state machine (FSM) is composed of one or more states and the transitions between those states. The Master Index Standardization Engine FSM framework is designed to be highly configurable and can be easily extended with no Java coding. The following topics describe the FSM framework and the configuration files that define FSM–based standardization.

About the Finite State Machine Framework

In an FSM framework, the standardization process is defined as one or more states. In a state, only the input symbols defined for that state are recognized. When one of those symbols is recognized, the following action or transition is based on configurable processing rules. For example, when an input symbol is recognized, it might be preprocessed by removing punctuation, matched against a list of tokens, and then postprocessed by normalizing the input value. Once this has been completed for all input symbols, the standardization engine determines which token is the most likely match.

FSM-based processing includes the following steps:

Cleansing – The entire input string is modified to make sure it is broken down into its individual components correctly.
Tokenization – The input string is broken down into its individual components.
Parsing – The individual field components are processed according to configurable rules. Parsing can include any combination of the following three stages:
- Preprocessing – Each token is cleansed prior to matching to make the value more uniform.
- Matching – The cleansed token is matched against patterns or value lists.
- Postprocessing – The matched token is normalized.
  
  Note –
  Several parsing sequences might be performed against one field component in order to best match it with a token. Each sequence is carried out until a match is made.
Ambiguity Resolution – Some input strings might match more than one processing rule, so the FSM framework includes a probability-based mechanism for determining the correct state transition.

Using the person data type, for example, first names such as “Bill” and “Will” are normalized to “William”, which is then converted to its phonetic equivalent. Standardization logic is defined in the standardization engine configuration files and in the Master Index Configuration Editor or mefa.xml in a master index project.

FSM-Based Configuration

The FSM-based standardization configuration files are stored in the master index project and appear in the Standardization Engine node of the project. These files are separated into groups based on the primary data types being processed. Data type groups have further subsets of configuration files based on the variants for each data type. FSM-based data types and variants, such as PersonName and PhoneNumber, include the following configuration file types.

Service Definition Files – Each data type and data type variant is defined by a service definition file. Service type files define the fields to be standardized for a data type and service instance files define the variant and Java factory class for the variant. Both files are in XML format and should not be modified unless the data type is extended to include more output symbols.
Process Definition Files – These files define the different stages of processing data for the data type or variant. It defines the FSM states, input and output symbols, patterns, and data cleansing rules. These files use a domain-specific language (DSL) to define how the data fields are processed.
Lexicon Files – The standardization engine uses these files to recognize input data. A lexicon provides a list of possible values for a specific field, and one lexicon file should be defined for each field on which standardization is performed.
Normalization Files – The standardization engine uses these files to convert nonstandard values into a common form. For example, a nickname file provides a list of nicknames along with the common version of each name. For example, “Beth” and “Liz” might both be normalized to “Elizabeth”. Each row in the file contains a nickname and its corresponding normalized version separated by a pipe character (|).