FSM-Based Configuration (Understanding the Master Index Standardization Engine)

Understanding the Master Index Standardization Engine

FSM-Based Configuration

The FSM-based standardization configuration files are stored in the master index project and appear in the Standardization Engine node of the project. These files are separated into groups based on the primary data types being processed. Data type groups have further subsets of configuration files based on the variants for each data type. FSM-based data types and variants, such as PersonName and PhoneNumber, include the following configuration file types.

Service Definition Files – Each data type and data type variant is defined by a service definition file. Service type files define the fields to be standardized for a data type and service instance files define the variant and Java factory class for the variant. Both files are in XML format and should not be modified unless the data type is extended to include more output symbols.
Process Definition Files – These files define the different stages of processing data for the data type or variant. It defines the FSM states, input and output symbols, patterns, and data cleansing rules. These files use a domain-specific language (DSL) to define how the data fields are processed.
Lexicon Files – The standardization engine uses these files to recognize input data. A lexicon provides a list of possible values for a specific field, and one lexicon file should be defined for each field on which standardization is performed.
Normalization Files – The standardization engine uses these files to convert nonstandard values into a common form. For example, a nickname file provides a list of nicknames along with the common version of each name. For example, “Beth” and “Liz” might both be normalized to “Elizabeth”. Each row in the file contains a nickname and its corresponding normalized version separated by a pipe character (|).