Defining the State Model and Processing Rules (Understanding the Master Index Standardization Engine)

Understanding the Master Index Standardization Engine

Defining the State Model and Processing Rules

The state model defines how the data is read, tokenized, parsed, and modified during standardization. The state model and processing rules are all defined in the standardizer.xml file.

Before you begin this step, determine the different forms in which the data to be standardized can be presented and how it should be standardized for each form. For example, name data might be in the form “First Name, Last Name, Middle Initial” or in the form “First Name, Middle Name, Last Name”. You need to account for each possibility. Determine each state in the process, and the input and output symbols used by each state. It might be useful to create a finite state machine model, as shown below. The model shows each state, the transitions to and from each state, and the output symbol for each state.

Figure 2 Sample Finite State Machine Model

Figure shows a sample FSM model for phone numbers.

For more information about the FSM model, see FSM Framework Configuration Overview.

To Define the State Model and Processing Rules

In /WorkingDirectory/resource, create a new XML file named standardizer.xml.

Tip –
You can copy the file from an existing variant in the data type to which you are adding the custom variant. Then you can modify the file for the new variant.

If the data you are processing does not need to be parsed, but only needs to be normalized, define normalization rules in the normalizer section of the file.

For more information, see Data Normalization Definitions and Standardization Processing Rules Reference.

If the data you are processing needs to be parsed and normalized, define the state model in the upper portion of the file.

For information about the state model and the elements that define it, see Standardization State Definitions.

Note –
The next several steps use the processing rules described in Standardization Processing Rules Reference. Some of these rules might require that you create normalization and lexicon files.

In the inputSymbols section of the file, define each input symbol along with any processing rules.

For more information, see Input Symbol Definitions.

In the outputSymbols section of the file, define each output symbol along with any processing rules.

For more information, see Output Symbol Definitions.

In the cleanser section of the file, define any cleansing rules that should be performed against the data prior to tokenization.

For more information, see Data Cleansing Definitions.

If you created any rules that reference normalization or lexicon files, continue to Creating Normalization and Lexicon Files.