Understanding the Master Index Standardization Engine

Standardization State Definitions

An FSM framework is defined by its different states and transitions between states. Each FSM begins with a start state when it receives an input string. The first recognized input symbol in the input string determines the next state based on customizable rules defined in the state model section of standardizer.xml. The next recognized input symbol determines the transition to the next state. This continues until no symbols are recognized and the termination state is reached.

Below is an excerpt from the state definitions for the PersonName data type. In this state, the first name has been processed and the standardization engine is looking for one of the following: a first name (indicating a middle name), a last name, an abbreviation (indicating a middle initial), a conjunction, or a nickname. A probability is given for each of these symbols indicating how likely it is to be the next token.


<stateModel name="start">
   <when inputSymbol="salutation" nextState="salutation" 
         outputSymbol="salutation" probability=".15"/>
   <when inputSymbol="givenName" nextState="headingFirstName" 
         outputSymbol="firstName" probability=".6"/>
   <when inputSymbol="abbreviation" nextState="headingFirstName" 
         outputSymbol="firstName" probability=".15"/>
   <when inputSymbol="surname" nextState="trailingLastName" 
         outputSymbol="lastName" probability=".1"/>
   <state name="headingFirstName">
      <when inputSymbol="givenName" nextState="headingMiddleName" 
            outputSymbol="middleName" probability=".4"/>
      <when inputSymbol="surname" nextState="headingLastName" 
            outputSymbol="lastName" probability=".3"/>
      <when inputSymbol="abbreviation" nextState="headingMiddleName" 
            outputSymbol="middleName" probability=".1"/>
      <when inputSymbol="conjunction" nextState="headingFirstName" 
            outputSymbol="conjunction" probability=".1"/>
      <when inputSymbol="nickname" nextState="firstNickname" 
            outputSymbol="nickname" probability=".1"/>
   </state>
   ...

The following table lists and describes the XML elements and attributes for the standardization state definitions.

Element 

Attribute 

Description 

stateModel 

 

The primary container element for the state model that includes the definitions for each state in the FSM. This element contains a series of when elements as described below to define the transitions from the start element to any of the other states. It also contains a series of state elements that define the remaining FSM states.

 

name 

The name of start state (by default, “start”). 

state 

 

A definition for one state in the FSM (not including the start state). Each state element contains a series of when elements and attributes as described above to define the processing flow.

 

name 

The name of the state. The names defined here are referenced in the nextState attributes described below to specify the next state.

when 

 

A statement defining which state to transition to and which symbol to output when a specific input symbol is recognized in each state. These elements define the possible transitions from one state to another.  

 

inputSymbol 

The name of an input symbol that might occur next in the input string. This must match one of the input symbols defined later in the file. For more information about input symbols and their processing logic, see Input Symbol Definitions.

 

nextState 

The name of the next state to transition to when the specified input symbol is recognized. This must match the name of one of the states defined in the state model section. 

 

outputSymbol 

The name of the symbol that the current state produces for when processing is complete for the state based on the input symbol. Not all transitions have an output symbol. This must match one of the output symbols defined later in the file. For more information, see Output Symbol Definitions

 

probability 

The probability that the given input symbol is actually the next symbol in the input string. Probabilities are indicated by a decimal between and including 1 and 0. All probabilities for a given state must add up to 1. If a state definition includes the eof element described below, all probabilities including the eof probability must add up to 1.

eof 

probability 

The probability that the FSM has reached the end of the input string in the current state. Probabilities are indicated by a decimal between and including 1 and 0. The sum of this probability and all other probabilities for a given state must be 1.