Standardization State Definitions (Understanding the Master Index Standardization Engine)

Understanding the Master Index Standardization Engine

Standardization State Definitions

An FSM framework is defined by its different states and transitions between states. Each FSM begins with a start state when it receives an input string. The first recognized input symbol in the input string determines the next state based on customizable rules defined in the state model section of standardizer.xml. The next recognized input symbol determines the transition to the next state. This continues until no symbols are recognized and the termination state is reached.

Below is an excerpt from the state definitions for the PersonName data type. In this state, the first name has been processed and the standardization engine is looking for one of the following: a first name (indicating a middle name), a last name, an abbreviation (indicating a middle initial), a conjunction, or a nickname. A probability is given for each of these symbols indicating how likely it is to be the next token.

<stateModel name="start">
   <when inputSymbol="salutation" nextState="salutation" 
         outputSymbol="salutation" probability=".15"/>
   <when inputSymbol="givenName" nextState="headingFirstName" 
         outputSymbol="firstName" probability=".6"/>
   <when inputSymbol="abbreviation" nextState="headingFirstName" 
         outputSymbol="firstName" probability=".15"/>
   <when inputSymbol="surname" nextState="trailingLastName" 
         outputSymbol="lastName" probability=".1"/>
   <state name="headingFirstName">
      <when inputSymbol="givenName" nextState="headingMiddleName" 
            outputSymbol="middleName" probability=".4"/>
      <when inputSymbol="surname" nextState="headingLastName" 
            outputSymbol="lastName" probability=".3"/>
      <when inputSymbol="abbreviation" nextState="headingMiddleName" 
            outputSymbol="middleName" probability=".1"/>
      <when inputSymbol="conjunction" nextState="headingFirstName" 
            outputSymbol="conjunction" probability=".1"/>
      <when inputSymbol="nickname" nextState="firstNickname" 
            outputSymbol="nickname" probability=".1"/>
   </state>
   ...

The following table lists and describes the XML elements and attributes for the standardization state definitions.

Element	Attribute	Description
stateModel		The primary container element for the state model that includes the definitions for each state in the FSM. This element contains a series of `when` elements as described below to define the transitions from the start element to any of the other states. It also contains a series of `state` elements that define the remaining FSM states.
	name	The name of start state (by default, “start”).
state		A definition for one state in the FSM (not including the start state). Each state element contains a series of `when` elements and attributes as described above to define the processing flow.
	name	The name of the state. The names defined here are referenced in the `nextState` attributes described below to specify the next state.
when		A statement defining which state to transition to and which symbol to output when a specific input symbol is recognized in each state. These elements define the possible transitions from one state to another.
	inputSymbol	The name of an input symbol that might occur next in the input string. This must match one of the input symbols defined later in the file. For more information about input symbols and their processing logic, see Input Symbol Definitions.
	nextState	The name of the next state to transition to when the specified input symbol is recognized. This must match the name of one of the states defined in the state model section.
	outputSymbol	The name of the symbol that the current state produces for when processing is complete for the state based on the input symbol. Not all transitions have an output symbol. This must match one of the output symbols defined later in the file. For more information, see Output Symbol Definitions
	probability	The probability that the given input symbol is actually the next symbol in the input string. Probabilities are indicated by a decimal between and including 1 and 0. All probabilities for a given state must add up to 1. If a state definition includes the `eof` element described below, all probabilities including the `eof` probability must add up to 1.
eof	probability	The probability that the FSM has reached the end of the input string in the current state. Probabilities are indicated by a decimal between and including 1 and 0. The sum of this probability and all other probabilities for a given state must be 1.