Understanding the Master Index Standardization Engine

Input Symbol Definitions

The input symbol definitions name and define processing logic for each input symbol recognized by the states. For each state, each possible input symbol is tried according to the rules defines here, and then the probability that it is the next token is assessed. Each input symbol might be subject to preprocessing, token matching, and postprocessing. Preprocessing can include removing punctuation or other regular expression substitutions. The value can then be matched against values in the lexicon file or against regular expressions. If the value matches, it can then be normalized based on the specified normalization file or on pattern replacement. One input symbol can have multiple preprocessing, matching, and postprocessing iterations to go through. If their are multiple iterations, each is carried out in turn until a match is found. All of these steps are optional.

Below is an excerpt from the input symbol definitions for PersonName processing. This excerpt processes the salutation portion of the input string by first removing periods, then comparing the value against the entries in the salutation.txt file, and finally normalizing the matched value based on the corresponding entry in the salutationNormalization.txt file. For example, if the value to process is “Mr.”, it is first changed to “Mr”, matched against a list of salutations, and then converted to “Mister” based on the entry in the normalization file.


<inputSymbol name="salutation">
   <matchers>
      <matcher>
         <preProcessing>
            <replaceAll regex="\." replacement=""/>
         </preProcessing>
         <lexicon resource="salutation.txt"/>
         <postProcessing>
            <dictionary resource="salutationNormalization.txt" separator="\|"/>
         </postProcessing>
      </matcher>
   </matchers>
</inputSymbol>

The following table lists and describes the XML elements and attributes for the input symbol definitions.

Element 

Attribute 

Description 

inputSymbol 

 

A container element for the processing logic for one input symbol. 

 

name 

The name of the input symbol against which the following logic applies. 

matchers 

 

A list of processing definitions, each of which define one preprocessing, matching, and postprocess sequence. Not all definitions include all three steps. 

matcher 

 

A processing definition for one sequence of preprocessing, matching, and postprocessing. A processing definition might contain only one or any combination of the three steps.  

 

factor 

A factor to apply to the probability specified for the input symbol in the state definition. For example, if the state definition probability is .4 and this factor is .25, then the probability for this matching sequence is .1. Only define this attribute when the probability for this matching sequence is very low. 

preProcessing 

 

A container element for the preprocessing rules to be carried out against an input symbol. For more information about the rules you can use, see Standardization Processing Rules Reference.

lexicon 

resource 

The name of the lexicon file containing the list of values to match the input symbol against. 


Note –

You can also match against patterns or regular expressions. For more information, see matchAllPatterns and pattern in Standardization Processing Rules Reference.


postProcessing 

 

A container element for the postprocessing rules to be carried out against an input symbol that has been matched. For more information about the rules you can use, see Standardization Processing Rules Reference.