Understanding the Master Index Standardization Engine

Standardization Processing Rules Reference

The Master Index Standardization Engine provides several matching and transformation rules for input values and patterns. You can add or modify any of these rules in the existing process definition files (standardizer.xml). Several of these rules use regular expressions to define patterns and values. See the Javadoc for java.util.regex for more information about regular expressions.

The available rules include the following:

dictionary

This rule checks the input value against a list of values in the specified normalization file, and, if the value is found, converts the input value to its normalized value. This generally used for postprocessing but can also be used for preprocessing tokens. The normalization files are located in the same directory as the process definition file (the instance folder for the data type or variant).

The syntax for dictionary is:


<dictionary resource="file_name" separator="delimiter"/>

The parameters for dictionary are:


Example 1 Sample dictionary Rule

The following sample checks the input value against the list in the first column of the givenNameNormalization.txt file, which uses a pipe symbol (|) to separate the input value from its normalized version. When a value is matched, the input value is converted to its normalization version.


<dictionary resource="givenNameNormalization.txt" separator="\|"/>

fixedString

This rule checks the input value against a fixed value. This is generally used for the token matching step for input symbol processing. You can define a list of fixed strings for an input symbol by enclosing multiple fixedString elements within a fixedStrings element. The syntax for fixedString is:


<fixedString>string</fixedString>

The parameter for fixedString is:


Example 2 Sample fixedString Rules

The following sample matches the input value against the fixed values “AND”, “OR” and “AND/OR”. If one of the fixed values matches the input string, processing is continued for that matcher definition. If no fixed values match the input string, processing is stopped for that matcher definition and the next matcher definition is processed (if one exists).


<fixedStrings>
   <fixedString>AND</fixedString>
   <fixedString>OR</fixedString>
   <fixedString>AND/OR</fixedString>
</fixedStrings>

lexicon

This rule checks the input value against a list of values in the specified lexicon file. This generally used for token matching. The lexicon files are located in the same directory as the process definition file (the instance folder for the data type or variant).

The syntax for lexicon is:


<lexicon resource="file_name/>

The parameter for lexicon is:


Example 3 Sample lexicon Rule

The following sample checks the input value against the list in the givenName.txt file. When a value is matched, the standardization engine continues to the postprocessing phase if one is defined.


<lexicon resource="givenName.txt"/>

normalizeSpace

This rule removes leading and trailing white space from a string and changes multiple spaces in the middle of a string to a single space. The syntax for normalizeSpace is:


<normalizeSpace/>

Example 4 Sample normalizeSpace Rule

The following sample removes the leading and trailing white space from a last name field prior to checking the input value against the surnames.txt file.


<matcher>
   <preProcessing>
     <normalizeSpace/>
   </preProcessing>
   <lexicon resource="surnames.txt"/>
</matcher>

pattern

This rule checks the input value against a specific regular expression to see if the patterns match. You can define a sequence of patterns by including them all in order in a matchAllPatterns element. You can also specify sub-patterns to exclude. The syntax for pattern is:


<pattern regex="regex_pattern"/>

The parameter for pattern is:

The pattern rule can be further customized by adding exceptFor rules that define patterns to exclude in the matching process. The syntax for exceptFor is:


<pattern regex="regex_pattern"/>
   <exceptFor regex="regex_pattern"/>
</pattern>

The parameter for exceptFor is:


Example 5 Sample pattern Rule

The following sample checks the input value against the sequence of patterns to see if the input value might be an area code. These rules specify a pattern that matches three digits contained in parentheses, such as (310).


<matchAllPatterns>
   <pattern regex="regex="\("/>
   <pattern regex="regex="\[0-9]{3}"/>
   <pattern regex="regex="\)"/>
</matchAllPatterns>

The following sample checks the input value to see if its pattern is a series of three letters excluding THE and AND.


<pattern regex="[A-Z]{3}">
   <exceptFor regex="regex="THE"/>
   <exceptFor regex="regex="AND"/>
</matchAllPatterns>

replace

This rule checks the input value for a specific pattern. If the pattern is found, it is replaced by a new pattern. This rule only replaces the first instance it finds of the pattern. The syntax for replace is:


<replace regex="regex_pattern" replacement="regex_pattern"/>

The parameters for replace are:


Example 6 Sample replace Rule

The following sample tries to match the input value against “ST.”. If a match is found, the standardization engine replaces the value with “SAINT”.


<replace regex="ST\." replacement="SAINT"/>

replaceAll

This rule checks the input value for a specific pattern. If the pattern is found, all instances are replaced by a new pattern. The syntax for replaceAll is:


<replaceAll regex="regex_pattern" replacement="regex_pattern"/>

The parameters for replaceAll are:


Example 7 Sample replaceAll Rule

The following sample finds all periods in the input value and converts them to blanks.


<replaceAll regex="\." replacement=""/>

transliterate

This rule converts the specified characters in the input string to a new set of characters, typically converting from one alphabet to another by adding or removing diacritical marks. The syntax for transliterate is:


<transliterate from="existing_char" to="new_char"/>

The parameters for transliterate are:


Example 8 Sample transliterate Rule

The following sample converts lower case vowels with acute accents to vowels with no accents.


<transliterate from="áéíóú" to="aeiou"/>

uppercase

This rule converts all characters in the input string to upper case. The rule does not take any parameters. The syntax for uppercase is:


<uppercase/>

Example 9 Sample uppercase Rule

The following sample converts the entire input string into uppercase prior to doing any pattern or value replacements. Since this is defined in the cleanser section, this is performed prior to tokenization.


<cleanser>
   <uppercase/>
   <replaceAll regex="\." replacement=". "/>
   <replaceAll regex="AND / OR" replacement="AND/OR"/>
   ...
</cleanser>