3 Finite State Machine Framework Configuration

This chapter provides conceptual information about the Finite State Machine (FSM) framework configuration. It also provides data and examples for you to use when you set up FSM-based person name and FSM-based telephone number configuration.

This chapter includes the following sections:

"Learning About the FSM Framework Configuration"
"Setting FSM-Based Person Name Configuration"
"Setting FSM-Based Telephone Number Configuration"

Learning About the FSM Framework Configuration

In the FSM framework, the state model definition, along with all the token processing logic, is provided in configuration files in XML format. In addition, lexicon and normalization files define logic used by the OHMPI Standardization Engine to recognize and normalize specific values for each data type or variant. The standardization configuration files for the OHMPI Standardization Engine must follow certain rules for formatting and interdependencies. The following topics provide an overview of the types of configuration files provided for standardization.

The configuration of the finite state machine (FSM) includes defining the various states, transitions between those states, and any actions to perform during each state. Each instance of the FSM begins in the start state. In each state, the standardization engine looks for the next token (or input symbol), optionally performs certain actions against the token, determines the potential output symbols, and then uses probability-based logic to determine the output symbol to generate for the state and how to transition to the next state. Within each state, only the input symbols defined for that state are recognized. When an input symbol is recognized, the processing defined for that symbol is carried out and the transition to the next state occurs. Note that some input symbols might trigger a transition back to the current state. Once the standardization engine does not recognize any input symbols, the FSM reaches a terminal state from which no further transitions are made.

You can define specialized processing rules for each input symbol in the state model. These rules include cleansing and data transformation logic, such as converting data to uppercase, removing punctuation, comparing the input value against a list of values, and so on. Both the state model and the processing rules are defined in the process definition file, standardizer.xml. The lists that you can use to compare and normalize values for each input symbol are contained in lexicon and normalization files.

The configuration files that configure the standardization engine are stored in the master person index project and appear as nodes in the Standardization Engine node of the project. The standardization files are separated into subsets that are each unique to a specific data type, which are further grouped into variants on those data types. You can define additional standardization file subsets to create new variants or even create new data types, such as automotive parts, inventory items, and so on.

The following topics provide information about the files you can configure or create to customize how your data is standardized:

"Process Definition File"
Lexicon Files
"Normalization Files"

Process Definition File

The process definition file (standardizer.xml) is the primary configuration file for standardization. It defines the state model, input and output symbol definitions, preprocessing and postprocessing rules, and normalization rules for any type of standardization. Using a domain-specific markup language, you can configure any type of standardization without having to code a new Java package. Each process definition file defines the different stages of processing data for one data type or variant. The process definition file is stored in the resource folder under the data type or variant it defines.

The process definition file is divided into six primary sections, which are described in the following sections:

"Standardization State Definitions"
"Input Symbol Definitions"
"Output Symbol Definitions"
"Data Cleansing Definitions"
"Data Normalization Definitions"
"Standardization Processing Rules Reference"

The processing flow is defined in the state definitions. The input symbol definitions specify the token preprocessing, matching, and postprocessing logic. This is the logic carried out for each input token in a given state. The output symbols define the output for each state. The data cleansing definitions specify any transformations made to the input string prior to tokenization. Normalization definitions are used for data that does not need to be tokenized, but only needs to be normalized and optionally phonetically encoded. For example, if the input text provides the first name in its own field, the middle name in its own field, and so on, then only the normalization definitions are used to standardize the data. The standardization processing rules can be used in all sections except the standardization state definitions.

Standardization State Definitions

An FSM framework is defined by its different states and transitions between states. Each FSM begins with a start state when it receives an input string. The first recognized input symbol in the input string determines the next state based on customizable rules defined in the state model section of standardizer.xml. The next recognized input symbol determines the transition to the next state. This continues until no symbols are recognized and the termination state is reached.

Below is an excerpt from the state definitions for the PersonName data type. In this state, the first name has been processed and the standardization engine is looking for one of the following: a first name (indicating a middle name), a last name, an abbreviation (indicating a middle initial), a conjunction, or a nickname. A probability is given for each of these symbols indicating how likely it is to be the next token.

<stateModel name="start">
   <when inputSymbol="salutation" nextState="salutation" 
         outputSymbol="salutation" probability=".15"/>
   <when inputSymbol="givenName" nextState="headingFirstName" 
         outputSymbol="firstName" probability=".6"/>
   <when inputSymbol="abbreviation" nextState="headingFirstName" 
         outputSymbol="firstName" probability=".15"/>
   <when inputSymbol="surname" nextState="trailingLastName" 
         outputSymbol="lastName" probability=".1"/>
   <state name="headingFirstName">
      <when inputSymbol="givenName" nextState="headingMiddleName" 
            outputSymbol="middleName" probability=".4"/>
      <when inputSymbol="surname" nextState="headingLastName" 
            outputSymbol="lastName" probability=".3"/>
      <when inputSymbol="abbreviation" nextState="headingMiddleName" 
            outputSymbol="middleName" probability=".1"/>
      <when inputSymbol="conjunction" nextState="headingFirstName" 
            outputSymbol="conjunction" probability=".1"/>
      <when inputSymbol="nickname" nextState="firstNickname" 
            outputSymbol="nickname" probability=".1"/>
   </state>
   ...

The following table lists and describes the XML elements and attributes for the standardization state definitions.

Element	Attribute	Description
stateModel		The primary container element for the state model that includes the definitions for each state in the FSM. This element contains a series of when elements as described below to define the transitions from the start element to any of the other states. It also contains a series of state elements that define the remaining FSM states.
	name	The name of start state (by default, “start”).
state		A definition for one state in the FSM (not including the start state). Each state element contains a series of when elements and attributes as described above to define the processing flow.
	name	The name of the state. The names defined here are referenced in the nextState attributes described below to specify the next state.
when		A statement defining which state to transition to and which symbol to output when a specific input symbol is recognized in each state. These elements define the possible transitions from one state to another.
	inputSymbol	The name of an input symbol that might occur next in the input string. This must match one of the input symbols defined later in the file. For more information about input symbols and their processing logic, see Input Symbol Definitions.
	nextState	The name of the next state to transition to when the specified input symbol is recognized. This must match the name of one of the states defined in the state model section.
	outputSymbol	The name of the symbol that the current state produces for when processing is complete for the state based on the input symbol. Not all transitions have an output symbol. This must match one of the output symbols defined later in the file. For more information, see Output Symbol Definitions
	probability	The probability that the given input symbol is actually the next symbol in the input string. Probabilities are indicated by a decimal between and including 1 and 0. All probabilities for a given state must add up to 1. If a state definition includes the eof element described below, all probabilities including the eof probability must add up to 1.
eof	probability	The probability that the FSM has reached the end of the input string in the current state. Probabilities are indicated by a decimal between and including 1 and 0. The sum of this probability and all other probabilities for a given state must be 1.

Input Symbol Definitions

The input symbol definitions name and define processing logic for each input symbol recognized by the states. For each state, each possible input symbol is tried according to the rules defined here, and then the probability that it is the next token is assessed. Each input symbol might be subject to preprocessing, token matching, and postprocessing. Preprocessing can include removing punctuation or other regular expression substitutions. The value can then be matched against values in the lexicon file or against regular expressions. If the value matches, it can then be normalized based on the specified normalization file or on pattern replacement. One input symbol can have multiple preprocessing, matching, and postprocessing iterations to go through. If their are multiple iterations, each is carried out in turn until a match is found. All of these steps are optional.

Below is an excerpt from the input symbol definitions for PersonName processing. This excerpt processes the salutation portion of the input string by first removing periods, then comparing the value against the entries in the salutation.txt file, and finally normalizing the matched value based on the corresponding entry in the salutationNormalization.txt file. For example, if the value to process is “Mr.”, it is first changed to “Mr” and then matched against a list of salutations before it is converted to “Mister” based on the entry in the normalization file.

<inputSymbol name="salutation">
   <matchers>
      <matcher>
         <preProcessing>
            <replaceAll regex="\." replacement=""/>
         </preProcessing>
         <lexicon resource="salutation.txt"/>
         <postProcessing>
            <dictionary resource="salutationNormalization.txt" separator="\|"/>
         </postProcessing>
      </matcher>
   </matchers>
</inputSymbol>

The following table lists and describes the XML elements and attributes for the input symbol definitions.

Element	Attribute	Description
inputSymbol		A container element for the processing logic for one input symbol.
	name	The name of the input symbol against which the following logic applies.
matchers		A list of processing definitions, each of which define one preprocessing, matching, and postprocess sequence. Not all definitions include all three steps.
matcher		A processing definition for one sequence of preprocessing, matching, and postprocessing. A processing definition might contain only one or any combination of the three steps.
	factor	A factor to apply to the probability specified for the input symbol in the state definition. For example, if the state definition probability is .4 and this factor is .25, then the probability for this matching sequence is .1. Only define this attribute when the probability for this matching sequence is very low.
preProcessing		A container element for the preprocessing rules to be carried out against an input symbol. For more information about the rules you can use, see "Standardization Processing Rules Reference".
lexicon	resource	The name of the lexicon file containing the list of values to match the input symbol against. Note: You can also match against patterns or regular expressions. For more information, see matchAllPatterns and pattern in "Standardization Processing Rules Reference".
postProcessing		A container element for the postprocessing rules to be carried out against an input symbol that has been matched. For more information about the rules you can use, see "Standardization Processing Rules Reference".

Output Symbol Definitions

The output symbol definitions name each output symbol that can be produced by the defined states. This section can define additional processing for output symbols using the rules described in "Standardization Processing Rules Reference". Each output symbol defined in the state model definitions must match a value defined here. Below is an excerpt from the output symbol definitions for PersonName processing.

<outputSymbols>
   <outputSymbol name="salutation"/>
   <outputSymbol name="firstName"/>
   <outputSymbol name="middleName"/>
   <outputSymbol name="nickname"/>
   <outputSymbol name="lastName"/>
   <outputSymbol name="generation"/>
   <outputSymbol name="title"/>
   <outputSymbol name="conjunction"/>
</outputSymbols>

The following table lists and describes the XML elements and attributes for the output symbol definitions.

Element	Attribute	Description
outputSymbols		A list of output symbols for each processing state.
outputSymbol		A definition for one output symbol.
	name	The name of the output symbol
occurrenceConcatenator		An optional class to specify the character that separates contiguous occurrences of the same output symbol. For example, this is used in the PhoneNumber data type to concatenate phone number components that are separated by dashes. Components are concatenated using blanks.
	class	The name of the occurrence concatenator class. One concatenator class is predefined.
property		A parameter for the occurrence concatenator class. For the default class, the parameter specifies a separator character.
	name	The name of the parameter. For the default class, the name is “separator”.
	value	The parameter value.
tokenConcatenator		An optional class to specify the character that separates non-contiguous occurrences of the same output symbol. For example, this is used in the PhoneNumber data type to concatenate phone number components.
	class	The name of the token concatenator class. one concatenator class is predefined.
property		A parameter for the token concatenator class. For the default class, the parameter specifies a separator character.
	name	The name of the parameter. For the default class, the name is “separator”.
	value	The value of the parameter.

Data Cleansing Definitions

You can define cleansing rules to transform the input data prior to tokenization to make the input record uniform and ensure the data is correctly separated into its individual components. This standardization step is optional.

Common data transformations include the following:

Converting a string to all uppercase.
Trimming leading and trailing white space.
Converting multiple spaces in the middle of a string to one space.
Transliterating accent characters or diacritical marks.
Adding a space on either side of extra characters (to help the tokenizer recognize them).
Removing extraneous content.
Fixing common typographical errors.

The cleansing rules are defined within a cleanser element in the process definition file. You can use any of the rules defined in "Standardization Processing Rules Reference" to cleanse the data. Cleansing attributes use regular expressions to define values to find and replace.

The following excerpt from the PhoneNumber data type does the following to the input string prior to processing:

Converts all characters to upper case.
Replaces the specified input patterns with new patterns.
Removes white space at the beginning and end of the string and concatenates multiple consecutive spaces into one space.

<cleanser>
   <uppercase/>
   <replaceAll regex="([0-9]{3})([0-9]{3})([0-9]{4})" replacement="($1)$2-$3"/>
   <replaceAll regex="([-(),])" replacement=" $1 "/>
   <replaceAll regex="\+(\d+) -" replacement="+$1-"/>
   <replaceAll regex="E?X[A-Z]*[.#]?\s*([0-9]+)" replacement="X $1"/>
   <normalizeSpace/>
</cleanser>

Data Normalization Definitions

If the data you are standardizing does not need to be parsed, but does require normalization, you can define data normalization rules to be used instead of the state model defined earlier in the process definition file. These rules would be used in the case of person names where the field components are already contained in separate fields and do no need to be parsed. In this case, the standardization engine processes one field at a time according to the rules defined in the normalizer section of standardizer.xml. In this section, you can define preprocessing rules to be applied to the fields prior to normalization.

Below is an excerpt from the PersonName data type. These rules convert the input string to all uppercase, and then processes the FirstName and MiddleName fields based on the givenName input symbol and processes the LastName field based on the surname input symbol.

<normalizer>
   <preProcessing>
      <uppercase/>
   </preProcessing>
   <for field="FirstName" use="givenName"/>
   <for field="MiddleName" use="givenName"/>
   <for field="LastName" use="surname"/>
</normalizer>

The following table lists and describes the XML elements and attributes for the normalization definitions.

Element	Attribute	Description
normalizer		A container element for the normalization rules to use when field components do not require parsing, but do require normalization.
preProcessing		A container element for any preprocessing rules to apply to the input strings prior to normalization. For more information about preprocessing rules, see Standardization Processing Rules Reference.
for		The input symbol to use for a given field. This is defined in the following attributes.
	field	The name of a field to be normalized.
	use	The name of the input symbol to associate with the field. The processing logic defined for the input symbol earlier in the file is used to normalize the data contained in that field.

Standardization Processing Rules Reference

The OHMPI Standardization Engine provides several matching and transformation rules for input values and patterns. You can add or modify any of these rules in the existing process definition files (standardizer.xml). Several of these rules use regular expressions to define patterns and values. See the Javadoc for java.util.regex for more information about regular expressions.

The available rules include the following:

"dictionary"
"fixedString"
"lexicon"
"normalizeSpace"
"pattern"
"replace"
"replaceAll"
"transliterate"
"uppercase"

dictionary

This rule checks the input value against a list of values in the specified normalization file, and, if the value is found, converts the input value to its normalized value. This generally used for postprocessing but can also be used for preprocessing tokens. The normalization files are located in the same directory as the process definition file (the instance folder fo r the data type or variant).

The syntax for dictionary is:

<dictionary resource="file_name" separator="delimiter"/>

The parameters for dictionary are:

resource - The name of the normalization file to use to look up the input value and determine the normalized value.
separator - The character used in the normalization file to separate the input value entries from the normalized versions. The default normalization files all use a pipe (|) as a separator.

Example 3-1 Sample dictionary Rule

The following sample checks the input value against the list in the first column of the givenNameNormalization.txt file, which uses a pipe symbol (|) to separate the input value from its normalized version. When a value is matched, the input value is converted to its normalization version.

<dictionary resource="givenNameNormalization.txt" separator="\|"/>

fixedString

This rule checks the input value against a fixed value. This is generally used for the token matching step for input symbol processing. You can define a list of fixed strings for an input symbol by enclosing multiple fixedString elements within a fixedStrings element. The syntax for fixedString is:

<fixedString>string</fixedString>

The parameter for fixedString is:

string - The fixed value to compare the input value against.

Example 3-2 Sample fixedString Rules

The following sample matches the input value against “AND”, “OR” and “AND/OR” which are fixed values. If one of the fixed values matches the input string, processing is continued for that matcher definition. If no fixed values match the input string, processing is stopped for that matcher definition and the next matcher definition is processed (if one exists).

<fixedStrings>
   <fixedString>AND</fixedString>
   <fixedString>OR</fixedString>
   <fixedString>AND/OR</fixedString>
</fixedStrings>

lexicon

This rule checks the input value against a list of values in the specified lexicon file. This generally used for token matching. The lexicon files are located in the same directory as the process definition file (the instance folder for the data type or variant).

The syntax for lexicon is:

<lexicon resource="file_name/>

The parameter for lexicon is:

resource - The name of the lexicon file to use to look up the input value to ensure correct tokenization.

Example 3-3 Sample lexicon Rule

The following sample checks the input value against the list in the givenName.txt file. When a value is matched, the standardization engine continues to the postprocessing phase if one is defined.

<lexicon resource="givenName.txt"/>

normalizeSpace

This rule removes leading and trailing white space from a string and changes multiple spaces in the middle of a string to a single space. The syntax for normalizeSpace is:

<normalizeSpace/>

Example 3-4 Sample normalizeSpace Rule

The following sample removes the leading and trailing white space from a last name field prior to checking the input value against the surnames.txt file.

<matcher>
   <preProcessing>
     <normalizeSpace/>
   </preProcessing>
   <lexicon resource="surnames.txt"/>
</matcher>

pattern

This rule checks the input value against a specific regular expression to see if the patterns match. You can define a sequence of patterns by including them all in order in a matchAllPatterns element. You can also specify sub-patterns to exclude. The syntax for pattern is:

<pattern regex="regex_pattern"/>

The parameter for pattern is:

regex - A regular expression to validate the input value against. See the Javadocs for java.util.regex for more information.

The pattern rule can be further customized by adding exceptFor rules that define patterns to exclude in the matching process. The syntax for exceptFor is:

<pattern regex="regex_pattern"/>
   <exceptFor regex="regex_pattern"/>
</pattern>

The parameter for exceptFor is:

regex - A regular expression to exclude from the pattern match. See the Javadocs for java.util.regex for more information.

Example 3-5 Sample pattern Rule

The following sample checks the input value against the sequence of patterns to see if the input value might be an area code. These rules specify a pattern that matches three digits contained in parentheses, such as (310).

<matchAllPatterns>
   <pattern regex="regex="\("/>
   <pattern regex="regex="\[0-9]{3}"/>
   <pattern regex="regex="\)"/>
</matchAllPatterns>

The following sample checks the input value against the sequence of patterns to see if the input value might be an area code. These rules specify a pattern that matches three digits contained in parentheses, such as (310).

<pattern regex="[A-Z]{3}">
   <exceptFor regex="regex="THE"/>
   <exceptFor regex="regex="AND"/>
</matchAllPatterns>

Example 3-6 Sample pattern Rule

The following sample checks the input value against the sequence of patterns to see if the input value might be an area code. These rules specify a pattern that matches three digits contained in parentheses, such as (310).

<matchAllPatterns>
   <pattern regex="regex="\("/>
   <pattern regex="regex="\[0-9]{3}"/>
   <pattern regex="regex="\)"/>
</matchAllPatterns>

The following sample checks the input value to see if its pattern is a series of three letters excluding THE and AND.

<pattern regex="[A-Z]{3}">
   <exceptFor regex="regex="THE"/>
   <exceptFor regex="regex="AND"/>
</matchAllPatterns>

replace

This rule checks the input value for a specific pattern. If the pattern is found, it is replaced by a new pattern. This rule only replaces the first instance it finds of the pattern. The syntax for replace is:

<replace regex="regex_pattern" replacement="regex_pattern"/>

The parameters for replace are:

regex - A regular expression that, if found in the input string, is converted to the replacement expression.
replacement - The regular expression that replaces the expression specified by the regex parameter.

Example 3-7 Sample replace Rule

The following sample tries to match the input value against “ST.”. If a match is found, the standardization engine replaces the value with “SAINT.”

<replace regex="ST\." replacement="SAINT"/>

replaceAll

This rule checks the input value for a specific pattern. If the pattern is found, all instances are replaced by a new pattern. The syntax for replaceAll is:

<replaceAll regex="regex_pattern" replacement="regex_pattern"/>

The parameters for replaceAll are:

regex - A regular expression that, if found in the input string, is converted to the replacement expression.
replacement - The regular expression that replaces the expression specified by the regex parameter.
Example 3-8 Sample replaceAll Rule
```
The following sample finds all periods in the input value and converts them to blanks.
```
```
<replaceAll regex="\." replacement=""/>
```

transliterate

This rule converts the specified characters in the input string to a new set of characters, typically converting from one alphabet to another by adding or removing diacritical marks. The syntax for transliterate is:

<transliterate from="existing_char" to="new_char"/>

The parameters for transliterate are:

from - The characters that exist in the input string that need to be transliterated.
to - The characters that will replace the above characters.

Example 3-9 Sample transliterate Rule

The following sample converts lower case vowels with acute accents to vowels with no accents.

<transliterate from="áéíóú" to="aeiou"/>

uppercase

This rule converts all characters in the input string to upper case. The rule does not take any parameters. The syntax for uppercase is:

<uppercase/>

Example 3-10 Sample uppercase Rule

The following sample converts the entire input string into uppercase prior to doing any pattern or value replacements. Since this is defined in the cleanser section, this is performed prior to tokenization.

<cleanser>
   <uppercase/>
   <replaceAll regex="\." replacement=". "/>
   <replaceAll regex="AND / OR" replacement="AND/OR"/>
   ...
</cleanser>

Lexicon Files

Lexicon files list the possible values for a specific field that the standardization engine uses to recognize input data. A lexicon file can be defined for each field on which standardization is performed. These files are referenced from the process definition file when defining matching or processing rules. The lexicon files are located in the resource folder for the data type or variant from which they are referenced.

Lexicon files are simply text files with a single column that lists the possible field values. They are typically given the same name as the token type, or standardization component, that they define. For example, the lexicon files for first and last names are givenNames.txt and surnames.txt. You can modify these files as needed to suit your data requirements and you can create new lexicon files to reference from the process definition file.

Below is an excerpt of the given names lexicon file:

ALIA
ALICA
ALICAI
ALICE
ALICEMARIE
ALICEN
ALICIA
ALICJA
ALID
ALIDA
ALIHAN
ALINA
ALINE
ALIS
ALISA
ALISE
ALISHA
ALISHIA
ALISIA
ALISON

Normalization Files

Normalization files list nonstandard values for a field along with their corresponding normalized value. The standardization engine uses these files to convert nonstandard values into a standard form. These files are referenced from the process definition file when defining normalization rules. The normalization files are located in the resource folder for the data type or variant from which they are referenced.

The most common example of normalization is a nickname file that provides a list of nicknames along with the standard version of each name. For example, “Beth” and “Liz” might both be standardized to “Elizabeth.” Each row in the file contains a nickname and its corresponding standardized version separated by a pipe character (|). You can modify these files as needed to suit your data processing needs, or you can create new normalization files to reference from the process definition file.

Below is an excerpt of the given names normalization file:

BEV|BEVERLY
BIANCA|BLANCHE
BILLIE|WILLIAM
BILLYE|WILLIAM
BILLY|WILLIAM
BILL|WILLIAM
BIRGIT|BRIDGET
BLANCA|BLANCHE
BLANCH|BLANCHE
BOBBIE|ROBERT
BOBBI|ROBERT
BOBBYE|ROBERT
BOBBY|ROBERT
BOB|ROBERT
BONNY|BONNIE
BRADLY|BRADLEY

Setting FSM-Based Person Name Configuration

By default, person name data is standardized using the finite state machine (FSM) framework. Processing person data might involve parsing free-form data fields, but normally involves normalizing and phonetically encoding certain fields prior to matching. The following topics describe the default configuration that defines person processing logic and provide information about modifying mefa.xml in a master person index application for processing person data.

"Person Name Standardization Overview"
"Person Name Standardization Components"
"Person Name Standardization Files"
"Person Name Standardization and Oracle Healthcare Master Person Index"

Person Name Standardization Overview

Processing data with the PersonName data type includes standardizing and matching a person's demographic information. The OHMPI Standardization Engine can normalize or standardize values for person data. These values are needed for accurate searching and matching on person data. Several configuration files designed specifically to handle person data are included to provide processing logic for the standardization and phonetic encoding process. The Master Person Index Standardization Engine can phonetically encode any field you specify.

In addition, when processing person information, you might want to standardize addresses to enable searching against address information. This requires working with the address configuration files described in Chapter 4, "Patterns-based Address Data Configuration."

Person Name Standardization Components

Standardization engines use tokens to determine how each field is standardized into its individual field components and to determine how to normalize a field value. Tokens also identify the field components to external applications like a master person index application. The following table lists each token generated by the OHMPI Standardization Engine for person data along with the standardization component they represent. These correspond to the output symbols in the process definition file and to the output fields listed in the service type definition file. For names, you can only specify the predefined field IDs that are listed in this table unless you customize an existing variant or create a new one.

Table 3-1 Person Name Tokens

Token	Description
firstName	Represents a first name field.
generation	Represents a field containing generational information, such as Junior, II, or 3rd.
lastName	Represents a last name field.
middleName	Represents a middle name field.
nickname	Represents a nickname field.
salutation	Represents a field containing prefix information for a name, such as Mr., Miss, or Mrs.
title	Represents a field containing a title, such as Doctor, Reverend, or Professor.

Person Name Standardization Files

Several configuration files are used to define standardization logic for processing person names. You can customize any of the configuration files described in this section to fit your processing and standardization requirements for person data. There are three types of standardization files for person data: process definition, lexicon, and normalization. Five default variants on the PersonName data type are provided that are specialized for standardizing data from France, Australia, Mexico, the United Kingdom, or the United States. In a master person index project, these files appear under PersonName in the Standardization Engine node. Files for each variant appear within sub-folders of PersonName and each corresponds to a specific national variant.

You can customize these files to add entries of other nationalities or languages, including those containing diacritical marks. You can also create new variants to process data of other nationalities. For more information, see Custom Data Types and Variants.

The following sections provide information about each type of person name standardization file:

"Person Name Lexicon Files"
"Person Name Normalization Files"
"Person Name Process Definition Files"

Person Name Lexicon Files

Each PersonName variant contains a set of lexicon files. Each lexicon file contains a list of possible values for a field. The standardization engine matches input values against the values listed in these files to recognize input symbols and ensure correct tokenization. The OHMPI Standardization Engine uses these files when processing input symbols as defined in the process definition file (standardizer.xml). They are primarily used during the token matching portion of parsing. You can modify these files as needed by adding, deleting, or modifying values in the list. You can also create additional lexicon files.

The PersonName data type includes the following lexicon files:

generation.txt
givenNames.txt
salutation.txt
surnames.txt
titles.txt

These files are located in the resource folder under each variant name.

Person Name Normalization Files

Each PersonName variant contains a set of normalization files that are used to normalize input values. The OHMPI Standardization Engine uses these files when processing input symbols as defined in the process definition file (standardizer.xml). Each normalization file contains a column of unnormalized values, such as nicknames or abbreviations, and a second column that contains the corresponding normalized values. The values in each column are separated by a pipe symbol (|). You can modify these files as needed by adding, deleting, or modifying values in the list. You can also create additional normalization files to reference from the process definition file.

The PersonName data type includes the following normalization files:

generationNormalization.txt
givenNameNormalization.txt
salutationNormalization.txt
surnameNormalization.txt
titleNormalization.txt

These files are located in the resource folder under each variant name.

Person Name Process Definition Files

Each variant has its own process definition file (standardizer.xml) that defines the state model for standardizing free-form person names. Each of these files also includes a section that defines just normalization without parsing for person names. The process definition file is located in the resource folder under each variant name. For information about the structure of this file, see "Process Definition File"

Person name standardization has several states, each defining how to process tokens when they are found in certain orders. The default file defines states for salutations, first names, middle names, last names, titles, suffixes, and separators. It defines provisions for instances when the fields do not appear in order or when the input string does not contain complete data. For example, the current definition handles instances where the input string is “FirstName, MiddleName, LastName” as well as instances where the input string is “LastName, FirstName, MiddleName”.

The process definition files for person names define several parsing rules for each field component. This file defines a set of cleansing rules to prepare the input string prior to any processing. Then the data is passed to the start state of the FSM. Most fields are preprocessed and then matched against regular expressions or against a list of values in a lexicon file (described in "Person Name Lexicon Files"). Postprocessing includes replacing regular expressions or normalizing the field value based on a normalization file (described in "Person Name Normalization Files"). The process definition files also define a set of normalization rules, which are followed when the incoming data already contains name information in separate fields and does not need to be parsed.

Person Name Standardization and Oracle Healthcare Master Person Index

Master person index applications rely on the OHMPI Standardization Engine to process person name data. To ensure correct processing of person information, you need to customize the Matching Service for the master person index application according to the rules defined for the standardization engine. This includes modifying mefa.xml to define normalization or standardization and phonetic encoding of the appropriate fields. You can modify mefa.xml with the Master Person Index Configuration Editor in the master person index project.

Standardization is defined in the StandardizationConfig section of mefa.xml, which is described in detail in “Match Field Configuration” Oracle Healthcare Master Person Index Configuration Reference (Part Number: E18592-01). To configure the required fields for normalization, modify the normalization structure in mefa.xml. To configure the required fields for parsing and normalization, modify the standardization structure. To configure phonetic encoding, modify the phonetic encoding structure. These tasks can all be performed using the Master Person Index Configuration Editor.

Generally, the person data type processes data that is parsed prior to processing, so you should not need to configure fields to parse unless your person data is stored in free-form text fields with all name information in one field. When processing person data, you might also want to search on address information. In that case, you need to configure the address fields to standardize and normalize.

The following sections provide information about the fields used in processing person data and how to configure person data standardization for a master person index application. The information provided in these topics is based on the default configuration.

"Person Name Processing Fields"
"Configuring a Normalization Structure for Person Names"
"Configuring a Standardization Structure for Person Names"
"Configuring Phonetic Encoding for Person Names"

Person Name Processing Fields

When standardizing person data, not all fields in a record need to be processed by the Master Person Index Standardization Engine. The standardization engine only needs to process fields that must be standardized, normalized, or phonetically converted. For a master person index application, these fields are defined in mefa.xml and processing logic for each field is defined in the standardization engine configuration files.

Person Name Standardized Fields

The OHMPI Standardization Engine can process person data that is provided in separate fields within a single record, meaning that no parsing is required of the name fields prior to normalization. It can also process person data contained in one long free-form field and parse the field into its individual components, such as first name, last name, title, and so on. Typically, only first and last names are normalized and phonetically encoded when standardizing person data, but the standardization engine can normalize and phonetically encode any field you choose. By default, the standardization engine processes these fields: first name, middle name, last name, nickname, salutation, generational suffix, and title.

Person Name Object Structure

The fields you specify for person name matching in the Master Person Index wizard are automatically defined for standardization and phonetic encoding. If you specify the PersonFirstName or PersonLastName match type in the wizard, the following fields are automatically added to the object structure and database creation script:

field_name_Std
field_name_Phon

where field_name is the name of the field for which you specified person name matching.

For example, if you specify the PersonFirstName match type for the FirstName field, two fields, FirstName_Std and FirstName_Phon, are automatically added to the structure. You can also add these fields manually if you do not specify match types in the wizard. If you are parsing free-form person data, be sure all output fields from the standardization process are included in the master person index object structure. If you store additional names in the database, such as alias names, maiden names, parent names, and so on, you can modify the phonetic structure to phonetically encode those names as well.

Configuring a Normalization Structure for Person Names

The fields defined for normalization for the PersonName data type can include any name fields. By default, normalization rules are defined in the process definition file for first, middle, and last name fields, and you can easily define additional fields. You only need to define a normalization structure for person data if you are processing individual fields that do not require parsing. Follow the instructions under “Defining OHMPI Normalization Rules” in Oracle Healthcare Master Person Index Configuration Guide to define fields for normalization. For the standardization-type element, enter PersonName. For a list of field IDs to use in the standardized-object-field-id element, see "Person Name Standardization Components".

A sample normalization structure for person data is shown below. This sample specifies that the PersonName standardization type is used to normalize the first name, alias first name, last name, and alias last name fields. For all name fields, both United States and United Kingdom domains are defined for standardization.

<structures-to-normalize>
   <group standardization-type="PersonName"
    domain-selector="com.sun.mdm.index.matching.impl.MultiDomainSelector">
      <locale-field-name>Person.PobCountry</locale-field-name>
      <locale-maps>
         <locale-codes>
            <value>UNST</value>
            <locale>US</locale>
         </locale-codes>
         <locale-codes>
            <value>GB</value>
            <locale>UK</locale>
            </locale-codes>
      </locale-maps>
      <unnormalized-source-fields>
         <source-mapping>
            <unnormalized-source-field-name>Person.FirstName
            </unnormalized-source-field-name>
            <standardized-object-field-id>FirstName
            </standardized-object-field-id>
         </source-mapping>
         <source-mapping>
            <unnormalized-source-field-name>Person.LastName
            </unnormalized-source-field-name>
            <standardized-object-field-id>LastName
            </standardized-object-field-id>
         </source-mapping>
      </unnormalized-source-fields>
         <normalization-targets>
            <target-mapping>
               <standardized-object-field-id>FirstName
               </standardized-object-field-id>
               <standardized-target-field-name>Person.FirstName_Std
               </standardized-target-field-name>
            </target-mapping>
            <target-mapping>
               <standardized-object-field-id>LastName
               </standardized-object-field-id>
               <standardized-target-field-name>Person.LastName_Std
               </standardized-target-field-name>
            </target-mapping>
         </normalization-targets>
      </group>
   <group standardization-type="PersonName" domain-selector=
     "com.sun.mdm.index.matching.impl.MultiDomainSelector">
      <locale-field-name>Person.PobCountry</locale-field-name>
      <locale-maps>
         <locale-codes>
            <value>UNST</value>
            <locale>US</locale>
         </locale-codes>
         <locale-codes>
            <value>GB</value>
            <locale>UK</locale>
         </locale-codes>
      </locale-maps>
      <unnormalized-source-fields>
         <source-mapping>
            <unnormalized-source-field-name>Person.Alias[*].FirstName
            </unnormalized-source-field-name>
            <standardized-object-field-id>FirstName
            </standardized-object-field-id>
         </source-mapping>
         <source-mapping>
            <unnormalized-source-field-name>Person.Alias[*].LastName
            </unnormalized-source-field-name>
            <standardized-object-field-id>LastName
            </standardized-object-field-id>
         </source-mapping>
      </unnormalized-source-fields>
      <normalization-targets>
         <target-mapping>
            <standardized-object-field-id>FirstName
            </standardized-object-field-id>
            <standardized-target-field-name>
            Person.Alias[*].FirstName_Std
            </standardized-target-field-name>
         </target-mapping>
         <target-mapping>
            <standardized-object-field-id>LastName
            </standardized-object-field-id>
            <standardized-target-field-name>
            Person.Alias[*].LastName_Std
            </standardized-target-field-name>
         </target-mapping>
      </normalization-targets>
   </group>
</structures-to-normalize>

Configuring a Standardization Structure for Person Names

For free-form name fields, the source fields that are defined for standardization should include the predefined standardization components. For example, fields containing person name information can include the first name, middle name, last name, suffix, title, and salutation. The target fields you define can include any of these parsed components. Follow the instructions under “Defining OHMPI Standardization Rules” in Oracle Healthcare Master Person Index Configuration Guide to define fields for standardization. For the standardization-type element, enter PersonName. For a list of field IDs to use in the standardized-object-field-id element, see "Person Name Standardization Components".

A sample standardization structure for person name data is shown below. Only the United States variant is defined in this structure.

free-form-texts-to-standardize>
   <group standardization-type="PERSONNAME"
    domain-selector="com.sun.mdm.index.matching.impl.SingleDomainSelectorUS">
      <unstandardized-source-fields>
         <unstandardized-source-field-name>Person.Name
         </unstandardized-source-field-name>
      </unstandardized-source-fields>
      <standardization-targets>
         <target-mapping>
            <standardized-object-field-id>salutation
            </standardized-object-field-id>
            <standardized-target-field-name>Person.Prefix
            </standardized-target-field-name>
         </target-mapping>
         <target-mapping>
            <standardized-object-field-id>firstName
            </standardized-object-field-id>
            <standardized-target-field-name>Person.FirstName
            </standardized-target-field-name>
         </target-mapping>
         <target-mapping>
            <standardized-object-field-id>middleName
            </standardized-object-field-id>
            <standardized-target-field-name>Person.MiddleName
            </standardized-target-field-name>
         </target-mapping>
         <target-mapping>
            <standardized-object-field-id>lastName
            </standardized-object-field-id>
            <standardized-target-field-name>Person.LastName
            </standardized-target-field-name>
         </target-mapping>
         <target-mapping>
            <standardized-object-field-id>suffix
            </standardized-object-field-id>
            <standardized-target-field-name>Person.Suffix
            </standardized-target-field-name>
         </target-mapping>
         <target-mapping>
            <standardized-object-field-id>title
            </standardized-object-field-id>
            <standardized-target-field-name>Person.Title
            </standardized-target-field-name>
         </target-mapping>
      </standardization-targets>
   </group>
</free-form-texts-to-standardize>

Configuring Phonetic Encoding for Person Names

When you specify a first, middle, or last name field for person name matching in the Master Person Index wizard, that field is automatically defined for phonetic encoding. You can define additional names, such as maiden names or alias names, for phonetic encoding as well. Follow the instructions under “Defining Phonetic Encoding for the Master Person Index” in Oracle Healthcare Master Person Index Configuration Guide to define fields for phonetic encoding.

A sample of fields defined for phonetic encoding is shown below. This sample converts name and alias name fields, as well as the street name.

<phoneticize-fields>
   <phoneticize-field>
      <unphoneticized-source-field-name>Person.FirstName_Std
      </unphoneticized-source-field-name>
      <phoneticized-target-field-name>Person.FirstName_Phon
      </phoneticized-target-field-name>
      <encoding-type>Soundex</encoding-type>
   </phoneticize-field>
   <phoneticize-field>
      <unphoneticized-source-field-name>Person.LastName_Std
      </unphoneticized-source-field-name>
      <phoneticized-target-field-name>Person.LastName_Phon
      </phoneticized-target-field-name>
      <encoding-type>NYSIIS</encoding-type>
   </phoneticize-field>
   <phoneticize-field>
      <unphoneticized-source-field-name>Person.Alias[*].FirstName_Std
      </unphoneticized-source-field-name>
      <phoneticized-target-field-name>Person.Alias[*].FirstName_Phon
      </phoneticized-target-field-name>
      <encoding-type>Soundex</encoding-type>
   </phoneticize-field>
   <phoneticize-field>
      <unphoneticized-source-field-name>Person.Alias[*].LastName_Std
      </unphoneticized-source-field-name>
      <phoneticized-target-field-name>Person.Alias[*].LastName_Phon
      </phoneticized-target-field-name>
      <encoding-type>NYSIIS</encoding-type>
   </phoneticize-field>
   <phoneticize-field>
      <unphoneticized-source-field-name>Person.Address[*].AddressLine1_StName
      </unphoneticized-source-field-name>
      <phoneticized-target-field-name>Person.Address[*].AddressLine1_StPhon
      </phoneticized-target-field-name>
      <encoding-type>NYSIIS</encoding-type>
   </phoneticize-field></phoneticize-fields>

Setting FSM-Based Telephone Number Configuration

By default, telephone number data is standardized using the finite state machine (FSM) framework. Processing telephone data involves parsing free-form data fields and normalizing certain field components prior to matching. The following topics describe the default configuration files that define telephone number processing logic and provide information about modifying mefa.xml in a master person index application for processing telephone data.

"Telephone Number Standardization Overview"
"Telephone Number Standardization Components"
"Telephone Number Standardization Files"
"Telephone Number Standardization Files"

Telephone Number Standardization Overview

Processing data using the PhoneNumber data type includes standardizing and matching telephone numbers. The OHMPI Standardization Engine can create the parsed and normalized values for free-form telephone data. These values are required for accurate searching and matching. Several configuration files designed specifically to handle telephone data are included to provide processing logic for the standardization process.

In addition, when processing telephone information, you might want to standardize addresses to enable searching against address information. This requires working with the address configuration files described in Chapter 4, "Patterns-based Address Data Configuration."

Telephone Number Standardization Components

Standardization engines use tokens to determine how each field is standardized into its individual field components and to determine how to normalize a field value. Tokens also identify the field components to external applications, like a master person index application. The following table lists each token generated by the OHMPI Standardization Engine for telephone data along with the standardization component they represent. You can only specify the predefined field IDs that are listed in this table unless you customize the existing data type or create a new data type or variant.

Table 3-2 Telephone Number Tokens

Token	Description
areaCode	Represents a field containing an area code.
phoneNumber	Represents a field containing the telephone number, excluding area code, country code, and extension.
extension	Represents a field containing a telephone number extension.
countryCode	Represents a field containing the country code for a telephone number.

Telephone Number Standardization Files

Only one configuration file is used to define standardization logic for processing telephone numbers. The process definition file (standardizer.xml) defines the state model and logic for processing telephone numbers. There is only one variant for the PhoneNumber data type that is designed to handle telephone numbers from all countries. The files that make up the variant are stored in the master person index project under PhoneNumber/Generic. The process definition file is located in the resource subdirectory. You can customize this file to fit your processing and standardization requirements for telephone numbers. For more information about the structure of this file, see "Process Definition File".

Telephone number standardization has several states, each defining how to process tokens when they are found in certain orders. The default file defines states for country codes, area codes, phone numbers, and extensions. It defines provisions for instances when the fields do not appear in order or when the input string does not contain complete data. For example, the current definition handles instances where the input string begins with a country code or an area code, where it contains an extension, where it does not contain an extension, and when it contains multiple telephone numbers.

The process definition file for telephone numbers define several parsing rules for each field component. This file defines a set of cleansing rules to prepare the input string prior to any processing. Then the data is passed to the start state of the FSM. Most fields are matched against regular expressions and then postprocessed by replacing regular expressions. The output symbols are further processed by concatenating the digit groups of the actual phone number, separated by a hyphen.

Telephone Number Standardization and Oracle Healthcare Master Person Index

Master person index applications rely on the OHMPI Standardization Engine to process telephone number data. To ensure correct processing of telephone information, you need to customize the Matching Service for the master person index application according to the rules defined for the standardization engine. This includes modifying mefa.xml to define standardization of the appropriate fields. You can modify mefa.xml using the Master Person Index Configuration Editor.

Standardization is defined in the StandardizationConfig section of mefa.xml, which is described in detail in “Match Field Configuration” in Oracle Healthcare Master Person Index Configuration Reference (Part Number: E18592-01). To configure the required fields for parsing, modify the standardization structure in mefa.xml.

The following topics provide information about the fields used in processing telephone data and how to configure telephone number standardization for a master person index application. The information provided in these topics is based on the default configuration.

"Telephone Number Processing Fields"
"Configuring a Standardization Structure for Telephone Numbers"

Telephone Number Processing Fields

When standardizing telephone data, not all fields in a record need to be processed by the OHMPI Standardization Engine. The standardization engine only needs to process fields that must be parsed, normalized, or phonetically converted. For a master person index application, these fields are defined in mefa.xml and processing logic for each field is defined in the Standardization Engine node configuration files.

Telephone Number Standardized Fields

The OHMPI Standardization Engine can process telephone data that is contained in one long free-form field and can parse that field into its individual components. By default, the standardization engine separates telephone numbers into these field components: country code, area code, phone number, and extension.

Telephone Number Object Structure

To standardize telephone numbers in a master person index application, you need to manually define the standardization structure and you need to add the fields that will store the standardized field components to the object structure. In the default implementation, you can store any combination of the following telephone number field components in the master person index database.

Country Code
Area Code
Phone Number
Extension

The standardization engine has the capability to produce all of the above field components, but you only need to store the ones you need in the master person index database.

Configuring a Standardization Structure for Telephone Numbers

For free-form name fields, the source fields you define for standardization should include the standardization components predefined for the PhoneNumber data type. For example, any fields containing telephone number information can include the country code, area code, phone number, and extension. The target fields you define can include any of these parsed fields. Follow the instructions under “Defining OHMPI Standardization Rules” in Oracle Healthcare Master Person Index Configuration Guide to define fields for standardization. For the standardization-type element, enter PhoneNumber. For a list of field IDs to use in the standardized-object-field-id element, see "Telephone Number Standardization Components".

A sample standardization structure for telephone number data is shown below. No variant is defined in this structure because the standardization rules apply to global numbers.

<free-form-texts-to-standardize>
   <group standardization-type="PHONENUMBER"
    domain-selector="com.sun.mdm.index.matching.impl.MultiDomainSelector">
      <unstandardized-source-fields>
         <unstandardized-source-field-name>Person.Phone[*].PhoneNumber
         </unstandardized-source-field-name>
      </unstandardized-source-fields>
      <standardization-targets>
         <target-mapping>
            <standardized-object-field-id>countryCode</standardized-object-field-id>
            <standardized-target-field-name>Person.Phone[*].CountryCode
         </standardized-target-field-name>
         </target-mapping>
         <target-mapping>
            <standardized-object-field-id>areaCode</standardized-object-field-id>
            <standardized-target-field-name>Person.Phone[*].AreaCode
         </standardized-target-field-name>
         </target-mapping>
         <target-mapping>
            <standardized-object-field-id>phoneNumber</standardized-object-field-id>
            <standardized-target-field-name>Person.Phone[*].Number
         </standardized-target-field-name>
         </target-mapping>
         <target-mapping>
            <standardized-object-field-id>extension</standardized-object-field-id>
            <standardized-target-field-name>Person.Phone[*].Extension
         </standardized-target-field-name>
         </target-mapping>
      </standardization-targets>
   </group>
</free-form-texts-to-standardize>