2 Master Person Index Standardization Engine

This chapter provides conceptual information on how the OHMPI Standardization Engine works and standardization and matching process. It also describes the internationalization of the OHMPI Standardization Engine.

This chapter includes the following sections:

"Learning About the OHMPI Standardization Engine"
"Understanding the OHMPI Standardization and Matching Process"
"Internationalizing the OHMPI Standardization Engine"

Learning About the OHMPI Standardization Engine

The OHMPI Standardization Engine uses two frameworks to define standardization logic. One framework is based on a finite state machine (FSM) model and the other is based on patterns defined in configurable dictionaries. In the current implementation, the person names and telephone numbers are processed using the FSM framework, and addresses and business names are processed using the patterns-based framework. The OHMPI Standardization Engine includes several sets of files that define standardization logic for all supported data types. For person data and addresses, one set of standardization files is provided for the following national variants: Australia, France, Great Britain, Mexico, and the United States. You can customize these files to adapt the standardization and matching logic to your specific needs or you can create new data types or variants for even more customized processing. With pluggable standardization sets, you can define custom standardization processing for most types of data.

The following topics provide information about the OHMPI Standardization Engine, the standardization frameworks, and data is standardized:

"OHMPI Standardization Engine Data Types and Variants"
"OHMPI Standardization Engine Standardization Components"
"Finite State Machine Framework"
"Patterns-based Framework"

OHMPI Standardization Engine Data Types and Variants

A data type is the primary kind of data you are processing, such as person names, addresses, business names, automotive parts, and so on. A variant is a subset of a data type that is designed to standardize a specific kind of data. For example, for addresses and names, the variants typically define rules for the different countries in which the data originates. For automotive parts, the variants might be different manufacturers. Each data type and variant uses its own configuration files to define how fields in incoming records are parsed, standardized, and classified for processing. Data types are sometimes referred to as standardization types.

In the default implementation with a master person index application, the engine supports data standardization on the following types of data:

Person Information (described in "Setting FSM-Based Person Name Configuration")
Telephone Numbers (described in "Setting FSM-Based Telephone Number Configuration")
Street Addresses (described in "Setting Patterns-based Address Data Configuration")
Business Names (described in "Setting Patterns-based Business Name Configuration")

In the default configuration, the standardization engine expects street address and business names to be in free-form text fields that need to be parsed prior to normalization and phonetic encoding. Person and phone information can also be contained in free-form text fields, but theses types of information can also be processed if the data is already parsed into its individual components. Each data type requires specific customization to mefa.xml in the master person index project. This can be done by modifying the file directly or by using the OHMPI Configuration Editor.

OHMPI Standardization Engine Standardization Components

The OHMPI Standardization Engine breaks down fields into various components during the parsing process. This is known an tokenization. For example, it breaks addresses into floor number, street number, street name, street direction, and so on. Some of these components are similar and might be stored in the same output field. In the default configuration for a master person index application, for example, when the standardization engine finds a house number, rural route number, or PO box number, the value is stored in the HouseNumber database field. You can customize this as needed, as long as any field you specify to store a component is also included in the object structure defined for the master person index application.

The standardization engine uses tokens to determine how to process fields that are defined for normalization or parsing into their individual standardization components. For FSM-based data types, the tokens are defined as output symbols in the process definition files and are referenced in the standardization structures in the Master Person Index Configuration Editor and in mefa.xml. The tokens determine how each field is normalized or how a free-form text field is parsed and normalized. For patterns-based data types, the tokens are defined internally in the Java code. The tokens for business names specify which business type key file to use to normalize a specific standardization component. The tokens for addresses determine which database fields store each standardization component and how each component is standardized.

Finite State Machine Framework

A finite state machine (FSM) is composed of one or more states and the transitions between those states. The OHMPI Standardization Engine FSM framework is designed to be highly configurable and can be easily extended with no Java coding. The following sections describe the FSM framework and the configuration files that define FSM-based standardization.

About the Finite State Machine Framework

In an FSM framework, the standardization process is defined as one or more states. In a state, only the input symbols defined for that state are recognized. When one of those symbols is recognized, the following action or transition is based on configurable processing rules. For example, when an input symbol is recognized, it might be preprocessed by removing punctuation, matched against a list of tokens, and then postprocessed by normalizing the input value. Once this has been completed for all input symbols, the standardization engine determines which token is the most likely match.

FSM-based processing includes the following steps:

Cleansing - The entire input string is modified to make sure it is broken down into its individual components correctly.
Tokenization - The input string is broken down into its individual components.
Parsing - The individual field components are processed according to configurable rules. Parsing can include any combination of the following three stages:
- Preprocessing - Each token is cleansed prior to matching to make the value more uniform.
- Matching - The cleansed token is matched against patterns or value lists.
- Postprocessing - The matched token is normalized.
  
  Note:
  Several parsing sequences might be performed against one field component in order to best match it with a token. Each sequence is carried out until a match is made.
Ambiguity Resolution - Some input strings might match more than one processing rule, so the FSM framework includes a probability-based mechanism for determining the correct state transition.

Using the person data type, for example, first names such as “Bill” and “Will” are normalized to “William,” which is then converted to its phonetic equivalent. Standardization logic is defined in the standardization engine configuration files and in the Master Person Index Configuration Editor or mefa.xml in a master person index project.

FSM-Based Configuration

The FSM-based standardization configuration files are stored in the master person index project and appear in the Standardization Engine node of the project. These files are separated into groups based on the primary data types being processed. Data type groups have further subsets of configuration files based on the variants for each data type. FSM-based data types and variants, such as PersonName and PhoneNumber, include the following configuration file types.

Service Definition Files - Each data type and data type variant is defined by a service definition file. Service type files define the fields to be standardized for a data type and service instance files define the variant and Java factory class for the variant. Both files are in XML format and should not be modified unless the data type is extended to include more output symbols.
Process Definition Files - These files define the different stages of processing data for the data type or variant. It defines the FSM states, input and output symbols, patterns, and data cleansing rules. These files use a domain-specific language (DSL) to define how the data fields are processed.
Lexicon Files - The standardization engine uses these files to recognize input data. A lexicon provides a list of possible values for a specific field, and one lexicon file should be defined for each field on which standardization is performed.
Normalization Files - The standardization engine uses these files to convert nonstandard values into a common form. For example, a nickname file provides a list of nicknames along with the common version of each name. For example, “Beth” and “Liz” might both be normalized to “Elizabeth.” Each row in the file contains a nickname and its corresponding normalized version separated by a pipe character (|).

Patterns-based Framework

In the patterns-based framework, the standardization process is defined configurable dictionaries and also in the underlying Java code. You can configure several aspects of the standardization process, such as the detectable patterns for each data type, how values are normalized, and how the input string is cleansed and parsed. You can define custom patterns-based data types and variants by creating custom Java packages that define processing.

About the Patterns-based Framework

In the patterns-based framework, individual field components are recognized by the patterns defined for each data type and by information provided in configurable files about how to preprocess, match, and postprocess each field component. The patterns-based framework processes data in the following stages.

Parsing - A free-form text field is separated into its individual components, such as street address information or a business name. This process takes into account logic you can customize, such as token patterns, special characters, and priority weights for patterns.
Data-Type Identification - Look up the different locale-specific data dictionaries to identify related types. In the case of postal address, for example, identify street directions, street name, apartment number, and so on. Normalization - Once a field is parsed, individual components of the field are normalized based on the configuration files. This can include changing the input street name to a common form or changing the input business name to its official form.
Normalization - Once a field is parsed, individual components of the field are normalized based on the configuration files. This can include changing the input street name to a common form or changing the input business name to its official form.
Patterns-Resolution - In general, there is more than one pattern for the same input record, and we associated algorithm need to choose the appropriate pattern in the pattern dictionary table.

Using the street address data type, for example, street addresses are parsed into their component parts, such as house numbers, street names, and so on. Certain fields are normalized, such as street name, street type, and street directions. The street name is then phonetically converted. Standardization logic is defined in the standardization engine configuration files and in the Master Person Index Configuration Editor or mefa.xml in a master person index project.

Patterns-based Configuration

The patterns-based standardization configuration files are stored in the master person index project and appear as nodes in the Standardization Engine node of the project. These files are separated into groups based on the primary data types and variants being processed. Patterns-based data types and variants, such as the default Address and Business Name types, use the following configuration file types.

Service Definition Files - Each data type and data type variant is configured by a service definition file. Service type files define the fields to be standardized for a data type, and service instance definition files define the variant and Java factory class for the variant. Both files are in XML format. These files should not be modified.
Category Files - The standardization engine uses category files when processing business names. These files list common values for certain types of data, such as industries and organizations for business names. Category files also define standardized versions of each term or classify the terms into different categories, and some files perform both functions. When processing address files, category files named clues files are used.
Clues Files - The standardization engine uses clues files when processing address data types. These files list general terms used in street address fields, define standardized versions of each term, and classify the terms into various component types using predefined address tokens. These files are used by the standardization engine to determine how to parse a street address into its various components. Clues files provide clues in the form of tokens to help the engine recognize the component type of certain values in the input fields.
Patterns Files - The patterns files specify how incoming data should be interpreted for standardization based on the format, or pattern, of the data. These files are used only for processing data contained in free-form text fields that must be parsed prior to matching (such as street address fields or business names). Patterns files list possible input data patterns, which are encoded in the form of tokens. Each token signifies a specific component of the free-form text field. For example, in a street address field, the house number is identified by one token, the street name by another, and so on. Patterns files also define the format of the output fields for each input pattern.
Key Type Files - For business name processing, the standardization engine refers to a number of key type files for processing data. These files generally define standard versions of terms commonly found in business names and some classify these terms into various components or industries. These files are used by the standardization engine to determine how to parse a business name into its different components and to recognize the component type of certain values in the input fields.
Reference Files - Reference files define general terms that appear in input fields for each data type. Some reference files define terms to ignore and some define terms that indicate the business name is continuing. For example, in business name processing “and” is defined as a joining term. This helps the standardization engine to recognize that the primary business name in “Martin and Sons, Inc.” is “Martin and Sons” instead of just “Martin.” Reference files can also define characters to be ignored by the standardization engine.

Understanding the OHMPI Standardization and Matching Process

In a default Oracle Healthcare Master Person Index implementation, the master person index application uses the OHMPI Match Engine and the OHMPI Standardization Engine to cleanse data in real time. The standardization engine uses configurable pattern-matching logic to identify data and reformat it into a standardized form. The match engine uses a matching algorithm with a proven methodology to process and weight records in the master person index database. By incorporating both standardization and matching capabilities, you can condition data prior to matching. You can also use these capabilities to review legacy data prior to loading it into the database. This review helps you determine data anomalies, invalid or default values, and missing fields.

In a master person index application, both matching and standardization occur when two records are analyzed for the probability of a match. Before matching, certain fields are normalized, parsed, or converted into their phonetic values if necessary. The match fields are then analyzed and weighted according to the rules defined in a match configuration file. The weights for each field are combined to determine the overall matching weight for the two records. After these steps are complete, survivorship is determined by the master person index application based on how the overall matching weight compares to the duplicate and match thresholds of the master person index application.

In a master person index application, the standardization and matching process includes the following steps:
The master person index application receives an incoming record.
The OHMPI Standardization Engine standardizes and/or normalizes the fields. These fields are defined in mefa.xml and the rules for standardization are defined in the standardization engine configuration files.
The master person index application queries the database for a candidate selection pool (records that are possible matches) using the blocking query specified in master.xml. If the blocking query uses standardized or phonetic fields, the criteria values are obtained from the database.
For each possible match, the master person index application creates a match string (based on the match columns in mefa.xml) and sends the string to the OHMPI Match Engine.
The OHMPI Match Engine checks the incoming record against each possible match, producing a matching weight for each. Matching is performed using the weighting rules defined in the match configuration file.

Internationalizing the OHMPI Standardization Engine

By default, the OHMPI Standardization Engine is configured for addresses and names originating from Australia, France, Great Britain, Mexico, and the United States, and for telephone numbers and business names of any origin. Each national variant for each data type uses a specific subset of configuration files. In addition, you can define custom national variants for the standardization engine to support addresses and names from other countries and to support other data types. You can process with your data using the standardization files for a single variant or you can use multiple variants depending on how the master person index application is configured.