JavaScript is required to for searching.
Skip Navigation Links
Exit Print View
Oracle Java CAPS Master Index Standardization Engine Reference     Java CAPS Documentation
search filter icon
search icon

Document Information

Oracle Java CAPS Master Index Standardization Engine Reference

About the Master Index Standardization Engine

Related Topics

Master Index Standardization Engine Overview

Standardization Concepts

Data Parsing or Reformatting

Data Normalization

Phonetic Encoding

How the Master Index Standardization Engine Works

Master Index Standardization Engine Data Types and Variants

Master Index Standardization Engine Standardization Components

Finite State Machine Framework

About the Finite State Machine Framework

FSM-Based Configuration

Rules-Based Framework

About the Rules-Based Framework

Rules-Based Configuration

Oracle Java CAPS Master Index Standardization and Matching Process

Master Index Standardization Engine Internationalization

Finite State Machine Framework Configuration

FSM Framework Configuration Overview

Process Definition File

Standardization State Definitions

Input Symbol Definitions

Output Symbol Definitions

Data Cleansing Definitions

Data Normalization Definitions

Standardization Processing Rules Reference

dictionary

fixedString

lexicon

normalizeSpace

pattern

replace

replaceAll

transliterate

uppercase

Lexicon Files

Normalization Files

FSM-Based Person Name Configuration

Person Name Standardization Overview

Person Name Standardization Components

Person Name Standardization Files

Person Name Lexicon Files

Person Name Normalization Files

Person Name Process Definition Files

Person Name Standardization and Oracle Java CAPS Master Index

Person Name Processing Fields

Person Name Standardized Fields

Person Name Object Structure

Configuring a Normalization Structure for Person Names

Configuring a Standardization Structure for Person Names

Configuring Phonetic Encoding for Person Names

FSM-Based Telephone Number Configuration

Telephone Number Standardization Overview

Telephone Number Standardization Components

Telephone Number Standardization Files

Telephone Number Standardization and Oracle Java CAPS Master Index

Telephone Number Processing Fields

Telephone Number Standardized Fields

Telephone Number Object Structure

Configuring a Standardization Structure for Telephone Numbers

Rules-Based Address Data Configuration

Address Data Standardization Overview

Address Data Standardization Components

Address Data Standardization Files

Address Clues File

Address Master Clues File

Address Patterns File

Address Pattern File Components

Address Type Tokens

Pattern Classes

Pattern Modifiers

Priority Indicators

Address Standardization and Oracle Java CAPS Master Index

Address Data Processing Fields

Address Standardized Fields

Address Object Structure

Configuring a Standardization Structure for Address Data

Configuring Phonetic Encoding for Address Data

Rules-Based Business Name Configuration

Business Name Standardization Overview

Business Name Standardization Components

Business Name Standardization Files

Business Name Adjectives Key Type File

Business Alias Key Type File

Business Association Key Type File

Business General Terms Reference File

Business City or State Key Type File

Business Former Name Reference File

Merged Business Name Category File

Primary Business Name Reference File

Business Connector Tokens Reference File

Business Country Key Type File

Business Industry Sector Reference File

Business Industry Key Type File

Business Organization Key Type File

Business Patterns File

Business Name Tokens

Business Name Standardization and Oracle Java CAPS Master Index

Business Name Processing Fields

Business Name Standardized Fields

Business Name Object Structure

Configuring a Standardization Structure for Business Names

Configuring Phonetic Encoding for Business Names

Custom FSM-Based Data Types and Variants

About Custom FSM-Based Data Types and Variants

About the Standardization Packages

Creating Custom FSM-Based Data Types

Creating the Working Directory

To Create the Working Directory

Defining the Service Type

To Define the Service Type

Defining the Variants

To Define the Variants

Packaging and Importing the Data Type

To Package and Import the Data Type

Service Type Definition File

Creating Custom FSM-Based Variants

Creating the Working Directory

To Create the Working Directory

Defining the Service Instance

To Define the Service Instance

Defining the State Model and Processing Rules

To Define the State Model and Processing Rules

Creating Normalization and Lexicon Files

To Create Normalization and Lexicon Files

Packaging and Importing the Variant

To Package and Import the Variant

Service Instance Definition File

Process Definition File

The process definition file (standardizer.xml) is the primary configuration file for standardization. It defines the state model, input and output symbol definitions, preprocessing and postprocessing rules, and normalization rules for any type of standardization. Using a domain-specific markup language, you can configure any type of standardization without having to code a new Java package. Each process definition file defines the different stages of processing data for one data type or variant. The process definition file is stored in the resource folder under the data type or variant it defines.

The process definition file is divided into six primary sections, which are described in the following topics:

The processing flow is defined in the state definitions. The input symbol definitions specify the token preprocessing, matching, and postprocessing logic. This is the logic carried out for each input token in a given state. The output symbols define the output for each state. The data cleansing definitions specify any transformations made to the input string prior to tokenization. Normalization definitions are used for data that does not need to be tokenized, but only needs to be normalized and optionally phonetically encoded. For example, if the input text provides the first name in its own field, the middle name in its own field, and so on, then only the normalization definitions are used to standardize the data. The standardization processing rules can be used in all sections except the standardization state definitions.

Standardization State Definitions

An FSM framework is defined by its different states and transitions between states. Each FSM begins with a start state when it receives an input string. The first recognized input symbol in the input string determines the next state based on customizable rules defined in the state model section of standardizer.xml. The next recognized input symbol determines the transition to the next state. This continues until no symbols are recognized and the termination state is reached.

Below is an excerpt from the state definitions for the PersonName data type. In this state, the first name has been processed and the standardization engine is looking for one of the following: a first name (indicating a middle name), a last name, an abbreviation (indicating a middle initial), a conjunction, or a nickname. A probability is given for each of these symbols indicating how likely it is to be the next token.

<stateModel name="start">
   <when inputSymbol="salutation" nextState="salutation" 
         outputSymbol="salutation" probability=".15"/>
   <when inputSymbol="givenName" nextState="headingFirstName" 
         outputSymbol="firstName" probability=".6"/>
   <when inputSymbol="abbreviation" nextState="headingFirstName" 
         outputSymbol="firstName" probability=".15"/>
   <when inputSymbol="surname" nextState="trailingLastName" 
         outputSymbol="lastName" probability=".1"/>
   <state name="headingFirstName">
      <when inputSymbol="givenName" nextState="headingMiddleName" 
            outputSymbol="middleName" probability=".4"/>
      <when inputSymbol="surname" nextState="headingLastName" 
            outputSymbol="lastName" probability=".3"/>
      <when inputSymbol="abbreviation" nextState="headingMiddleName" 
            outputSymbol="middleName" probability=".1"/>
      <when inputSymbol="conjunction" nextState="headingFirstName" 
            outputSymbol="conjunction" probability=".1"/>
      <when inputSymbol="nickname" nextState="firstNickname" 
            outputSymbol="nickname" probability=".1"/>
   </state>
   ...

The following table lists and describes the XML elements and attributes for the standardization state definitions.

Element
Attribute
Description
stateModel
The primary container element for the state model that includes the definitions for each state in the FSM. This element contains a series of when elements as described below to define the transitions from the start element to any of the other states. It also contains a series of state elements that define the remaining FSM states.
name
The name of start state (by default, “start”).
state
A definition for one state in the FSM (not including the start state). Each state element contains a series of when elements and attributes as described above to define the processing flow.
name
The name of the state. The names defined here are referenced in the nextState attributes described below to specify the next state.
when
A statement defining which state to transition to and which symbol to output when a specific input symbol is recognized in each state. These elements define the possible transitions from one state to another.
inputSymbol
The name of an input symbol that might occur next in the input string. This must match one of the input symbols defined later in the file. For more information about input symbols and their processing logic, see Input Symbol Definitions.
nextState
The name of the next state to transition to when the specified input symbol is recognized. This must match the name of one of the states defined in the state model section.
outputSymbol
The name of the symbol that the current state produces for when processing is complete for the state based on the input symbol. Not all transitions have an output symbol. This must match one of the output symbols defined later in the file. For more information, see Output Symbol Definitions
probability
The probability that the given input symbol is actually the next symbol in the input string. Probabilities are indicated by a decimal between and including 1 and 0. All probabilities for a given state must add up to 1. If a state definition includes the eof element described below, all probabilities including the eof probability must add up to 1.
eof
probability
The probability that the FSM has reached the end of the input string in the current state. Probabilities are indicated by a decimal between and including 1 and 0. The sum of this probability and all other probabilities for a given state must be 1.

Input Symbol Definitions

The input symbol definitions name and define processing logic for each input symbol recognized by the states. For each state, each possible input symbol is tried according to the rules defines here, and then the probability that it is the next token is assessed. Each input symbol might be subject to preprocessing, token matching, and postprocessing. Preprocessing can include removing punctuation or other regular expression substitutions. The value can then be matched against values in the lexicon file or against regular expressions. If the value matches, it can then be normalized based on the specified normalization file or on pattern replacement. One input symbol can have multiple preprocessing, matching, and postprocessing iterations to go through. If their are multiple iterations, each is carried out in turn until a match is found. All of these steps are optional.

Below is an excerpt from the input symbol definitions for PersonName processing. This excerpt processes the salutation portion of the input string by first removing periods, then comparing the value against the entries in the salutation.txt file, and finally normalizing the matched value based on the corresponding entry in the salutationNormalization.txt file. For example, if the value to process is “Mr.”, it is first changed to “Mr”, matched against a list of salutations, and then converted to “Mister” based on the entry in the normalization file.

<inputSymbol name="salutation">
   <matchers>
      <matcher>
         <preProcessing>
            <replaceAll regex="\." replacement=""/>
         </preProcessing>
         <lexicon resource="salutation.txt"/>
         <postProcessing>
            <dictionary resource="salutationNormalization.txt" separator="\|"/>
         </postProcessing>
      </matcher>
   </matchers>
</inputSymbol>

The following table lists and describes the XML elements and attributes for the input symbol definitions.

Element
Attribute
Description
inputSymbol
A container element for the processing logic for one input symbol.
name
The name of the input symbol against which the following logic applies.
matchers
A list of processing definitions, each of which define one preprocessing, matching, and postprocess sequence. Not all definitions include all three steps.
matcher
A processing definition for one sequence of preprocessing, matching, and postprocessing. A processing definition might contain only one or any combination of the three steps.
factor
A factor to apply to the probability specified for the input symbol in the state definition. For example, if the state definition probability is .4 and this factor is .25, then the probability for this matching sequence is .1. Only define this attribute when the probability for this matching sequence is very low.
preProcessing
A container element for the preprocessing rules to be carried out against an input symbol. For more information about the rules you can use, see Standardization Processing Rules Reference.
lexicon
resource
The name of the lexicon file containing the list of values to match the input symbol against.

Note - You can also match against patterns or regular expressions. For more information, see matchAllPatterns and pattern in Standardization Processing Rules Reference.


postProcessing
A container element for the postprocessing rules to be carried out against an input symbol that has been matched. For more information about the rules you can use, see Standardization Processing Rules Reference.

Output Symbol Definitions

The output symbol definitions name each output symbol that can be produced by the defined states. This section can define additional processing for output symbols using the rules described in Standardization Processing Rules Reference. Each output symbol defined in the state model definitions must match a value defined here. Below is an excerpt from the output symbol definitions for PersonName processing.

<outputSymbols>
   <outputSymbol name="salutation"/>
   <outputSymbol name="firstName"/>
   <outputSymbol name="middleName"/>
   <outputSymbol name="nickname"/>
   <outputSymbol name="lastName"/>
   <outputSymbol name="generation"/>
   <outputSymbol name="title"/>
   <outputSymbol name="conjunction"/>
</outputSymbols>

The following table lists and describes the XML elements and attributes for the output symbol definitions.

Element
Attribute
Description
outputSymbols
A list of output symbols for each processing state.
outputSymbol
A definition for one output symbol.
name
The name of the output symbol
occurrenceConcatenator
An optional class to specify the character that separates contiguous occurrences of the same output symbol. For example, this is used in the PhoneNumber data type to concatenate phone number components that are separated by dashes. Components are concatenated using blanks.
class
The name of the occurrence concatenator class. One concatenator class is predefined.
property
A parameter for the occurrence concatenator class. For the default class, the parameter specifies a separator character.
name
The name of the parameter. For the default class, the name is “separator”.
value
The parameter value.
tokenConcatenator
An optional class to specify the character that separates non-contiguous occurrences of the same output symbol. For example, this is used in the PhoneNumber data type to concatenate phone number components.
class
The name of the token concatenator class. one concatenator class is predefined.
property
A parameter for the token concatenator class. For the default class, the parameter specifies a separator character.
name
The name of the parameter. For the default class, the name is “separator”.
value
The value of the parameter.

Data Cleansing Definitions

You can define cleansing rules to transform the input data prior to tokenization to make the input record uniform and ensure the data is correctly separated into its individual components. This standardization step is optional.

Common data transformations include the following:

The cleansing rules are defined within a cleanser element in the process definition file. You can use any of the rules defined in Standardization Processing Rules Reference to cleanse the data. Cleansing attributes use regular expressions to define values to find and replace.

The following excerpt from the PhoneNumber data type does the following to the input string prior to processing:

<cleanser>
   <uppercase/>
   <replaceAll regex="([0-9]{3})([0-9]{3})([0-9]{4})" replacement="($1)$2-$3"/>
   <replaceAll regex="([-(),])" replacement=" $1 "/>
   <replaceAll regex="\+(\d+) -" replacement="+$1-"/>
   <replaceAll regex="E?X[A-Z]*[.#]?\s*([0-9]+)" replacement="X $1"/>
   <normalizeSpace/>
</cleanser>

Data Normalization Definitions

If the data you are standardizing does not need to be parsed, but does require normalization, you can define data normalization rules to be used instead of the state model defined earlier in the process definition file. These rules would be used in the case of person names where the field components are already contained in separate fields and do no need to be parsed. In this case, the standardization engine processes one field at a time according to the rules defined in the normalizer section of standardizer.xml. In this section, you can define preprocessing rules to be applied to the fields prior to normalization.

Below is an excerpt from the PersonName data type. These rules convert the input string to all uppercase, and then processes the FirstName and MiddleName fields based on the givenName input symbol and processes the LastName field based on the surname input symbol.

 <normalizer>
   <preProcessing>
      <uppercase/>
   </preProcessing>
   <for field="FirstName" use="givenName"/>
   <for field="MiddleName" use="givenName"/>
   <for field="LastName" use="surname"/>
</normalizer>

The following table lists and describes the XML elements and attributes for the normalization definitions.

Element
Attribute
Description
normalizer
A container element for the normalization rules to use when field components do not require parsing, but do require normalization.
preProcessing
A container element for any preprocessing rules to apply to the input strings prior to normalization. For more information about preprocessing rules, see Standardization Processing Rules Reference.
for
The input symbol to use for a given field. This is defined in the following attributes.
field
The name of a field to be normalized.
use
The name of the input symbol to associate with the field. The processing logic defined for the input symbol earlier in the file is used to normalize the data contained in that field.

Standardization Processing Rules Reference

The Master Index Standardization Engine provides several matching and transformation rules for input values and patterns. You can add or modify any of these rules in the existing process definition files (standardizer.xml). Several of these rules use regular expressions to define patterns and values. See the Javadoc for java.util.regex for more information about regular expressions.

The available rules include the following:

dictionary

This rule checks the input value against a list of values in the specified normalization file, and, if the value is found, converts the input value to its normalized value. This generally used for postprocessing but can also be used for preprocessing tokens. The normalization files are located in the same directory as the process definition file (the instance folder for the data type or variant).

The syntax for dictionary is:

<dictionary resource="file_name" separator="delimiter"/>

The parameters for dictionary are:

Example 1 Sample dictionary Rule

The following sample checks the input value against the list in the first column of the givenNameNormalization.txt file, which uses a pipe symbol (|) to separate the input value from its normalized version. When a value is matched, the input value is converted to its normalization version.

<dictionary resource="givenNameNormalization.txt" separator="\|"/>

fixedString

This rule checks the input value against a fixed value. This is generally used for the token matching step for input symbol processing. You can define a list of fixed strings for an input symbol by enclosing multiple fixedString elements within a fixedStrings element. The syntax for fixedString is:

<fixedString>string</fixedString>

The parameter for fixedString is:

Example 2 Sample fixedString Rules

The following sample matches the input value against the fixed values “AND”, “OR” and “AND/OR”. If one of the fixed values matches the input string, processing is continued for that matcher definition. If no fixed values match the input string, processing is stopped for that matcher definition and the next matcher definition is processed (if one exists).

<fixedStrings>
   <fixedString>AND</fixedString>
   <fixedString>OR</fixedString>
   <fixedString>AND/OR</fixedString>
</fixedStrings>

lexicon

This rule checks the input value against a list of values in the specified lexicon file. This generally used for token matching. The lexicon files are located in the same directory as the process definition file (the instance folder for the data type or variant).

The syntax for lexicon is:

<lexicon resource="file_name/>

The parameter for lexicon is:

Example 3 Sample lexicon Rule

The following sample checks the input value against the list in the givenName.txt file. When a value is matched, the standardization engine continues to the postprocessing phase if one is defined.

<lexicon resource="givenName.txt"/>

normalizeSpace

This rule removes leading and trailing white space from a string and changes multiple spaces in the middle of a string to a single space. The syntax for normalizeSpace is:

<normalizeSpace/>

Example 4 Sample normalizeSpace Rule

The following sample removes the leading and trailing white space from a last name field prior to checking the input value against the surnames.txt file.

<matcher>
   <preProcessing>
     <normalizeSpace/>
   </preProcessing>
   <lexicon resource="surnames.txt"/>
</matcher>

pattern

This rule checks the input value against a specific regular expression to see if the patterns match. You can define a sequence of patterns by including them all in order in a matchAllPatterns element. You can also specify sub-patterns to exclude. The syntax for pattern is:

<pattern regex="regex_pattern"/>

The parameter for pattern is:

The pattern rule can be further customized by adding exceptFor rules that define patterns to exclude in the matching process. The syntax for exceptFor is:

<pattern regex="regex_pattern"/>
   <exceptFor regex="regex_pattern"/>
</pattern>

The parameter for exceptFor is:

Example 5 Sample pattern Rule

The following sample checks the input value against the sequence of patterns to see if the input value might be an area code. These rules specify a pattern that matches three digits contained in parentheses, such as (310).

<matchAllPatterns>
   <pattern regex="regex="\("/>
   <pattern regex="regex="\[0-9]{3}"/>
   <pattern regex="regex="\)"/>
</matchAllPatterns>

The following sample checks the input value to see if its pattern is a series of three letters excluding THE and AND.

<pattern regex="[A-Z]{3}">
   <exceptFor regex="regex="THE"/>
   <exceptFor regex="regex="AND"/>
</matchAllPatterns>

replace

This rule checks the input value for a specific pattern. If the pattern is found, it is replaced by a new pattern. This rule only replaces the first instance it finds of the pattern. The syntax for replace is:

<replace regex="regex_pattern" replacement="regex_pattern"/>

The parameters for replace are:

Example 6 Sample replace Rule

The following sample tries to match the input value against “ST.”. If a match is found, the standardization engine replaces the value with “SAINT”.

<replace regex="ST\." replacement="SAINT"/>

replaceAll

This rule checks the input value for a specific pattern. If the pattern is found, all instances are replaced by a new pattern. The syntax for replaceAll is:

<replaceAll regex="regex_pattern" replacement="regex_pattern"/>

The parameters for replaceAll are:

Example 7 Sample replaceAll Rule

The following sample finds all periods in the input value and converts them to blanks.

<replaceAll regex="\." replacement=""/>

transliterate

This rule converts the specified characters in the input string to a new set of characters, typically converting from one alphabet to another by adding or removing diacritical marks. The syntax for transliterate is:

<transliterate from="existing_char" to="new_char"/>

The parameters for transliterate are:

Example 8 Sample transliterate Rule

The following sample converts lower case vowels with acute accents to vowels with no accents.

<transliterate from="áéíóú" to="aeiou"/>

uppercase

This rule converts all characters in the input string to upper case. The rule does not take any parameters. The syntax for uppercase is:

<uppercase/>

Example 9 Sample uppercase Rule

The following sample converts the entire input string into uppercase prior to doing any pattern or value replacements. Since this is defined in the cleanser section, this is performed prior to tokenization.

<cleanser>
   <uppercase/>
   <replaceAll regex="\." replacement=". "/>
   <replaceAll regex="AND / OR" replacement="AND/OR"/>
   ...
</cleanser>