Skip Navigation Links | |
Exit Print View | |
![]() |
Oracle Java CAPS Master Index Standardization Engine Reference Java CAPS Documentation |
Oracle Java CAPS Master Index Standardization Engine Reference
About the Master Index Standardization Engine
Master Index Standardization Engine Overview
How the Master Index Standardization Engine Works
Master Index Standardization Engine Data Types and Variants
Master Index Standardization Engine Standardization Components
Finite State Machine Framework
About the Finite State Machine Framework
About the Rules-Based Framework
Oracle Java CAPS Master Index Standardization and Matching Process
Master Index Standardization Engine Internationalization
Finite State Machine Framework Configuration
FSM Framework Configuration Overview
Standardization State Definitions
Data Normalization Definitions
FSM-Based Person Name Configuration
Person Name Standardization Overview
Person Name Standardization Components
Person Name Standardization Files
Person Name Normalization Files
Person Name Process Definition Files
Person Name Standardization and Oracle Java CAPS Master Index
Person Name Standardized Fields
Configuring a Normalization Structure for Person Names
Configuring a Standardization Structure for Person Names
Configuring Phonetic Encoding for Person Names
FSM-Based Telephone Number Configuration
Telephone Number Standardization Overview
Telephone Number Standardization Components
Telephone Number Standardization Files
Telephone Number Standardization and Oracle Java CAPS Master Index
Telephone Number Processing Fields
Telephone Number Standardized Fields
Telephone Number Object Structure
Configuring a Standardization Structure for Telephone Numbers
Rules-Based Address Data Configuration
Address Data Standardization Overview
Address Data Standardization Components
Address Data Standardization Files
Address Pattern File Components
Address Standardization and Oracle Java CAPS Master Index
Address Data Processing Fields
Configuring a Standardization Structure for Address Data
Configuring Phonetic Encoding for Address Data
Rules-Based Business Name Configuration
Business Name Standardization Overview
Business Name Standardization Components
Business Name Standardization Files
Business Name Adjectives Key Type File
Business Association Key Type File
Business General Terms Reference File
Business City or State Key Type File
Business Former Name Reference File
Merged Business Name Category File
Primary Business Name Reference File
Business Connector Tokens Reference File
Business Country Key Type File
Business Industry Sector Reference File
Business Industry Key Type File
Business Organization Key Type File
Business Name Standardization and Oracle Java CAPS Master Index
Business Name Processing Fields
Business Name Standardized Fields
Business Name Object Structure
Configuring a Standardization Structure for Business Names
Configuring Phonetic Encoding for Business Names
Custom FSM-Based Data Types and Variants
About Custom FSM-Based Data Types and Variants
About the Standardization Packages
Creating Custom FSM-Based Data Types
Creating the Working Directory
To Create the Working Directory
Packaging and Importing the Data Type
To Package and Import the Data Type
Creating Custom FSM-Based Variants
Creating the Working Directory
To Create the Working Directory
To Define the Service Instance
Defining the State Model and Processing Rules
To Define the State Model and Processing Rules
Creating Normalization and Lexicon Files
To Create Normalization and Lexicon Files
Packaging and Importing the Variant
The process definition file (standardizer.xml) is the primary configuration file for standardization. It defines the state model, input and output symbol definitions, preprocessing and postprocessing rules, and normalization rules for any type of standardization. Using a domain-specific markup language, you can configure any type of standardization without having to code a new Java package. Each process definition file defines the different stages of processing data for one data type or variant. The process definition file is stored in the resource folder under the data type or variant it defines.
The process definition file is divided into six primary sections, which are described in the following topics:
The processing flow is defined in the state definitions. The input symbol definitions specify the token preprocessing, matching, and postprocessing logic. This is the logic carried out for each input token in a given state. The output symbols define the output for each state. The data cleansing definitions specify any transformations made to the input string prior to tokenization. Normalization definitions are used for data that does not need to be tokenized, but only needs to be normalized and optionally phonetically encoded. For example, if the input text provides the first name in its own field, the middle name in its own field, and so on, then only the normalization definitions are used to standardize the data. The standardization processing rules can be used in all sections except the standardization state definitions.
An FSM framework is defined by its different states and transitions between states. Each FSM begins with a start state when it receives an input string. The first recognized input symbol in the input string determines the next state based on customizable rules defined in the state model section of standardizer.xml. The next recognized input symbol determines the transition to the next state. This continues until no symbols are recognized and the termination state is reached.
Below is an excerpt from the state definitions for the PersonName data type. In this state, the first name has been processed and the standardization engine is looking for one of the following: a first name (indicating a middle name), a last name, an abbreviation (indicating a middle initial), a conjunction, or a nickname. A probability is given for each of these symbols indicating how likely it is to be the next token.
<stateModel name="start"> <when inputSymbol="salutation" nextState="salutation" outputSymbol="salutation" probability=".15"/> <when inputSymbol="givenName" nextState="headingFirstName" outputSymbol="firstName" probability=".6"/> <when inputSymbol="abbreviation" nextState="headingFirstName" outputSymbol="firstName" probability=".15"/> <when inputSymbol="surname" nextState="trailingLastName" outputSymbol="lastName" probability=".1"/> <state name="headingFirstName"> <when inputSymbol="givenName" nextState="headingMiddleName" outputSymbol="middleName" probability=".4"/> <when inputSymbol="surname" nextState="headingLastName" outputSymbol="lastName" probability=".3"/> <when inputSymbol="abbreviation" nextState="headingMiddleName" outputSymbol="middleName" probability=".1"/> <when inputSymbol="conjunction" nextState="headingFirstName" outputSymbol="conjunction" probability=".1"/> <when inputSymbol="nickname" nextState="firstNickname" outputSymbol="nickname" probability=".1"/> </state> ...
The following table lists and describes the XML elements and attributes for the standardization state definitions.
|
The input symbol definitions name and define processing logic for each input symbol recognized by the states. For each state, each possible input symbol is tried according to the rules defines here, and then the probability that it is the next token is assessed. Each input symbol might be subject to preprocessing, token matching, and postprocessing. Preprocessing can include removing punctuation or other regular expression substitutions. The value can then be matched against values in the lexicon file or against regular expressions. If the value matches, it can then be normalized based on the specified normalization file or on pattern replacement. One input symbol can have multiple preprocessing, matching, and postprocessing iterations to go through. If their are multiple iterations, each is carried out in turn until a match is found. All of these steps are optional.
Below is an excerpt from the input symbol definitions for PersonName processing. This excerpt processes the salutation portion of the input string by first removing periods, then comparing the value against the entries in the salutation.txt file, and finally normalizing the matched value based on the corresponding entry in the salutationNormalization.txt file. For example, if the value to process is “Mr.”, it is first changed to “Mr”, matched against a list of salutations, and then converted to “Mister” based on the entry in the normalization file.
<inputSymbol name="salutation"> <matchers> <matcher> <preProcessing> <replaceAll regex="\." replacement=""/> </preProcessing> <lexicon resource="salutation.txt"/> <postProcessing> <dictionary resource="salutationNormalization.txt" separator="\|"/> </postProcessing> </matcher> </matchers> </inputSymbol>
The following table lists and describes the XML elements and attributes for the input symbol definitions.
|
The output symbol definitions name each output symbol that can be produced by the defined states. This section can define additional processing for output symbols using the rules described in Standardization Processing Rules Reference. Each output symbol defined in the state model definitions must match a value defined here. Below is an excerpt from the output symbol definitions for PersonName processing.
<outputSymbols> <outputSymbol name="salutation"/> <outputSymbol name="firstName"/> <outputSymbol name="middleName"/> <outputSymbol name="nickname"/> <outputSymbol name="lastName"/> <outputSymbol name="generation"/> <outputSymbol name="title"/> <outputSymbol name="conjunction"/> </outputSymbols>
The following table lists and describes the XML elements and attributes for the output symbol definitions.
|
You can define cleansing rules to transform the input data prior to tokenization to make the input record uniform and ensure the data is correctly separated into its individual components. This standardization step is optional.
Common data transformations include the following:
Converting a string to all uppercase.
Trimming leading and trailing white space.
Converting multiple spaces in the middle of a string to one space.
Transliterating accent characters or diacritical marks.
Adding a space on either side of extra characters (to help the tokenizer recognize them).
Removing extraneous content.
Fixing common typographical errors.
The cleansing rules are defined within a cleanser element in the process definition file. You can use any of the rules defined in Standardization Processing Rules Reference to cleanse the data. Cleansing attributes use regular expressions to define values to find and replace.
The following excerpt from the PhoneNumber data type does the following to the input string prior to processing:
Converts all characters to upper case.
Replaces the specified input patterns with new patterns.
Removes white space at the beginning and end of the string and concatenates multiple consecutive spaces into one space.
<cleanser> <uppercase/> <replaceAll regex="([0-9]{3})([0-9]{3})([0-9]{4})" replacement="($1)$2-$3"/> <replaceAll regex="([-(),])" replacement=" $1 "/> <replaceAll regex="\+(\d+) -" replacement="+$1-"/> <replaceAll regex="E?X[A-Z]*[.#]?\s*([0-9]+)" replacement="X $1"/> <normalizeSpace/> </cleanser>
If the data you are standardizing does not need to be parsed, but does require normalization, you can define data normalization rules to be used instead of the state model defined earlier in the process definition file. These rules would be used in the case of person names where the field components are already contained in separate fields and do no need to be parsed. In this case, the standardization engine processes one field at a time according to the rules defined in the normalizer section of standardizer.xml. In this section, you can define preprocessing rules to be applied to the fields prior to normalization.
Below is an excerpt from the PersonName data type. These rules convert the input string to all uppercase, and then processes the FirstName and MiddleName fields based on the givenName input symbol and processes the LastName field based on the surname input symbol.
<normalizer> <preProcessing> <uppercase/> </preProcessing> <for field="FirstName" use="givenName"/> <for field="MiddleName" use="givenName"/> <for field="LastName" use="surname"/> </normalizer>
The following table lists and describes the XML elements and attributes for the normalization definitions.
|
The Master Index Standardization Engine provides several matching and transformation rules for input values and patterns. You can add or modify any of these rules in the existing process definition files (standardizer.xml). Several of these rules use regular expressions to define patterns and values. See the Javadoc for java.util.regex for more information about regular expressions.
The available rules include the following:
This rule checks the input value against a list of values in the specified normalization file, and, if the value is found, converts the input value to its normalized value. This generally used for postprocessing but can also be used for preprocessing tokens. The normalization files are located in the same directory as the process definition file (the instance folder for the data type or variant).
The syntax for dictionary is:
<dictionary resource="file_name" separator="delimiter"/>
The parameters for dictionary are:
resource – The name of the normalization file to use to look up the input value and determine the normalized value.
separator – The character used in the normalization file to separate the input value entries from the normalized versions. The default normalization files all use a pipe (|) as a separator.
Example 1 Sample dictionary Rule
The following sample checks the input value against the list in the first column of the givenNameNormalization.txt file, which uses a pipe symbol (|) to separate the input value from its normalized version. When a value is matched, the input value is converted to its normalization version.
<dictionary resource="givenNameNormalization.txt" separator="\|"/>
This rule checks the input value against a fixed value. This is generally used for the token matching step for input symbol processing. You can define a list of fixed strings for an input symbol by enclosing multiple fixedString elements within a fixedStrings element. The syntax for fixedString is:
<fixedString>string</fixedString>
The parameter for fixedString is:
string – The fixed value to compare the input value against.
Example 2 Sample fixedString Rules
The following sample matches the input value against the fixed values “AND”, “OR” and “AND/OR”. If one of the fixed values matches the input string, processing is continued for that matcher definition. If no fixed values match the input string, processing is stopped for that matcher definition and the next matcher definition is processed (if one exists).
<fixedStrings> <fixedString>AND</fixedString> <fixedString>OR</fixedString> <fixedString>AND/OR</fixedString> </fixedStrings>
This rule checks the input value against a list of values in the specified lexicon file. This generally used for token matching. The lexicon files are located in the same directory as the process definition file (the instance folder for the data type or variant).
The syntax for lexicon is:
<lexicon resource="file_name/>
The parameter for lexicon is:
resource – The name of the lexicon file to use to look up the input value to ensure correct tokenization.
Example 3 Sample lexicon Rule
The following sample checks the input value against the list in the givenName.txt file. When a value is matched, the standardization engine continues to the postprocessing phase if one is defined.
<lexicon resource="givenName.txt"/>
This rule removes leading and trailing white space from a string and changes multiple spaces in the middle of a string to a single space. The syntax for normalizeSpace is:
<normalizeSpace/>
Example 4 Sample normalizeSpace Rule
The following sample removes the leading and trailing white space from a last name field prior to checking the input value against the surnames.txt file.
<matcher> <preProcessing> <normalizeSpace/> </preProcessing> <lexicon resource="surnames.txt"/> </matcher>
This rule checks the input value against a specific regular expression to see if the patterns match. You can define a sequence of patterns by including them all in order in a matchAllPatterns element. You can also specify sub-patterns to exclude. The syntax for pattern is:
<pattern regex="regex_pattern"/>
The parameter for pattern is:
regex – A regular expression to validate the input value against. See the Javadocs for java.util.regex for more information.
The pattern rule can be further customized by adding exceptFor rules that define patterns to exclude in the matching process. The syntax for exceptFor is:
<pattern regex="regex_pattern"/> <exceptFor regex="regex_pattern"/> </pattern>
The parameter for exceptFor is:
regex – A regular expression to exclude from the pattern match. See the Javadocs for java.util.regex for more information.
Example 5 Sample pattern Rule
The following sample checks the input value against the sequence of patterns to see if the input value might be an area code. These rules specify a pattern that matches three digits contained in parentheses, such as (310).
<matchAllPatterns> <pattern regex="regex="\("/> <pattern regex="regex="\[0-9]{3}"/> <pattern regex="regex="\)"/> </matchAllPatterns>
The following sample checks the input value to see if its pattern is a series of three letters excluding THE and AND.
<pattern regex="[A-Z]{3}"> <exceptFor regex="regex="THE"/> <exceptFor regex="regex="AND"/> </matchAllPatterns>
This rule checks the input value for a specific pattern. If the pattern is found, it is replaced by a new pattern. This rule only replaces the first instance it finds of the pattern. The syntax for replace is:
<replace regex="regex_pattern" replacement="regex_pattern"/>
The parameters for replace are:
regex – A regular expression that, if found in the input string, is converted to the replacement expression.
replacement – The regular expression that replaces the expression specified by the regex parameter.
Example 6 Sample replace Rule
The following sample tries to match the input value against “ST.”. If a match is found, the standardization engine replaces the value with “SAINT”.
<replace regex="ST\." replacement="SAINT"/>
This rule checks the input value for a specific pattern. If the pattern is found, all instances are replaced by a new pattern. The syntax for replaceAll is:
<replaceAll regex="regex_pattern" replacement="regex_pattern"/>
The parameters for replaceAll are:
regex – A regular expression that, if found in the input string, is converted to the replacement expression.
replacement – The regular expression that replaces the expression specified by the regex parameter.
Example 7 Sample replaceAll Rule
The following sample finds all periods in the input value and converts them to blanks.
<replaceAll regex="\." replacement=""/>
This rule converts the specified characters in the input string to a new set of characters, typically converting from one alphabet to another by adding or removing diacritical marks. The syntax for transliterate is:
<transliterate from="existing_char" to="new_char"/>
The parameters for transliterate are:
from – The characters that exist in the input string that need to be transliterated.
to – The characters that will replace the above characters.
Example 8 Sample transliterate Rule
The following sample converts lower case vowels with acute accents to vowels with no accents.
<transliterate from="áéíóú" to="aeiou"/>
This rule converts all characters in the input string to upper case. The rule does not take any parameters. The syntax for uppercase is:
<uppercase/>
Example 9 Sample uppercase Rule
The following sample converts the entire input string into uppercase prior to doing any pattern or value replacements. Since this is defined in the cleanser section, this is performed prior to tokenization.
<cleanser> <uppercase/> <replaceAll regex="\." replacement=". "/> <replaceAll regex="AND / OR" replacement="AND/OR"/> ... </cleanser>