JavaScript is required to for searching.
Skip Navigation Links
Exit Print View
Oracle Java CAPS Master Index Standardization Engine Reference     Java CAPS Documentation
search filter icon
search icon

Document Information

Oracle Java CAPS Master Index Standardization Engine Reference

About the Master Index Standardization Engine

Related Topics

Master Index Standardization Engine Overview

Standardization Concepts

Data Parsing or Reformatting

Data Normalization

Phonetic Encoding

How the Master Index Standardization Engine Works

Master Index Standardization Engine Data Types and Variants

Master Index Standardization Engine Standardization Components

Finite State Machine Framework

About the Finite State Machine Framework

FSM-Based Configuration

Rules-Based Framework

About the Rules-Based Framework

Rules-Based Configuration

Oracle Java CAPS Master Index Standardization and Matching Process

Master Index Standardization Engine Internationalization

Finite State Machine Framework Configuration

FSM Framework Configuration Overview

Process Definition File

Standardization State Definitions

Input Symbol Definitions

Output Symbol Definitions

Data Cleansing Definitions

Data Normalization Definitions

Standardization Processing Rules Reference

dictionary

fixedString

lexicon

normalizeSpace

pattern

replace

replaceAll

transliterate

uppercase

Lexicon Files

Normalization Files

FSM-Based Person Name Configuration

Person Name Standardization Overview

Person Name Standardization Components

Person Name Standardization Files

Person Name Lexicon Files

Person Name Normalization Files

Person Name Process Definition Files

Person Name Standardization and Oracle Java CAPS Master Index

Person Name Processing Fields

Person Name Standardized Fields

Person Name Object Structure

Configuring a Normalization Structure for Person Names

Configuring a Standardization Structure for Person Names

Configuring Phonetic Encoding for Person Names

FSM-Based Telephone Number Configuration

Telephone Number Standardization Overview

Telephone Number Standardization Components

Telephone Number Standardization Files

Telephone Number Standardization and Oracle Java CAPS Master Index

Telephone Number Processing Fields

Telephone Number Standardized Fields

Telephone Number Object Structure

Configuring a Standardization Structure for Telephone Numbers

Rules-Based Address Data Configuration

Address Data Standardization Overview

Address Data Standardization Components

Address Data Standardization Files

Address Clues File

Address Master Clues File

Address Patterns File

Address Pattern File Components

Address Type Tokens

Pattern Classes

Pattern Modifiers

Priority Indicators

Address Standardization and Oracle Java CAPS Master Index

Address Data Processing Fields

Address Standardized Fields

Address Object Structure

Configuring a Standardization Structure for Address Data

Configuring Phonetic Encoding for Address Data

Rules-Based Business Name Configuration

Business Name Standardization Overview

Business Name Standardization Components

Business Name Standardization Files

Business Name Adjectives Key Type File

Business Alias Key Type File

Business Association Key Type File

Business General Terms Reference File

Business City or State Key Type File

Business Former Name Reference File

Merged Business Name Category File

Primary Business Name Reference File

Business Connector Tokens Reference File

Business Country Key Type File

Business Industry Sector Reference File

Business Industry Key Type File

Business Organization Key Type File

Business Patterns File

Business Name Tokens

Business Name Standardization and Oracle Java CAPS Master Index

Business Name Processing Fields

Business Name Standardized Fields

Business Name Object Structure

Configuring a Standardization Structure for Business Names

Configuring Phonetic Encoding for Business Names

Custom FSM-Based Data Types and Variants

About Custom FSM-Based Data Types and Variants

About the Standardization Packages

Creating Custom FSM-Based Data Types

Creating the Working Directory

To Create the Working Directory

Defining the Service Type

To Define the Service Type

Defining the Variants

To Define the Variants

Packaging and Importing the Data Type

To Package and Import the Data Type

Service Type Definition File

Creating Custom FSM-Based Variants

Creating the Working Directory

To Create the Working Directory

Defining the Service Instance

To Define the Service Instance

Defining the State Model and Processing Rules

To Define the State Model and Processing Rules

Creating Normalization and Lexicon Files

To Create Normalization and Lexicon Files

Packaging and Importing the Variant

To Package and Import the Variant

Service Instance Definition File

Creating Custom FSM-Based Variants

The flexible framework of the Master Index Standardization Engine allows you to define new FSM-based variants on existing FSM-based data types so you can standardize different categories of the same type of data. For example, you might need to standardize names from several different countries. Variants are easily incorporated into a master index project and can be made globally available to all projects. Perform the following steps to create a custom variant.

Creating the Working Directory

The working directory for custom variants requires a specific structure. At a minimum, the working directory will look similar to the following:

/WorkingDir
   serviceInstance.xml
   /resource
      standardizer.xml

The resource directory might also contain several normalization and lexicon files.

To Create the Working Directory

  1. Create a working directory for the new variant.
  2. In the new working directory, create a resource directory.
  3. Continue to Defining the Service Instance.

Defining the Service Instance

The serviceInstance.xml file for each variant defines the name of the variant, the data type it modifies, and additional Java class information.

To Define the Service Instance

  1. Create a file named serviceInstance.xml at the top level of your working directory.

    Tip - You can copy a service instance file from an existing variant in the data type to which you will add the new variant, and then modify it for the new variant.


  2. Define values for the elements and attributes described in Service Instance Definition File.

    This example defines a new Spanish variant to the PersonName data type.

    <serviceInstance type="PersonName">
      <description>Person Name Standardization: Spain</description>
      <parameter name="dataType" value="PersonName" />
      <parameter name="variantType" value="SP" />
      <componentManagerFactory 
           class="com.sun.inti.components.component.BeanComponentManagerFactory">
        <property name="stylesheetURL" 
           value="classpath:/com/sun/mdm/standardizer/impl/standardizer.xsl"/>
        <property name="urlSource" >
          <bean class="com.sun.inti.components.url.ResourceURLSource">
            <property name="resourceName" value="standardizer.xml />
          </bean>
        </property>
      </componentManagerFactory>
    </serviceInstance>

    Note - The value you enter for the variantType parameter must match the name you want the variant to display in the Standardization folder of the master index project.


  3. Save and close the file.
  4. Continue to Defining the State Model and Processing Rules.

Defining the State Model and Processing Rules

The state model defines how the data is read, tokenized, parsed, and modified during standardization. The state model and processing rules are all defined in the standardizer.xml file.

Before you begin this step, determine the different forms in which the data to be standardized can be presented and how it should be standardized for each form. For example, name data might be in the form “First Name, Last Name, Middle Initial” or in the form “First Name, Middle Name, Last Name”. You need to account for each possibility. Determine each state in the process, and the input and output symbols used by each state. It might be useful to create a finite state machine model, as shown below. The model shows each state, the transitions to and from each state, and the output symbol for each state.

Figure 2 Sample Finite State Machine Model

image:Figure shows a sample FSM model for phone numbers.

For more information about the FSM model, see FSM Framework Configuration Overview.

To Define the State Model and Processing Rules

  1. In /WorkingDirectory/resource, create a new XML file named standardizer.xml.

    Tip - You can copy the file from an existing variant in the data type to which you are adding the custom variant. Then you can modify the file for the new variant.


  2. If the data you are processing does not need to be parsed, but only needs to be normalized, define normalization rules in the normalizer section of the file.

    For more information, see Data Normalization Definitions and Standardization Processing Rules Reference.

  3. If the data you are processing needs to be parsed and normalized, define the state model in the upper portion of the file.

    For information about the state model and the elements that define it, see Standardization State Definitions.


    Note - The next several steps use the processing rules described in Standardization Processing Rules Reference. Some of these rules might require that you create normalization and lexicon files.


  4. In the inputSymbols section of the file, define each input symbol along with any processing rules.

    For more information, see Input Symbol Definitions.

  5. In the outputSymbols section of the file, define each output symbol along with any processing rules.

    For more information, see Output Symbol Definitions.

  6. In the cleanser section of the file, define any cleansing rules that should be performed against the data prior to tokenization.

    For more information, see Data Cleansing Definitions.

  7. If you created any rules that reference normalization or lexicon files, continue to Creating Normalization and Lexicon Files.

Creating Normalization and Lexicon Files

Lexicon files list the possible values for a field so the standardization engine can quickly and accurately recognize different field components. Normalization files list the nonstandard values that might be found in a field along with the standard version so the standardization engine can present a common form for the data. You need to create a file for each lexicon or normalization file you referenced from standardizer.xml.

For more information about normalization and lexicon files, see Lexicon Files and Normalization Files.

To Create Normalization and Lexicon Files

  1. For each normalization file you referenced in standardizer.xml, do the following:
    1. Create a text file in /WorkingDirectory/resource.
    2. Save the file under the name you used to reference it from standardizer.xml.
    3. In the file, enter a list of nonstandard values along with their standardized values, separating the nonstandard value from the standard value with a pipe (|) as shown below.
      COR|COURT
      CRT|COURT
      CR.|COURT
      CT|COURT
      CT.|COURT
      DR|DRIVE
      DR.|DRIVE
      DRV|DRIVE
      ...
    4. When you are finished, save and close the file.
  2. For each lexicon file you referenced in standardizer.xml, do the following:
    1. Create a text file in /WorkingDirectory/resource.
    2. Save the file under the name you used to reference it from standardizer.xml.
    3. In the file, enter a list of all possible values for the field as shown below.
      E
      EAST
      ET
      N
      NO
      NORTH
      NTH
      S
      SO
      SOUTH
      ...
    4. When you are finished, save and close the file.
  3. Continue to Packaging and Importing the Variant.

Packaging and Importing the Variant

Once you have created all the files for the variant, you need to package them into a ZIP file to be imported into a master index application.

To Package and Import the Variant

  1. In the working directory, select the folder and file at the top level and add them to a ZIP file.
  2. Name the ZIP file the same name as the variant. This is the value you entered for the variantType parameter in Defining the Service Instance.

    The ZIP file structure should be similar to the following. Note that this variant includes several normalization and lexicon files. Your variant might not contain any.


    Figure 3 Custom Variant Zip File

    image:Figure shows a sample ZIP file for a custom variant package.
  3. Import the file into a master index application as described in Importing Standardization Data Types and Variants in Oracle Java CAPS Master Index Configuration Guide.

Service Instance Definition File

Each data type variant is configured by a service definition file. Service type files define the fields to be standardized for a data type, and service instance definition files define the variant and Java factory class for the variant. Both files are in XML format.

Element
Attribute
Description
serviceInstance
A container element for the description and any parameters for the variant.
type
The name of the data type to which the variant belongs.
description
A brief description of the variant, such as “Person Names: Spain”.
parameter
One parameter for the variant. The default variants contain two parameters, dataType and variantType. The dataType parameter specifies the name of the data type to which the variant belongs. The variantType parameter specifies the name of the variant. For a master index application, these are the names of the nodes that appear under the Standardization Engine node.
name
The name of the parameter.
value
The value of the parameter.
componentManagerFactory
The component manager factory class for the variant.
class
The name of the component manager factory class. The default class is com.sun.inti.components.component.BeanComponentManagerFactory.
property
A property of the component manager factory class. The default class has two properties. The stylesheetURL property defines the location of the stylesheet, standardizer.xml.

The urlSource property defines the process definition file. Its value is a bean (by default, com.sun.inti.components.url.ResourceURLSource), which has a property called resourceName. The value for this property is standardizer.xml.

name
The name of the property.
value
The value for the property.