Skip Navigation Links | |
Exit Print View | |
Oracle Java CAPS Master Index Standardization Engine Reference Java CAPS Documentation |
Oracle Java CAPS Master Index Standardization Engine Reference
About the Master Index Standardization Engine
Master Index Standardization Engine Overview
How the Master Index Standardization Engine Works
Master Index Standardization Engine Data Types and Variants
Master Index Standardization Engine Standardization Components
Finite State Machine Framework
About the Finite State Machine Framework
About the Rules-Based Framework
Oracle Java CAPS Master Index Standardization and Matching Process
Master Index Standardization Engine Internationalization
Finite State Machine Framework Configuration
FSM Framework Configuration Overview
Standardization State Definitions
Data Normalization Definitions
Standardization Processing Rules Reference
FSM-Based Person Name Configuration
Person Name Standardization Overview
Person Name Standardization Components
Person Name Standardization Files
Person Name Normalization Files
Person Name Process Definition Files
Person Name Standardization and Oracle Java CAPS Master Index
Person Name Standardized Fields
Configuring a Normalization Structure for Person Names
Configuring a Standardization Structure for Person Names
Configuring Phonetic Encoding for Person Names
FSM-Based Telephone Number Configuration
Telephone Number Standardization Overview
Telephone Number Standardization Components
Telephone Number Standardization Files
Telephone Number Standardization and Oracle Java CAPS Master Index
Telephone Number Processing Fields
Telephone Number Standardized Fields
Telephone Number Object Structure
Configuring a Standardization Structure for Telephone Numbers
Rules-Based Address Data Configuration
Address Data Standardization Overview
Address Data Standardization Components
Address Data Standardization Files
Address Pattern File Components
Address Standardization and Oracle Java CAPS Master Index
Address Data Processing Fields
Configuring a Standardization Structure for Address Data
Configuring Phonetic Encoding for Address Data
Rules-Based Business Name Configuration
Business Name Standardization Overview
Business Name Standardization Components
Business Name Standardization Files
Business Name Adjectives Key Type File
Business Association Key Type File
Business General Terms Reference File
Business City or State Key Type File
Business Former Name Reference File
Merged Business Name Category File
Primary Business Name Reference File
Business Connector Tokens Reference File
Business Country Key Type File
Business Industry Sector Reference File
Business Industry Key Type File
Business Organization Key Type File
Business Name Standardization and Oracle Java CAPS Master Index
Business Name Processing Fields
Business Name Standardized Fields
Business Name Object Structure
Configuring a Standardization Structure for Business Names
Configuring Phonetic Encoding for Business Names
Custom FSM-Based Data Types and Variants
About Custom FSM-Based Data Types and Variants
About the Standardization Packages
Creating Custom FSM-Based Data Types
Creating the Working Directory
To Create the Working Directory
Packaging and Importing the Data Type
To Package and Import the Data Type
Creating Custom FSM-Based Variants
Creating the Working Directory
To Create the Working Directory
To Define the Service Instance
Defining the State Model and Processing Rules
To Define the State Model and Processing Rules
Creating Normalization and Lexicon Files
To Create Normalization and Lexicon Files
Packaging and Importing the Variant
The flexible framework of the Master Index Standardization Engine allows you to define new FSM-based variants on existing FSM-based data types so you can standardize different categories of the same type of data. For example, you might need to standardize names from several different countries. Variants are easily incorporated into a master index project and can be made globally available to all projects. Perform the following steps to create a custom variant.
The working directory for custom variants requires a specific structure. At a minimum, the working directory will look similar to the following:
/WorkingDir serviceInstance.xml /resource standardizer.xml
The resource directory might also contain several normalization and lexicon files.
The serviceInstance.xml file for each variant defines the name of the variant, the data type it modifies, and additional Java class information.
Tip - You can copy a service instance file from an existing variant in the data type to which you will add the new variant, and then modify it for the new variant.
This example defines a new Spanish variant to the PersonName data type.
<serviceInstance type="PersonName"> <description>Person Name Standardization: Spain</description> <parameter name="dataType" value="PersonName" /> <parameter name="variantType" value="SP" /> <componentManagerFactory class="com.sun.inti.components.component.BeanComponentManagerFactory"> <property name="stylesheetURL" value="classpath:/com/sun/mdm/standardizer/impl/standardizer.xsl"/> <property name="urlSource" > <bean class="com.sun.inti.components.url.ResourceURLSource"> <property name="resourceName" value="standardizer.xml /> </bean> </property> </componentManagerFactory> </serviceInstance>
Note - The value you enter for the variantType parameter must match the name you want the variant to display in the Standardization folder of the master index project.
The state model defines how the data is read, tokenized, parsed, and modified during standardization. The state model and processing rules are all defined in the standardizer.xml file.
Before you begin this step, determine the different forms in which the data to be standardized can be presented and how it should be standardized for each form. For example, name data might be in the form “First Name, Last Name, Middle Initial” or in the form “First Name, Middle Name, Last Name”. You need to account for each possibility. Determine each state in the process, and the input and output symbols used by each state. It might be useful to create a finite state machine model, as shown below. The model shows each state, the transitions to and from each state, and the output symbol for each state.
Figure 2 Sample Finite State Machine Model
For more information about the FSM model, see FSM Framework Configuration Overview.
Tip - You can copy the file from an existing variant in the data type to which you are adding the custom variant. Then you can modify the file for the new variant.
For more information, see Data Normalization Definitions and Standardization Processing Rules Reference.
For information about the state model and the elements that define it, see Standardization State Definitions.
Note - The next several steps use the processing rules described in Standardization Processing Rules Reference. Some of these rules might require that you create normalization and lexicon files.
For more information, see Input Symbol Definitions.
For more information, see Output Symbol Definitions.
For more information, see Data Cleansing Definitions.
Lexicon files list the possible values for a field so the standardization engine can quickly and accurately recognize different field components. Normalization files list the nonstandard values that might be found in a field along with the standard version so the standardization engine can present a common form for the data. You need to create a file for each lexicon or normalization file you referenced from standardizer.xml.
For more information about normalization and lexicon files, see Lexicon Files and Normalization Files.
COR|COURT CRT|COURT CR.|COURT CT|COURT CT.|COURT DR|DRIVE DR.|DRIVE DRV|DRIVE ...
E EAST ET N NO NORTH NTH S SO SOUTH ...
Once you have created all the files for the variant, you need to package them into a ZIP file to be imported into a master index application.
The ZIP file structure should be similar to the following. Note that this variant includes several normalization and lexicon files. Your variant might not contain any.
Figure 3 Custom Variant Zip File
Each data type variant is configured by a service definition file. Service type files define the fields to be standardized for a data type, and service instance definition files define the variant and Java factory class for the variant. Both files are in XML format.
|