5 Custom FSM-Based Data Types and Variants

This chapter provides conceptual information and procedures for creating custom FSM-based data types and variants.

This chapter includes the following sections:

"Learning About Custom FSM-Based Data Types and Variants"
"Learning About the Standardization Packages"
"Creating Custom FSM-Based Data Types"

"Creating Custom FSM-Based Variants"

Learning About Custom FSM-Based Data Types and Variants

The finite state machine framework of the OHMPI Standardization Engine is very flexible, allowing you to define new data types and variants so you can standardize any type of data. This process requires no Java coding; all processing rules and logic are defined in XML files using predefined rules. The new data types and variants can be imported into NetBeans for use in master person index projects. The following sections provide information and instructions for creating custom data types and variants.

Creating a custom FSM data type or variant for the OHMPI Standardization Engine requires defining the processing logic for the data type in an XML file. No Java coding is required in order to incorporate the comparators into a master person index application. The processing logic is based in the files described in Finite State Machine Framework Configuration.

You define the following information for each data type or variant you create.

The state model that defines each state, its input and output symbols, and transitions
Any preprocessing, matching, or postprocessing logic for input and output symbols
Any cleansing rules to be applied to the data prior to parsing
Optionally, lists of non-standard values and the standard values to which they should be converted (such as a nickname table)
Optionally, lists of possible values for a field component that helps the standardization engine identify and parse the component

After you create the package, you can import the custom data type or variant into NetBeans using the easy import function of Oracle Healthcare Master Person Index. You can then define standardization and normalization structures for the master person index using the new data type or variant.

Learning About the Standardization Packages

After you create a custom data type or variant you need to package the files in a ZIP file so they are available for import into NetBeans. Create a single package for each data type or variant.

For a custom data type, the ZIP file includes the following:

A service type definition file
One or more service instance definition files (depending on how many variants you include)
One or more process definition files (standardizer.xml)
Normalization files (optional)
Lexicon files (optional)

For a custom variant, the ZIP file includes the following:

One service instance definition file
One process definition file (standardizer.xml)
Normalization files (optional)
Lexicon files (optional)

Creating Custom FSM-Based Data Types

You can define new data types and their corresponding variants using the flexible FSM framework of the standardization engine. Data types are easily incorporated into a master person index project and can be made globally available to all projects. Perform the following steps to define a custom data type for the standardization engine.

"Creating the Working Directory"
"Defining the Service Type"
"Defining the Variants"
"Packaging and Importing the Data Type"
"Service Type Definition File"

Creating the Working Directory

The working directory for custom data types requires a specific structure. At a minimum, the working directory will look similar to the following:

/WorkingDir
   serviceType.xml
   /lib
   /instance
      /Generic
         serviceInstance.xml
         /resource
            standardizer.xml

If the data type has several variants, the directory structure will not include the Generic folder, but will contain several folders named by the variants name in its place. Each variant folder must be of the same structure as the Generic folder shown above. The resource directory might also contain several normalization and lexicon files.

To Create the Working Directory

Create a working directory and add a lib and an instance directory at the top level.
Copy the files standardizer-api.jar and standardizer-impl.jar from /NetBeans_Home/soa2/modules/ext/mdm/standardizer/lib to the lib directory.
Do one of the following:
- If the data type only has one variant, create the following directory structure in the instance directory:
  
  /Generic/resource/
- If the data type has several variants, create the following directory structure in the instance directory for each variant:
  
  /VariantName/resource/
Continue to "Defining the Service Type".

Defining the Service Type

The serviceType.xml file defines information about the data type, and is a required file for each data type.

To Define the Service Type

Create a file named serviceType.xml in your working directory.

Note:
You can copy the service type file from an existing data type and modify it for your use.

Enter text similar to the following, where description is the name of the data type and the value elements list the tokens, or standardization components, of the data type.

<serviceType configurationResource="standardizer.xml">
  <description>My Data Type Standardization</description>
  <parameter name="fields">
    <list>
      <value>Data Field1</value>
      <value>Data Field2</value>
      ...
    </list>
  </parameter>
</serviceType>

Note:

For more information about the elements in this file, see "Service Type Definition File".

Save and close the file.
Continue to "Defining the Variants".

Defining the Variants

For each data type you create, you need to create one or more variants that define the logic for processing a specific type of data.

To Define the Variants

Perform the following steps for each variant that will be used for the data type you are creating.

Define the service instance, as described in Defining the Service Instance.

Create the serviceInstance.xml file in /WorkingDir/instance/VariantName.
Define the state model and processing logic, as described in "Defining the State Model and Processing Rules".

Create the standardizer.xml file in /WorkingDir/instance/VariantName/resource.
If needed, create normalization and lexicon files, as described in "Creating Normalization and Lexicon Files".

Create the files in /WorkingDir/instance/VariantName/resource.
Continue to "Packaging and Importing the Variant".

Packaging and Importing the Data Type

Once you have created all the files for the data type, you need to package them into a ZIP file to be imported into a master person index application.

To Package and Import the Data Type

In the working directory, select the folders and files at the top level and add them to a ZIP file.
Name the ZIP file the same name as the data type.

The ZIP file structure should look similar to the following:
Import the file into a master person index application as described in ”Importing Standardization Data Types and Variants” in Oracle Healthcare Master Person Index Configuration Guide (Part Number E18473-01).

Service Type Definition File

Each data type is configured by a service type definition file, serviceType.xml. Service type files define the fields to be standardized for a data type. The following table lists and describes the elements in the service type file.

ElementDescription	Attribute
serviceType	A description and any parameters for the data type.
configurationResource	The name of the standardization process file that defines the states and processing for the data type.
description	A brief description of the data type, such as “Address Standardization”.
parameter	A parameter for the configuration resource. By default, “fields” is the name of the parameter, and it is populated with a list of standardized field component names.
name	The name of the parameter.
value	One or more values for the parameter.

Creating Custom FSM-Based Variants

The flexible framework of the OHMPI Standardization Engine allows you to define new FSM-based variants on existing FSM-based data types so you can standardize different categories of the same type of data. For example, you might need to standardize names from several different countries. Variants are easily incorporated into a master person index project and can be made globally available to all projects. Perform the following steps to create a custom variant.

"Creating the Working Directory"
"Defining the Service Instance"
"Defining the Service Instance"
"Creating Normalization and Lexicon Files"
"Packaging and Importing the Variant"
"Service Instance Definition File"

Creating the Working Directory

The working directory for custom variants requires a specific structure. At a minimum, the working directory will look similar to the following:

/WorkingDir
   serviceInstance.xml
   /resource
      standardizer.xml

The resource directory might also contain several normalization and lexicon files.

To Create the Working Directory

Create a working directory for the new variant.
In the new working directory, create a resource directory.
Continue to "Defining the Service Instance".

Defining the Service Instance

The serviceInstance.xml file for each variant defines the name of the variant, the data type it modifies, and additional Java class information.

To Define the Service Instance

Create a file named serviceInstance.xml at the top level of your working directory.

Tip:
You can copy a service instance file from an existing variant in the data type to which you will add the new variant, and then modify it for the new variant.

Define values for the elements and attributes described in Service Instance Definition File.

This example defines a new Spanish variant to the PersonName data type.

<serviceInstance type="PersonName">
  <description>Person Name Standardization: Spain</description>
  <parameter name="dataType" value="PersonName" />
  <parameter name="variantType" value="SP" />
  <componentManagerFactory 
       class="com.sun.inti.components.component.BeanComponentManagerFactory">
    <property name="stylesheetURL" 
       value="classpath:/com/sun/mdm/standardizer/impl/standardizer.xsl"/>
    <property name="urlSource" >
      <bean class="com.sun.inti.components.url.ResourceURLSource">
        <property name="resourceName" value="standardizer.xml />
      </bean>
    </property>
  </componentManagerFactory>
</serviceInstance>

Note:

The value you enter for the variantType parameter must match the name you want the variant to display in the Standardization folder of the master person index project.

Save and close the file.
Continue to "Defining the State Model and Processing Rules".

Defining the State Model and Processing Rules

The state model defines how the data is read, tokenized, parsed, and modified during standardization. The state model and processing rules are all defined in the standardizer.xml file.

Before you begin this step, determine the different forms in which the data to be standardized can be presented and how it should be standardized for each form. For example, name data might be in the form “First Name, Last Name, Middle Initial” or in the form “First Name, Middle Name, Last Name” and you need to account for each possibility. Determine each state in the process, and the input and output symbols used by each state. It might be useful to create a finite state machine model, as shown below. The model shows each state, the transitions to and from each state, and the output symbol for each state.

For more information about the FSM model, see "Learning About Custom FSM-Based Data Types and Variants".

To Define the State Model and Processing Rules

In /WorkingDirectory/resource, create a new XML file named standardizer.xml.

Tip:
You can copy the file from an existing variant in the data type to which you are adding the custom variant. Then you can modify the file for the new variant.
If the data you are processing does not need to be parsed, but only needs to be normalized, define normalization rules in the normalizer section of the file.

For more information, see "Data Normalization Definitions" and "Standardization Processing Rules Reference".
If the data you are processing needs to be parsed and normalized, define the state model in the upper portion of the file.

For information about the state model and the elements that define it, see "Standardization State Definitions".

Note:
The next several steps use the processing rules described in "Standardization Processing Rules Reference". Some of these rules might require that you create normalization and lexicon files.

The next several steps use the processing rules described in "Standardization Processing Rules Reference". Some of these rules might require that you create normalization and lexicon files.
In the inputSymbols section of the file, define each input symbol along with any processing rules.

For more information, see "Input Symbol Definitions".
In the outputSymbols section of the file, define each output symbol along with any processing rules.

For more information, see "Output Symbol Definitions".
In the cleanser section of the file, define any cleansing rules that should be performed against the data prior to tokenization.

For more information, see "Data Cleansing Definitions".
If you created any rules that reference normalization or lexicon files, continue to "Creating Normalization and Lexicon Files".

Creating Normalization and Lexicon Files

Lexicon files list the possible values for a field so the standardization engine can quickly and accurately recognize different field components. Normalization files list the nonstandard values that might be found in a field along with the standard version so the standardization engine can present a common form for the data. You need to create a file for each lexicon or normalization file you referenced from standardizer.xml.

For more information about normalization and lexicon files, see "Lexicon Files" and "Normalization Files".

To Create Normalization and Lexicon Files

For each normalization file you referenced in standardizer.xml, do the following:
1. Create a text file in /WorkingDirectory/resource.
2. Save the file under the name you used to reference it from standardizer.xml.
3. In the file, enter a list of nonstandard values along with their standardized values, separating the nonstandard value from the standard value with a pipe (|) as shown below.
```
COR|COURT
CRT|COURT
CR.|COURT
CT|COURT
CT.|COURT
DR|DRIVE
DR.|DRIVE
DRV|DRIVE
...
```
4. When you are finished, save and close the file.
For each lexicon file you referenced in standardizer.xml, do the following:
1. Create a text file in /WorkingDirectory/resource.
2. Save the file under the name you used to reference it from standardizer.xml.
3. In the file, enter a list of all possible values for the field as shown below.
```
E
EAST
ET
N
NO
NORTH
NTH
S
SO
SOUTH
...
```
4. When you are finished, save and close the file.
Continue to "Packaging and Importing the Variant".

Packaging and Importing the Variant

Once you have created all the files for the variant, you need to package them into a ZIP file to be imported into a master person index application.

To Package and Import the Variant

In the working directory, select the folder and file at the top level and add them to a ZIP file.
Name the ZIP file the same name as the variant. This is the value you entered for the variantType parameter in "Defining the Service Instance".

The ZIP file structure should be similar to the following. Note that this variant includes several normalization and lexicon files. Your variant might not contain any.
Import the file into a master person index application as described in “Importing Standardization Data Types and Variants” in Oracle Healthcare Master Person Index Configuration Guide. (Part Number E18473-01).

Service Instance Definition File

Each data type variant is configured by a service definition file. Service type files define the fields to be standardized for a data type, and service instance definition files define the variant and Java factory class for the variant. Both files are in XML format.

ElementDescription	Attribute
serviceInstance	A container element for the description and any parameters for the variant.
type	The name of the data type to which the variant belongs.
description	A brief description of the variant, such as “Person Names: Spain”.
parameter	One parameter for the variant. The default variants contain two parameters, dataType and variantType. The dataType parameter specifies the name of the data type to which the variant belongs. The variantType parameter specifies the name of the variant. For a master person index application, these are the names of the nodes that appear under the Standardization Engine node.
name	The name of the parameter.
value	The value of the parameter.
componentManagerFactory	The component manager factory class for the variant.
class	The name of the component manager factory class. The default class is com.sun.inti.components.component.BeanComponentManagerFactory.
property	A property of the component manager factory class. The default class has two properties. The stylesheetURL property defines the location of the stylesheet, standardizer.xml. The urlSource property defines the process definition file. Its value is a bean (by default, com.sun.inti.components.url.ResourceURLSource), which has a property called resourceName. The value for this property is standardizer.xml.
name	The name of the property.
value	The value for the property.