JavaScript is required to for searching.
Skip Navigation Links
Exit Print View
Analyzing and Cleansing Data for a Master Index     Java CAPS Documentation
search filter icon
search icon

Document Information

Analyzing and Cleansing Data for a Master Index

Related Topics

Data Cleansing and Analysis Overview

About the Data Profiler

About the Data Cleanser

Data Cleansing and Profiling Process Overview

Required Format for Flat Data Files

Generating the Data Profiler and Data Cleanser

To Generate the Data Profiler and Data Cleanser

Configuring the Environment

To Configure the Environment

Extracting the Legacy Data

Determining the Fields to Analyze

Defining the Data Analysis Rules

To Define Data Analysis Rules

Performing the Initial Data Analysis

To Perform the Initial Data Analysis

Reviewing the Data Profiler Reports

Configuring the Data Cleansing Rules

To Configure the Data Cleansing Rules

Cleansing the Legacy Data

To Cleanse the Data

Performing Frequency Analyses on Cleansed Data

Adjusting the Master Index Configuration

Data Profiler Rules Syntax

Data Profiler Processing Attributes

Data Profiler Global Variables

Simple Frequency Analysis Rules

Constrained Frequency Analysis Rules

Pattern Frequency Analysis Rules

Data Cleanser Rules Syntax

Data Cleanser Processing Attributes

Data Cleanser Global Variables

Data Validation Rules

dataLength

dateRange

matchFromFile

patternMatch

range

reject

return

validateDBField

Data Transformation Rules

assign

patternReplace

replace

truncate

Conditional Data Rules

dataLength

equals

isnull

matches

Conditional Operators

Data Profiler Report Samples

Simple Frequency Analysis Report Samples

Constrained Frequency Analysis Report Samples

Pattern Frequency Analysis Report Samples

Data Profiler Rules Syntax

Data Profiler rules are expressed using a Rules Definition Language (RDL) in an XML configuration file. Using a markup language allows you to easily define and change rules without having to code them. The configuration file is divided into two sections. The first section, cleansingRulesdefines rules for the Data Cleanser and the second section, profilingRules, defines rules for the Data Profiler.

The profiling section includes five subsections:

Data Profiler Processing Attributes

The following table lists and describes the attributes for the profilerVariable element in the configuration file. These attributes define the data source and path names for the Data Profiler as well as batch size. Below is a sample of the profiler attributes.

profilerVariable objectdefFilePath="../../src/Configuration" 
DBconnection="../StagingDB" startFrom="50001" profileSize="50000" 
reportFilePath=/Reports
Attribute
Description
objectdefFilePath
The path and filename to the object.xml file to use to profile the data.
DBconnection
The path to the staging database or the path and name of the flat file containing the data to be profiled. In this path, use forward slashes rather than back slashes.
startFrom
The record number at which the Data Profiler will start analyzing data. Use this attribute, along with the profileSize attribute, if you are running the process in batches.
profileSize
The number of records to process in one batch. This attribute is optional.
reportFilePath
The path where the Data Profiler reports will be stored. The profiler generates three different types of reports: Simple Frequency Reports, Constrained Frequency Reports, and Pattern Frequency Reports. One report is generated for each frequency rule you define. To see examples of these reports, see Data Profiler Report Samples

Data Profiler Global Variables

You can define global variables for the fields used in the Data Profiler rules so you do not need to use the qualified field name for each field. When defining variables, the qualified field name syntax is used. For more information about qualified field names, see Master Index Field Notations in Oracle Java CAPS Master Index Configuration Reference . Variables are defined in the varList element of the profiling rules section, and each variable is defined by a var element.

The var element has these attributes:

Here is an example of a variable list for a master index application storing person data.

<varList>
   <var name="fname" default="Person.FirstName"/>
   <var name="lname" default="Person.LastName"/>
   <var name="ssn" default="Person.SSN"/>
   <var name="zip" default="Person.Address.PostCode"/>
   <var name="state" default="Person.Address.State"/>
</varList>

When you reference a variable in a rule, use the format “:[variable_name]” (where variable_name is the name you assigned to the variable). For example, to reference the FirstName field as defined above, it would look similar to this in the rules:

<field fieldName=":[fname]"/>

Simple Frequency Analysis Rules

A simple frequency analysis compares the values of the fields you specify and creates a report that lists each value for those fields along with the number of times each value occurs. You can perform the analysis on a single field or multiple fields, and you can sort the resulting report by any of the fields in the report. Each simple frequency analysis rule is defined within SimpleFrequencyAnalysis tags that include the elements and attributes listed in the following table.

Table 1 Simple Frequency Analysis Rules

Element
Attribute
Description
fields
A list of fields to include in the frequency analysis.
field
One field definition in the list of fields.
fieldName
The name of the field. If you defined a variable for the field, the syntax for this attribute is fieldName=”:[var_name]”, where var_name is the name you gave the variable. If you did not define a variable, enter the qualified field name within double quotes. For example, fieldName=“Person.FirstName”.
sortOrder
If defined, a field on which to sort in order of frequency. If multiple fields are defined, all of their frequencies will be sorted in this order.
fieldName
The name of the field on which to sort. Use the syntax described for fieldName above.
increasing
An indicator of whether to sort in increasing or decreasing order. Specify “true” to sort in increasing order, or specify “false” to sort in decreasing order.
threshold
If defined, a frequency threshold above which or below which field values will be listed on the report. If multiple fields are defined, the threshold setting applies the combination of the fields.
value
The frequency threshold. This is a cutoff value to help limit the results of the report.
more
An indicator of whether the threshold is an upper or lower threshold. Specify “true” to return field values with a frequency greater than or equal to the threshold. Specify “false” to return field values with a frequency less than the threshold.
topNpattern
If defined, the number of top frequencies to display. For example, you can restrict a report to just the top 10 frequencies of a field. If multiple fields are defined, this setting applies to the combination of fields.
value
The number of top frequencies to display.
showall
An indicator of whether to display more than the specified number of frequencies if there are multiple values tied at the lowest frequency to display. Specify “true” to show all values that are tied for the top frequencies. Specify “false” to only display the number of frequencies specified by the value element. If there is a tie, the displayed value is selected randomly.

Example 1 Sample Simple Frequency Analysis Rules

The following sample defines an analysis of unique values for first and last names, and the resulting report displays only those values that occur 25 or more times and only those with the top 10 frequencies. If more than one value has the same frequency at the 10th frequency, all of those values are displayed. The results appear in increasing order of frequency.

<SimpleFrequencyAnalysis>
  <fields>
    <field fieldName=":[fname]"/>
    <field fieldName=":[lname]"
  </field>
  <sortOrder fieldName=":[fname]"/>
  <threshold value="25" more="true"/>
  <topNpatterns value="10" showall="true"/>
</SimpleFrequencyAnalysis>
  

The following sample defines an analysis of social security numbers and lists any duplicate values (that is, values that appear two or more times).

<SimpleFrequencyAnalysis>
  <fields>
    <field fieldName="Person.SSN"/>
  </field>
  <sortOrder fieldName="Person.SSN"/>
  <threshold value="2" more="true"/>
<SimpleFrequencyAnalysis>

Constrained Frequency Analysis Rules

A constrained frequency analysis compares the values of the fields you specify based on validation rules you define. It creates a report for each rule that lists each value for the fields along with the number of times each value or combination of values occurs. You can perform the analysis on a single field or multiple fields, and you can sort the resulting report by frequency in increasing or decreasing order. Constrained frequency analysis rules are defined within ConstrainedFrequencyAnalysis tags that include the elements and attributes listed in the following table.

Table 2 Constrained Frequency Analysis Rules

Element
Attribute
Description
fields
A list of fields to include in the frequency analysis.
field
One field definition in the list of fields.
fieldName
The name of the field. If you defined a variable for the field, the syntax for this attribute is fieldName=”:[var_name]”, where var_name is the name you gave the variable. If you did not define a variable, enter the qualified field name within double quotes. For example, fieldName=“Person.FirstName”.
sortOrder
If defined, specifies a field on which to sort in order of frequency.
fieldName
The name of the field to sort on. Use the syntax described for fieldName above.
increasing
An indicator of whether to sort in increasing or decreasing frequency. Specify “true” to sort in increasing order, or specify “false” to sort in decreasing order.
threshold
If defined, specifies a frequency threshold above which or below which field values will be listed on the report.
value
The frequency threshold. This is a cutoff value to help limit the results of the report.
more
An indicator of whether the threshold is an upper or lower threshold. Specify “true” to return field values with a frequency greater than or equal to the threshold. Specify “false” to return field values with a frequency less than the threshold.
ruleList
A list of rules to apply to the frequency analysis.
rule
One rule definition to apply to the frequency analysis. You can define multiple rule definitions in a rule list. The following validation rules can be used in a constrained frequency analysis: dateRange, patternMatch, dataLength, and range. For descriptions and samples of these rules, see Data Validation Rules.

Example 2 Sample Constrained Frequency Analysis

The example below performs a frequency analysis on the date of birth, but only for those dates that fall within a range too early to be valid. This an example of a profiling you might do prior to data cleansing in order to determine invalid values. This can also bring out invalid formats for the date, such as MM/DD/YY.

<ConstrainedFrequencyAnalysis>
  <fields>
    <field fieldName="Person.DOB"/>
  </fields>
  <ruleList>
    <rule>
      <dateRange fieldName="Person.DOB" min="01/01/0001" max="01/01/1900"/>
    </rule>
  </ruleList>
</ConstrainedFrequencyAnalysis>

Pattern Frequency Analysis Rules

A pattern frequency analysis compares the regular expression patterns found in the values for the specified field and performs a frequency analysis based on the patterns found. It creates a report for each field that lists each pattern along with the number of times each pattern occurs. You can perform the analysis on a single field or multiple fields, and you can sort the resulting report by frequency in increasing or decreasing order. Patterns are represented by regular expressions. For more information, see the Javadoc for java.util.regex.

Pattern frequency analysis rules are defined within PatternFrequencyAnalysis tags that include the elements and attributes listed in the following table.

Table 3 Pattern Frequency Analysis Rules

Element
Attribute
Description
topNpattern
If defined, the number of top frequencies to display. For example, you can restrict a report to just the top 10 frequencies of a field. If multiple fields are defined, this setting applies to the combination of fields.
value
The number of top frequencies to display.
showall
An indicator of whether to display more than the specified number of frequencies if there are multiple values tied at the lowest frequency to display. Specify “true” to show all values that are tied for the top frequencies. Specify “false” to only display the number of frequencies specified by the value element. If there is a tie, the displayed value is selected randomly.
fields
A list of fields to include in the frequency analysis.
field
One field definition in the list of fields.
fieldName
The name of the field. If you defined a variable for the field, the syntax for this attribute is fieldName=”:[var_name]”, where var_name is the name you gave the variable. If you did not define a variable, enter the qualified field name within double quotes. For example, fieldName=“Person.FirstName”.
sortOrder
If defined, specifies a field on which to sort in order of frequency.
fieldName
The name of the field to sort on. Use the syntax described for fieldName above.
increasing
An indicator of whether to sort in increasing or decreasing frequency. Specify “true” to sort in increasing order, or specify “false” to sort in decreasing order.
threshold
If defined, specifies a frequency threshold above which or below which patterns will be listed on the report.
value
The frequency threshold. This is a cutoff value to help limit the results of the report.
more
An indicator of whether the threshold is an upper or lower threshold. Specify “true” to return field values with a frequency greater than or equal to the threshold. Specify “false” to return field values with a frequency less than the threshold.

Example 3 Sample Pattern Frequency Analysis

The examples below perform pattern frequency analyses on the date of birth and the social security number. This generates two reports, one for each field.

<PatternFrequencyAnalysis>
  <topNpatterns value='10' showall "true"/>
  <fields>
    <field fieldName="Person.SSN"/>
  </fields>
</PatternFrequencyAnalysis>
<PatternFrequencyAnalysis>
  <topNpatterns value='10' showall "false"/>
  <fields>
    <field fieldName="Person.DOB"/>
  </fields>
</PatternFrequencyAnalysis>