Skip Navigation Links | |
Exit Print View | |
Analyzing and Cleansing Data for a Master Index Java CAPS Documentation |
Analyzing and Cleansing Data for a Master Index
Data Cleansing and Analysis Overview
Data Cleansing and Profiling Process Overview
Required Format for Flat Data Files
Generating the Data Profiler and Data Cleanser
To Generate the Data Profiler and Data Cleanser
Determining the Fields to Analyze
Defining the Data Analysis Rules
Performing the Initial Data Analysis
To Perform the Initial Data Analysis
Reviewing the Data Profiler Reports
Configuring the Data Cleansing Rules
To Configure the Data Cleansing Rules
Performing Frequency Analyses on Cleansed Data
Adjusting the Master Index Configuration
Data Profiler Processing Attributes
Data Profiler Global Variables
Simple Frequency Analysis Rules
Data Cleanser Processing Attributes
Data Cleanser Global Variables
Simple Frequency Analysis Report Samples
Data Profiler rules are expressed using a Rules Definition Language (RDL) in an XML configuration file. Using a markup language allows you to easily define and change rules without having to code them. The configuration file is divided into two sections. The first section, cleansingRulesdefines rules for the Data Cleanser and the second section, profilingRules, defines rules for the Data Profiler.
The profiling section includes five subsections:
The following table lists and describes the attributes for the profilerVariable element in the configuration file. These attributes define the data source and path names for the Data Profiler as well as batch size. Below is a sample of the profiler attributes.
profilerVariable objectdefFilePath="../../src/Configuration" DBconnection="../StagingDB" startFrom="50001" profileSize="50000" reportFilePath=/Reports
|
You can define global variables for the fields used in the Data Profiler rules so you do not need to use the qualified field name for each field. When defining variables, the qualified field name syntax is used. For more information about qualified field names, see Master Index Field Notations in Oracle Java CAPS Master Index Configuration Reference . Variables are defined in the varList element of the profiling rules section, and each variable is defined by a var element.
The var element has these attributes:
name – The name of the variable.
default – The qualified field name to the field defined by the name attribute.
Here is an example of a variable list for a master index application storing person data.
<varList> <var name="fname" default="Person.FirstName"/> <var name="lname" default="Person.LastName"/> <var name="ssn" default="Person.SSN"/> <var name="zip" default="Person.Address.PostCode"/> <var name="state" default="Person.Address.State"/> </varList>
When you reference a variable in a rule, use the format “:[variable_name]” (where variable_name is the name you assigned to the variable). For example, to reference the FirstName field as defined above, it would look similar to this in the rules:
<field fieldName=":[fname]"/>
A simple frequency analysis compares the values of the fields you specify and creates a report that lists each value for those fields along with the number of times each value occurs. You can perform the analysis on a single field or multiple fields, and you can sort the resulting report by any of the fields in the report. Each simple frequency analysis rule is defined within SimpleFrequencyAnalysis tags that include the elements and attributes listed in the following table.
Table 1 Simple Frequency Analysis Rules
|
Example 1 Sample Simple Frequency Analysis Rules
The following sample defines an analysis of unique values for first and last names, and the resulting report displays only those values that occur 25 or more times and only those with the top 10 frequencies. If more than one value has the same frequency at the 10th frequency, all of those values are displayed. The results appear in increasing order of frequency.
<SimpleFrequencyAnalysis> <fields> <field fieldName=":[fname]"/> <field fieldName=":[lname]" </field> <sortOrder fieldName=":[fname]"/> <threshold value="25" more="true"/> <topNpatterns value="10" showall="true"/> </SimpleFrequencyAnalysis>
The following sample defines an analysis of social security numbers and lists any duplicate values (that is, values that appear two or more times).
<SimpleFrequencyAnalysis> <fields> <field fieldName="Person.SSN"/> </field> <sortOrder fieldName="Person.SSN"/> <threshold value="2" more="true"/> <SimpleFrequencyAnalysis>
A constrained frequency analysis compares the values of the fields you specify based on validation rules you define. It creates a report for each rule that lists each value for the fields along with the number of times each value or combination of values occurs. You can perform the analysis on a single field or multiple fields, and you can sort the resulting report by frequency in increasing or decreasing order. Constrained frequency analysis rules are defined within ConstrainedFrequencyAnalysis tags that include the elements and attributes listed in the following table.
Table 2 Constrained Frequency Analysis Rules
|
Example 2 Sample Constrained Frequency Analysis
The example below performs a frequency analysis on the date of birth, but only for those dates that fall within a range too early to be valid. This an example of a profiling you might do prior to data cleansing in order to determine invalid values. This can also bring out invalid formats for the date, such as MM/DD/YY.
<ConstrainedFrequencyAnalysis> <fields> <field fieldName="Person.DOB"/> </fields> <ruleList> <rule> <dateRange fieldName="Person.DOB" min="01/01/0001" max="01/01/1900"/> </rule> </ruleList> </ConstrainedFrequencyAnalysis>
A pattern frequency analysis compares the regular expression patterns found in the values for the specified field and performs a frequency analysis based on the patterns found. It creates a report for each field that lists each pattern along with the number of times each pattern occurs. You can perform the analysis on a single field or multiple fields, and you can sort the resulting report by frequency in increasing or decreasing order. Patterns are represented by regular expressions. For more information, see the Javadoc for java.util.regex.
Pattern frequency analysis rules are defined within PatternFrequencyAnalysis tags that include the elements and attributes listed in the following table.
Table 3 Pattern Frequency Analysis Rules
|
Example 3 Sample Pattern Frequency Analysis
The examples below perform pattern frequency analyses on the date of birth and the social security number. This generates two reports, one for each field.
<PatternFrequencyAnalysis> <topNpatterns value='10' showall "true"/> <fields> <field fieldName="Person.SSN"/> </fields> </PatternFrequencyAnalysis>
<PatternFrequencyAnalysis> <topNpatterns value='10' showall "false"/> <fields> <field fieldName="Person.DOB"/> </fields> </PatternFrequencyAnalysis>