JavaScript is required to for searching.
Skip Navigation Links
Exit Print View
Analyzing and Cleansing Data for a Master Index     Java CAPS Documentation
search filter icon
search icon

Document Information

Analyzing and Cleansing Data for a Master Index

Related Topics

Data Cleansing and Analysis Overview

About the Data Profiler

About the Data Cleanser

Data Cleansing and Profiling Process Overview

Required Format for Flat Data Files

Generating the Data Profiler and Data Cleanser

To Generate the Data Profiler and Data Cleanser

Configuring the Environment

To Configure the Environment

Extracting the Legacy Data

Determining the Fields to Analyze

Defining the Data Analysis Rules

To Define Data Analysis Rules

Performing the Initial Data Analysis

To Perform the Initial Data Analysis

Reviewing the Data Profiler Reports

Configuring the Data Cleansing Rules

To Configure the Data Cleansing Rules

Cleansing the Legacy Data

To Cleanse the Data

Performing Frequency Analyses on Cleansed Data

Adjusting the Master Index Configuration

Data Profiler Rules Syntax

Data Profiler Processing Attributes

Data Profiler Global Variables

Simple Frequency Analysis Rules

Constrained Frequency Analysis Rules

Pattern Frequency Analysis Rules

Data Cleanser Rules Syntax

Data Cleanser Processing Attributes

Data Cleanser Global Variables

Data Validation Rules

dataLength

dateRange

matchFromFile

patternMatch

range

reject

return

validateDBField

Data Transformation Rules

assign

patternReplace

replace

truncate

Conditional Data Rules

dataLength

equals

isnull

matches

Conditional Operators

Data Profiler Report Samples

Simple Frequency Analysis Report Samples

Constrained Frequency Analysis Report Samples

Pattern Frequency Analysis Report Samples

Data Profiler Report Samples

The following topics provide sample frequency configurations along with excerpts of the reports they produce.

Simple Frequency Analysis Report Samples

Simple frequency analysis reports list the frequencies of various data values found in the specified fields without using any data verification, transformation, or conditional rules. You can specify a sort order for this type of report, and you can specify a frequency threshold. The sample frequency rule defined below analyzes first and last names, reporting the top 6 frequencies and only if the frequency is three or more.

<SimpleFrequencyAnalysis>
  <fields>
    <field fieldName="Person.FirstName"/>
    <field fieldName="Person.LastName"/>
  </fields>
  <sortOrder fieldName="Person.FirstName" increasing="false"/>
  <threshold value="3" more="true"/>
  <topNpatterns value="6" showall="false"/> 
</SimpleFrequencyAnalysis>

This analysis generates a report similar to the following:

SF_PROFILE_SIMPLE_FRQ_1_1–0.csv

PERSON.LAWSTNAME
PERSON.FIRSTNAME
FREQUENCY
SMITH
ANN
38
JONES
SUSAN
31
SMITH
JOHN
31
THOMPSON
JAMES
28
JOHNSON
BETH
26
MILLER
FRANK
25

The sample frequency rule defined below analyzes social security numbers and analyzes whether there are duplicates (two or more occurrences).

<SimpleFrequencyAnalysis>
  <fields>
    <field fieldName="Person.SSN"/>
  </fields>
  <sortOrder fieldName="Person.SSN" increasing="false"/>
  <threshold value="2" more="true"/>
</SimpleFrequencyAnalysis>

This analysis generates a report similar to the following excerpt:

SF_PROFILE_SIMPLE_FRQ_2_1–0.csv

PERSON.SSN
FREQUENCY
999999999
457
000000000
125
123456789
41
222423535
4
992203847
2

Constrained Frequency Analysis Report Samples

Constrained frequency analysis reports list the frequencies of various data values found in the specified fields based on defined rules. For example, you can define rules that will only include certain patterns or that will exclude certain values. This topic includes two constrained analysis definitions along with corresponding sample reports.

<ConstrainedFrequencyAnalysis>
  <fields>
    <field fieldName="Person.SSN"/>
  </fields>
  <ruleList>
    <rule>
      <dataLength fieldName="Person.SSN" len="10" more="false"/>
    </rule>
  </ruleList>
</ConstrainedFrequencyAnalysis>

The above analysis generates a report for social security numbers with less than 10 characters (which means the hyphens are likely missing). Below is a sample output.

CF_PROFILE_CONSTRAINED_FRQ_1_1–100000.csv

PERSON.SSN
FREQUENCY
300555444
1
299557777
1
822331111
2
999999999
98
000000000
115

The following analysis generates a report for dates of birth that are prior to 01/01/1899 (which means they likely contain typographical errors). Below is a sample output.

<ConstrainedFrequencyAnalysis>
  <fields>
    <field fieldName="Person.DOB"/>
  </fields>
  <ruleList>
    <rule>
      <dataRange fieldName="Person.DOB" min="01/01/0001" max="01/01/1899"/>
    </rule>
  </ruleList>
</ConstrainedFrequencyAnalysis>

CF_PROFILE_CONSTRAINED_FRQ_2_1–100000.csv

PERSON.DOB
FREQUENCY
07/08/53
1
10/04/51
1
09/16/1682
1
12/28/1680
2
05/09/1898
2

Pattern Frequency Analysis Report Samples

Pattern frequency analysis reports list the frequencies of various data patterns found in the values of the specified fields. Patterns are expressed as regular expressions. This topic includes sample reports based on the pattern frequencies defined below for social security number and date of birth patterns.

<PatternFrqueencyAnalysis>
  <topNpatterns ="5" showall="true"/>
  <fields>
    <field fieldName="Person.SSN"/>
  </fields>
</PatternFrequencyAnalysis>
<PatternFrequencyAnalysis>
  <topNpatterns ="5" increasing="true"/>
  <fields>
    <field fieldName="Person.DOB"/>
  </fields>
</PatternFrequencyAnalysis>

The above rules generate two reports, one for social security number patterns and one for date of birth patterns. The reports only lists the top 5 patterns. Below are sample outputs for each. You can easily determine invalid values based on the patterns listed.

PF_PROFILE_PATTERN_FRQ_1_1–10000.csv

PERSON.SSN
FREQUENCY
NNN/NN/NNNN
11
51
NNN-NN-NNN
64
NNNNNNNNN
92
NNN-NN-NNNN
7614

PF_PROFILE_PATTERN_FRQ_2_1–10000.csv

PERSON.DOB
FREQUENCY
NNNN/NN/NN
14
22
NNNNNNNN
62
NN/NN/NN
84
NN/NN/NNNN
9766