JavaScript is required to for searching.
Skip Navigation Links
Exit Print View
Analyzing and Cleansing Data for a Master Index     Java CAPS Documentation
search filter icon
search icon

Document Information

Analyzing and Cleansing Data for a Master Index

Related Topics

Data Cleansing and Analysis Overview

About the Data Profiler

About the Data Cleanser

Data Cleansing and Profiling Process Overview

Required Format for Flat Data Files

Generating the Data Profiler and Data Cleanser

To Generate the Data Profiler and Data Cleanser

Configuring the Environment

To Configure the Environment

Extracting the Legacy Data

Determining the Fields to Analyze

Defining the Data Analysis Rules

To Define Data Analysis Rules

Performing the Initial Data Analysis

To Perform the Initial Data Analysis

Reviewing the Data Profiler Reports

Configuring the Data Cleansing Rules

To Configure the Data Cleansing Rules

Cleansing the Legacy Data

To Cleanse the Data

Performing Frequency Analyses on Cleansed Data

Adjusting the Master Index Configuration

Data Profiler Rules Syntax

Data Profiler Processing Attributes

Data Profiler Global Variables

Simple Frequency Analysis Rules

Constrained Frequency Analysis Rules

Pattern Frequency Analysis Rules

Data Cleanser Rules Syntax

Data Cleanser Processing Attributes

Data Cleanser Global Variables

Data Validation Rules

dataLength

dateRange

matchFromFile

patternMatch

range

reject

return

validateDBField

Data Transformation Rules

assign

patternReplace

replace

truncate

Conditional Data Rules

dataLength

equals

isnull

matches

Conditional Operators

Data Profiler Report Samples

Simple Frequency Analysis Report Samples

Constrained Frequency Analysis Report Samples

Pattern Frequency Analysis Report Samples

Reviewing the Data Profiler Reports

When you run the Data Profiler, as described in Performing the Initial Data Analysis, the Data Profiler creates one report for each rule you defined and stores them in the location you specified in the configuration file. Review each report to help determine which fields need to be cleansed and how data needs to be validated or modified. When you profile prior to cleansing, you are looking for occurrences of invalid values or patterns, default values, missing values, and so on. This information translates into the rules you will write for the Data Cleanser.

Reports are written in CSV format so you can import them into a spreadsheet or reporting tool. Data patterns listed in the pattern frequency reports appear as regular expressions. See the Javadoc for java.util.regex for more information. The Data Profiler names each report based on the type of frequency performed, the order in which the rules appear in the configuration file, and the records on which it was performed. The naming syntax for the report file names is:

UD_FreqType_Order_Records.csv

where:

For example, SF_PROFILE_SIMPLE_FRQ_3_1-100000.csv is the third simple frequency analysis report defined in the configuration file and performed against the first the first 100,000 records. CF_PROFILE_CONSTRAINED_FRQ_1_1-0.csv is the first constrained frequency analysis report defined in the configuration file and performed against all records.

For examples of different types of frequency analysis reports, see Data Profiler Report Samples. When you finish analyzing the reports, continue to Configuring the Data Cleansing Rules.