JavaScript is required to for searching.
Skip Navigation Links
Exit Print View
Analyzing and Cleansing Data for a Master Index     Java CAPS Documentation
search filter icon
search icon

Document Information

Analyzing and Cleansing Data for a Master Index

Related Topics

Data Cleansing and Analysis Overview

About the Data Profiler

About the Data Cleanser

Data Cleansing and Profiling Process Overview

Required Format for Flat Data Files

Generating the Data Profiler and Data Cleanser

To Generate the Data Profiler and Data Cleanser

Configuring the Environment

To Configure the Environment

Extracting the Legacy Data

Determining the Fields to Analyze

Defining the Data Analysis Rules

To Define Data Analysis Rules

Performing the Initial Data Analysis

To Perform the Initial Data Analysis

Reviewing the Data Profiler Reports

Configuring the Data Cleansing Rules

To Configure the Data Cleansing Rules

Cleansing the Legacy Data

To Cleanse the Data

Performing Frequency Analyses on Cleansed Data

Adjusting the Master Index Configuration

Data Profiler Rules Syntax

Data Profiler Processing Attributes

Data Profiler Global Variables

Simple Frequency Analysis Rules

Constrained Frequency Analysis Rules

Pattern Frequency Analysis Rules

Data Cleanser Rules Syntax

Data Cleanser Processing Attributes

Data Cleanser Global Variables

Data Validation Rules

dataLength

dateRange

matchFromFile

patternMatch

range

reject

return

validateDBField

Data Transformation Rules

assign

patternReplace

replace

truncate

Conditional Data Rules

dataLength

equals

isnull

matches

Conditional Operators

Data Profiler Report Samples

Simple Frequency Analysis Report Samples

Constrained Frequency Analysis Report Samples

Pattern Frequency Analysis Report Samples

Performing Frequency Analyses on Cleansed Data

After the data is cleansed (see Cleansing the Legacy Data), you can perform additional analyses against the data to help you determine how to configure query blocks and matching rules. Typically you would analyze the fields that are included in the block definitions for the query that is used for matching, and you could also analyze fields used for matching. The frequencies of these fields indicate how reliable they might be in the matching process and also indicate whether the blocking definitions are too broad or narrow to retrieve a reliable group of records for matching.

For this process, you might want to run frequency analysis against groups of records to find the frequencies for the unique values of the fields in the blocking definitions. After the data has been cleansed, you can run frequencies on standardized and phonetically encoded fields. The input for this process is the good data file to which the Data Cleanser wrote all the corrected and validated records.


Note - The blocking query is defined in the master index project in query.xml, and the match fields are defined in mefa.xml.


To perform frequency analyses on the cleansed data, repeat these procedures: