JavaScript is required to for searching.
Skip Navigation Links
Exit Print View
Analyzing and Cleansing Data for a Master Index     Java CAPS Documentation
search filter icon
search icon

Document Information

Analyzing and Cleansing Data for a Master Index

Related Topics

Data Cleansing and Analysis Overview

About the Data Profiler

About the Data Cleanser

Data Cleansing and Profiling Process Overview

Required Format for Flat Data Files

Generating the Data Profiler and Data Cleanser

To Generate the Data Profiler and Data Cleanser

Configuring the Environment

To Configure the Environment

Extracting the Legacy Data

Determining the Fields to Analyze

Defining the Data Analysis Rules

To Define Data Analysis Rules

Performing the Initial Data Analysis

To Perform the Initial Data Analysis

Reviewing the Data Profiler Reports

Configuring the Data Cleansing Rules

To Configure the Data Cleansing Rules

Cleansing the Legacy Data

To Cleanse the Data

Performing Frequency Analyses on Cleansed Data

Adjusting the Master Index Configuration

Data Profiler Rules Syntax

Data Profiler Processing Attributes

Data Profiler Global Variables

Simple Frequency Analysis Rules

Constrained Frequency Analysis Rules

Pattern Frequency Analysis Rules

Data Cleanser Rules Syntax

Data Cleanser Processing Attributes

Data Cleanser Global Variables

Data Validation Rules

dataLength

dateRange

matchFromFile

patternMatch

range

reject

return

validateDBField

Data Transformation Rules

assign

patternReplace

replace

truncate

Conditional Data Rules

dataLength

equals

isnull

matches

Conditional Operators

Data Profiler Report Samples

Simple Frequency Analysis Report Samples

Constrained Frequency Analysis Report Samples

Pattern Frequency Analysis Report Samples

Cleansing the Legacy Data

After you customize the configuration file for the Data Cleanser, you can run the Data Cleanser against the staging database or a flat file. This step generates two files, one containing the records that passed all validation and was successfully cleansed and one containing records that failed validation along with an error message for each.

To Cleanse the Data

Before You Begin

Before performing this step, make sure you have completed the following procedures:

  1. Navigate to NetBeans_Projects/Project_Name/cleanser-generated/cleanser.
  2. If you changed the name of the configuration file from sampleConfig.xml or created multiple configuration files, do the following:
    1. Open run.bat (or run.sh on UNIX) for editing.
    2. Change “<Rule_Config_File>” to the name of the configuration file to use for this run.
  3. Do one of the following:
    • On Windows, navigate to the cleanser home directory and then double-click run.bat or type run.bat in the command line.
    • On UNIX, navigate to the cleanser home directory and then type run.sh.
  4. Review the output files.
  5. If there are any records in the bad data file, do one of the following:
    • If there are common errors for several records, define new cleansing rules in sampleConfig.xml to transform the bad data, delete the output files from the previous run, and then rerun the Data Cleanser against the staging database or flat file.
    • If there are unique errors for few records, fix the errant records in the bad data file, rename the bad data file, update the DBConnection and startcounter properties, and rerun the Data Cleanser against the updated file.

      Caution

      Caution - Be sure to change the DBConnection attribute in the configuration file to point to the renamed file and change the startcounter value to the next record to be processed. For example, if the original run processed 100 good records, change the value to “101” to start processing the bad records. Any records cleansed from the fixed file are appended to the good data file.


  6. Repeat the previous steps until there are no records being written to the bad data file.

    Note - The final output to the good file can be loaded into the master index database using the Initial Bulk Match and Load tool (see Loading the Initial Data Set for a Master Index). The Data Cleanser automatically places the data in the correct format based on the object.xml file.


Next Steps

Continue to Performing Frequency Analyses on Cleansed Data to perform frequency analyses on the cleansed data.