Analyzing and Cleansing Data for Sun Master Index

Cleansing the Legacy Data

After you customize the configuration file for the Data Cleanser, you can run the Data Cleanser against the staging database or a flat file. This step generates two files, one containing the records that passed all validation and was successfully cleansed and one containing records that failed validation along with an error message for each.

To Cleanse the Data

Before You Begin

Before performing this step, make sure you have completed the following procedures:

Navigate to NetBeans_Projects/Project_Name/cleanser-generated/cleanser.

If you changed the name of the configuration file from sampleConfig.xml or created multiple configuration files, do the following:
1. Open run.bat (or run.sh on UNIX) for editing.
2. Change “<Rule_Config_File>” to the name of the configuration file to use for this run.

Do one of the following:
- On Windows, navigate to the cleanser home directory and then double-click run.bat or type run.bat in the command line.
- On UNIX, navigate to the cleanser home directory and then type run.sh.

Review the output files.

If there are any records in the bad data file, do one of the following:
- If there are common errors for several records, define new cleansing rules in sampleConfig.xml to transform the bad data, delete the output files from the previous run, and then rerun the Data Cleanser against the staging database or flat file.
- If there are unique errors for few records, fix the errant records in the bad data file, rename the bad data file, update the DBConnection and startcounter properties, and rerun the Data Cleanser against the updated file.
  
  Caution –
  Be sure to change the DBConnection attribute in the configuration file to point to the renamed file and change the startcounter value to the next record to be processed. For example, if the original run processed 100 good records, change the value to “101” to start processing the bad records. Any records cleansed from the fixed file are appended to the good data file.

Repeat the previous steps until there are no records being written to the bad data file.

Note –
The final output to the good file can be loaded into the master index database using the Initial Bulk Match and Load tool (see Loading the Initial Data Set for a Sun Master Index). The Data Cleanser automatically places the data in the correct format based on the object.xml file.

Next Steps

Continue to Performing Frequency Analyses on Cleansed Data to perform frequency analyses on the cleansed data.