After you customize the configuration file for the Data Cleanser, you can run the Data Cleanser against the staging database or a flat file. This step generates two files, one containing the records that passed all validation and was successfully cleansed and one containing records that failed validation along with an error message for each.
Before performing this step, make sure you have completed the following procedures:
Navigate to NetBeans_Projects/Project_Name/cleanser-generated/cleanser.
If you changed the name of the configuration file from sampleConfig.xml or created multiple configuration files, do the following:
Do one of the following:
Review the output files.
If there are any records in the bad data file, do one of the following:
If there are common errors for several records, define new cleansing rules in sampleConfig.xml to transform the bad data, delete the output files from the previous run, and then rerun the Data Cleanser against the staging database or flat file.
If there are unique errors for few records, fix the errant records in the bad data file, rename the bad data file, update the DBConnection and startcounter properties, and rerun the Data Cleanser against the updated file.
Be sure to change the DBConnection attribute in the configuration file to point to the renamed file and change the startcounter value to the next record to be processed. For example, if the original run processed 100 good records, change the value to “101” to start processing the bad records. Any records cleansed from the fixed file are appended to the good data file.
Repeat the previous steps until there are no records being written to the bad data file.
The final output to the good file can be loaded into the master index database using the Initial Bulk Match and Load tool (see Loading the Initial Data Set for a Sun Master Index). The Data Cleanser automatically places the data in the correct format based on the object.xml file.
Continue to Performing Frequency Analyses on Cleansed Data to perform frequency analyses on the cleansed data.