About the Data Cleanser (Analyzing and Cleansing Data for Sun Master Index)

Analyzing and Cleansing Data for Sun Master Index

About the Data Cleanser

The Data Cleanser validates and modifies data based on predefined rules and rules you define. Rules are defined using the same Rules Definition Language (RDL) as the Data Profiler, which provides a flexible framework for defining cleansing rules. The Data Cleanser not only validates and transforms data based on the rules you define, but it also parses, normalizes, and phonetically encodes data using the standardization configuration in the mefa.xml file of the master index project. You can define rules that validate or transform data during the cleansing process, and you can include conditional rules and operators. If you need to perform custom processing, you can define Java classes to extend the functionality of the rules.

The output of the Data Cleanser is two flat files; one file contains the records that passed all validations and was successfully transformed and standardized, and the other contains all records that failed validation or could not be transformed correctly. The bad data file also provides the reason each record failed so you can easily determine how to fix the data. This is an iterative process, and you might run the Data Cleanser several times to make sure all data is processed correctly. The final run of the Data Cleanser should produce only a good data file, with no records failing the process.

The final result of the Data Cleanser is a file that contains records that conform to the master index object definition and that no longer contain invalid or default values. The fields are formatted correctly and any fields that are defined for standardization in the master index application are standardized in the file. This is the file to load into the master index database using the Initial Bulk Match and Load tool (for more information, see Loading the Initial Data Set for a Sun Master Index).