Analyzing and Cleansing Data for Sun Master Index

Data Cleanser Processing Attributes

The following table lists and describes the attributes for the cleansingVariable element in the configuration file. These attributes define the data source and path names for the Data Cleanser as well as global validation rules. Below is a sample of the cleansing attributes.


cleansingVariable objectdefFilePath="../../src/Configuration" validateType="true" 
validateNull="false" validateLength="true" DBconnection="../StagingDB" 
goodFilePath="./Output/good.txt" badFilePath=./Output/bad.txt startCount="1"
standardizer="true"

Attribute 

Description 

objectdefFilePath 

The path and filename for the object.xml file to use to cleanse the data.

validateType 

An indicator of whether the cleanser should validate each field's data type against the type defined in object.xml. Specify true to validate field type; otherwise specify false. If you validate against type and the validation fails for any field in a record, the record is written to the bad file.

validateNull 

An indicator of whether the cleanser should check for null values in each field that is configured to be required in object.xml. Specify true to check for null values; otherwise specify false. If you check for null values and any required field in a record is null, the record is written to the bad file.

validateLength 

An indicator of whether the cleanser should validate each field's length against the length defined in object.xml. Specify true to validate field length; otherwise specify false. If you validate against length and the validation fails for any field in a record, the record is written to the bad file.

DBconnection 

The path to the staging database or the path and name of the flat file containing the data to be profiled. Use forward slashes in this path rather than back slashes. 

badDataFilePath 

The path and name of the file that lists the records that are found to contain bad data during the cleansing process. This file includes an error message for each record describing the reason it was rejected. If you specify a path that does not exist, you need to create the path. 

goodDataFilePath 

The path and name of the file that lists the records that do not contain any bad data. These records can be processed through the Initial Bulk Match and Load tool into the master index database. If you specify a path that does not exist, you need to create the path. 

startCounter 

The starting number for the GID generator for the cleansed records. The GID is a unique value used by the Initial Bulk Match and Load tool, which takes the good data file created by the cleansing process as its input. Enter a non-negative long value. For the initial cleansing, set this to 1.

standardizer 

An indicator of whether the Data Cleanser should standardize the input data according to the standardization rules defined in the mefa.xml file in the master index project. Specify true to standardize the data. This populates the standardized values into the output file. Specify false to bypass standardization. If no value is specified or this property is missing, the default is true.