Analyzing and Cleansing Data for Sun Master Index

Data Cleanser Rules Syntax

Data Cleanser rules are expressed using a Rules Definition Language (RDL) in an XML configuration file. Using a markup language allows you to easily define and change rules without having to code them, and you can also create custom Java classes to perform additional types of data validation or transformation. The first section of the configuration file, cleansingRules, defines rules for the Data Cleanser. The cleansing section includes three subsections. The first defines processing attributes, the second defines variables, and the third defines cleansing rules. The following topics provide information about each subsection.

Data Cleanser Processing Attributes

The following table lists and describes the attributes for the cleansingVariable element in the configuration file. These attributes define the data source and path names for the Data Cleanser as well as global validation rules. Below is a sample of the cleansing attributes.


cleansingVariable objectdefFilePath="../../src/Configuration" validateType="true" 
validateNull="false" validateLength="true" DBconnection="../StagingDB" 
goodFilePath="./Output/good.txt" badFilePath=./Output/bad.txt startCount="1"
standardizer="true"

Attribute 

Description 

objectdefFilePath 

The path and filename for the object.xml file to use to cleanse the data.

validateType 

An indicator of whether the cleanser should validate each field's data type against the type defined in object.xml. Specify true to validate field type; otherwise specify false. If you validate against type and the validation fails for any field in a record, the record is written to the bad file.

validateNull 

An indicator of whether the cleanser should check for null values in each field that is configured to be required in object.xml. Specify true to check for null values; otherwise specify false. If you check for null values and any required field in a record is null, the record is written to the bad file.

validateLength 

An indicator of whether the cleanser should validate each field's length against the length defined in object.xml. Specify true to validate field length; otherwise specify false. If you validate against length and the validation fails for any field in a record, the record is written to the bad file.

DBconnection 

The path to the staging database or the path and name of the flat file containing the data to be profiled. Use forward slashes in this path rather than back slashes. 

badDataFilePath 

The path and name of the file that lists the records that are found to contain bad data during the cleansing process. This file includes an error message for each record describing the reason it was rejected. If you specify a path that does not exist, you need to create the path. 

goodDataFilePath 

The path and name of the file that lists the records that do not contain any bad data. These records can be processed through the Initial Bulk Match and Load tool into the master index database. If you specify a path that does not exist, you need to create the path. 

startCounter 

The starting number for the GID generator for the cleansed records. The GID is a unique value used by the Initial Bulk Match and Load tool, which takes the good data file created by the cleansing process as its input. Enter a non-negative long value. For the initial cleansing, set this to 1.

standardizer 

An indicator of whether the Data Cleanser should standardize the input data according to the standardization rules defined in the mefa.xml file in the master index project. Specify true to standardize the data. This populates the standardized values into the output file. Specify false to bypass standardization. If no value is specified or this property is missing, the default is true.

Data Cleanser Global Variables

You can define global variables for the fields used in the cleansing rules so you do not need to enter the qualified field name for each rule. When defining variables, the qualified field name syntax is used. For more information about qualified field names, see Master Index Field Notations in Understanding Sun Master Index Configuration Options . Variables are defined in the varList element, and each variable is defined by a var element.

The var element has these attributes:

Here is an example of a variable list for a master index application storing person data.


<varList>
   <var name="fname" default="Person.FirstName"/>
   <var name="lname" default="Person.LastName"/>
   <var name="ssn" default="Person.SSN"/>
   <var name="zip" default="Person.Address.PostCode"/>
   <var name="state" default="Person.Address.State"/>
</varList>

When you reference a variable in a rule, use the format “:[variable_name]” (where variable_name is the name you assigned to the variable). For example, to reference the FirstName field as defined above, it would look similar to this in the rules:


<field fieldName=":[fname]"/>

Data Validation Rules

You can define rules to validate certain fields against specific values, a range of values, patterns, and length. You can define multiple rules, which are listed in the ruleList element. Each rule you define is contained within a rule element, and each rule element only defines one rule (though the rule can be complex and include conditional rules and operators).

The following validation rules are predefined to help you validate data during the cleansing process. These rules can be used in conjunction with transformation rules, conditional rules, and conditional operators.

dataLength

This rule validates the length of the specified field value against the length you specify. You can validate against a range “greater than or equal to” or “less than” the given length. Used alone for the Data Cleanser, this rule rejects records containing field values that fail the validation. You can also use this rule with conditional rules and operators to define more complex rules. When used in a constrained analysis for the Data Profiler, the frequency report is generated for all records for which the data length rule is true.

The syntax for dataLength is:


<dataLength fieldName="name" len="length" more="true/false"/>

The parameters for dataLength are:


Example 4 Sample dataLength Validation Rule

This sample defines a rule to check for records containing first names that are 30 or more characters and containing last names that are shorter than two characters.


<rule>
  <dataLength fieldName="Person.FirstName" len="30" more="true"/>
  <datalength fieldName="Person.LastName" len="2" more="false"/>
</rule>

dateRange

This rule validates the value of the specified date field against the range you specify. Used alone in the Data Cleanser, this rule rejects records that contain data that fails validation. You can also use this rule with conditional rules and operators to create more complex rules. When this rule is used with the Data Profiler, the frequency report includes all records that match the date range specified. Note that dates with a 2–digit year instead of a 4–digit year will have “00” appended to the front. For example, if the invalid year “54” is found in the data, it is considered to be “0054” and is converted to that value in the Data Cleanser output file. The Data Profiler does not check for valid day and month values, and they appear in frequency reports for your analysis. Days and months are validated in the Data Cleanser.

Enter the date arguments for this rule in the format specified for the master index application (defined in object.xml). The syntax for dateRange is:


<dateRange fieldName="name" min="minimum_date" max="maximum_date"/>

The parameters for dateRange are:


Example 5 Sample dateRange Rule

This sample defines a rule that validates the date field against a range beginning 01/01/1900 and ending 06/01/2008. When used in the Data Cleanser, if a date of birth does not fall within that range, the record will be rejected and written to the bad data file. When used with the Data Profiler, any dates of birth that fall within the range are included in the frequency report.


<rule>
  <dateRange fieldName="Person.DOB" min="01/01/1900" max="06/01/2008"/>
</rule>

matchFromFile

This rule validates the value of the specified field against a list of values that you supply in a delimited file known as a valid value list. If a field value or part of a field value does not match an entry in the file, the record is rejected and written to the bad data file. You can also define a list of bad data in the file, known as an exclusion list, and then use this rule with conditional rules to reject records containing the bad data. If you include an empty entry in this list to indicate a null value, include the empty field in the middle of the list and not at the beginning or end. The syntax for matchFromFile is:


<matchFromFile fieldName="name" filePath="path" delimiter="delimeter" exact="true/false"/>

The parameters for matchFromFile are:


Example 6 Sample matchFromFile Rule

This sample defines a rule that validates State fields against a list of valid state abbreviations contained in the file State.txt. The field value must exactly match an entry in the list or the record will be rejected.


<rule>
  <matchFromFile fieldName="Person.Address.State" 
   filePath="C:\\Profiling\\ValidValues\\StateCodes.txt"
   delimiter=";" exact="true"/>
</rule>

In this case, the valid value list would look similar to this:


AL;AK;AS;AZ;AR;CA;CO;CT;DE;DC;...

patternMatch

This rule validates the pattern of the specified field value against a regular expression. See the Javadoc for java.util.regex for more information about using regular expressions. Used alone in the Data Cleanser, this rule can be defined to either reject records containing field values do not match the pattern or to reject records containing fields values that do match the pattern. You can also use this rule with conditional rules and operators to define more complex rules. When this rule is used with the Data Profiler, the frequency report includes a frequency for any patterns that either match or do not match the specified pattern, depending on the value of the found parameter.

The syntax for patternMatch is:


<patternMatch fieldName="name" matchPattern="pattern" found="true/false"/>

The parameters for patternMatch are:


Example 7 Sample patternMatch Rule

This sample validates the social security number field to ensure it is in the format “NNN-NN-NNNN”. Records containing a social security number in any other format are rejected or are not included in the frequency report.


<rule>
  <patternMatch fieldName="Person.SSN" matchPattern="[0-9]{3}-[0-9]{2}-[0-9]{4}" 
   found="true"/>
</rule>

range

This rule validates the value of the specified field against a numeric range. The range is specified in integers. When used alone with the Data Cleanser, this rule rejects records containing field values that fall outside of the given range. You can also use this rule with conditional rules and operators to define more complex rules. When this rule is used with the Data Profiler, the frequency report includes a frequency of field values that fall within the range.

The syntax for range is:


<range fieldName="name" min="minimum" max="maximum"/>

The parameters for range are:


Example 8 Sample range Rule

This sample validates United States postal codes to be sure they fall within a valid range. Records containing postal codes less than 06000 or greater than 99950 will be rejected or will not be included in a frequency report. Note that the PostalCode field needs to be a numeric data type in order for this to be a valid rule.


<rule>
  <range fieldName="Person.Age" min="06000" max="99950"/>
</rule>

reject

This rule rejects the specified field as bad data and writes the record to the bad data file. This rule is designed to be used in conditional statements as one action to take if a field value fails its validation. The syntax for reject is:


<reject fieldName="name"/>

The parameter for reject is:


Example 9 Sample reject Rule

This sample checks whether the SSN field is null. If the SSN field is null, the record is rejected and written to the bad data file. If the field is not null, the record is returned as “good” data.


<rule>
  <if>
    <condition>
      <isnull fieldName="Person.SSN"/>
    </condition>
    <then>
      <reject fieldName="Person.SSN"/>
    </then>
    <else>
      <return fieldName="Person.SSN"/>
    </else>
  </if>
</rule>

return

This rule returns the specified field as good data . This rule is designed to be used in conditional statements as the action to take if a field value passes its validation. The syntax for return is:


<return fieldName="name"/>

The parameter for return is:


Example 10 Sample return Rule

For a sample of the return rule, see the sample for reject.


validateDBField

This rule validates the specified fields against the length defined for those fields in object.xml. You can specify whether to reject records that exceed the defined length or to truncate the field value to the defined length. The syntax for validateDBField is:


<validateDBField> 
  <field fieldName="name" action="reject/truncate"/>
  <field fieldName="name" action="reject/truncate"/>
  ...
</validateDBField>

The parameters for validateDBField are a list of fields and the action to take for each field. The field elements take the following parameters:


Example 11 Sample validateDBField Rule

The following sample checks the length of the social security number and last name in each record. If the social security number is too long, the record is rejected and written to the bad data file. If the last name is too long, it is truncated.


<rule>
  <validateDBField>
    <field fieldName="Person.SSN" action="reject"/>
    <field fieldName="Person.LastName" action="truncate"/>
  </validateDBFields>
</rule>

Data Transformation Rules

You can define rules that modify the values or the patterns of certain fields. Use these rules alone to modify a field across all records, or use them within conditional rules to modify the values or patterns only in fields that meet the criteria you define. You can define multiple rules, which are listed in a ruleList element. Each rule you define is contained within a rule element, and each rule element only defines one rule (though one rule can be complex and include conditional rules and operators).

The following transformation rules are predefined to help you cleanse data. These rules can be used in conjunction with conditional rules and operators, as well as with validation rules.

assign

This rule assigns a new value to the specified field. The syntax for assign is:


<assign fieldName="name" value"new_value"/>

The parameters for assign are:


Example 12 Sample assign Rule

The following rule checks whether the value of the gender field equals 2, FEM, or FML. If any of those values are found, they are changed to “F”. This standardizes the values that indicate a gender of female to the correct processing code for female. If the three listed values are not found in the gender field, the record is returned as good data.


<rule>
  <if>
    <condition>
      <or>
        <equals value1="Person.Gender" value2="2" exact="true"/>
        <equals value1="Person.Gender" value2="FEM" exact="true"/>
        <equals value1="Person.Gender" value2="FML" exact="true"/>
      </or>
    </condition>
    <then>
      <assign fieldName="Person.Gender" value="F"/>
    </then>
    <else>
      <return fieldName="Person.Gender"/>
    </else>
  </if>
</rule>

patternReplace

This rule checks the value of the specified field against a pattern. If the patterns match, this rule replaces the existing pattern with a new pattern. Use regular expressions to define the patterns (see the Javadoc for java.util.regex for more information). The syntax for patternReplace is:


<patternReplace fieldName="name" matchPattern"old_pattern" replace="new_pattern"/>

The parameters for patternReplace are:


Example 13 Sample patternReplace Rule

The following sample searches the SSN field for 9–digit values without hyphens. If it finds such values, it inserts hyphens at the appropriate places.


<rule>
  <patternReplace fieldName="Person.SSN" matchPattern="[0-9]{9}"
   replace="[0-9]{3}-[0-9]{2}-[0-9]{4}"/>
</rule>

replace

This rule checks the value of the specified field against a given string. If the field value or part of the field value matches the string you specify, this rule replaces the existing string with a new string. The syntax for replace is:


<replace fieldName="name" matchPattern="old_string" replace="new_string"/>

The parameters for replace are:


Example 14 Sample replace Rule

The following sample looks for the values 1, MALE, or MAL in the gender field. If those values are found, it replaces them with “M” in order to standardize them to the processing code for male. Note that the replacement for MALE is defined before the replacement for MAL. Since this rule looks for and replaces full field values or partial field values, performing the process on MAL first would replace the “MAL” in MALE with “M”, resulting in “ME” for the new field value.


<rule>
  <replace fieldName="Person.Gender" matchPattern="1" replace="M"/>
  <replace fieldName="Person.Gender" matchPattern="MALE" replace="M"/>
  <replace fieldName="Person.Gender" matchPattern="MAL" replace="M"/>
</rule>

truncate

This rule checks the value of the specified field against its length as defined in object.xml. If the field value exceeds the length, the rule truncates the field to its actual length. The syntax for truncate is:


<truncate fieldName="name"/>

The parameter for truncate is:


Example 15 Sample truncate Rule

The following sample checks the lengths defined for the LastName, MaidenName, and MotherMN fields in object.xml. If any of these field values are found to be longer than the defined length, the value is truncated to the exact length defined in object.xml.


<rule>
  <truncate fieldName="Person.LastName"/>
  <truncate fieldName="Person.MaidenName"/>
  <truncate fieldName="Person.MotherMN"/>
</rule>

Conditional Data Rules

You can define conditional rules to use in conjunction with the validation and transformation rules described in Data Validation Rules and Data Transformation Rules. Conditional rules return either true or false. They only define a condition and not an action, so they must be used with other types of rules.

Conditional rules use if, then, and else statements in the following format:


<rule>
  <if>
    <condition>
      ...
    </condition>
    <then>
      ...
    </then>
    <else>
      ...
    </else>
  </if>
</rule>

The following conditional rules are predefined:

In addition, you can use the conditional operators described in Conditional Operators.

dataLength

This rule checks the length of the value of the specified field against the length defined in object.xml. You can check for lengths greater than or equal to the defined length or less than the defined length. This rule returns “true” if the length matches the specified length range; otherwise it returns “false”. The syntax for dataLength is:


<dataLength fieldName="name" len="length" more="true/false"/>

The parameters for dataLength are:


Example 16 Sample dataLength Conditional Rule

The following sample checks postal code fields to verify they do not exceed the United States length of five characters. If a postal code containing more than five characters is found, the record is rejected an written to the bad data file. If a postal code contains five or fewer characters, the field is returned as good.


<rule>
  <if>
    <condition>
      <dataLength fieldName="Person.Address.PostalCode" len="6" more="true"/>
    </condition>
    <then>
      <reject fieldName="Person.Address.PostalCode"/>
    </then>
    <else>
      <return fieldName="Person.Address.PostalCode"/>
    </else>
  </if>
</rule>

equals

This rule checks whether a specific value is equal to or found within the value of the specified field. This rule returns true if the conditions are matched; otherwise it returns false. The syntax for equals is:


<equals fieldName="name" value2="value" exact="true/false"/>

The parameters for equals are:


Example 17 Sample equals Rule

The following rule checks whether the value of the gender field equals 2, FEM, or FEMALE. If any of those values are found, they are changed to “F”. This standardizes the values that indicate a gender of female to the correct processing code for female. If the three listed values are not found in the gender field, the record is returned as good data.


<rule>
  <if>
    <condition>
      <or>
        <equals value1="Person.Gender" value2="FEMALE" exact="true"/>
        <equals value1="Person.Gender" value2="2" exact="true"/>
        <equals value1="Person.Gender" value2="FEM" exact="true"/>
        <equals value1="Person.Gender" value2="FML" exact="true"/>
      </or>
    </condition>
    <then>
      <assign fieldName="Person.Gender" value="F"/>
    </then>
    <else>
      <return fieldName="Person.Gender"/>
    </else>
  </if>
</rule>

isnull

This rule checks the value of the specified field and returns true if the field value is null. The syntax for isnull is:


<isnull fieldName="name"/>

The parameter for isnull is:


Example 18 Sample isnull Rule

The following sample checks the first name field for null values. If any first names are null, the record is rejected and written to the bad data file. Otherwise, the field is returned as good data.


<rule>
  <if>
    <condition>
      <isnull fieldName="Person.FirstName"/>
    </condition>
    <then>
      <reject fieldName="Person.FirstName"/>
    </then>
    <else>
      <return fieldName="Person.LastName"/>
    </else>
  </if>
</rule>

matches

This rule checks the value of the specified field for a certain pattern. If the pattern is found, the rule returns true. Use regular expressions to define the pattern. For more information, see the Javadoc for java.util.regex. The syntax for matches is:


<matches fieldName="name" pattern="pattern"/>

The parameters for matches are:


Example 19 Sample matches Rule

The following sample checks for phone numbers of the format (ddd)ddd-dddd (where “d” is a digit). Phone fields that are in that format are returned as good data. Fields that are not in that format are rejected and the records are written to the bad data file.


<rule>
  <if>
    <condition>
      <matches fieldName="Person.Phone.Phone" pattern="([0-9]{3})[0-9]{3}-[0-9]{4}"/>
    </condition>
    <then>
      <return fieldName="Person.Phone.Phone"/>
    </then>
    <else>
      <reject fieldName="Person.Phone.Phone"/>
    </else>
  </if>
</rule>

Conditional Operators

You can use the following conditional operators with the conditional rules.Example 20 and Example 21 provide examples of conditional operator usage for multiple conditions.


Example 20 Sample for Nested Conditional Operators

The following sample checks for phone numbers of the format (ddd)ddd-dddd (where “d” is a digit). Phone fields that are in that format are returned as good data. Fields that are not in that format are rechecked to see if the field value is a string of 10 digits. Phone fields that return true for that validation are reformatted to include the parentheses and hyphen. Phone fields that return false for the final validation are rejected and the records are written to the bad data file.


<rule>
  <if>
    <condition>
      <matches fieldName="Person.Phone.Phone" pattern="([0-9]{3})[0-9]{3}-[0-9]{4}"/>
    </condition>
    <then>
      <return fieldName="Person.Phone.Phone"/>
    </then>
    <else>
      <if>
        <condition>
          <matches fieldName="Person.Phone.Phone" pattern="[0-9]{10}"/>
        </condition>
        <then>
          <patternReplace fieldname"Person.Phone.Phone" matchPattern="[0-9]{10}" 
           replace="([0-9]{3})[0-9]{3}-[0-9]{4}"/>
        </then>
        <else>
          <reject fieldName="Person.Phone.Phone"/>
        </else>
      </if>
    </else>
  </if>
</rule>


Example 21 Sample for Conditional Operations Using “and”

The following sample checks postal code fields to verify they do not exceed the United States length of five characters. If a postal code containing more than five characters is found, the record is rejected an written to the bad data file. If a postal code contains five or fewer characters, the field is then checked to see whether it contains nine characters (indicating the extension might be appended to the postal code). If it is nine characters, the value is truncated to five characters, leaving the postal code.


<rule>
  <if>
    <condition>
      <and>
        <dataLength fieldName="Person.Address.PostalCode" len="5" more="true"/>
        <datalength fieldName="Person.Address.PostalCode" len="6" more="false"/>
      </and>
    </condition>
    <then>
      <return fieldName="Person.Address.PostalCode"/>
    </then>
    <else>
      <if>
        <condition>
          <and>
            <dataLength fieldName="Person.Address.PostalCode" len="9" more="true"/>
            <datalength fieldName="Person.Address.PostalCode" len="10" more="false"/>
          </and>
        </condition>
        <then>
          <truncate fieldName="Person.Address.PostalCode"/>
        </then>
        <else>
            <reject fieldName="Person.Address.PostalCode"/>
        </else>
      </if>
    </else>
  </if>
</rule>