Analyzing and Cleansing Data for Sun Master Index

Data Cleanser Rules Syntax

Data Cleanser rules are expressed using a Rules Definition Language (RDL) in an XML configuration file. Using a markup language allows you to easily define and change rules without having to code them, and you can also create custom Java classes to perform additional types of data validation or transformation. The first section of the configuration file, cleansingRules, defines rules for the Data Cleanser. The cleansing section includes three subsections. The first defines processing attributes, the second defines variables, and the third defines cleansing rules. The following topics provide information about each subsection.

Data Cleanser Processing Attributes

The following table lists and describes the attributes for the cleansingVariable element in the configuration file. These attributes define the data source and path names for the Data Cleanser as well as global validation rules. Below is a sample of the cleansing attributes.

cleansingVariable objectdefFilePath="../../src/Configuration" validateType="true" 
validateNull="false" validateLength="true" DBconnection="../StagingDB" 
goodFilePath="./Output/good.txt" badFilePath=./Output/bad.txt startCount="1"
standardizer="true"

Attribute	Description
objectdefFilePath	The path and filename for the `object.xml` file to use to cleanse the data.
validateType	An indicator of whether the cleanser should validate each field's data type against the type defined in `object.xml`. Specify `true` to validate field type; otherwise specify `false`. If you validate against type and the validation fails for any field in a record, the record is written to the bad file.
validateNull	An indicator of whether the cleanser should check for null values in each field that is configured to be required in `object.xml`. Specify `true` to check for null values; otherwise specify `false`. If you check for null values and any required field in a record is null, the record is written to the bad file.
validateLength	An indicator of whether the cleanser should validate each field's length against the length defined in `object.xml`. Specify `true` to validate field length; otherwise specify `false`. If you validate against length and the validation fails for any field in a record, the record is written to the bad file.
DBconnection	The path to the staging database or the path and name of the flat file containing the data to be profiled. Use forward slashes in this path rather than back slashes.
badDataFilePath	The path and name of the file that lists the records that are found to contain bad data during the cleansing process. This file includes an error message for each record describing the reason it was rejected. If you specify a path that does not exist, you need to create the path.
goodDataFilePath	The path and name of the file that lists the records that do not contain any bad data. These records can be processed through the Initial Bulk Match and Load tool into the master index database. If you specify a path that does not exist, you need to create the path.
startCounter	The starting number for the GID generator for the cleansed records. The GID is a unique value used by the Initial Bulk Match and Load tool, which takes the good data file created by the cleansing process as its input. Enter a non-negative long value. For the initial cleansing, set this to `1`.
standardizer	An indicator of whether the Data Cleanser should standardize the input data according to the standardization rules defined in the `mefa.xml` file in the master index project. Specify `true` to standardize the data. This populates the standardized values into the output file. Specify `false` to bypass standardization. If no value is specified or this property is missing, the default is true.

Data Cleanser Global Variables

You can define global variables for the fields used in the cleansing rules so you do not need to enter the qualified field name for each rule. When defining variables, the qualified field name syntax is used. For more information about qualified field names, see Master Index Field Notations in Understanding Sun Master Index Configuration Options . Variables are defined in the varList element, and each variable is defined by a var element.

The var element has these attributes:

name – The name of the variable.
default – The qualified field name to the field defined by the name attribute.

Here is an example of a variable list for a master index application storing person data.

<varList>
   <var name="fname" default="Person.FirstName"/>
   <var name="lname" default="Person.LastName"/>
   <var name="ssn" default="Person.SSN"/>
   <var name="zip" default="Person.Address.PostCode"/>
   <var name="state" default="Person.Address.State"/>
</varList>

When you reference a variable in a rule, use the format “:[variable_name]” (where variable_name is the name you assigned to the variable). For example, to reference the FirstName field as defined above, it would look similar to this in the rules:

<field fieldName=":[fname]"/>

Data Validation Rules

You can define rules to validate certain fields against specific values, a range of values, patterns, and length. You can define multiple rules, which are listed in the ruleList element. Each rule you define is contained within a rule element, and each rule element only defines one rule (though the rule can be complex and include conditional rules and operators).

The following validation rules are predefined to help you validate data during the cleansing process. These rules can be used in conjunction with transformation rules, conditional rules, and conditional operators.

dataLength

This rule validates the length of the specified field value against the length you specify. You can validate against a range “greater than or equal to” or “less than” the given length. Used alone for the Data Cleanser, this rule rejects records containing field values that fail the validation. You can also use this rule with conditional rules and operators to define more complex rules. When used in a constrained analysis for the Data Profiler, the frequency report is generated for all records for which the data length rule is true.

The syntax for dataLength is:

<dataLength fieldName="name" len="length" more="true/false"/>

The parameters for dataLength are:

fieldName – The qualified field name or assigned variable for the field to validate.
len – The length to use to validate the field value.
more – An indicator of whether to consider lengths that are greater than or equal to the length or that are less than the length. Specify “true” if the field value should be greater than or equal to the specified length. Specify “false” if the field value should be less than the specified length.

Example 4 Sample dataLength Validation Rule

This sample defines a rule to check for records containing first names that are 30 or more characters and containing last names that are shorter than two characters.

<rule>
  <dataLength fieldName="Person.FirstName" len="30" more="true"/>
  <datalength fieldName="Person.LastName" len="2" more="false"/>
</rule>

dateRange

This rule validates the value of the specified date field against the range you specify. Used alone in the Data Cleanser, this rule rejects records that contain data that fails validation. You can also use this rule with conditional rules and operators to create more complex rules. When this rule is used with the Data Profiler, the frequency report includes all records that match the date range specified. Note that dates with a 2–digit year instead of a 4–digit year will have “00” appended to the front. For example, if the invalid year “54” is found in the data, it is considered to be “0054” and is converted to that value in the Data Cleanser output file. The Data Profiler does not check for valid day and month values, and they appear in frequency reports for your analysis. Days and months are validated in the Data Cleanser.

Enter the date arguments for this rule in the format specified for the master index application (defined in object.xml). The syntax for dateRange is:

<dateRange fieldName="name" min="minimum_date" max="maximum_date"/>

The parameters for dateRange are:

fieldName – The qualified field name or assigned variable for the field to validate.
min – The earliest date to include in the valid date range.
max – The latest date to include in the valid date range.

Example 5 Sample dateRange Rule

This sample defines a rule that validates the date field against a range beginning 01/01/1900 and ending 06/01/2008. When used in the Data Cleanser, if a date of birth does not fall within that range, the record will be rejected and written to the bad data file. When used with the Data Profiler, any dates of birth that fall within the range are included in the frequency report.

<rule>
  <dateRange fieldName="Person.DOB" min="01/01/1900" max="06/01/2008"/>
</rule>

matchFromFile

This rule validates the value of the specified field against a list of values that you supply in a delimited file known as a valid value list. If a field value or part of a field value does not match an entry in the file, the record is rejected and written to the bad data file. You can also define a list of bad data in the file, known as an exclusion list, and then use this rule with conditional rules to reject records containing the bad data. If you include an empty entry in this list to indicate a null value, include the empty field in the middle of the list and not at the beginning or end. The syntax for matchFromFile is:

<matchFromFile fieldName="name" filePath="path" delimiter="delimeter" exact="true/false"/>

The parameters for matchFromFile are:

fieldName – The qualified field name or assigned variable for the field to validate.
path – The path and filename of the file containing the good or bad field values.
delimiter – The delimiter used to separate the values in the value list file.
exact – An indicator of whether the full field value must exactly match one of the entries in the valid value list in order to be considered valid. Specify “true” if the full field value must match exactly. Specify “false” if any part of the field value can match one of the entries in the valid value list.

Example 6 Sample matchFromFile Rule

This sample defines a rule that validates State fields against a list of valid state abbreviations contained in the file State.txt. The field value must exactly match an entry in the list or the record will be rejected.

<rule>
  <matchFromFile fieldName="Person.Address.State" 
   filePath="C:\\Profiling\\ValidValues\\StateCodes.txt"
   delimiter=";" exact="true"/>
</rule>

In this case, the valid value list would look similar to this:

AL;AK;AS;AZ;AR;CA;CO;CT;DE;DC;...

patternMatch

This rule validates the pattern of the specified field value against a regular expression. See the Javadoc for java.util.regex for more information about using regular expressions. Used alone in the Data Cleanser, this rule can be defined to either reject records containing field values do not match the pattern or to reject records containing fields values that do match the pattern. You can also use this rule with conditional rules and operators to define more complex rules. When this rule is used with the Data Profiler, the frequency report includes a frequency for any patterns that either match or do not match the specified pattern, depending on the value of the found parameter.

The syntax for patternMatch is:

<patternMatch fieldName="name" matchPattern="pattern" found="true/false"/>

The parameters for patternMatch are:

field_name – The qualified field name or assigned variable for the field to validate.
matchPattern – A regular expression that defines the pattern to match against.
found – A boolean indicator of whether to validate for field values that do match the pattern or that do not match the pattern. When this parameter is set to true in a cleansing rule, the Data Cleanser rejects records that do not match the pattern. For the Data Profiler, the generated report only includes field values that match the pattern. When this parameter is set to false in a cleansing rule, the Data Cleanser rejects records that match the pattern. For the Data Profiler, the generated report only includes fields that do not match the pattern.

Example 7 Sample patternMatch Rule

This sample validates the social security number field to ensure it is in the format “NNN-NN-NNNN”. Records containing a social security number in any other format are rejected or are not included in the frequency report.

<rule>
  <patternMatch fieldName="Person.SSN" matchPattern="[0-9]{3}-[0-9]{2}-[0-9]{4}" 
   found="true"/>
</rule>

range

This rule validates the value of the specified field against a numeric range. The range is specified in integers. When used alone with the Data Cleanser, this rule rejects records containing field values that fall outside of the given range. You can also use this rule with conditional rules and operators to define more complex rules. When this rule is used with the Data Profiler, the frequency report includes a frequency of field values that fall within the range.

The syntax for range is:

<range fieldName="name" min="minimum" max="maximum"/>

The parameters for range are:

fieldName – The qualified field name or assigned variable for the field to validate.
min – An integer indicating the minimum value in the numeric range.
max – An integer indicating the maximum value in the numeric range.

Example 8 Sample range Rule

This sample validates United States postal codes to be sure they fall within a valid range. Records containing postal codes less than 06000 or greater than 99950 will be rejected or will not be included in a frequency report. Note that the PostalCode field needs to be a numeric data type in order for this to be a valid rule.

<rule>
  <range fieldName="Person.Age" min="06000" max="99950"/>
</rule>

reject

This rule rejects the specified field as bad data and writes the record to the bad data file. This rule is designed to be used in conditional statements as one action to take if a field value fails its validation. The syntax for reject is:

<reject fieldName="name"/>

The parameter for reject is:

fieldName – The qualified field name or assigned variable for the field to reject.

Example 9 Sample reject Rule

This sample checks whether the SSN field is null. If the SSN field is null, the record is rejected and written to the bad data file. If the field is not null, the record is returned as “good” data.

<rule>
  <if>
    <condition>
      <isnull fieldName="Person.SSN"/>
    </condition>
    <then>
      <reject fieldName="Person.SSN"/>
    </then>
    <else>
      <return fieldName="Person.SSN"/>
    </else>
  </if>
</rule>

return

This rule returns the specified field as good data . This rule is designed to be used in conditional statements as the action to take if a field value passes its validation. The syntax for return is:

<return fieldName="name"/>

The parameter for return is:

fieldName – The qualified field name or assigned variable for the field to return as good data.

Example 10 Sample return Rule

For a sample of the return rule, see the sample for reject.

validateDBField

This rule validates the specified fields against the length defined for those fields in object.xml. You can specify whether to reject records that exceed the defined length or to truncate the field value to the defined length. The syntax for validateDBField is:

<validateDBField> 
  <field fieldName="name" action="reject/truncate"/>
  <field fieldName="name" action="reject/truncate"/>
  ...
</validateDBField>

The parameters for validateDBField are a list of fields and the action to take for each field. The field elements take the following parameters:

fieldName – The qualified field name or assigned variable for the field to validate.
action – The action to take if the field value exceeds the allowed length. Specify “reject” to reject the record as bad data, or specify “truncate” to truncate the field value to the length defined in object.xml.

Example 11 Sample validateDBField Rule

The following sample checks the length of the social security number and last name in each record. If the social security number is too long, the record is rejected and written to the bad data file. If the last name is too long, it is truncated.

<rule>
  <validateDBField>
    <field fieldName="Person.SSN" action="reject"/>
    <field fieldName="Person.LastName" action="truncate"/>
  </validateDBFields>
</rule>

Data Transformation Rules

You can define rules that modify the values or the patterns of certain fields. Use these rules alone to modify a field across all records, or use them within conditional rules to modify the values or patterns only in fields that meet the criteria you define. You can define multiple rules, which are listed in a ruleList element. Each rule you define is contained within a rule element, and each rule element only defines one rule (though one rule can be complex and include conditional rules and operators).

The following transformation rules are predefined to help you cleanse data. These rules can be used in conjunction with conditional rules and operators, as well as with validation rules.

assign

This rule assigns a new value to the specified field. The syntax for assign is:

<assign fieldName="name" value"new_value"/>

The parameters for assign are:

fieldName – The qualified field name or assigned variable for the field to assign the value to.
value – The new value to replace the existing value in the field.

Example 12 Sample assign Rule

The following rule checks whether the value of the gender field equals 2, FEM, or FML. If any of those values are found, they are changed to “F”. This standardizes the values that indicate a gender of female to the correct processing code for female. If the three listed values are not found in the gender field, the record is returned as good data.

<rule>
  <if>
    <condition>
      <or>
        <equals value1="Person.Gender" value2="2" exact="true"/>
        <equals value1="Person.Gender" value2="FEM" exact="true"/>
        <equals value1="Person.Gender" value2="FML" exact="true"/>
      </or>
    </condition>
    <then>
      <assign fieldName="Person.Gender" value="F"/>
    </then>
    <else>
      <return fieldName="Person.Gender"/>
    </else>
  </if>
</rule>

patternReplace

This rule checks the value of the specified field against a pattern. If the patterns match, this rule replaces the existing pattern with a new pattern. Use regular expressions to define the patterns (see the Javadoc for java.util.regex for more information). The syntax for patternReplace is:

<patternReplace fieldName="name" matchPattern"old_pattern" replace="new_pattern"/>

The parameters for patternReplace are:

fieldName – The qualified field name or assigned variable for the field to check.
matchPattern – The pattern to check for in the specified field.
replace – The pattern that will replace the existing pattern in the specified field.

Example 13 Sample patternReplace Rule

The following sample searches the SSN field for 9–digit values without hyphens. If it finds such values, it inserts hyphens at the appropriate places.

<rule>
  <patternReplace fieldName="Person.SSN" matchPattern="[0-9]{9}"
   replace="[0-9]{3}-[0-9]{2}-[0-9]{4}"/>
</rule>

replace

This rule checks the value of the specified field against a given string. If the field value or part of the field value matches the string you specify, this rule replaces the existing string with a new string. The syntax for replace is:

<replace fieldName="name" matchPattern="old_string" replace="new_string"/>

The parameters for replace are:

fieldName – The qualified field name or assigned variable for the field to check.
matchPattern – The string to check for in the specified field.
replace – The string that will replace the existing string in the specified field.

Example 14 Sample replace Rule

The following sample looks for the values 1, MALE, or MAL in the gender field. If those values are found, it replaces them with “M” in order to standardize them to the processing code for male. Note that the replacement for MALE is defined before the replacement for MAL. Since this rule looks for and replaces full field values or partial field values, performing the process on MAL first would replace the “MAL” in MALE with “M”, resulting in “ME” for the new field value.

<rule>
  <replace fieldName="Person.Gender" matchPattern="1" replace="M"/>
  <replace fieldName="Person.Gender" matchPattern="MALE" replace="M"/>
  <replace fieldName="Person.Gender" matchPattern="MAL" replace="M"/>
</rule>

truncate

This rule checks the value of the specified field against its length as defined in object.xml. If the field value exceeds the length, the rule truncates the field to its actual length. The syntax for truncate is:

<truncate fieldName="name"/>

The parameter for truncate is:

fieldName – The qualified field name or assigned variable for the field to check and to truncate if it is too long.

Example 15 Sample truncate Rule

The following sample checks the lengths defined for the LastName, MaidenName, and MotherMN fields in object.xml. If any of these field values are found to be longer than the defined length, the value is truncated to the exact length defined in object.xml.

<rule>
  <truncate fieldName="Person.LastName"/>
  <truncate fieldName="Person.MaidenName"/>
  <truncate fieldName="Person.MotherMN"/>
</rule>

Conditional Data Rules

You can define conditional rules to use in conjunction with the validation and transformation rules described in Data Validation Rules and Data Transformation Rules. Conditional rules return either true or false. They only define a condition and not an action, so they must be used with other types of rules.

Conditional rules use if, then, and else statements in the following format:

<rule>
  <if>
    <condition>
      ...
    </condition>
    <then>
      ...
    </then>
    <else>
      ...
    </else>
  </if>
</rule>

The following conditional rules are predefined:

In addition, you can use the conditional operators described in Conditional Operators.

dataLength

This rule checks the length of the value of the specified field against the length defined in object.xml. You can check for lengths greater than or equal to the defined length or less than the defined length. This rule returns “true” if the length matches the specified length range; otherwise it returns “false”. The syntax for dataLength is:

<dataLength fieldName="name" len="length" more="true/false"/>

The parameters for dataLength are:

fieldName – The qualified field name or assigned variable for the field to check.
len – The length to use to validate the field value.
more – An indicator of whether to check for lengths greater than or equal to the given length or for lengths less than the given length. Specify “true” to check for lengths greater than or equal to the given length, or specify “false” to check for lengths less than the given length.

Example 16 Sample dataLength Conditional Rule

<rule>
  <if>
    <condition>
      <dataLength fieldName="Person.Address.PostalCode" len="6" more="true"/>
    </condition>
    <then>
      <reject fieldName="Person.Address.PostalCode"/>
    </then>
    <else>
      <return fieldName="Person.Address.PostalCode"/>
    </else>
  </if>
</rule>

equals

This rule checks whether a specific value is equal to or found within the value of the specified field. This rule returns true if the conditions are matched; otherwise it returns false. The syntax for equals is:

<equals fieldName="name" value2="value" exact="true/false"/>

The parameters for equals are:

fieldName – The qualified field name or assigned variable for the field to check.
value2 – The value to compare with the value of the specified field.
exact – An indicator of whether to match exactly against the specified value or to find the specified value within the field value. Specify “true” to match exactly against the specified value, or specify “false” to check for the value within the field.

Example 17 Sample equals Rule

The following rule checks whether the value of the gender field equals 2, FEM, or FEMALE. If any of those values are found, they are changed to “F”. This standardizes the values that indicate a gender of female to the correct processing code for female. If the three listed values are not found in the gender field, the record is returned as good data.

<rule>
  <if>
    <condition>
      <or>
        <equals value1="Person.Gender" value2="FEMALE" exact="true"/>
        <equals value1="Person.Gender" value2="2" exact="true"/>
        <equals value1="Person.Gender" value2="FEM" exact="true"/>
        <equals value1="Person.Gender" value2="FML" exact="true"/>
      </or>
    </condition>
    <then>
      <assign fieldName="Person.Gender" value="F"/>
    </then>
    <else>
      <return fieldName="Person.Gender"/>
    </else>
  </if>
</rule>

isnull

This rule checks the value of the specified field and returns true if the field value is null. The syntax for isnull is:

<isnull fieldName="name"/>

The parameter for isnull is:

fieldName – The qualified field name or assigned variable for the field to check for null values.

Example 18 Sample isnull Rule

The following sample checks the first name field for null values. If any first names are null, the record is rejected and written to the bad data file. Otherwise, the field is returned as good data.

<rule>
  <if>
    <condition>
      <isnull fieldName="Person.FirstName"/>
    </condition>
    <then>
      <reject fieldName="Person.FirstName"/>
    </then>
    <else>
      <return fieldName="Person.LastName"/>
    </else>
  </if>
</rule>

matches

This rule checks the value of the specified field for a certain pattern. If the pattern is found, the rule returns true. Use regular expressions to define the pattern. For more information, see the Javadoc for java.util.regex. The syntax for matches is:

<matches fieldName="name" pattern="pattern"/>

The parameters for matches are:

fieldName – The qualified field name or assigned variable for the field to check against the given pattern.
pattern – The pattern to find in the given field.

Example 19 Sample matches Rule

<rule>
  <if>
    <condition>
      <matches fieldName="Person.Phone.Phone" pattern="([0-9]{3})[0-9]{3}-[0-9]{4}"/>
    </condition>
    <then>
      <return fieldName="Person.Phone.Phone"/>
    </then>
    <else>
      <reject fieldName="Person.Phone.Phone"/>
    </else>
  </if>
</rule>

Conditional Operators

You can use the following conditional operators with the conditional rules.Example 20 and Example 21 provide examples of conditional operator usage for multiple conditions.

if
else
then
or
and
not

Example 20 Sample for Nested Conditional Operators

The following sample checks for phone numbers of the format (ddd)ddd-dddd (where “d” is a digit). Phone fields that are in that format are returned as good data. Fields that are not in that format are rechecked to see if the field value is a string of 10 digits. Phone fields that return true for that validation are reformatted to include the parentheses and hyphen. Phone fields that return false for the final validation are rejected and the records are written to the bad data file.

<rule>
  <if>
    <condition>
      <matches fieldName="Person.Phone.Phone" pattern="([0-9]{3})[0-9]{3}-[0-9]{4}"/>
    </condition>
    <then>
      <return fieldName="Person.Phone.Phone"/>
    </then>
    <else>
      <if>
        <condition>
          <matches fieldName="Person.Phone.Phone" pattern="[0-9]{10}"/>
        </condition>
        <then>
          <patternReplace fieldname"Person.Phone.Phone" matchPattern="[0-9]{10}" 
           replace="([0-9]{3})[0-9]{3}-[0-9]{4}"/>
        </then>
        <else>
          <reject fieldName="Person.Phone.Phone"/>
        </else>
      </if>
    </else>
  </if>
</rule>

Example 21 Sample for Conditional Operations Using “and”

The following sample checks postal code fields to verify they do not exceed the United States length of five characters. If a postal code containing more than five characters is found, the record is rejected an written to the bad data file. If a postal code contains five or fewer characters, the field is then checked to see whether it contains nine characters (indicating the extension might be appended to the postal code). If it is nine characters, the value is truncated to five characters, leaving the postal code.

<rule>
  <if>
    <condition>
      <and>
        <dataLength fieldName="Person.Address.PostalCode" len="5" more="true"/>
        <datalength fieldName="Person.Address.PostalCode" len="6" more="false"/>
      </and>
    </condition>
    <then>
      <return fieldName="Person.Address.PostalCode"/>
    </then>
    <else>
      <if>
        <condition>
          <and>
            <dataLength fieldName="Person.Address.PostalCode" len="9" more="true"/>
            <datalength fieldName="Person.Address.PostalCode" len="10" more="false"/>
          </and>
        </condition>
        <then>
          <truncate fieldName="Person.Address.PostalCode"/>
        </then>
        <else>
            <reject fieldName="Person.Address.PostalCode"/>
        </else>
      </if>
    </else>
  </if>
</rule>