Analyzing and Cleansing Data for Sun Master Index

Data Validation Rules

You can define rules to validate certain fields against specific values, a range of values, patterns, and length. You can define multiple rules, which are listed in the ruleList element. Each rule you define is contained within a rule element, and each rule element only defines one rule (though the rule can be complex and include conditional rules and operators).

The following validation rules are predefined to help you validate data during the cleansing process. These rules can be used in conjunction with transformation rules, conditional rules, and conditional operators.

dataLength

This rule validates the length of the specified field value against the length you specify. You can validate against a range “greater than or equal to” or “less than” the given length. Used alone for the Data Cleanser, this rule rejects records containing field values that fail the validation. You can also use this rule with conditional rules and operators to define more complex rules. When used in a constrained analysis for the Data Profiler, the frequency report is generated for all records for which the data length rule is true.

The syntax for dataLength is:


<dataLength fieldName="name" len="length" more="true/false"/>

The parameters for dataLength are:


Example 4 Sample dataLength Validation Rule

This sample defines a rule to check for records containing first names that are 30 or more characters and containing last names that are shorter than two characters.


<rule>
  <dataLength fieldName="Person.FirstName" len="30" more="true"/>
  <datalength fieldName="Person.LastName" len="2" more="false"/>
</rule>

dateRange

This rule validates the value of the specified date field against the range you specify. Used alone in the Data Cleanser, this rule rejects records that contain data that fails validation. You can also use this rule with conditional rules and operators to create more complex rules. When this rule is used with the Data Profiler, the frequency report includes all records that match the date range specified. Note that dates with a 2–digit year instead of a 4–digit year will have “00” appended to the front. For example, if the invalid year “54” is found in the data, it is considered to be “0054” and is converted to that value in the Data Cleanser output file. The Data Profiler does not check for valid day and month values, and they appear in frequency reports for your analysis. Days and months are validated in the Data Cleanser.

Enter the date arguments for this rule in the format specified for the master index application (defined in object.xml). The syntax for dateRange is:


<dateRange fieldName="name" min="minimum_date" max="maximum_date"/>

The parameters for dateRange are:


Example 5 Sample dateRange Rule

This sample defines a rule that validates the date field against a range beginning 01/01/1900 and ending 06/01/2008. When used in the Data Cleanser, if a date of birth does not fall within that range, the record will be rejected and written to the bad data file. When used with the Data Profiler, any dates of birth that fall within the range are included in the frequency report.


<rule>
  <dateRange fieldName="Person.DOB" min="01/01/1900" max="06/01/2008"/>
</rule>

matchFromFile

This rule validates the value of the specified field against a list of values that you supply in a delimited file known as a valid value list. If a field value or part of a field value does not match an entry in the file, the record is rejected and written to the bad data file. You can also define a list of bad data in the file, known as an exclusion list, and then use this rule with conditional rules to reject records containing the bad data. If you include an empty entry in this list to indicate a null value, include the empty field in the middle of the list and not at the beginning or end. The syntax for matchFromFile is:


<matchFromFile fieldName="name" filePath="path" delimiter="delimeter" exact="true/false"/>

The parameters for matchFromFile are:


Example 6 Sample matchFromFile Rule

This sample defines a rule that validates State fields against a list of valid state abbreviations contained in the file State.txt. The field value must exactly match an entry in the list or the record will be rejected.


<rule>
  <matchFromFile fieldName="Person.Address.State" 
   filePath="C:\\Profiling\\ValidValues\\StateCodes.txt"
   delimiter=";" exact="true"/>
</rule>

In this case, the valid value list would look similar to this:


AL;AK;AS;AZ;AR;CA;CO;CT;DE;DC;...

patternMatch

This rule validates the pattern of the specified field value against a regular expression. See the Javadoc for java.util.regex for more information about using regular expressions. Used alone in the Data Cleanser, this rule can be defined to either reject records containing field values do not match the pattern or to reject records containing fields values that do match the pattern. You can also use this rule with conditional rules and operators to define more complex rules. When this rule is used with the Data Profiler, the frequency report includes a frequency for any patterns that either match or do not match the specified pattern, depending on the value of the found parameter.

The syntax for patternMatch is:


<patternMatch fieldName="name" matchPattern="pattern" found="true/false"/>

The parameters for patternMatch are:


Example 7 Sample patternMatch Rule

This sample validates the social security number field to ensure it is in the format “NNN-NN-NNNN”. Records containing a social security number in any other format are rejected or are not included in the frequency report.


<rule>
  <patternMatch fieldName="Person.SSN" matchPattern="[0-9]{3}-[0-9]{2}-[0-9]{4}" 
   found="true"/>
</rule>

range

This rule validates the value of the specified field against a numeric range. The range is specified in integers. When used alone with the Data Cleanser, this rule rejects records containing field values that fall outside of the given range. You can also use this rule with conditional rules and operators to define more complex rules. When this rule is used with the Data Profiler, the frequency report includes a frequency of field values that fall within the range.

The syntax for range is:


<range fieldName="name" min="minimum" max="maximum"/>

The parameters for range are:


Example 8 Sample range Rule

This sample validates United States postal codes to be sure they fall within a valid range. Records containing postal codes less than 06000 or greater than 99950 will be rejected or will not be included in a frequency report. Note that the PostalCode field needs to be a numeric data type in order for this to be a valid rule.


<rule>
  <range fieldName="Person.Age" min="06000" max="99950"/>
</rule>

reject

This rule rejects the specified field as bad data and writes the record to the bad data file. This rule is designed to be used in conditional statements as one action to take if a field value fails its validation. The syntax for reject is:


<reject fieldName="name"/>

The parameter for reject is:


Example 9 Sample reject Rule

This sample checks whether the SSN field is null. If the SSN field is null, the record is rejected and written to the bad data file. If the field is not null, the record is returned as “good” data.


<rule>
  <if>
    <condition>
      <isnull fieldName="Person.SSN"/>
    </condition>
    <then>
      <reject fieldName="Person.SSN"/>
    </then>
    <else>
      <return fieldName="Person.SSN"/>
    </else>
  </if>
</rule>

return

This rule returns the specified field as good data . This rule is designed to be used in conditional statements as the action to take if a field value passes its validation. The syntax for return is:


<return fieldName="name"/>

The parameter for return is:


Example 10 Sample return Rule

For a sample of the return rule, see the sample for reject.


validateDBField

This rule validates the specified fields against the length defined for those fields in object.xml. You can specify whether to reject records that exceed the defined length or to truncate the field value to the defined length. The syntax for validateDBField is:


<validateDBField> 
  <field fieldName="name" action="reject/truncate"/>
  <field fieldName="name" action="reject/truncate"/>
  ...
</validateDBField>

The parameters for validateDBField are a list of fields and the action to take for each field. The field elements take the following parameters:


Example 11 Sample validateDBField Rule

The following sample checks the length of the social security number and last name in each record. If the social security number is too long, the record is rejected and written to the bad data file. If the last name is too long, it is truncated.


<rule>
  <validateDBField>
    <field fieldName="Person.SSN" action="reject"/>
    <field fieldName="Person.LastName" action="truncate"/>
  </validateDBFields>
</rule>