Analyzing and Cleansing Data for Sun Master Index

Constrained Frequency Analysis Rules

A constrained frequency analysis compares the values of the fields you specify based on validation rules you define. It creates a report for each rule that lists each value for the fields along with the number of times each value or combination of values occurs. You can perform the analysis on a single field or multiple fields, and you can sort the resulting report by frequency in increasing or decreasing order. Constrained frequency analysis rules are defined within ConstrainedFrequencyAnalysis tags that include the elements and attributes listed in the following table.

Table 2 Constrained Frequency Analysis Rules

Element 

Attribute 

Description 

fields 

 

A list of fields to include in the frequency analysis. 

field 

 

One field definition in the list of fields. 

 

fieldName 

The name of the field. If you defined a variable for the field, the syntax for this attribute is fieldName=”:[var_name]”, where var_name is the name you gave the variable. If you did not define a variable, enter the qualified field name within double quotes. For example, fieldName=“Person.FirstName”.

sortOrder 

 

If defined, specifies a field on which to sort in order of frequency.  

 

fieldName 

The name of the field to sort on. Use the syntax described for fieldName above.

 

increasing 

An indicator of whether to sort in increasing or decreasing frequency. Specify “true” to sort in increasing order, or specify “false” to sort in decreasing order. 

threshold 

 

If defined, specifies a frequency threshold above which or below which field values will be listed on the report. 

 

value 

The frequency threshold. This is a cutoff value to help limit the results of the report. 

 

more 

An indicator of whether the threshold is an upper or lower threshold. Specify “true” to return field values with a frequency greater than or equal to the threshold. Specify “false” to return field values with a frequency less than the threshold. 

ruleList 

 

A list of rules to apply to the frequency analysis.  

rule 

 

One rule definition to apply to the frequency analysis. You can define multiple rule definitions in a rule list. The following validation rules can be used in a constrained frequency analysis: dateRange, patternMatch, dataLength, and range. For descriptions and samples of these rules, see Data Validation Rules.


Example 2 Sample Constrained Frequency Analysis

The example below performs a frequency analysis on the date of birth, but only for those dates that fall within a range too early to be valid. This an example of a profiling you might do prior to data cleansing in order to determine invalid values. This can also bring out invalid formats for the date, such as MM/DD/YY.


<ConstrainedFrequencyAnalysis>
  <fields>
    <field fieldName="Person.DOB"/>
  </fields>
  <ruleList>
    <rule>
      <dateRange fieldName="Person.DOB" min="01/01/0001" max="01/01/1900"/>
    </rule>
  </ruleList>
</ConstrainedFrequencyAnalysis>