Analyzing and Cleansing Data for Sun Master Index

Pattern Frequency Analysis Rules

A pattern frequency analysis compares the regular expression patterns found in the values for the specified field and performs a frequency analysis based on the patterns found. It creates a report for each field that lists each pattern along with the number of times each pattern occurs. You can perform the analysis on a single field or multiple fields, and you can sort the resulting report by frequency in increasing or decreasing order. Patterns are represented by regular expressions. For more information, see the Javadoc for java.util.regex.

Pattern frequency analysis rules are defined within PatternFrequencyAnalysis tags that include the elements and attributes listed in the following table.

Table 3 Pattern Frequency Analysis Rules

Element 

Attribute 

Description 

topNpattern 

 

If defined, the number of top frequencies to display. For example, you can restrict a report to just the top 10 frequencies of a field. If multiple fields are defined, this setting applies to the combination of fields. 

 

value 

The number of top frequencies to display. 

 

showall 

An indicator of whether to display more than the specified number of frequencies if there are multiple values tied at the lowest frequency to display. Specify “true” to show all values that are tied for the top frequencies. Specify “false” to only display the number of frequencies specified by the value element. If there is a tie, the displayed value is selected randomly.

fields 

 

A list of fields to include in the frequency analysis. 

field 

 

One field definition in the list of fields. 

 

fieldName 

The name of the field. If you defined a variable for the field, the syntax for this attribute is fieldName=”:[var_name]”, where var_name is the name you gave the variable. If you did not define a variable, enter the qualified field name within double quotes. For example, fieldName=“Person.FirstName”.

sortOrder 

 

If defined, specifies a field on which to sort in order of frequency.  

 

fieldName 

The name of the field to sort on. Use the syntax described for fieldName above.

 

increasing 

An indicator of whether to sort in increasing or decreasing frequency. Specify “true” to sort in increasing order, or specify “false” to sort in decreasing order. 

threshold 

 

If defined, specifies a frequency threshold above which or below which patterns will be listed on the report. 

 

value 

The frequency threshold. This is a cutoff value to help limit the results of the report. 

 

more 

An indicator of whether the threshold is an upper or lower threshold. Specify “true” to return field values with a frequency greater than or equal to the threshold. Specify “false” to return field values with a frequency less than the threshold. 


Example 3 Sample Pattern Frequency Analysis

The examples below perform pattern frequency analyses on the date of birth and the social security number. This generates two reports, one for each field.


<PatternFrequencyAnalysis>
  <topNpatterns value='10' showall "true"/>
  <fields>
    <field fieldName="Person.SSN"/>
  </fields>
</PatternFrequencyAnalysis>

<PatternFrequencyAnalysis>
  <topNpatterns value='10' showall "false"/>
  <fields>
    <field fieldName="Person.DOB"/>
  </fields>
</PatternFrequencyAnalysis>