Analyzing and Cleansing Data for Sun Master Index

Data Profiler Report Samples

The following topics provide sample frequency configurations along with excerpts of the reports they produce.

Simple Frequency Analysis Report Samples

Simple frequency analysis reports list the frequencies of various data values found in the specified fields without using any data verification, transformation, or conditional rules. You can specify a sort order for this type of report, and you can specify a frequency threshold. The sample frequency rule defined below analyzes first and last names, reporting the top 6 frequencies and only if the frequency is three or more.


<SimpleFrequencyAnalysis>
  <fields>
    <field fieldName="Person.FirstName"/>
    <field fieldName="Person.LastName"/>
  </fields>
  <sortOrder fieldName="Person.FirstName" increasing="false"/>
  <threshold value="3" more="true"/>
  <topNpatterns value="6" showall="false"/> 
</SimpleFrequencyAnalysis>

This analysis generates a report similar to the following:

SF_PROFILE_SIMPLE_FRQ_1_1–0.csv

PERSON.LAWSTNAME 

PERSON.FIRSTNAME 

FREQUENCY 

SMITH 

ANN 

38 

JONES 

SUSAN 

31 

SMITH 

JOHN 

31 

THOMPSON 

JAMES 

28 

JOHNSON 

BETH 

26 

MILLER 

FRANK 

25 

The sample frequency rule defined below analyzes social security numbers and analyzes whether there are duplicates (two or more occurrences).


<SimpleFrequencyAnalysis>
  <fields>
    <field fieldName="Person.SSN"/>
  </fields>
  <sortOrder fieldName="Person.SSN" increasing="false"/>
  <threshold value="2" more="true"/>
</SimpleFrequencyAnalysis>

This analysis generates a report similar to the following excerpt:

SF_PROFILE_SIMPLE_FRQ_2_1–0.csv

PERSON.SSN 

FREQUENCY 

999999999 

457 

000000000 

125 

123456789 

41 

222423535 

992203847 

Constrained Frequency Analysis Report Samples

Constrained frequency analysis reports list the frequencies of various data values found in the specified fields based on defined rules. For example, you can define rules that will only include certain patterns or that will exclude certain values. This topic includes two constrained analysis definitions along with corresponding sample reports.


<ConstrainedFrequencyAnalysis>
  <fields>
    <field fieldName="Person.SSN"/>
  </fields>
  <ruleList>
    <rule>
      <dataLength fieldName="Person.SSN" len="10" more="false"/>
    </rule>
  </ruleList>
</ConstrainedFrequencyAnalysis>

The above analysis generates a report for social security numbers with less than 10 characters (which means the hyphens are likely missing). Below is a sample output.

CF_PROFILE_CONSTRAINED_FRQ_1_1–100000.csv

PERSON.SSN 

FREQUENCY 

300555444 

299557777 

822331111 

999999999 

98 

000000000 

115 

The following analysis generates a report for dates of birth that are prior to 01/01/1899 (which means they likely contain typographical errors). Below is a sample output.


<ConstrainedFrequencyAnalysis>
  <fields>
    <field fieldName="Person.DOB"/>
  </fields>
  <ruleList>
    <rule>
      <dataRange fieldName="Person.DOB" min="01/01/0001" max="01/01/1899"/>
    </rule>
  </ruleList>
</ConstrainedFrequencyAnalysis>

CF_PROFILE_CONSTRAINED_FRQ_2_1–100000.csv

PERSON.DOB 

FREQUENCY 

07/08/53 

10/04/51 

09/16/1682 

12/28/1680 

05/09/1898 

Pattern Frequency Analysis Report Samples

Pattern frequency analysis reports list the frequencies of various data patterns found in the values of the specified fields. Patterns are expressed as regular expressions. This topic includes sample reports based on the pattern frequencies defined below for social security number and date of birth patterns.


<PatternFrqueencyAnalysis>
  <topNpatterns ="5" showall="true"/>
  <fields>
    <field fieldName="Person.SSN"/>
  </fields>
</PatternFrequencyAnalysis>

<PatternFrequencyAnalysis>
  <topNpatterns ="5" increasing="true"/>
  <fields>
    <field fieldName="Person.DOB"/>
  </fields>
</PatternFrequencyAnalysis>

The above rules generate two reports, one for social security number patterns and one for date of birth patterns. The reports only lists the top 5 patterns. Below are sample outputs for each. You can easily determine invalid values based on the patterns listed.

PF_PROFILE_PATTERN_FRQ_1_1–10000.csv

PERSON.SSN 

FREQUENCY 

NNN/NN/NNNN 

11 

 

51 

NNN-NN-NNN 

64 

NNNNNNNNN 

92 

NNN-NN-NNNN 

7614 

PF_PROFILE_PATTERN_FRQ_2_1–10000.csv

PERSON.DOB 

FREQUENCY 

NNNN/NN/NN 

14 

 

22 

NNNNNNNN 

62 

NN/NN/NN 

84 

NN/NN/NNNN 

9766