Analyzing and Cleansing Data for Sun Master Index

About the Data Profiler

The Data Profiler analyzes the frequency of data values and patterns in your existing data based on predefined rules and rules you define. Rules are defined using a Rules Definition Language (RDL) in XML format. The RDL provides a flexible and extensible framework that makes defining rules an easy and straightforward process. Use the Data Profiler to perform an initial analysis of existing data to determine which fields contain invalid or default values, invalid formats, incorrect dates, and so on. This analysis spotlights values that need to be validated or modified during the cleansing process. For example, if you find you have several dates in the incorrect format, you can reformat the dates during cleansing. You might also find cases where the postal code extension is appended to the postal code, making the field value nine characters. You can truncate those fields to five characters during the cleansing phase, leaving the actual zip code.

The final data analysis is performed after the data is cleansed using the Data Cleanser. Use this analysis to verify the blocking definitions for the blocking query used in the master index match process. This analysis indicates whether the data blocks defined for the query are too wide or narrow in scope, which would result in an unreliable set of records being returned for a match search. This analysis can also show how reliably the field indicates a match between two records, which indicates how much relative weight should be given to each field in the match string.

The Data Profiler performs three types of analysis and outputs a report for each field being profiled. A simple frequency analysis provides a count of each value in the specified fields. You can specify a sort order and minimum frequency to display in the reports. A constrained frequency analysis provides a count of each value in the specified fields based on the validation rules you define. You can validate against a range of values, field length, or pattern to define the rules for the validation. A pattern frequency analysis provides a count of the patterns found in the specified fields. You can specify a sort order and the number of frequencies to display on the report.