Analyzing and Cleansing Data for Sun Master Index

Performing Frequency Analyses on Cleansed Data

After the data is cleansed (see Cleansing the Legacy Data), you can perform additional analyses against the data to help you determine how to configure query blocks and matching rules. Typically you would analyze the fields that are included in the block definitions for the query that is used for matching, and you could also analyze fields used for matching. The frequencies of these fields indicate how reliable they might be in the matching process and also indicate whether the blocking definitions are too broad or narrow to retrieve a reliable group of records for matching.

For this process, you might want to run frequency analysis against groups of records to find the frequencies for the unique values of the fields in the blocking definitions. After the data has been cleansed, you can run frequencies on standardized and phonetically encoded fields. The input for this process is the good data file to which the Data Cleanser wrote all the corrected and validated records.


Note –

The blocking query is defined in the master index project in query.xml, and the match fields are defined in mefa.xml.


To perform frequency analyses on the cleansed data, repeat these procedures: