Skip Navigation Links | |
Exit Print View | |
Analyzing and Cleansing Data for a Master Index Java CAPS Documentation |
Analyzing and Cleansing Data for a Master Index
Data Cleansing and Analysis Overview
Data Cleansing and Profiling Process Overview
Required Format for Flat Data Files
Generating the Data Profiler and Data Cleanser
To Generate the Data Profiler and Data Cleanser
Defining the Data Analysis Rules
Performing the Initial Data Analysis
To Perform the Initial Data Analysis
Reviewing the Data Profiler Reports
Configuring the Data Cleansing Rules
To Configure the Data Cleansing Rules
Performing Frequency Analyses on Cleansed Data
Adjusting the Master Index Configuration
Data Profiler Processing Attributes
Data Profiler Global Variables
Simple Frequency Analysis Rules
Constrained Frequency Analysis Rules
Pattern Frequency Analysis Rules
Data Cleanser Processing Attributes
Data Cleanser Global Variables
Simple Frequency Analysis Report Samples
Once you extract the data from your source systems (described in Extracting the Legacy Data), you should determine what you want to achieve from the initial pass-through with the Data Profiler before you run the Data Cleanser. You do not need to profile the data prior to cleansing, however running a profile first can help you determine which fields need to be validated or transformed by the cleanser and how the those fields need to be processed. The Data Profiler identifies common values and patterns for these fields and gives you information about how to configure the Data Cleanser.
Here are some examples of the types of fields you might want to analyze prior to cleansing. After reviewing your data processing requirements, you will likely come up with additional types of analysis to perform.
Fields that are likely to contain default values. Default values can include invalid values such as “999–99–9999” for Social Security Numbers or “John Doe” for first and last names.
Fields that must be presented in a specific format. Required formats can include hyphenated social security numbers or phone numbers with parentheses and a hyphen (for example, (780)555–1515).
Fields whose values are restricted to a valid value list. This can include fields such as gender, where there is generally one abbreviation for Female, one abbreviation for Male, one abbreviation for Unknown, and so on. Analyzing these fields helps identify incorrect abbreviations that cannot be read correctly by the master index.
Date fields, especially dates of birth. In a master person index, the date of birth is generally used for blocking and matching, so you should verify whether you have any that are obviously incorrect (such as birth dates prior to 1900 or later than 2008).
Once you determine the fields to profile, continue to Defining the Data Analysis Rules