Skip Navigation Links | |
Exit Print View | |
Analyzing and Cleansing Data for a Master Index Java CAPS Documentation |
Analyzing and Cleansing Data for a Master Index
Data Cleansing and Analysis Overview
Data Cleansing and Profiling Process Overview
Required Format for Flat Data Files
Generating the Data Profiler and Data Cleanser
To Generate the Data Profiler and Data Cleanser
Determining the Fields to Analyze
Defining the Data Analysis Rules
Performing the Initial Data Analysis
To Perform the Initial Data Analysis
Configuring the Data Cleansing Rules
To Configure the Data Cleansing Rules
Performing Frequency Analyses on Cleansed Data
Adjusting the Master Index Configuration
Data Profiler Processing Attributes
Data Profiler Global Variables
Simple Frequency Analysis Rules
Constrained Frequency Analysis Rules
Pattern Frequency Analysis Rules
Data Cleanser Processing Attributes
Data Cleanser Global Variables
Simple Frequency Analysis Report Samples
When you run the Data Profiler, as described in Performing the Initial Data Analysis, the Data Profiler creates one report for each rule you defined and stores them in the location you specified in the configuration file. Review each report to help determine which fields need to be cleansed and how data needs to be validated or modified. When you profile prior to cleansing, you are looking for occurrences of invalid values or patterns, default values, missing values, and so on. This information translates into the rules you will write for the Data Cleanser.
Reports are written in CSV format so you can import them into a spreadsheet or reporting tool. Data patterns listed in the pattern frequency reports appear as regular expressions. See the Javadoc for java.util.regex for more information. The Data Profiler names each report based on the type of frequency performed, the order in which the rules appear in the configuration file, and the records on which it was performed. The naming syntax for the report file names is:
UD_FreqType_Order_Records.csv
where:
ID identifies the type of frequency analysis performed. SF indicates simple frequency, CF indicates constrained frequency, and PF indicates pattern frequency.
FreqType is a description of the type of frequency analysis performed; for example, PROFILE_PATTERN_FRQ.
Order is the order in which the rule that generated the report appears in the configuration file. The profiler numbers each type of frequency report. For example, you might have constrained frequency reports 1, 2, and 3, and also have pattern frequency reports 1 and 2.
Records is the range of records on which analysis was performed. If you did not specify a profile size, the ending number in the range is “0” (zero).
For example, SF_PROFILE_SIMPLE_FRQ_3_1-100000.csv is the third simple frequency analysis report defined in the configuration file and performed against the first the first 100,000 records. CF_PROFILE_CONSTRAINED_FRQ_1_1-0.csv is the first constrained frequency analysis report defined in the configuration file and performed against all records.
For examples of different types of frequency analysis reports, see Data Profiler Report Samples. When you finish analyzing the reports, continue to Configuring the Data Cleansing Rules.