Skip Headers
Oracle® Warehouse Builder Data Modeling, ETL, and Data Quality Guide
11g Release 2 (11.2)

Part Number E10935-03
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Master Index
Master Index
Go to Feedback page
Contact Us

Go to previous page
Previous
Go to next page
Next
View PDF

21 Data Cleansing and Correction with Data Rules

This chapter describes the data cleansing features of Oracle Warehouse Builder and how to use them.

This chapter contains the following topics:

Overview of Data Rules in ETL and Automatic Data Correction

After you derive data rules from profiling results, you can automate the process of correcting source data based on profiling results. You can create the schema and mapping corrections. The schema correction creates scripts that you can use to create a corrected set of data objects with the same structure as the source objects, but with the derived data rules applied. The mapping correction creates new correction mappings to take your data from the source objects and load them into new objects.

For a given set of data objects (tables, views and so on) and a given set of data rules applied to those objects, Warehouse Builder can automatically generate the following data correction objects and logic:

To actually create your corrected data, you must then deploy the corrected schema objects, mappings and relevant data rules to the target location and either run the mappings or schedule them to run as needed. You can then implement further ETL using the cleansed schema objects as a source instead of the original dirty data.

You can also apply data rules in ETL mappings manually. When adding a data object such as a table to a mapping, you can select one or more data rules to be applied to the object. Error tables are created for the object, and you can define separate flows within your mapping for compliant and noncompliant data.

Generating Correction Mappings from Data Profiling Results

When automatically generating corrections for source tables or other objects based on data rules, the objects generated include the following.

To actually create your corrected data, you must then deploy the corrected schema objects, mappings and relevant data rules to the target database and either run the mappings or schedule them to run as needed.

Prerequisites for Creating Corrections

The prerequisites for creating corrections are:

  • You must already have a data profile where you have profiled the source data objects (tables, views and so on) to be corrected.

  • You must already have data rules to be used to identify noncompliant data for correction.

Steps to Create Correction Objects

The Data Profile Editor enables you to create mappings that will perform schema correction and data cleansing based on your data profiling results.

To create corrections:

  1. If the data profile is not already open, open it by right-clicking the data profile in the Projects Navigator and selecting Open.

  2. From the Profile menu, select Create Correction.

    The Create Correction Wizard is displayed.

  3. On the Welcome page, click Next.

  4. On the Select Target Module page, specify the target module that will contain the corrections and click Next.

    You can either create a new module or use an existing module.

    • To store the corrections in an existing target module, choose Select an existing module. The Available list displays the existing modules in which corrections can be stored. Select the module from this list.

    • To store the corrections in a new target module, select Create a new target module. The Create Module Wizard guides you through the steps of creating a new target module.

    To remove correction objects created as a result of previous corrections, select Remove previous correction objects.

  5. On the Select Objects page, select the objects for which corrections should be generated by moving them to the Selected list. Click Next.

    The Filter list enables you to filter the objects that are available for selection. The default selection is All Objects. You can display only particular types of data objects such as tables or views.

  6. On the Select Data Rules and Data Types page, select the corrections that must be generated to perform schema correction. Click Next.

    See "Selecting the Data Rules and Data Types for Corrected Schema Objects" for information about specifying data corrections.

  7. (Optional) On the Data Rules Validation page, note the validation errors, if any, and correct them before proceeding.

    If correction objects from a previous data correction action exist for the objects selected for correction, this page displays a message. Click Next to remove previously created correction objects.

  8. On the Verify and Accept Corrected Tables page, select the objects that you want to correct and click Next.

    See "Selecting the Objects to Be Corrected" for more information about how to specify how objects should be corrected.

  9. On the Choose Data Correction Actions page, specify the correction actions to be performed to cleanse source data and click Next.

    See "Choosing Data Correction and Cleansing Actions" for more details about specifying the actions that perform data correction and cleansing.

  10. On the Summary page, click Finish to create the correction objects.

The correction schema is created and added to the Projects Navigator. The correction objects and mappings are displayed under the module that you specify as the target module on the Select Target Module page of the Create Correction Wizard. The correction object uses the same name as the source object. The name of the correction mapping is the object name prefixed with M_. The correction mapping is used to cleanse source data and load it into the corrected target object.

Selecting the Data Rules and Data Types for Corrected Schema Objects

Use the Data Rules and Data Types page to select the schema corrections that should be generated for the corrected data objects. Based on the data profiling results, Warehouse Builder populates this page with data type corrections and data rules that you can apply to the data object.

Schema correction consists of correcting data type definitions and defining data rules that should be applied to the corrected objects. The objects selected for correction are displayed on the left side of the page and are organized into a tree by modules. The panel on the right contains two tabs: Data Rules and Data Types. Select an object by clicking the object name and then define how schema correction should be performed for this object using the Data Rules and Data Types tabs.

Data Rules The Data Rules tab displays the available data rules for the object selected in the object tree. Specify the data rules that should be generated for the corrected object by selecting the check box to the left of the data rule. Warehouse Builder uses these data rules to create constraints on the tables during the schema generation.

The Bindings section contains details about the table column to which the rule is bound. Click a rule name to display the bindings for that rule.

Warehouse Builder uses different methods of enforcing data rules on corrected schema objects. The method used depends on the type of data rule that you are implementing.

Table 21-1 describes the methods used for object schema correction. It also lists the data rule types for which each correction is used.

Table 21-1 Data Rules Implementation for Schema Correction

Schema Correction Method Description Data Rule Types for which Correction Method Can be Used

Create Constraints

Creates a constraint reflecting the data rule on the correction table. If a constraint cannot be created, then a validation message is displayed on the Data Rules Validation page of the Apply Data Rule Wizard.

Custom

Domain List

Domain Pattern List

Domain Range

Common Format

No Nulls

Unique Key

Change the data type

Changes the data type of the column to NUMBER or DATE according to the results of profiling. The data type is changed for data rules of type Is Number and Is Name.

Is Number

Is Date

Create a lookup table

Creates a lookup table and adds the appropriate foreign key or unique key constraints to the corrected table and the lookup table.

Functional Dependency

Name and Address Parse

Adds additional name and address attributes to the correction table. The name and address attributes correspond to a selection of the output values of the Name and Address operator. In the map that is created to cleanse data, a Name and Address operator is used to perform name and address cleansing.

Name and Address


Data Types The Data Types tab displays the columns that are selected for correction. The change could be a modification of the data type, precision, or from fixed-length to variable-length. The Documented Data Type column on this tab displays the existing column definition and the New Data Type column displays the proposed correction to the column definition.

To correct a column definition, select the check box to the left of the column name.

Selecting the Objects to Be Corrected

Use the Verify and Accept Corrected Tables page to confirm the objects that you want to correct and to provide additional details about how data correction should be performed. This page contains the objects you selected for schema correction on the Data Rules and Data Types page.

Use the following steps to specify how your data objects should be corrected.

  1. In the Verify and Accept Corrected Tables that Will be Generated section, select Create to the left of a data object to create this data object in the corrected schema.

    The Definition of the Corrected Table section displays the corrections details for the selected data object. The Columns tab displays the details of columns that will be created in the corrected data object. The Constraints tab displays details of constraints that will be created on the corrected data object. The Data Rules tab displays details of data rules that will be created on the corrected data object.

  2. On the Columns tab of the Definition of the Corrected Table section:

    • Select Create to the left of a column name to create this column in the corrected data object.

    • Deselect Create to the left of a column name to remove this column from the corrected object.

    • Edit the Data Type, Length, Precision, Seconds Precision, and Scale for a column by clicking the value and entering the new value. However, you cannot modify a column name.

  3. On the Constraints tab of the Definition of the Corrected Table section:

    • Click Add Constraint to create additional constraints.

    • Select the constraint and click Delete to remove a constraint from the corrected data object.

  4. On the Data Rules tab of the Definition of the Corrected Table section:

    • Select the check box to the left of a data rule to apply this derived data rule to the corrected data object.

      Ensure that the Bindings column contains the column to which the data rule should be applied.

    • Click Apply Rule to apply a new data rule to the corrected object. The Apply Data Rule Wizard guides you through the process of applying a data rule.

Choosing Data Correction and Cleansing Actions

When you decide to automatically generate corrected objects based on data profiling results, you must specify how inconsistent data from the source object should be cleansed before being stored in the corrected object. To do this, you specify a cleansing strategy for each data rule that is applied to the correction object.

The Choose Data Correction Actions page enables you to specify how to correct source data. This page contains two sections: Select a Corrected Table and Choose Data Correction Actions. The Select a Corrected Table section lists the objects that you selected for corrections. This section contains the following columns for each data object that you selected for correction:

  • Correct: Select this option to enable generation of correction objects for the data object listed in the Table column.

  • Table: Represents the name of the data object for which correction actions are being specified.

  • Load Option: Indicates which records should be loaded by the correction mapping. Select All Records to indicate that the generated correction mapping should load all records. Select Corrected Objects to indicate that the generated correction mapping should load only the records being corrected.

  • Audit Option: Select this option to create a data auditor for the table represented by the Table column.

  • Description: Represents a description for the correction mapping that is created.

Select a data object in the Select a Corrected Table section to display the affiliated data rules in the Choose Data Correction Actions section.

Choosing Data Correction Actions

For each data rule, you must choose a correction action that specifies how data values that violate data rules set for the data object should be handled. Use the list in the Action column to specify the correction action that you want to perform.

The correction actions that you can choose are:

  • Ignore: The data rule is ignored and, therefore, no values are rejected based on this data rule.

  • Report: The data rule is run only after the data has been loaded for reporting purposes. It is similar to the Ignore option, except that a report containing values that do not adhere to the data rules is created. This action can be used for some rule types only.

  • Cleanse: The values rejected by this data rule are moved to an error table where cleansing strategies are applied. When you select this option, you must specify a cleansing strategy as described in "Specifying the Cleansing Strategy".

Specifying the Cleansing Strategy

For each data rule, use the Cleansing Strategy list to specify how data that violates a set data rule should be cleansed. This option is enabled only if you select Cleanse in the Action column. The cleansing strategy depends on the type of data rule and the rule configuration. Error tables are used to store the records that do not conform to the data rule.

Table 21-2 describes the cleansing strategies and lists the types of data rules for which each strategy is applicable.

Table 21-2 Cleansing Strategies for Data Correction

Cleansing Strategy Description Applicable to Data Rule Types

Remove

Does not populate the target table with error records

All

Custom

Creates a function in the target table that contains a header, but no implementation details. You must add the implementation details to this function.

Domain List

Domain Pattern List

Domain Range

Common Format

No Nulls

Name and Address

Custom

Use Existing Function

Select from a list of existing functions to perform the correction

Domain List

Domain Pattern List

Domain Range

Common Format

No Nulls

Name and Address

Custom

Set to Min

Sets the attribute value of the error record to the minimum value defined in the data rule

Domain Range rules that have a minimum value defined

Set to Max

Sets the attribute value of the error record to the maximum value defined in the data rule

Domain Range rules that have a maximum value defined

Similarity

Uses a similarity algorithm based on permitted domain values to find a value that is similar to the error record. If no similar value is found, then the original value is used.

Domain List rules with character data types

Soundex

Uses a soundex algorithm based on permitted domain values to find a value that is similar to the error record. If no soundex value is found, then the original value is used.

Domain List rules with character data types

Merge

Uses the match-merge algorithm to merge duplicate records into a single row

Unique Key

Set to Mode

Uses the mode value to correct the error records if a mode value exists for the functional dependency partition that fails

Functional Dependency


Viewing the Correction Tables and Mappings

You can view the correction tables in the Table Editor to see the data rules and constraints created as part of the design of your table. You can also view the correction mappings as you can view any other ETL mapping.

To view the correction mappings:

  1. Double-click the mapping to open the object in the Mapping Editor.

  2. After the mapping is open, select View and then Auto Layout to view the entire mapping.

    Figure 21-1 displays a correction map generated by the Create Correction Wizard.

    Figure 21-1 Generated Correction Mapping

    Description of Figure 21-1 follows
    Description of "Figure 21-1 Generated Correction Mapping"

  3. Select the submapping ATTR_VALUE_1 and click Visit Child Graph on the toolbar to view the submapping.

    Figure 21-2 displays the submapping that is displayed.

    Figure 21-2 Correction Submapping

    Description of Figure 21-2 follows
    Description of "Figure 21-2 Correction Submapping"

    The submapping is the element in the mapping that performs the actual correction cleansing that you specified in the Create Correction Wizard. In the middle of this submap is the DOMAINSIMILARITY transformation that was generated as a function by the Create Correction Wizard.

Cleansing and Transforming Source Data Based on Data Profiling Results

After you generate correction objects, you must deploy and execute the correction objects to perform schema correction and data cleansing. Your data is corrected after you run the correction mappings with the data rules. The relevant data rules also remain bound to the objects in the corrected schema for optional use in data auditors.

Correcting your schema and cleansing data requires the following steps:

  1. Deploying Schema Corrections

  2. Deploying Correction Mappings

Deploying Schema Corrections

When you perform schema correction based on data profiling results, Warehouse Builder generates the schema correction actions that you specified and generates corrected data objects. The name of the corrected data object is the name of the original source object prefixed with TMP_.

When deploying schema corrections, deploy all corrected data objects, along with any data rules that were defined for the corrected objects as part of the data correction process.

Deploying Correction Mappings

When you generate correction mappings to cleanse source data based on data profiling results, Warehouse Builder creates the correction mappings in the workspace. The name of the correction mapping for a particular data object is the name of the data object prefixed with M_. For example, the correction mapping generated to cleanse the DEPT table is called M_DEPT.

To deploy the correction mappings created as part of the data correction process:

  1. Grant the SELECT privilege on the source tables to PUBLIC.

    For example, your correction mapping contains the table EMPLOYEES from the HR schema. You can successfully deploy this correction mapping only if the SELECT privilege is granted to PUBLIC on the HR.EMPLOYEES table.

  2. Deploy the correction tables created as a result of data profiling.

    You can right-click the table in the Projects Navigator and select Deploy. Or you can use the Control Center to deploy data objects.

  3. Deploy the correction mappings generated to cleanse source data.

  4. To cleanse source data and load it into the corrected tables, execute the correction mapping as you would any other ETL mapping.

    To execute the mapping, right-click the mapping in the Projects Navigator and select Start.

You can also schedule this mapping to run like any other mapping or include it in process flows.