16 Working with Oracle Data Quality Products

This chapter describes how to work with Data Quality Products in Oracle Data Integrator.

This chapter includes the following sections:

Section 16.1, "Introduction to Oracle Data Quality Products"
Section 16.2, "The Data Quality Process"

16.1 Introduction to Oracle Data Quality Products

Oracle Data Profiling and Oracle Data Quality for Data Integrator (also referred to as Oracle Data Quality Products) extend the inline Data Quality features of Oracle Data Integrator to provide more advanced data governance capabilities.

A complete Data Quality system includes data profiling, integrity and quality:

Profiling makes possible data investigation and quality assessment. It allows business users to get a clear picture of their data quality challenges, to monitor and track the quality of their data over time. Profiling is handled by Oracle Data Profiling. It allows business users to assess the quality of their data through metrics, to discover or infer rules based on this data, and finally to monitor over time the evolution of the data quality.
Integrity control is essential in ensuring the overall consistency of the data in your information system's applications. Application data is not always valid for the constraints and declarative rules imposed by the information system. You may, for instance, find orders with no customer, or order lines with no product, and so forth. Oracle Data Integrator provides built-in working environment to detect these constraint violation and store them for recycling or reporting purposes. Static and Flow checks in Oracle Data Integrator are integrity checks.
Quality includes integrity and extends to more complex quality processing. A rule-based engine apply data quality standards as part of an integration process to cleanse, standardize, enrich, match and de-duplicate any type of data, including names and addresses. Oracle Data Quality for Data Integrator places data quality as well as name and address cleansing at the heart of the enterprise integration strategy.

16.2 The Data Quality Process

The data quality process described in this section uses Oracle Data Quality products to profile and cleanse data extracted from systems using Oracle Data Integrator. The cleansed data is also re-integrated into the original system using Oracle Data Integrator.

The Quality Process has the following steps:

Create a Quality Input File from Oracle Data Integrator, containing the data to cleanse.
Create an Entity in Oracle Data Quality, based on this file.
Create a Profiling Project to determine quality issues.
Create a Oracle Data Quality Project cleansing this Entity.
Export the Data Quality Project for run-time.
Reverse-engineer the Entities using the RKM Oracle Data Quality.
Use Oracle Data Quality Input and Output Files in Interfaces
Run this Quality Project from Oracle Data Integrator using the OdiDataQuality tool.
Sequence the Process in a Package.

16.2.1 Create a Quality Input File

Oracle Data Quality uses as a source for the Quality project a flat file which contains the data to cleanse. This Quality input file can be created from Data Integrator and loaded from any source datastore using interfaces. This file should be a FILE datastore with the following parameters defined on the Files tab:

Parameter	Value
File Format	`Delimited`
Heading (Number of Lines)	`1`
Record Separator	`MS-DOS`
Field Separator	`Other`
[Field Separator] Other	`,(comma sign - Hexadecimal 2C)`
Text Delimiter	`" (double quotation marks)`
Decimal Separator	`empty, not specified`

For more information on creating a FILE datastore, refer to the Chapter 5, "Creating and Reverse-Engineering a Model". For more information on loading flat files, see "Files" in the Oracle Fusion Middleware Connectivity and Knowledge Modules Guide for Oracle Data Integrator.

16.2.2 Create an Entity

To import a data source into Oracle Data Quality for Data Integrator means to create an entity based on a delimited source file.

16.2.2.1 Step 1: Validate Loader Connections

Your administrator must set up at least one Loader Connection when he or she installs Oracle Data Quality for Data Integrator. This Loader Connection is used to access the Oracle Data Quality input file. As the input file is a delimited file, this Loader Connection should be a Delimited Loader Connection. Step 1 requires you validate this Delimited Loader Connection set up. Also verify that all the data and schema files you need are copied to the directory defined by the Loader Connection.

If you do not have access to the Metabase Manager, ask your Metabase administrator to verify the Loader Connection for you.

If you are a Metabase User and have access to the Metabase Manager, follow this procedure:

To validate a Loader Connection

Open the Metabase Manager (Start > All Programs > Oracle > Oracle Data Profiling and Quality > Metabase Manager).
Verify you are in Admin Mode.
Expand the Control Admin node.
Double-click Loader Connections.
On the right, the Loader Connections list view displays each Loader Connection, showing its name, type, data file, and parameters. Review the information to verify that the Loader Connection created by your administrator is a Delimited Loader Connection and that the data and schema directories are pointing to the correct location.

Note:

If you are a Metabase User with full Metabase privileges, you can create a new Loader Connection.

16.2.2.2 Step 2: Create Entity and Import Data

Use the Create Entity wizard to create an Entity. The Wizard takes you through each step, helps you to select data to load, and provides an interface for specifying connection and schema settings. It also gives you options for customizing how the data appears in an Entity.

To import a delimited source file into Oracle Data Quality for Data Integrator:

Copy the flat file that you want to import into Oracle Data Quality for Data Integrator into the data directory that you specified when you defined the Loader Connection.
Click on the Windows Start menu and select All Programs > Oracle > Oracle Data Profiling and Quality > Oracle Data Profiling and Quality.
Log in the user interface with your metabase user. The Oracle Data Profiling and Quality user interface opens
From the Main menu, select Analysis >Create Entity…
The Create Entity wizard opens in the upper right pane.
On the Connection Page of the Create Entity wizard, select the Loader Connection given to you by the administrator that you have checked in Step 1.
Leave the default settings for the filter and the connection and click Next.
Oracle Data Quality connects to the data source using the Loader Connection you selected in Step 4. If the connection fails, contact your Metabase Administrator
In the Entity Selection dialog, select the data source file name you want to import in the list and click Next.
Select the schema settings for the selected data file corresponding to the parameters of the file described in the section Section 16.2.1, "Create a Quality Input File"
- Delimiter: , (comma)
- Quote: " (double quotation marks)
- Attribute information: Names on first line
- Select Records are CR/LF terminated.
- Character encoding: ascii
For more information on configuring Entities for delimited files, see the Online Help for Oracle Data Profiling and Oracle Data Quality.

Note:
If the file is generated using Oracle Data Integrator These file format parameters should correspond to the file format specified in the Files tab of the datastore definition.
After you select the schema settings, click Preview. The Preview mode shows how the data will appear in the Entity, based on your selected schema settings. The data displays below in a list view. Use the Preview mode to customize how the data will appear in the new Entity.
When you are ready to continue, click Close.
Click Next. The Load Parameters dialog opens. Specify the parameters as follows:
- Select All Rows.
- Leave the default Job name.
Click Next to continue.
In Confirm Settings, review the list of settings and click Finish to schedule the Entity creation job. The Schedule Job window opens.
Click Run Now.

16.2.2.3 Step 3: Verify Entity

During the data import process, Oracle Data Quality for Data Integrator translates your data files into three basic components (Metabase objects): Entities, Attributes, and Rows.

Perform the following list of verification tasks to ensure that the data you expected has been successfully imported to a Metabase and are correctly represented in the Metabase Explorer.

Make sure that for every data file imported you have one corresponding Entity.
Make sure that the column names do not contain any special characters with the exception of underscore (_) or minus sign (-) characters. Minus signs and underscores will be translated into spaces during the data load process.
Make sure that for every field imported you have one corresponding Attribute.
Make sure that you have one Entity Row for every data row imported.

16.2.3 Create a Profiling Project

You can now run a Data Profiling Project with Oracle Data Profiling to find quality problems. Profiling discovers and analyzes the quality of your enterprise data. It analyzes data at the most detailed levels to identify data anomalies, broken filters and data rules, misaligned data relationships, and other concerns that allow data to undermine your business objectives.

For more information on Data Profiling see "Working with Oracle Data Profiling" in the Online Help for Oracle Data Profiling and Oracle Data Quality.

16.2.4 Create a Oracle Data Quality Project

You can now create an Oracle Data Quality Project to validate and transform your data, and resolve data issues such as mismatching and redundancy.

Oracle Data Quality for Data Integrator is a powerful tool for repairing and correcting fields, values and records across multiple business contexts and applications, including data with country-specific origins. Oracle Data Quality for Data Integrator enables data processing for standardization, cleansing and enrichment, tuning capabilities for customization, and the ability to view your results in real-time.

A Quality Project cleanses input files and loads cleansed data into output files. At the end of your Oracle Data Quality project this input file may be split into several output files, depending on the data Quality project.

Important Note: A Data Quality project contains many temporary entities, some of them not useful in the integration process. To limit the Entities reversed-engineered for usage by Oracle Integrator, a filter based on entities name can be used. To use this filter efficiently, it is recommended that you rename in your quality project the entities that you want to use in Oracle Data Integrator in a consistent way. For example rename the entities ODI_IN_XXX and the output (and no-match) files ODI_OUT_XXX , where XXX is the name of the entity.

For more information on Data Quality projects see "Working with Oracle Data Quality" in the Online Help for Oracle Data Profiling and Oracle Data Quality.

16.2.5 Export the Data Quality Project

Oracle Data Integrator is able to run projects exported from Oracle Data Quality. Once the Data Quality project is complete, you need to export it for Oracle Data Integrator. The exported project contains the data files, Data Dictionary Language (DDL) files, settings files, output and statistics files, user-defined tables and scripts for each process module you in the project. An exported project can be run on UNIX or Windows platforms without the user interface, and only requires the Oracle Data Quality Server.

To create a batch script:

In the Explorer or Project Workflow, right-click the Oracle Data Quality project and select Export... > ODQ Batch Project > No data.
In Browse for Folder, select or make a folder where you want the project to be exported.
Click OK. A message window appears indicating that the files are being copied. This export process creates a folder named after the metabase (<metabase_name>) at the location that you specified. This folder contains a projectN sub-folder (where N is the project identifier in Oracle Data Quality). This project folder contains the following folders among others:
- data: This folder is used to contain input and output data as well as temporary data files. These files have a .DAT extension. As you specified No data for the export, this folder is empty.
- ddl: This folder contains the entities metadata files (.DDX and .XML). These files describe the data files' fields. They are prefixed with eNN_ , where NN is the Entity number. Each entity is described in two metadata files. eNN_<name of the entity>.ddx is the description of the entity with possible duplicated columns (suitable for fixed files). enNN_<name of the entity_csv.ddx is the description of the entity with non-duplicated columns (suitable for fixed and delimited files). It recommended to use these files for the reverse-engineering process.
- scripts: This folder contains the batch script runprojectN. This script runs the quality process and is the one that will be triggered by Oracle Data Integrator.
- settings: This folder contains settings files (.ddt, .sto, .stt, .stx) and the configuration file config_batch.tbl.
After the message window has disappeared, examine the folder you have specified and check that all folders and files are correctly created.
Move the exported project to a folder on the run-time machine. This machine must have the Oracle Data Quality Server installed at it will run the quality project.
Open with a text Editor the batch script (runprojectN) and the configuration file (config_batch.tbl) in the /batch/settings sub-folder of your projectN folder.
Perform the following changes to configure the run-time directory in the project.
- In config_batch.tbl, specify the location (absolute path) of the directory containing the projectN folder for the DATABASE parameter.
- In runprojectN, specify the location (absolute path) of the projectN directory for the TS_PROJECT parameter.
For example, if you have the config_batch.tbl and runproject2.* files located in C:\oracle\oracledq\metabase_ data\metabase\oracledq\project2\batch\, you should specify
- in \settings\config_batch.tbl: DATABASE = C:\oracle\oracledq\metabase_ data\metabase\oracledq\project2\batch
- in \scripts\runprojectN.*: set TS_PROJECT=C:\oracle\oracledq\metabase_ data\metabase\oracledq\project2\batch
Save and close the config_batch.tbl file.
In runprojectN uncomment the very last line of the file (remove the :: character at the beginning of the last line).
Save and close the runprojectN file.
Oracle Data Integrator uses CSV formatted files (typically, comma-delimited with one header line) to provide the data quality project with input data, and expects output data to be in the same format.

In the /settings directory, open with an Editor the settings file corresponding to the first process of your project. This file is typically named eN_transfmr_p1.stx (where N is the internal ID of the entity corresponding to the quality input file) if the first process is a transformer.

Change the following input parameters in the settings file:

In DATA_FILE_NAME, specify the name and location (absolute path) of your quality input file in run-time.
In FILE_DELIMITER, specify the delimiter used in the quality input file.
In START_RECORD, specify the line number where data starts. For example, if there is a 1 line header, the value should be 2.

For example, if you have the customer_master.csv quality input file (comma-separated with one header line) located in C:/oracle/oracledq/metabase_data/metabase/oracledq/Data/, you should edit the following section:

<CATEGORY><INPUT><PARAMETER><INPUT_SETTINGS>
  <ARGUMENTS>
  <ENTRY>
    <ENTRY_ID>1</ENTRY_ID>
    <FILE_QUALIFIER>Customer_Master(1)</FILE_QUALIFIER>
    <DATA_FILE_NAME>$(INPUT)/e1_customer_master.dat</DATA_FILE_NAME>
    <DDL_FILE_NAME>$(DDL)/e1_customer_master.ddx</DDL_FILE_NAME>
    <FILE_DELIMITER/>
    <USE_QUOTES_AS_QUALIFIER/>
    <START_RECORD/>

as shown below:

<CATEGORY><INPUT><PARAMETER><INPUT_SETTINGS>
   <ENTRY>
     <ENTRY_ID>1</ENTRY_ID>
     <FILE_QUALIFIER>Customer_Master(1)</FILE_QUALIFIER>
<DATA_FILE_NAME>C:\oracle\oracledq\metabase_data\metabase\oracledq\Data\customer_master.csv</DATA_FILE_NAME>
<DDL_FILE_NAME>$(DDL)/e1_customer_master.ddx</DDL_FILE_NAME>
<FILE_DELIMITER>,</FILE_DELIMITER>
<USE_QUOTES_AS_QUALIFIER/>
<START_RECORD>2</START_RECORD>

Save and close the settings file.
Also in the /settings directory, open the file that corresponds to the settings of the process generating the output (cleansed) data. Typically, for a cleansing project which finishes with a Data Reconstructor process, it is named with eNN_datarec_pXX.stx. Change the following value in the settings file to give the full path of the generated output file.
```
<CATEGORY><OUTPUT><PARAMETER>
<OUTPUT_SETTINGS>
<ARGUMENTS>
<FILE_QUALIFIER>OUTPUT</FILE_QUALIFIER>
<DATA_FILE_NAME>C:\oracle\oracledq\metabase_data\metabase\oracledq\Data\customer_master_cleansed.csv</DATA_FILE_NAME>
<DDL_FILE_NAME>$(DDL)/e36_us_datarec_p11.ddx</DDL_FILE_NAME>
```
Save and close the settings file.
If you have several data quality processes that generate useful output files (for example, one data reconstructor per country). Repeat the two previous steps for each of these processes.

16.2.6 Reverse-engineer the Entities

In order to provide the Quality process with input data and use its output data in data integrator's integration processes, it is necessary to reverse-engineer these Entities. This operation is performed using a customized reverse-engineering method based on the Oracle Data Quality RKM. The RKM reads metadata from the .ddx files located in the /ddl folder of your data quality project.

To reverse-engineer the Entities of a data Quality project:

Import the RKM Oracle Data Quality into your Oracle Data Integrator project.
Insert a physical schema for the File technology in Topology Manager. Specifying for both, the Directory (Schema) and the Directory (Work Schema), the absolute path of your data folder. For example C:\oracle\oracledq\metabase_data\metabase\oracledq\projectN\data

This directory must be accessible to the agent that will be used to run the transformations. Oracle Data Integrator will look in the schema for the source and target data structures for the interfaces. The RKM will access the output data files and reverse-engineer them.

Create a File model and reverse the /ddl folder.

In Designer Navigator expand the Models panel.
Right-click then select New Model.
Enter the following fields in the Definition tab:

Name: Name of the model used in the user interface.

Technology: File

Logical Schema: Select the Logical Schema on which your model will be based.
In the Reverse tab and select:

Parameter Value/Action

Reverse: Customized

Context: Reverse-engineering Context

Type of objects to reverse-engineer: Table

KM Select the RKM Oracle Data Quality
Set the RKM options as shown in Table 16-1:

Parameter	Value/Action
Reverse:	Customized
Context:	Reverse-engineering Context
Type of objects to reverse-engineer:	Table
KM	Select the RKM Oracle Data Quality

Table 16-1 KM Options for RKM Oracle Data Quality

Parameter	Default Value	Description
DDX_FILE_NAME	`*.ddx`	Mask for DDX Files to process. If you have used a naming convention in the Quality project for the Entities that you want to use, enter a mask that will return only these Entities. For example, specify the ODI*_csv.ddx mask if you have used the ODI_IN_XX and ODI_OUT_XX naming convention for your input and output entities.
USE_FRIENDLY_NAMES	`No`	Set this option to Yes if you want the Reverse-Engineering process to generate user-friendly names for datastore columns based on the field name specified in the DDX file.
USE_LOG	`Yes`	Set to Yes if you want the reverse-engineering process activity be logged in a log file.
LOG_FILE_NAME	`/temp/reverse.log`	Name of the log file.

Click Apply. The model is created, but contains no datastores yet.
Click Reverse. Now, the model contains datastores that you can see in the Models view.

16.2.7 Use Oracle Data Quality Input and Output Files in Interfaces

You can now create in Oracle Data Integrator interfaces sourcing or targeting the data Quality input and output files.

For example, you can:

Create interfaces to load the input file using datastores from various sources.
Create interfaces to re-integrate the output data back into the sources after cleansing.

16.2.8 Run this Quality Project from Oracle Data Integrator

The OdiDataQuality tool executes the batch file to run the Oracle Data Quality project. This tool takes as a parameter the path to the runprojectN script file. It can run either in synchronous (the tool waits for the quality process to complete) or asynchronous mode.

For more information about the OdiDataQuality tool and its parameters, see Section A.6.3, "OdiDataQuality".

16.2.9 Sequence the Process in a Package

Create a package in Oracle Data Integrator sequencing the following process:

One or more Interfaces creating the Quality input file, containing the data to cleanse.
OdiDataQuality tool step launching the Oracle Data Quality process.
One or more Interfaces loading the data from the Oracle Data Quality output files into the target datastores.