13
Transformation

This chapter helps you create and manage a data warehouse, and discusses:

Techniques for Data Transformation Inside the Database

Techniques for Data Transformation Inside the Database

Data transformations are often the most complex and, in terms of processing time, the most costly part of the ETT process. They can run the gamut from simple data conversions to extremely complex data-scrubbing techniques. Many, if not all, data transformations can occur within an Oracle8i database, although transformations are also often implemented outside of the database (for example, on flat files) as well.

This chapter discusses the techniques used to implement data transformation within Oracle8i and special considerations for choosing among these techniques, and includes:

Transformation Flow

The data-transformation logic for most data warehouses consists of multiple steps. For example, in transforming new records to be inserted into a sales table, there may be separate logical transformation steps to validate each dimension key.

A graphical way of looking at the transformation logic is presented in Figure 13-1:

Figure 13-1 Data Transformation

When using Oracle8i as a transformation engine, a common strategy is to implement each different transformation as a separate SQL operation, and to create a separate, temporary table (such as the tables new_sales_step1 and new_sales_step2 in Figure 13-1) to store the incremental results for each step. This strategy also provides a natural checkpointing scheme to the entire transformation process, which enables to the process to be more easily monitored and restarted.

It may also be possible to combine many simple logical transformations into a single SQL statement or single PL/SQL procedure. While this may provide better performance than performing each step independently, it may also introduce difficulties in modifying, adding or dropping individual transformations, and, moreover, prevents any effective checkpointing.

Transformations Provided by SQL*Loader

Before any data-transformations can occur within the database, the raw data must first be loaded into the database. Chapter 12, "Transportation", discussed several techniques for transporting data to an Oracle data warehouse. Perhaps the most common technique for transporting data is via flat files.

SQL*Loader is used to move data from flat files into an Oracle data warehouse. During this data load, SQL*Loader can also be used to implement basic data transformations. When using direct-path SQL*Loader, datatype conversion and simple NULL handling can be automatically resolved during the data load. Most data warehouses choose to use direct-path loading for performance reasons.

Oracle's conventional-path loader provides broader capabilities for data transformation than direct-path loader: SQL functions can be applied to any column as those values are being loaded. This provides a rich capability for transformations during the data load. However, the conventional-path loader is less efficient than direct-path loader and is not parallelizable. For these reasons, the conventional-path loader should be considered primarily for loading and transforming smaller amounts of data.

For more information on SQL*Loader, see Oracle8i Utilities.

Transformations Using SQL and PL/SQL

Once data is loaded into an Oracle8i database, data transformations can be executed using SQL and PL/SQL operations. There are two basic techniques for implementing data-transformations within Oracle8i.

Technique 1: The CREATE TABLE ... AS SELECT ... statement

The CREATE TABLE ... AS SELECT ... statement (CTAS) is a very powerful tool for manipulating large sets of data. As we shall see in the examples below, many data transformations can be expressed in standard SQL, and Oracle's implementation of CTAS provides a mechanism for efficiently executing a SQL query, and storing the results of that query in a database table.

In a data warehouse environment, CTAS is typically run in parallel and using NOLOGGING mode for best performance.

Technique 2: PL/SQL Procedures

In a data warehouse environment, PL/SQL can be used to implement complex transformations in the Oracle8i database.

While CTAS operates on entire tables and emphasizes parallelism, PL/SQL provides a row-based approached and can accommodate very sophisticated transformation rules. For example, a PL/SQL procedure could open multiple cursors and read data from multiple source tables, combine this data using complex business rules, and finally insert the transformed data into one or more target tables; it may be difficult or impossible to express the same sequence of operations using standard SQL commands.

The following examples demonstrate transformations using SQL and PL/SQL.

Data Substitution

A simple and common type of data transformation is a data substitution. In a data-substitution transformation, some or all of the values of a single column are modified. For example, our sales table has a sales_channel_id column. This column is used to specify whether a given sales transaction was made by our company's own sales force (a direct sale) or via a distributor (an indirect sale).

We may receive data from multiple source systems for our data warehouse. Let us suppose that one of those source systems processes only direct sales, and thus the source system does not know indirect sales channels. When the data warehouse initially receives sales data from this system, all of sales records have a null value for the sales_channel_id field. These null values must be modified, and set to the proper key value.

This can be efficiently done using a CTAS statement:

CREATE TABLE temp_sales_step2 
NOLOGGING PARALLEL AS
SELECT sales_product_id, NVL(sales_channel_id, 1) sales_channel_id,
       sales_customer_id, sales_time_id, sales_quantity_sold,
       sales_dollar_amount 
FROM   temp_sales_step1;

Another possible technique for implementing a data-substitution is to use an UPDATE statement to modify the sales_channel_id column. An UPDATE will provide the correct result. However, many data-substitution transformations require that a very large percentage of the rows (often all of the rows) be modified. In these cases, it may be more efficient to use a CTAS statement than an UPDATE.

Key Lookups

Another simple transformation is a key lookup For example, suppose that, in a retail data warehouse, sales transaction data has been loaded into the data warehouse. Although the data warehouse's SALES table contains a PRODUCT_ID column, the sales transaction data extracted from the source system contains UPC codes instead of PRODUCT_IDs. Therefore, it is necessary to transform the UPC codes into PRODUCT_IDs before the new sales transaction data can be inserted into the SALES table.

In order to execute this transformation, there must be a lookup table which relates the PRODUCT_ID values to the UPC codes. This may be the PRODUCT dimension table, or perhaps there is another table in the data warehouse that has been created specifically to support this transformation. For this example, we assume that there is a table named PRODUCT, which has a PRODUCT_ID and an UPC_CODE column.

This data-substitution transformation can be implemented using the following CTAS statement:

CREATE TABLE temp_sales_step2 
NOLOGGING PARALLEL 
AS 
SELECT 
      sales_transaction_id,
      product.product_id sales_product_id, 
      sales_customer_id,
      sales_time_id, 
      sales_channel_id, 
      sales_quantity_sold, 
      sales_dollar_amount
FROM  temp_sales_step1, product
WHERE temp_sales_step1.upc_code = product.upc_code;

This CTAS statement will convert each valid UPC code to a valid PRODUCT_ID value. If the ETT process has guaranteed that each UPC code is valid, then this statement alone may be sufficient to implement the entire transformation.

Exception Handling: Invalid Data

However, it can be important to handle new sales data that does not have valid UPC codes.

One approach is to use an additional CTAS statement to identify the invalid rows:

CREATE TABLE temp_sales_step1_invalid 
NOLOGGING PARALLEL 
AS 
SELECT * FROM temp_sales_step1
WHERE temp_sales_step1.upc_code NOT IN (SELECT upc_code FROM product);

This invalid data is now stored in a separate table, TEMP_SALES_STEP1_INVALID, and can be handled separately by the ETT process.

A second approach is to modify the original CTAS to use an outer join:

CREATE TABLE temp_sales_step2 
NOLOGGING PARALLEL 
AS 
SELECT 
      sales_transaction_id,
      product.product_id sales_product_id, 
      sales_customer_id,
      sales_time_id, 
      sales_channel_id, 
      sales_quantity_sold, 
      sales_dollar_amount
FROM  temp_sales_step1, product
WHERE temp_sales_step1.upc_code = product.upc_code (+);

Using this outer join, the sales transactions that originally contained invalidated UPC codes will be assigned a PRODUCT_ID of NULL. These PRODUCT_IDs can be handled later.

There are other possible approaches to handling invalid UPC codes. Some data warehouses may choose to insert null-valued PRODUCT_IDs into their SALES table, while other data warehouses may not allow any new data from the entire batch to be inserted into the SALES table until all invalid UPC codes have been addressed. The correct approach is determined by the business requirements of the data warehouse. Regardless of the specific requirements, exception handling can be addressed by the same basic SQL techniques as transformations.

Pivoting

A data warehouse can receive data from many different sources. Some of these source systems may not be relational databases, and may store data in very different formats from the data warehouse. For example, suppose that you receive a set of sales records from a non-relational database having the form:

product_id, store_id, week_id, sales_sun, sales_mon, sales_tue, 
sales_wed, sales_thu, sales_fri, sales_sat

In your data warehouse, you would want to store the records in a more typical relational form:

product_id, store_id, time_id, sales_amount

Thus, you need to build a transformation such that each record in the input stream must be converted into seven records for the data warehouse's SALES table. This operation is commonly referred to as pivoting.

The CTAS approach to pivoting would require a UNION-ALL query:


CREATE table temp_sales_step2
NOLOGGING PARALLEL 
AS 
SELECT product_id, time_id, sales_amount 
FROM
(SELECT product_id, store_id, TO_DATE(week_id,'WW') time_id, 
          sales_sun sales_amount FROM temp_sales_step1
 UNION ALL
 SELECT product_id, store_id, TO_DATE(week_id,'WW')+1 time_id, 
        sales_mon sales_amount FROM temp_sales_step1
 UNION ALL
 SELECT product_id, store_id, TO_DATE(week_id,'WW')+2 time_id, 
        sales_tue sales_amount FROM temp_sales_step1
 UNION ALL
 SELECT product_id, store_id, TO_DATE(week_id,'WW')+3 time_id, 
        sales_web sales_amount FROM temp_sales_step1
 UNION ALL
 SELECT product_id, store_id, TO_DATE(week_id,'WW')+4 time_id, 
        sales_thu sales_amount FROM temp_sales_step1
 UNION ALL
 SELECT product_id, store_id, TO_DATE(week_id,'WW')+5 time_id, 
        sales_fri sales_amount FROM temp_sales_step1
 UNION ALL
 SELECT product_id, store_id, TO_DATE(week_id,'WW')+6 time_id, 
        sales_sat sales_amount FROM temp_sales_step1);

Like all CTAS operations, this operation can be fully parallelized. However, the CTAS approach also requires seven separate scans of the data, one for each day of the week. Even with parallelism, the CTAS approach may be time-consuming.

An alternative implementation is PL/SQL. A very basic PL/SQL function to implement the pivoting operation would be:

DECLARE
   CURSOR c1 is
      SELECT 
       product_id, store_id, week_id, sales_sun, sales_mon, sales_tue, 
       sales_wed, sales_thu, sales_fri, sales_sat
      FROM temp_sales_step1;
BEGIN
   FOR next IN c1 LOOP
      INSERT INTO temp_sales_step2 VALUES (product_id, store_id, 
         TO_DATE(week_id,'WW') time_id, sales_sun );
      INSERT INTO temp_sales_step2 VALUES (product_id, store_id, 
         TO_DATE(week_id,'WW')+1 time_id, sales_mon );
      INSERT INTO temp_sales_step2 VALUES (product_id, store_id, 
         TO_DATE(week_id,'WW')+2 time_id, sales_tue );
      INSERT INTO temp_sales_step2 VALUES (product_id, store_id, 
         TO_DATE(week_id,'WW')+3 time_id, sales_wed );
      INSERT INTO temp_sales_step2 VALUES (product_id, store_id, 
         TO_DATE(week_id,'WW')+4 time_id, sales_thu );
      INSERT INTO temp_sales_step2 VALUES (product_id, store_id, 
         TO_DATE(week_id,'WW')+5 time_id, sales_fri );
      INSERT INTO temp_sales_step2 VALUES (product_id, store_id, 
         TO_DATE(week_id,'WW')+6 time_id, sales_sat );
   END LOOP;
   COMMIT;
END;

This PL/SQL procedure could be modified to provide even better performance. Array inserts could accelerate the insertion phase of the procedure. Further performance could be gained by parallelizing this transformation operation, particularly if the TEMP_SALES_STEP1 table is partitioned, using techniques similar to the parallelization of data-unloading described in Chapter 11, "Extraction".

The primary advantage of this PL/SQL procedure over a CTAS approach is that it only requires a single scan of the data. Pivoting is an example of a complex transformation that may be more amenable to PL/SQL.

Emphasis on Transformation Techniques

This chapter is designed to introduce techniques for implementing scalable and efficient data transformations within Oracle8i. The examples in this chapter are relatively simple; real-world data transformations are often considerably more complex. However, the transformation techniques introduced in this chapter meet the majority of real-world data transformation requirements, often with more scalability and less programming than alternative approaches.

This chapter does not seek to illustrate all of the typical transformations that would be encountered in a data warehouse, but to demonstrate the types of techniques that can be applied to implement these transformations and to provide guidance in how to choose the best techniques.

13 Transformation