11
Extraction

This chapter discusses extraction, which is when you take data from an operational system and move it to your warehouse. The chapter discusses:

Overview of Extraction

Extraction is the operation of copying data from a database into a file or onto a network connection. This is the first step of the ETT process: data must be extracted from the source system(s) so that this data may be subsequently transformed and loaded into the data warehouse.

The source systems for a data warehouse are typically transaction-processing database applications. For example, one of the source systems for a sales-analysis data warehouse may be the order-entry system which records all of the current order activities.

Designing and creating the extraction process is often one of the most time-consuming tasks in the ETT process and, indeed, in the entire data warehousing process. The source systems may be very complex, and thus determining which data needs to be extracted can be difficult. Moreover, the source system typically cannot be modified, nor can its performance or availability be impacted, to accommodate the needs of the data warehouse extraction process. These are very important considerations for extraction, and ETT in general.

This chapter, however, focuses on the technical considerations for extracting data. It assumes that the data warehouse team has already identified the data that will be extracted, and discusses common techniques used for extracting data from source databases. The techniques for extraction fall into two broad categories:

Techniques which extract data from an operational system and place the data into a file. Examples are data-unloads and exports.
Techniques which extract data from an operational system and directly transport data into the target database (a data warehouse or a staging database). Examples are gateways and distributed queries.

Extracting Via Data Files

Most database systems provide mechanisms for exporting and/or unloading data from the internal database format into flat files. Extracts from mainframe systems often use COBOL programs, but many databases, as well as third-party software vendors, provide export and/or unload utilities.

Data extraction does not necessarily mean that entire database structures are unloaded in flat files. In many cases, it may be appropriate to unload entire database tables or objects. In other cases, it may be more appropriate to unload only a subset of a given table, or unload the results of joining multiple tables together. Different extraction techniques may vary in their capabilities to support these two scenarios.

When the source system is an Oracle database, there are several alternatives for extracting data into files:

Extracting into Flat Files Using SQL*Plus

The most basic technique for extracting data is to execute a SQL query in SQL*Plus and direct the output of the query to a file. For example, to extract a flat file, empdept.log, containing a list of employee names and department names from the EMP and DEPT tables, the following SQL script could be run:

SET echo off
SET pagesize 0
SPOOL empdept.dat
SELECT ename, dname FROM emp, dept
WHERE emp.deptno = dept.deptno;
SPOOL off

The exact format of the output file can be specified using SQL*Plus's system variables.

Note that this extraction technique offers the advantage of being able to extract the output of any SQL statement. The example above extracts the results of a join.

This extraction technique can be parallelized. Parallelization can be achieved by initiating multiple, concurrent SQL*Plus sessions, each session running a separate query representing a different portion of the data to be extracted. For example, suppose that you wish to extract data from an ORDERS table, and that the ORDERS table has been range partitioned by month, with partitions ORDERS_JAN1998, ORDER_FEB1998, and so on. Then, in order to extract a single year of data from the ORDERS table, you could initiate 12 concurrent SQL*Plus sessions, each extracting a single partition. The SQL script for one such session could be:

SPOOL order_jan.dat
SELECT * FROM orders PARTITION (orders_jan1998);
SPOOL OFF

These 12 SQL*Plus processes would concurrently spool data to 12 separate files. These files may need to be concatenated (using OS utilities) following the extraction.

Even if the ORDERS table is not partitioned, it is still possible to parallelize the extraction. By viewing the data dictionary, it is possible to identify the Oracle data blocks that make up the ORDERS table. Using this information, you could then derive a set of ROWID-range queries for extracting data from the ORDERS table:

SELECT * FROM orders WHERE rowid BETWEEN <value1> and <value2>;

Parallelization of the extraction of complex SQL queries may also be possible, although the process of breaking a single complex query into multiple components can be challenging.

Note that all parallel techniques can use considerably more CPU and I/O resources on the source system, and the impact on the source system should be evaluated before parallelizing any extraction technique.

Extracting into Flat Files Using OCI or Pro*C Programs

OCI programs (or other program using Oracle call interfaces, such as Pro*C programs), can also be used to extract data. These techniques typically provide improved performance over the SQL*Plus approach, although they also require additional programming. Like the SQL*Plus approach, an OCI program can be used to extract the results of any SQL query. Furthermore, the parallelization techniques described for the SQL*Plus approach can be readily applied to OCI programs as well.

Exporting into Oracle Export Files Using Oracle's EXP Utility

Oracle's export utility allows tables (including data) to be exported into Oracle export files. Unlike the SQL*Plus and OCI approaches, which describe the extraction of the results of a SQL statement, EXP provides a mechanism for extracting database objects. Thus, EXP differs from the previous approaches in several important ways:

The export files contain metadata as well as data. An export file contains, not only the raw data of a table, but also information on how to recreate the table, potentially including any indexes, constraints, grants, and other attributes associated with that table.
A single export file may contain many database objects, and even an entire schema.
Export cannot be directly used to export portions of a database object, or to export the results of a complex SQL query. Export can only be used to extract entire database objects.
The output of the export utility can only be processed using the Oracle import utility.

Oracle provides a direct-path export, which is quite efficient for extracting data. Note that in Oracle8i, there is no direct-path import, which should be considered when evaluating the overall performance of an export-based extraction strategy.

Although the SQL*Plus and OCI extraction techniques are more common, EXP may be useful for certain ETT environments that require extraction of metadata as well as data.

See Oracle8i Utilities for more information on using export.

Copying to Another Oracle Database Using Transportable Tablespaces

One of the most powerful features for extracting and moving large volumes of data between Oracle database is transportable tablespaces. A more detailed example of using this feature to extract and transport data is provided in Chapter 12, "Transportation". When possible, transportable tablespaces are very highly recommended for data extraction, because they often provide considerable advantages in performance and manageability over other extraction techniques.

Extracting Via Distributed Operations

Using distributed-query technology, one Oracle database can directly query tables located in another Oracle database. Specifically, a data warehouse or staging database could directly access tables and data located in an Oracle-based source system. This is perhaps the simplest method for moving data between two Oracle databases because it combines the extraction and transformation into a single step, and, moreover, requires minimal programming.

Continuing our example from above, suppose that we wanted to extract a list of employee names with department names from a source database, and store it into our data warehouse. Using a Net8 connection, and distributed-query technology, this can be achieved using a single SQL statement:

CREATE TABLE empdept
AS 
SELECT ename, dname FROM emp@source_db, dept@source_db
WHERE emp.deptno = dept.deptno;

This statement creates a local table in the data warehouse, EMPDEPT, and populates it with data from the EMP and DEPT tables on the source system.

This technique is ideal for moving small volumes of data. However, the data is transported from the source system to the data warehouse via a single Net8 connection. Thus, the scalability of this technique is limited. For larger data volumes, file-based data extraction and transportation techniques are often more scalable and thus more appropriate.

Gateways are another form of distributed-query technology, except that gateways allow an Oracle database (such as a data warehouse) to access database tables stored in remote, non-Oracle databases. Like distributed queries, gateways are very easy to set up and use, but also lack scalability for very large data volumes.

For more information on distributed queries, see Oracle8i Distributed Database Systems and Oracle8i Concepts.

Change Capture

An important consideration for extraction is incremental extraction, also called change data capture. If a data warehouse extracts data from an operational system on a nightly basis, then the only data which that data warehouse requires is the data that has changed since the last extraction (that is, the data that has been modified in the last 24 hours).

If it was possible to efficiently identify and extract only the most recently-changed data, the extraction process (as well as all downstream operations in the ETT process) could be much more efficient since it would only need to extract a much smaller volume of data. Unfortunately, for many source systems, identifying the recently modified data may be difficult or intrusive to the operation of the system. Change data capture is typically the most challenging technical issue in data extraction.

Many data warehouses do not use any change-capture techniques as part of the extraction process. Instead, entire tables from the source systems are extracted to the data warehouse or staging area, and these tables are compared with a previous extract from the source system to identify the changed data. While this approach may not have significant impact on the source systems, it clearly can place a considerable burden on the data-warehouse processes particularly if the data volumes are large.

Thus, change data capture as part of the extraction process is often desirable. This section describes several techniques for implementing change-data-capture on Oracle source systems:

These techniques are based upon the characteristics of the source systems, or may require modifications to the source systems. Thus, each of these techniques must be carefully evaluated by the owners of the source system prior to implementation.

Each of these change-capture techniques will work in conjunction with the data-extraction technique discussed above. For example, timestamps can be used whether the data is being unloaded to a file or accessed via distributed query.

Timestamps

The tables in some operational systems have timestamp columns. The timestamp specifies the time and date that a given row was last modified. If the tables in an operational system have columns containing timestamps, then the latest data can easily be identified using the timestamp columns. For example, the following query might be useful for extracting today's data from an ORDERS table:

SELECT * FROM orders WHERE TIMESTAMP = TO_DATE(sysdate, 'mm-dd-yyyy');

If timestamps are not available in an operational system, the system may be able to be modified to include timestamps. This would require, first, modifying the operational system's tables to include a new timestamp column, and second, creating a trigger (see "Triggers") to update the timestamp column following every operation which modifies a given row.

Partitioning

Some source systems may utilize Oracle's range-partitioning, such that the source tables are partitioned along a date key, which allows for easy identification of new data. For example, if you are extracting from an ORDERS table, and the ORDERS table is partitioned by week, then it is easy to identify the current week's data.

Triggers

Triggers can be created in operational systems to keep track of recently updated records. They can then be used in conjunction with timestamp columns to allow you to identify the exact time and date when a given row was last modified. You do this by creating a trigger on each source table that requires change-data-capture. Following each DML statement that is executed on the source table, this trigger updates the timestamp column with the current time. Thus, the timestamp column provides the exact time and date when a given row was last modified.

A similar trigger-based technique is to use Oracle's materialized view logs. These logs are used by materialized views to identify changed data, and these logs are accessible to end users. A materialized view log can be created on each source table requiring change-data-capture. Then, whenever any modifications are made to the source table, a record will be inserted into the materialized-view log indicating which row(s) were modified. The materialized view log can then be subsequently queried by the extraction process to identify the changed data.

Materialized view logs rely on triggers, but they provide an advantage in that the creation and maintenance of this change-data system is largely managed by Oracle.

Both of these trigger-based techniques will impact performance on the source systems, and this impact should be carefully consider prior to implementing on a production source system.

11 Extraction