Incremental Data Loading in Parallel

Parallel DML combined with the updatable join views facility provides an efficient solution for refreshing the tables of a data warehouse system. To refresh tables is to update them with the differential data generated from the OLTP production system.

In the following example, assume a refresh of a table named customers that has columns c_key, c_name, and c_addr. The differential data contains either new rows or rows that have been updated since the last refresh of the data warehouse. In this example, the updated data is shipped from the production system to the data warehouse system by means of ASCII files. These files must be loaded into a temporary table, named diff_customer, before starting the refresh process. You can use SQL*Loader with both the parallel and direct options to efficiently perform this task. You can use the APPEND hint when loading in parallel as well.

After diff_customer is loaded, the refresh process can be started. It can be performed in two phases or by merging in parallel, as demonstrated in the following:

Optimizing Performance for Updating the Table in Parallel

The following statement is a straightforward SQL implementation of the update using subqueries:

UPDATE customers SET(c_name, c_addr) = (SELECT c_name, c_addr
FROM diff_customer WHERE diff_customer.c_key = customer.c_key)
WHERE c_key IN(SELECT c_key FROM diff_customer);

Unfortunately, the two subqueries in this statement affect performance.

An alternative is to rewrite this query using updatable join views. To rewrite the query, you must first add a primary key constraint to the diff_customer table to ensure that the modified columns map to a key-preserved table:

CREATE UNIQUE INDEX diff_pkey_ind ON diff_customer(c_key) PARALLEL NOLOGGING;
ALTER TABLE diff_customer ADD PRIMARY KEY (c_key);

You can then update the customers table with the following SQL statement:

UPDATE /*+ PARALLEL(cust_joinview) */
(SELECT /*+ PARALLEL(customers) PARALLEL(diff_customer) */
CUSTOMER.c_name AS c_name CUSTOMER.c_addr AS c_addr,
diff_customer.c_name AS c_newname, diff_customer.c_addr AS c_newaddr
   FROM diff_customer
   WHERE customers.c_key = diff_customer.c_key) cust_joinview
   SET c_name = c_newname, c_addr = c_newaddr;

The underlying scans feeding the join view cust_joinview are done in parallel. You can then parallelize the update to further improve performance, but only if the customers table is partitioned.

Efficiently Inserting the New Rows into the Table in Parallel

The last phase of the refresh process consists of inserting the new rows from the diff_customer temporary table to the customers table. Unlike the update case, you cannot avoid having a subquery in the INSERT statement:

INSERT /*+PARALLEL(customers)*/ INTO customers SELECT * FROM diff_customer s);

However, you can guarantee that the subquery is transformed into an anti-hash join by using the HASH_AJ hint. Doing so enables you to use parallel INSERT to execute the preceding statement efficiently. Parallel INSERT is applicable even if the table is not partitioned.

Optimizing Performance by Merging in Parallel

You can combine update and insert operations into one statement, commonly known as a merge. The following statement achieves the same result as all of the statements in "Optimizing Performance for Updating the Table in Parallel" and "Efficiently Inserting the New Rows into the Table in Parallel":

MERGE INTO customers USING diff_customer
ON (diff_customer.c_key = customer.c_key) WHEN MATCHED THEN
  UPDATE SET (c_name, c_addr) = (SELECT c_name, c_addr 
  FROM diff_customer WHERE diff_customer.c_key = customers.c_key) 
WHEN NOT MATCHED THEN
 INSERT VALUES (diff_customer.c_key,diff_customer.c_data);