Oracle8i Data Warehousing Guide
Release 2 (8.1.6)

Part Number A76994-01

Library

Product

Contents

Index

Go to previous page Go to next page

14
Loading and Refreshing

This chapter discusses how to load and refresh a data warehouse, and discusses:

Refreshing a Data Warehouse

Following extraction and transformation, the final step of the ETT process is to physically insert the new, clean data into the production data warehouse schema, and take all of the other steps necessary (such as building indexes, validating constraints, taking backups) to make this new data available to the end-users.

Using Partitioning to Improve Data Warehouse Refresh

The partitioning scheme of the data warehouse is often crucial in determining the efficiency of refresh operations in the data warehouse load process. In fact, the load process is often the primary consideration in choosing the partitioning scheme of data-warehouse tables and indexes.

The partitioning scheme of the largest data-warehouse tables (for example, the fact table in a star schema) should be based upon the loading paradigm of the data warehouse.

Most data warehouses are loaded on a regular schedule: every night, every week or every month, new data is brought into the data warehouse. The data being loaded at the end of the week or month typically corresponds to the transactions for the week or month week. In this very common scenario, the data warehouse is being loaded by time. This suggests that the data-warehouse tables should be partitioned by a date key. In our data warehouse example, suppose the new data is loaded into the SALES table every month. Furthermore, the SALES table has been partitioned by month. These steps show how the load process will proceed to add the data for a new month (January, 2000) to SALES table:

Step 1: Place the new data into a separate table, SALES_00JAN. This data may be directly loaded into SALES_00JAN from outside the data warehouse, or this data may be the result of previous data-transformation operations that have already occurred in the data warehouse. SALES_00JAN has the exact same columns, datatypes, etc. as the SALES table. Gather statistics on the SALES_00JAN table.

Step 2: Create indexes and add constraints on SALES_00JAN. Again, the indexes and constraints on SALES_00JAN should be identical to the indexes and constraints on SALES.

Indexes can be built in parallel, and should use the NOLOGGING and the COMPUTE STATISTICS options. For example:

CREATE BITMAP INDEX sales_00jan_customer_id_bix 
      tablespace (sales_index) NOLOGGING parallel (degree 8) COMPUTE STATISTICS;
  

All constraints should be applied to the SALES_00JAN table that are present on the SALES table. This includes referential integrity constraints. Some typical constraints would be:

ALTER TABLE sales_00jan add CONSTRAINT sales_pk 
      unique(sales_transaction_id) disable validate;
ALTER TABLE sales_00jan add constraint sales_customer_ri
      sales_customer_id REFERENCES customer(customer_id);
  

Step 3: Add the SALES_00JAN table to the SALES table.

In order to add this new data to the SALES table, we need to do two things. First, we need to add a new partition to the SALES table. We will use the ALTER TABLE ... ADD PARTITION statement. This will add an empty partition to the SALES table:

ALTER TABLE sales ADD PARTITION sales_00jan 
VALUES LESS THAN (TO_DATE('01-FEB-2000', 'dd-mon-yyyy'));

Then, we can add our newly created table to this partition using the exchange partition operation. This will exchange the new, empty partition with the newly-loaded table.

ALTER TABLE sales EXCHANGE PARTITiON sales_00jan WITH TABLE sales_00jan 
INCLUDING INDEXES WITHOUT VALIDATION;
  

The exchange operator will preserve the indexes and constraints that were already present on the SALES_00JAN table. For unique constraints (such as the unique constraint on SALES_TRANSACTION_ID), the exchange operator might need to do additional processing to ensure consistency. If there were only foreign-key constraints, the exchange operation would be instantaneous.

The benefits of this partitioning technique are significant. First, the new data is loaded with minimal resource utilization. The new data is loaded into an entirely separate table, and the index-processing and constraint-processing are only applied to the new partition. If the SALES table was 50GB and had 12 partitions, then a new month's worth of data contains approximately 4 GB. Only the new month's worth of data needs to be indexed; none of the indexes on the remaining 46GB of data needs to be modified at all.

This partitioning scheme additionally ensures that the load-processing time is directly proportional to the amount of new data being loaded, not to the total size of the SALES table.

Second, the new data is loaded with minimal impact on concurrent queries. Since all of the operations associated with data loading are occurring on a separate SALES_00JAN table, none of the existing data or indexes of the SALES table is impacted during this data-refresh process. The SALES table, and its indexes, remain entirely untouched throughout this refresh process.

The EXCHANGE operator can be viewed as a publishing mechanism. Until the data-warehouse administrator exchanges the SALES_00JAN table into the SALES table, end-user cannot see the new data. Once the EXCHANGE has occurred, than any end-user query accessing the SALES table will immediately be able to see the SALES_00JAN data.

Note that partitioning is not only useful for adding new data, but also for removing data. Many data warehouses maintain a rolling window of data; that is, the data warehouse stores the most recent 12-months of SALES data for example. Just as a new partition can be added to the SALES table (as described above), an old partition can be quickly (and independently) removed from the SALES table. The above two benefits (reduced resources utilization and minimal end-user impact) are just as pertinent to removing a partition as they are to adding a partition.

This example is a simplification of the data warehouse load scenario. Real-world data-warehouse refresh characteristics are always more complex. However, the advantages of this rolling window approach are not diminished in more complex scenarios.

Consider two typical scenarios:

  1. Data is loaded daily. However, the data warehouse contains two years of data, so that partitioning by day might not be desired.

    SOLUTION: Partition by week or month (as appropriate). Use INSERT to add the new data to an existing partition. Since INSERT is only affecting a single partition, the benefits described above remain intact. The INSERT could occur while the partition remains a part of the table. INSERT into a single partition can be parallelized:

    INSERT INTO SALES PARTITION (SALES_00JAN) SELECT * FROM NEW_SALES;
    
    

    The indexes of this SALES partition will be maintained in parallel as well. Alternatively, the SALES_00JAN partition could be EXCHANGEd out of the SALES table, and the INSERT could occur there (this approach could be used, for example, if it was determined that based on data volumes it would be more efficient to drop and rebuild the indexes on this partition rather than maintain them)

  2. New data feeds, although consisting primarily of data for the most recent day/week/month, also contain some data from previous time periods.

    SOLUTION: Use parallel SQL operations (such as CREATE TABLE ... AS SELECT) to separate the new data from the data in previous time periods. Process the old data separately using other techniques.

    New data feeds are not solely time-based. This can occur when a data warehouse receives data from multiple operational systems. For example, the sales data from DIRECT channels may come into the data warehouse separately from the data from INDIRECT channels. For business reasons, it may furthermore make sense to keep the DIRECT and INDIRECT data in separate partitions.

    SOLUTION: Oracle supports concatenated partitioning keys. The SALES table could be partitioned by (month, channel). Care must be taken with this approach to ensure that the partition-pruning techniques (when querying the SALES table) are understood prior to implementation.

    Another possibility is composite (range/hash) partitioning. This approach is only feasible if the second key has a high cardinality. In this example, CHANNEL has only two possible values, so that it would not be a good candidate for a hash-partitioning key.

    The rolling window approach is the most basic technique for refreshing a data warehouse.

Example: Implementing an Efficient Upsert

Commonly, the data that is extracted from a source system is not simply a list of new records that needs to be inserted into the data warehouse. Instead, this new data set is a combination of new records as well as modified records. For example, suppose that most of data extracted from the OLTP systems will be new sales transactions. These records will be inserted into the warehouse's SALES table, however, some records may reflect modifications of previous transactions, such as returned merchandise or transactions that may have been incomplete or incorrect when initially loaded into the data warehouse. The records require updates to the SALES table.

As a typical scenario, suppose that there is a table, called NEW_SALES, which contains both inserts and updates that will be applied to the data warehouse's SALES table. When designing the entire data-warehouse load process, it was determined that the NEW_SALES table would contain records with the following semantics:

This UPDATE-ELSE-INSERT operation is often called an upsert. Upsert can be executed using two SQL statements. The first SQL statement updates the appropriate rows in the SALES tables, while the second SQL statement inserts the rows:

UPDATE sales SET 
       sales.sales_dollar_amount = sales.sales_dollar_amount 
        + new_sales.sales_dollar_amount,
       sales.sales_quantity_sold = sales.sales_quantity_sold
        + new_sales.sales_quantity_sold
WHERE new_sales.sales_transaction_id = sales.transaction_id;
  
INSERT INTO sales
SELECT * FROM new_sales
WHERE new_sales.sales_transaction_id NOT IN 
 (SELECT sales_transaction_id FROM sales); 

Both of these SQL statements can be parallelized, and this provides a very scalable mechanism for handling large amounts of changes.

An alternative implementation of upserts is to utilize a PL/SQL package, which successively reads each row of the NEW_SALES table, and applies if-then logic to either update or insert the new row into the SALES table. A PL/SQL-based implementation will certainly be effective if the NEW_SALES table is small, although the above approach will often be more efficient for larger data volumes.

Example: Maintaining Referential Integrity

In some data warehousing environments, the data warehouse administrator may wish to insert new data into tables in order to guarantee referential integrity. For example, a data warehouse may derive SALES from an operational system that retrieves data directly from cash registers. The SALES is refreshed nightly. However, the data for the PRODUCT dimension table may be derived from a separate operational system. The PRODUCT dimension table may only be refreshed once per week, since the PRODUCT table changes relatively slowly. If a new product was introduced on Monday, then it is possible for that product's PRODUCT_ID to appear in the SALES data of the data warehouse before that PRODUCT_ID has been inserted into the data warehouses PRODUCT table.

Although the sales transactions of the new product may be valid, this sales data will not satisfy the referential-integrity constraint between the PRODUCT dimension table and the SALES fact table.

Rather than disallow the new sales transactions, the data warehouse administrator may choose to insert the sales transactions into the SALES table.

However, the administrator may also wish to maintain the referential-integrity relationship between the SALES and PRODUCT tables. This can be accomplished by inserting new rows into the PRODUCT table as placeholders for the unknown products.

As in previous examples, we assume that the new data for the SALES table will be staged in a separate table, NEW_SALES. Using a single INSERT statement (which can be parallelized), the PRODUCT table can be altered to reflect the new products:

INSERT INTO PRODUCT_ID
  (SELECT SALES_PRODUCT_ID, 'Unknown Product Name', NULL, NULL ...
   FROM NEW_SALES WHERE SALES_PRODUCT_ID NOT IN
  (SELECT PRODUCT_ID FROM PRODUCT));

Example: Purging Data

Occasionally, it is necessary to remove large amounts of data from a data warehouse.

A very common scenario was discussed above with the rolling window, in which older data is rolled out of the data warehouse to make room for new data.

However, sometimes other data may need to removed from a data warehouse. Let's suppose that a retail company has previously sold products from 'MS Software', and that 'MS Software' has subsequently gone out of business. The business users of the warehouse may decide that they are no longer interested in seeing any data related to 'MS Software', so this data should be deleted.

One approach to removing a large volume of data is via parallel delete:

DELETE FROM SALES WHERE SALES_PRODUCT_ID IN (SELECT PRODUCT_ID FROM
PRODUCT WHERE PRODUCT_CATEGORY = 'MS Software');

This SQL statement will spawn one parallel process per partition. The advantage to this approach is that this will be much more efficient than a serial DELETE statement, and that none of the data in the SALES table will need to moved.

However, this approach also has some disadvantages. When removing a large percentage of rows, the delete statement will leave many empty row-slots in the existing partitions. If new data is being loaded using a rolling-window technique (or is being loaded using direct-path insert or load), then this storage will not be reclaimed. Moreover, even though the DELETE statement is parallelized, there may be more efficient methods. An alternative method is to recreate the entire SALES table, keeping the data for all product categories except 'MS Software'.

CREATE TABLE SALES2 AS 
SELECT * FROM SALES, PRODUCT
WHERE SALES.SALES_PRODUCT_ID = PRODUCT.PRODUCT_ID
AND PRODUCT_CATEGORY <> 'MS Software'
NOLOGGING PARALLEL (DEGREE 8)
PARTITION ... ;
CREATE INDEXES, constraints, etc. 
DROP TABLE SALES;
RENAME SALES2 TO SALES;

This approach may be more efficient than a parallel delete. However, this approach is also costly in terms of the amount of disk space, since the SALES table must effectively be instantiated twice.

An alternative method to utilize less space is to recreate the SALES table one partition at a time:

CREATE TABLE SALES_TEMP AS SELECT * FROM SALES WHERE 1=0;

INSERT INTO SALES_TEMP PARTITION (SALES_99JAN)
SELECT * FROM SALES, PRODUCT
WHERE SALES.SALES_PRODUCT_ID = PRODUCT.PRODUCT_ID
AND PRODUCT_CATEGORY <> 'MS Software';
<CREATE appropriate indexes and constraints on SALES_TEMP>
ALTER TABLE SALES EXCHANGE PARTITION (SALES_99JAN) WITH TABLE SALES_TEMP;

Continue this process for each partition in the SALES table.

Example: Re-synchronizing a Materialized View

Another major component of loading and refreshing a data warehouse is the refreshing of materialized views. Because many of the previous examples in this chapter relied on partition-maintenance operations, this section will initially discuss how to maintain materialized views following these operations. The following example demonstrates how you can manually re-synchronize a materialized view with its detail tables after performing partition maintenance operations on one of its detail tables. The method shown requires the materialized view to be partitioned on the same key column(s) as that of detail table and that the materialized view partitions have a one to one correspondence with the detail table partitions.

Drop an old partition from table FACT

ALTER TABLE fact DROP PARTITION month1;

Drop the corresponding old partition from materialized view DAILY_SUM using ALTER MATERIALIZED VIEW is currently not allowed, but ALTER TABLE works.)

ALTER TABLE daily_sum DROP PARTITION daily_sum_month1;

The materialized view is now stale and requires a complete refresh because of the partition operations. However, the materialized view is in fact fresh since we have manually re-synchronized it with its detail tables by dropping the corresponding materialized view partition. Therefore, we can alter the materialized view to tell Oracle to consider it fresh.

ALTER MATERIALIZED VIEW daily_sum CONSIDER FRESH;

DAILY_SUM's status is now neither known fresh nor known stale. Instead, it is UNKNOWN, enabling it to be used in QUERY_REWRITE_INTEGRITY = TRUSTED mode (if the materialized view is enabled for query rewrite). Moreover, the materialized view is again eligible for fast refresh after any subsequent updates.

In addition to re-synchronizing materialized views, this technique is also useful when the materialized view is intended to accumulate historical, aggregated data after it has been removed from the detail table. For example, you might wish to store 12 months of detail sales data in your warehouse, but also store 36 months of aggregate data in your materialized views. Oracle does not recommend you enable such a materialized view for query rewrite since the contents of the materialized view are deliberately not synchronized with its detail tables. This scenario can be implemented as shown above, except you omit the following statement:

ALTER TABLE daily_sum DROP PARTITION daily_sum_month1;

In order to deliberately not synchronize the materialized view with its detail tables. So long as the materialized view is not enabled for query rewrite, it is safe to use the following statement:

ALTER MATERIALIZED VIEW daily_sum CONSIDER FRESH;

This statement informs Oracle that DAILY_SUM is fresh for your purposes, and thereby re-enable fast refresh after subsequent updates.

Populating Databases Using Parallel Load

This section presents a case study illustrating how to create, load, index, and analyze a large data warehouse fact table with partitions in a typical star schema. This example uses SQL Loader to explicitly stripe data over 30 disks.

Figure 14-1 Datafile Layout for Parallel Load Example


Step 1: Create the Tablespaces and Add Datafiles in Parallel

Below is the command to create a tablespace named "Tsfacts1". Other tablespaces are created with analogous commands. On a 10-CPU machine, it should be possible to run all 12 CREATE TABLESPACE commands together. Alternatively, it might be better to run them in two batches of 6 (two from each of the three groups of disks).

CREATE TABLESPACE Tsfacts1 
DATAFILE /dev/D1.1'  SIZE 1024MB REUSE
DATAFILE /dev/D2.1'  SIZE 1024MB REUSE
DATAFILE /dev/D3.1'  SIZE 1024MB REUSE
DATAFILE /dev/D4.1' SIZE 1024MB REUSE
DATAFILE /dev/D5.1'  SIZE 1024MB REUSE
DATAFILE /dev/D6.1'  SIZE 1024MB REUSE
DATAFILE /dev/D7.1'  SIZE 1024MB REUSE
DATAFILE /dev/D8.1' SIZE 1024MB REUSE
DATAFILE /dev/D9.1'  SIZE 1024MB REUSE
DATAFILE /dev/D10.1  SIZE 1024MB REUSE
DEFAULT STORAGE (INITIAL 100MB NEXT 100MB PCTINCREASE 0)
CREATE TABLESPACE Tsfacts2
DATAFILE /dev/D4.2' SIZE 1024MB REUSE
DATAFILE /dev/D5.2'  SIZE 1024MB REUSE
DATAFILE /dev/D6.2'  SIZE 1024MB REUSE
DATAFILE /dev/D7.2'  SIZE 1024MB REUSE
DATAFILE /dev/D8.2' SIZE 1024MB REUSE
DATAFILE /dev/D9.2'  SIZE 1024MB REUSE
DATAFILE /dev/D10.2  SIZE 1024MB REUSE
DATAFILE /dev/D1.2'  SIZE 1024MB REUSE
DATAFILE /dev/D2.2'  SIZE 1024MB REUSE
DATAFILE /dev/D3.2'  SIZE 1024MB REUSE
DEFAULT STORAGE (INITIAL 100MB NEXT 100MB PCTINCREASE 0)
...
CREATE TABLESPACE Tsfacts4
DATAFILE /dev/D10.4'  SIZE 1024MB REUSE
DATAFILE /dev/D1.4'  SIZE 1024MB REUSE
DATAFILE /dev/D2.4'  SIZE 1024MB REUSE
DATAFILE /dev/D3.4  SIZE 1024MB REUSE
DATAFILE /dev/D4.4' SIZE 1024MB REUSE
DATAFILE /dev/D5.4'  SIZE 1024MB REUSE
DATAFILE /dev/D6.4'  SIZE 1024MB REUSE
DATAFILE /dev/D7.4'  SIZE 1024MB REUSE
DATAFILE /dev/D8.4' SIZE 1024MB REUSE
DATAFILE /dev/D9.4'  SIZE 1024MB REUSE
DEFAULT STORAGE (INITIAL 100MB NEXT 100MB PCTINCREASE 0)
...
CREATE TABLESPACE Tsfacts12
DATAFILE /dev/D30.4'  SIZE 1024MB REUSE
DATAFILE /dev/D21.4'  SIZE 1024MB REUSE
DATAFILE /dev/D22.4'  SIZE 1024MB REUSE
DATAFILE /dev/D23.4  SIZE 1024MB REUSE
DATAFILE /dev/D24.4' SIZE 1024MB REUSE
DATAFILE /dev/D25.4'  SIZE 1024MB REUSE
DATAFILE /dev/D26.4'  SIZE 1024MB REUSE
DATAFILE /dev/D27.4'  SIZE 1024MB REUSE
DATAFILE /dev/D28.4' SIZE 1024MB REUSE
DATAFILE /dev/D29.4'  SIZE 1024MB REUSE
DEFAULT STORAGE (INITIAL 100MB NEXT 100MB PCTINCREASE 0)

Extent sizes in the STORAGE clause should be multiples of the multiblock read size, where:

blocksize * MULTIBLOCK_READ_COUNT = multiblock read size

INITIAL and NEXT should normally be set to the same value. In the case of parallel load, make the extent size large enough to keep the number of extents reasonable, and to avoid excessive overhead and serialization due to bottlenecks in the data dictionary. When PARALLEL=TRUE is used for parallel loader, the INITIAL extent is not used. In this case you can override the INITIAL extent size specified in the tablespace default storage clause with the value specified in the loader control file, for example, 64KB.

Tables or indexes can have an unlimited number of extents provided you have set the COMPATIBLE system parameter to match the current release number, and use the MAXEXTENTS keyword on the CREATE or ALTER command for the tablespace or object. In practice, however, a limit of 10,000 extents per object is reasonable. A table or index has an unlimited number of extents, so set the PERCENT_INCREASE parameter to zero to have extents of equal size.


Note:

It is not desirable to allocate extents faster than about 2 or 3 per minute. Thus, each process should get an extent that lasts for 3 to 5 minutes. Normally such an extent is at least 50MB for a large object. Too small an extent size incurs significant overhead and this affects performance and scalability of parallel operations. The largest possible extent size for a 4GB disk evenly divided into 4 partitions is 1GB. 100MB extents should perform well. Each partition will have 100 extents. You can then customize the default storage parameters for each object created in the tablespace, if needed. 


Step 2: Create the Partitioned Table

We create a partitioned table with 12 partitions, each in its own tablespace. The table contains multiple dimensions and multiple measures. The partitioning column is named "dim_2" and is a date. There are other columns as well.

CREATE TABLE fact (dim_1 NUMBER, dim_2 DATE, ...
meas_1 NUMBER, meas_2 NUMBER, ... )
PARALLEL
(PARTITION BY RANGE (dim_2)
PARTITION jan95 VALUES LESS THAN ('02-01-1995') TABLESPACE
TSfacts1
PARTITION feb95 VALUES LESS THAN ('03-01-1995') TABLESPACE
TSfacts2
...
PARTITION dec95 VALUES LESS THAN ('01-01-1996') TABLESPACE
TSfacts12);

Step 3: Load the Partitions in Parallel

This section describes four alternative approaches to loading partitions in parallel.

The different approaches to loading help you manage the ramifications of the PARALLEL=TRUE keyword of SQL*Loader that controls whether individual partitions are loaded in parallel. The PARALLEL keyword entails restrictions such as the following:

However, regardless of the setting of this keyword, if you have one loader process per partition, you are still effectively loading into the table in parallel.

Case 1

In this approach, assume 12 input files are partitioned in the same way as your table. The DBA has 1 input file per partition of the table to be loaded. The DBA starts 12 SQL*Loader sessions concurrently in parallel, entering statements like these:

SQLLDR DATA=jan95.dat DIRECT=TRUE CONTROL=jan95.ctl
SQLLDR DATA=feb95.dat DIRECT=TRUE CONTROL=feb95.ctl
 . . . 
SQLLDR DATA=dec95.dat DIRECT=TRUE CONTROL=dec95.ctl

In the example, the keyword PARALLEL=TRUE is not set. A separate control file per partition is necessary because the control file must specify the partition into which the loading should be done. It contains a statement such as:

LOAD INTO fact partition(jan95)

The advantages of this approach are that local indexes are maintained by SQL*Loader. You still get parallel loading, but on a partition level--without the restrictions of the PARALLEL keyword.

A disadvantage is that you must partition the input prior to loading manually.

Case 2

In another common approach, assume an arbitrary number of input files that are not partitioned in the same way as the table. The DBA can adopt a strategy of performing parallel load for each input file individually. Thus if there are 7 input files, the DBA can start 7 SQL*Loader sessions, using statements like the following:

SQLLDR DATA=file1.dat DIRECT=TRUE PARALLEL=TRUE

Oracle partitions the input data so that it goes into the correct partitions. In this case all the loader sessions can share the same control file, so there is no need to mention it in the statement.

The keyword PARALLEL=TRUE must be used because each of the 7 loader sessions can write into every partition. In case 1, every loader session would write into only 1 partition, because the data was partitioned prior to loading. Hence all the PARALLEL keyword restrictions are in effect.

In this case, Oracle attempts to spread the data evenly across all the files in each of the 12 tablespaces--however an even spread of data is not guaranteed. Moreover, there could be I/O contention during the load when the loader processes are attempting to write to the same device simultaneously.

Case 3

In Case 3 (illustrated in the example), the DBA wants precise control over the load. To achieve this, the DBA must partition the input data in the same way as the datafiles are partitioned in Oracle.

This example uses 10 processes loading into 30 disks. To accomplish this, the DBA must split the input into 120 files beforehand. The 10 processes will load the first partition in parallel on the first 10 disks, then the second partition in parallel on the second 10 disks, and so on through the 12th partition. The DBA runs the following commands concurrently as background processes:

SQLLDR DATA=jan95.file1.dat DIRECT=TRUE PARALLEL=TRUE FILE=/dev/D1.1 
...
SQLLDR DATA=jan95.file10.dat DIRECT=TRUE PARALLEL=TRUE FILE=/dev/D10.1 
WAIT;
...
SQLLDR DATA=dec95.file1.dat DIRECT=TRUE PARALLEL=TRUE FILE=/dev/D30.4
...
SQLLDR DATA=dec95.file10.dat DIRECT=TRUE PARALLEL=TRUE FILE=/dev/D29.4 

For Oracle Parallel Server, divide the loader session evenly among the nodes. The datafile being read should always reside on the same node as the loader session.

The keyword PARALLEL=TRUE must be used, because multiple loader sessions can write into the same partition. Hence all the restrictions entailed by the PARALLEL keyword are in effect. An advantage of this approach, however, is that it guarantees that all of the data is precisely balanced, exactly reflecting your partitioning.


Note:

Although this example shows parallel load used with partitioned tables, the two features can be used independent of one another. 


Case 4

For this approach, all partitions must be in the same tablespace. You need to have the same number of input files as datafiles in the tablespace, but you do not need to partition the input the same way in which the table is partitioned.

For example, if all 30 devices were in the same tablespace, then you would arbitrarily partition your input data into 30 files, then start 30 SQL*Loader sessions in parallel. The statement starting up the first session would be similar to the following:

SQLLDR DATA=file1.dat DIRECT=TRUE PARALLEL=TRUE FILE=/dev/D1
. . .
SQLLDR DATA=file30.dat DIRECT=TRUE PARALLEL=TRUE FILE=/dev/D30

The advantage of this approach is that as in Case 3, you have control over the exact placement of datafiles because you use the FILE keyword. However, you are not required to partition the input data by value because Oracle does that for you.

A disadvantage is that this approach requires all the partitions to be in the same tablespace. This minimizes availability.

Refreshing Materialized Views

When creating a materialized view, you have the option of specifying whether the refresh occurs ON DEMAND or ON COMMIT. When ON_DEMAND refresh is used, then the materialized view can be refreshed by calling one of the procedures in DBMS_MVIEW.

The DBMS_MVIEW package provides three different types of refresh operations.

See "Manual Refresh Using the DBMS_MVIEW Package" for more information about this package.

Performing a refresh operation requires temporary space to rebuild the indexes, and can require additional space for performing the refresh operation itself.

Some sites may prefer to not refresh all of their materialized views at the same time. Therefore, if you defer refreshing your materialized views, you can temporarily disable query rewrite with ALTER SYSTEM SET QUERY_REWRITE_ENABLED = FALSE. Users who still want access to the stale materialized views can override this default with ALTER SESSION SET QUERY_REWRITE_ENABLED = TRUE. After refreshing the materialized views, you can re-enable query rewrite as the default for all sessions in the current database instance by setting ALTER SYSTEM SET QUERY_REWRITE_ENABLED = TRUE.

Refreshing a materialized view automatically updates all of its indexes; in the case of full refresh, this requires temporary sort space. If insufficient temporary space is available to rebuild the indexes, then you must explicitly drop each index or mark it unusable prior to performing the refresh operation.

When a materialized view is refreshed, one of four refresh methods may be specified as shown in the table below.

Refresh Option    Description 

COMPLETE 

Refreshes by recalculating the materialized view's defining query. 

FAST 

Refreshes by incrementally applying changes to the detail tables. 

FORCE 

Tries to do a fast refresh. If that is not possible, it does a complete refresh. 

ALWAYS 

Unconditionally does a complete refresh. 

Complete Refresh

A complete refresh occurs when the materialized view is initially defined, unless the materialized view references a prebuilt table, and complete refresh may be requested at any time during the life of the materialized view. Because the refresh involves reading the detail table to compute the results for the materialized view, this can be a very time-consuming process, especially if there are huge amounts of data to be read and processed. Therefore, one should always consider the time required to process a complete refresh before requesting it. See Oracle8i Designing and Tuning for Performance for further details.

However, there are cases when the only refresh method available is complete refresh because the materialized view does not satisfy the conditions specified in the following section for a fast refresh.

Fast Refresh

Most data warehouses require periodic incremental updates to their detail data. As described in "Schema Design Guidelines for Materialized Views", you can use the SQL*Loader direct path option, or any bulk load utility that uses Oracle's direct path interface, to perform incremental loads of detail data. Use of Oracle's direct path interface makes fast refresh of your materialized views efficient because, instead of having to recompute the entire materialized view, the changes are added to the existing data. Thus, applying only the changes can result in a very fast refresh time.

The time required to perform incremental refresh is sensitive to several factors:

The first factors can be addressed by partitioning the materialized view container by time, like the fact tables, and by creating a local concatenated index on the materialized view keys. The third factor can be addressed by creating dimensions and hierarchies for your schema, and by ensuring that all materialized view inner joins are strict 1:n relationships whenever possible, as described below.

Manual Refresh Using the DBMS_MVIEW Package

Three different refresh procedures are available in the DBMS_MVIEW package for performing ON DEMAND refresh and they each have their own unique set of parameters. To use this package, Oracle8 queues must be available, which means that the following parameters must be set in the initialization parameter file. If queues are unavailable, refresh will fail with an appropriate message.

See Also:

See Oracle8i Supplied PL/SQL Packages Reference for detailed information about the DBMS_MVIEW package. Oracle8i Replication explains how to use it in a replication environment. 

Required Initialization Parameters for Refresh

These packages also create a log which, by default, is called refresh.log and is useful in helping to diagnose problems during the refresh process. This log file can be renamed by calling the procedure DBMS_OLAP.SET_LOGFILE_NAME ('log filename').

Refresh Specific Materialized Views

The DBMS_MVIEW.REFRESH procedure is used to refresh one or more materialized views that are explicitly defined in the FROM list. This refresh procedure can also be used to refresh materialized views used by replication, so not all of the parameters are required. The required parameters to use this procedure are:

Therefore, to perform a fast refresh on the materialized view store_mv, the package would be called as follows:

DBMS_MVIEW.REFRESH('STORE_MV', 'A', '', TRUE, FALSE, 0,0,0, FALSE);

Multiple materialized views can be refreshed at the same time and they don't all have to use the same refresh method. To give them different refresh methods, specify multiple method codes in the same order as the list of materialized views (without commas). For example, the following specifies that store_mv will be completely refreshed and product_mv will receive a fast refresh.

DBMS_MVIEW.REFRESH('STORE_MV,PRODUCT_MV', 'AF', '', TRUE, FALSE, 0,0,0, FALSE);

Refresh All Materialized Views

An alternative to specifying the materialized views to refresh is to use the procedure DBMS_MVIEW.REFRESH_ALL_MVIEWS. This will result in all materialized views being refreshed. If any of the materialized views fails to refresh, then the number of failures is reported.

The parameters for this procedure are:

An example of refreshing all materialized views is:

DBMS_MVIEW.REFRESH_ALL_MVIEWS ( failures,'A','',FALSE,FALSE);

Refresh Dependent

The third option is the ability to refresh only those materialized views that depend on a specific table using the procedure DBMS_MVIEW. REFRESH_DEPENDENT. For example, suppose the changes have been received for the orders table but not customer payments. The refresh dependent procedure can be called to refresh only those materialized views that reference the ORDER table.

The parameters for this procedure are:

In order to perform a full refresh on all materialized views that reference the ORDERS table, use:

DBMS_mview.refresh_dependent  (failures, 'ORDERS', 'A', '', FALSE, FALSE );

To provide the list of materialized views that are directly dependent on a given object (table/MV), use:

DBMS_mview.get_mv_dependencies ( mvlist  IN   VARCHAR2, deplist OUT  VARCHAR2)

The input to the above functions is the name(s) of the materialized view and the output is a comma separated list of the materialized views that are defined on it. For example:

get_mv_dependencies ( "JOHN.SALES_REG, SCOTT.PROD_TIME", deplist)  

would populate deplist with the list of materialized views defined on the input arguments

deplist <= "JOHN.SUM_SALES_WEST, JOHN.SUM_SALES_EAST, SCOTT.SUM_PROD_MONTH".

Tips for Refreshing Using Refresh

If the process that is executing DBMS_MVIEW.REFRESH is interrupted or the instance is shut down, any refresh jobs that were executing in job queue processes will be requeued and will continue running. To remove these jobs, use the DBMS_JOB.REMOVE procedure.

Materialized Views with Joins and Aggregates

1-Enable Parallel DML.

2-Use atomic=FALSE, which will use TRUNCATE instead of DELETE to delete existing rows.

Here are some guidelines for using the refresh mechanism for materialized views with joins and aggregates.

  1. Always load new data using the direct-path option if possible. Avoid deletes and updates because a complete refresh will be necessary (for aggregates only). However, you can drop a partition on a materialized view and do a fast refresh.

  2. Place fixed key constraints on the fact table, and primary key constraints from the fact table to the dimension table. Doing this enables refresh to identify the fact table, which helps fast refresh.

  3. During loading, disable all constraints and re-enable when finished loading.

  4. Index the materialized view on the foreign key columns using a concatenated index.

  5. To speed up fast refresh, make the number of job queue processes greater than the number of processors.

  6. If there are many materialized views to refresh, it is faster to refresh all in a single command than to call them individually.

  7. Make use of the "?" refresh method to ensure getting a refreshed materialized view that can be used to query rewrite. If a fast refresh cannot be done, a complete refresh will be performed. Whereas, if a fast refresh had been requested and there was no need for a refresh, the materialized view would not be refreshed at all.

  8. Try to create materialized views that are fast refreshable because it refreshes more quickly.

  9. If a materialized view contains data that is based on data which is no longer in the fact table, maintain the materialized view using fast refresh. If no job queues are started, two job queue processes will be started by the refresh. This can be modified by:

    ALTER SYSTEM SET JOB_QUEUE_PROCESSES = value
    
    
  10. In general, the more processors there are, the more job queue processes should be created. Also, if you are doing mostly complete refreshes, reduce the number of job queue processes, since each refresh consumes more system resources than a fast refresh. The number of job queue processes limits the number of materialized views that can be refreshed concurrently. In contrast, if you perform mostly fast refreshes, increase the number of job queue processes.

Refresh of Materialized Views Containing a Single Table with Aggregates

A materialized view which contains aggregates and is based on a single table may be fast refreshable, provided it adheres to the rules in Requirements for Fast Refresh when data changes are made using either direct path or SQL DML statements. At refresh time, Oracle detects the type of DML that has been done (direct-load or SQL DML) and uses either the materialized view log or information available from the direct-path to determine the new data. If changes will be made to your data using both methods, then refresh should be performed after each type of data change rather than issuing one refresh at the end. This is because Oracle can perform significant optimizations if it detects that only one type of DML is done. It is therefore recommended that scenario 2 be followed rather than scenario 1.

To improve fast refresh performance, it is highly recommended that indexes be created on the columns which contain the rowids.

Scenario 1

Scenario 2

Furthermore, for refresh ON COMMIT, Oracle keeps track of the type of DML done in the committed transaction. It is thus recommended that the user does not do direct-path load and SQL DML to other tables in the same transaction as Oracle may not be able to optimize the refresh phase.

If the user has done a lot of updates to the table, it is better to bunch them in one transaction, so that refresh of the materialized view will be performed just once at commit time rather than after each update. In the warehouse, after a bulk load, the user should enable parallel DML in the session and perform the refresh. Oracle will use parallel DML to do the refresh, which will enhance performance tremendously. There is more to gain if the materialized view is partitioned.

As an example, assume that a materialized view is partitioned and has a parallel clause. The following sequence would be recommended in a data warehouse

  1. Bulk load into detail table

  2. ALTER SESSION ENABLE PARALLEL DML;

  3. Refresh materialized view

Refresh of Materialized Views Containing only Joins

If a materialized view contains joins but no aggregates, then having an index on each of the join column rowids in the detail table will enhance refresh performance greatly because this type of materialized view tends to be much larger than materialized views containing aggregates. For example, referring to the following materialized view:

CREATE MATERIALIZED VIEW  detail_fact_mv 
BUILD IMMEDIATE 
REFRESH FAST ON COMMIT 
 AS 
 SELECT  
 f.rowid "fact_rid", t.rowid "time_rid", s.rowid "store_rid", 
 s.store_key, s.store_name, f.dollar_sales, 
 f.unit_sales, f.time_key 
 FROM fact f, time t, store s 
 WHERE f.store_key = s.store_key(+) and 
 f.time_key = t.time_key(+); 
 

Indexes should be created on columns FACT_RID, TIME_RID and STORE_RID. Partitioning is highly recommended as is enabling parallel DML in the session before invoking refresh because it will greatly enhance refresh performance.

This type of materialized view can also be fast refreshed if DML is performed on the detail table. It is therefore recommended that the same procedure be applied to this type of materialized view as for a single table aggregate. That is, perform one type of change (direct-path load or DML) and then refresh the materialized view. This is because Oracle can perform significant optimizations if it detects that only one type of change has been done.

Also, it is recommended that the refresh be invoked after each table is loaded, rather than load all the tables and then perform the refresh. Therefore, try to use scenario 2 below for your refresh procedures.

Scenario 1

apply changes to fact
apply changes to store
refresh detail_fact_mv

Scenario 2

apply changes to fact 
refresh detail_fact_mv
apply changes to store 
refresh detail_fact_mv

For refresh ON COMMIT, Oracle keeps track of the type of DML done in the committed transaction. It is therefore recommended that you do not perform direct-path and conventional DML to other tables in the same transaction because Oracle may not be able to optimize the refresh phase. For example, the following is not recommended:

direct path new data into fact 
Conventional dml into store 
commit 

One should also try not to mix different types of conventional DML statements if possible. This would again prevent using various optimizations during fast refresh. For example, try to avoid:

insert into fact .. 
delete from fact .. 
commit 

If many updates are needed, try to group them all into one transaction because refresh will be performed just once at commit time, rather than after each update.

Scenario 1

update fact
commit 
update fact 
commit 
update fact 
commit 

Scenario 2

update fact
update fact 
update fact 
commit 

Note that if, when you use the DBMS_MVIEW package to refresh a number of materialized views containing only joins with the "atomic" parameter set to TRUE, parallel DML is disabled, which could lead to poorer refresh performance.

In a data warehousing environment, assuming that the materialized view has a parallel clause, the following sequence of steps is recommended:

  1. Bulk load into fact

  2. ALTER SESSION ENABLE PARALLEL DML;

  3. Refresh materialized view

Refreshing Nested Materialized Views

Refreshing materialized join views and single-table aggregate materialized views uses the same algorithms irrespective of whether or not the views are nested. All underlying objects are treated as ordinary tables. If the ON COMMIT refresh option is specified, then all the materialized views are refreshed in the appropriate order at commit time.

Example: Consider the schema in Figure 8-5. Assume all the materialized views are defined for ON COMMIT refresh. If table fact changes, at commit time, you could refresh join_fact_store_time first and then sum_sales_store_time and join_fact_store_time_prod (with no specific order for sum_sales_store_time and join_fact_store_time_prod because they do not have any dependencies between them).

In other words, Oracle builds a partially ordered set of materialized views and refreshes them such that, after the successful completion of the refresh, all the materialized views are fresh. The status of the materialized views can be checked by querying the appropriate (USER, DBA, ALL)_MVIEWS view.

If any of the materialized views is defined as ON DEMAND refresh (irrespective of whether the refresh method is fast, force, or complete), you will need to refresh them in the correct order (taking into account the dependencies between the materialized views) since the nested materialized view will be refreshed with respect to the current state of the other materialized views (whether fresh or not). You can find out dependent materialized views for a particular object using the PL/SQL function GET_MVIEW_DEPENDENCIES() in the DBMS_MVIEWS package.

If a refresh fails during commit time, the list of materialized views that have not been refreshed will be written to the alert log and the user will have to manually refresh them along with all their dependent materialized views.

APIs

As is the case with all materialized views, the functions in the DBMS_MVIEW package have to be used to refresh a nested materialized view on demand. These functions have the following behavior when used with nested materialized views:

Complex Materialized Views

A complex materialized view is one which cannot be fast refreshed. If you will be using only COMPLETE REFRESH, you can create a materialized view with any definition.

Recommended Initialization Parameters for Parallelism

The following parameters

Monitoring a Refresh

While a job is running, a SELECT * FROM V$SESSION_LONGOPS statement will tell you the progress of each materialized view being refreshed.

To look at the progress of which jobs are on which queue, use a SELECT * FROM DBA_JOBS_RUNNING statement.

The table ALL_MVIEWS contains the values, as a moving average, for the time most recently refreshed and the average time to refresh using both full and incremental methods.

Refresh will schedule the long running jobs first. Use the refresh log to check what each refresh did.

Tips after Refreshing Materialized Views

After you have performed a load or incremental load and rebuilt the detail table indexes, you need to re-enable integrity constraints (if any) and refresh the materialized views and materialized view indexes that are derived from that detail data. In a data warehouse environment, referential integrity constraints are normally enabled with the NOVALIDATE or RELY options. An important decision to make before performing a refresh operation is whether the refresh needs to be recoverable. Because materialized view data is redundant and can always be reconstructed from the detail tables, it may be preferable to disable logging on the materialized view. To disable logging and run incremental refresh non-recoverably, use the ALTER MATERIALIZED VIEW...NOLOGGING statement prior to REFRESH.

If the materialized view is being refreshed using the ON COMMIT method, then, following refresh operations, the alert log (alert_ <SID>.log) and the trace file (ora_<SID>_number.trc) should be consulted to check that no errors have occurred.


Go to previous page Go to next page
Oracle
Copyright © 1996-2000, Oracle Corporation.

All Rights Reserved.

Library

Product

Contents

Index