This chapter introduces integrated Oracle8 features for tuning enterprise-scale data warehouses. By intelligently tuning the system, the data layout, and the application, you can build a high performance, scalable data warehouse.
Topics in this chapter include
Data warehousing applications process a substantial amount of data by means of many CPU- and I/O-bound, data-intensive tasks such as
The resource required to complete the tasks on many gigabytes of data distinguishes data warehousing applications from other types of data processing. The bulk and complexity of your data may clearly indicate that you need to deploy multiple CPUs, investigate parallel processing, or consider specific data processing features that are directly relevant to the tasks at hand.
For example, in a typical data warehousing application, data-intensive tasks might be performed on 100 gigabytes of data. At a processing speed of 0.2 G to 2 G of data per hour per CPU, a single CPU might need from 2 days to more than 2 weeks to perform a task. With more than a single gigabyte of data (certainly with upwards of 10G), you need to consider increasing the number of CPUs.
Similarly, if you need to copy 10 gigabytes of data, consider that using Export/Import might take a single CPU 10 hours. By contrast, using parallel CREATE TABLE . . . AS SELECT on 10 CPUs might take only 1 hour.
Actual processing time depends on many factors, such as the complexity of the queries, the processing speed to which a particular hardware configuration can be tuned, and so on. Always run simple tests on your own system to find out its performance characteristics with regard to particular operations.
This section outlines Oracle8 features useful for building a data warehouse. It includes:
The ability to CREATE TABLE . . . AS SELECT in parallel enables you to reorganize extremely large tables efficiently. You might find it prohibitive to take a serial approach to reorganizing or reclaiming space in a table containing tens or thousands of gigabytes of data. Using Export/Import to perform such a task might result in an unacceptable amount of downtime. If you have a lot of temporary space available, you can use CREATE TABLE . . . AS SELECT to perform such tasks in parallel. With this approach, redefining integrity constraints is optional. This feature is often used for creating summary tables, which are precomputed results stored in the data warehouse.
The ability to create indexes in parallel benefits both data warehousing and OLTP applications. On extremely large tables, rebuilding an index may take a long time. Periodically DBAs may load a large amount of data and rebuild the index. With the ability to create indexes in parallel, you may be able to drop an index before loading new data, and re-create it afterward.
FAST FULL SCAN on the index is a faster alternative to a full table scan when an existing index already contains all the columns that are needed for the query. It can use multiblock I/O and can be parallelized just like a table scan. The hint INDEX_FFS enforces fast full index scan.
You can avoid downtime with very large or mission-critical tables by using partitions. You can divide a large table into multiple physical tables using partitioning criteria. In a data warehouse you can manage historical data by partitioning by date. You can then perform on a partition level all of the operations you might normally perform on the table level, such as backup and restore. You can add space for new data by adding a new partition, and delete old data by dropping an existing partition. Queries that use a key range to select from a partitioned table retrieve only the partitions that fall within that range. In this way partitions offer significant improvements in availability, administration and table scan performance.
Note: For performance reasons, in Oracle8 partitioned tables should be used rather than partition views. Please see Oracle8 Migration for instructions on migrating from partition views to partitioned tables.
You can use the ANALYZE command to analyze the storage characteristics of tables, indexes, and clusters to gather statistics which are then stored in the data dictionary. The optimizer uses these statistics in a cost-based approach to determine the most efficient execution plan for the SQL statements you issue. Note that statistics can be either computed or estimated, depending on the amount of overhead you are willing to allow for this purpose.
When very large amounts of data must be loaded, the destination table may be unavailable for an unacceptable amount of time. The ability to load data in parallel can dramatically slash the amount of downtime necessary.
This section summarizes Oracle8 features useful for querying a data warehouse. It includes:
The Oracle Parallel Server option provides benefits important to both OLTP and data warehousing applications:
Oracle Parallel Server failover capability (the ability of the application to reconnect automatically if the connection to the database is broken) results in improved availability, a primary benefit for OLTP applications. Likewise, scalability in the number of users that can connect to the database is a major benefit in OLTP environments. OLTP performance on Oracle Parallel Server can scale as well, if an application's data is isolated onto a single server.
For data warehousing applications, scalability of performance is a primary benefit of Oracle Parallel Server. The architecture of Oracle Parallel Server allows parallel query to perform excellent load balancing of work at runtime. If a node in an Oracle Parallel Server cluster or MPP is temporarily slowed down, work that was originally assigned to parallel query servers on that node (but not yet commenced by those servers) may be performed by servers on other nodes, hence preventing that node from becoming a serious bottleneck. Even though Oracle Parallel Server is a cornerstone of parallel query on clusters and MPPs, in a mostly query environment the overhead on the distributed lock manager is minimal.
Knowledge about parallelism is incorporated into the Oracle8 cost-based optimizer. Parallel execution considerations are thus a fundamental component in arriving at query execution plans. In addition, you can control the trade-off of throughput for response time in plan selection.
The optimizer chooses intelligent defaults for the degree of parallelism based on available processors and the number of disk drives storing data the query will access. Access path choices (such as table scans vs. index access) take into account the degree of parallelism, resulting in plans that are optimized for parallel execution. Execution plans are more scalable, and there is improved correlation between optimizer cost and execution time for parallel query.
The initialization parameter OPTIMIZER_PERCENT_PARALLEL defines the weighting that the optimizer uses to minimize response time in its cost functions.
The Oracle8 provides for improved performance through use of parallel execution.
Parallel execution enables multiple processes to work together simultaneously to process a single query or DML statement. By dividing the task among multiple server processes, Oracle can execute the operation more quickly than if only one server process were used.
Parallel execution can dramatically improve performance for data-intensive data warehousing operations. It helps systems scale in performance when adding hardware resources. The greatest performance benefits are on symmetric multiprocessing (SMP), clustered, or massively parallel systems where query processing can be effectively spread out among many CPUs on a single system.
Regular B*-tree indexes work best when each key or key range references only a few records, such as employee names. Bitmap indexes, by contrast, work best when each key references many records, such as employee gender.
Bitmap indexing provides the same functionality as regular indexes, but uses a different internal representation, which makes it very fast and space efficient. Bitmap indexing benefits data warehousing applications that have large amounts of data and ad hoc queries, but a low level of concurrent transactions. It provides reduced response time for many kinds of ad hoc queries; considerably reduced space usage compared to other indexing techniques; and dramatic performance gains even on very low end hardware. Bitmap indexes can be created in parallel and are completely integrated with cost-based optimization.
See Also: "Using Bitmap Indexes" on page 10-13
One type of data warehouse design is known as a "star" schema. This typically consists of one or more very large "fact" tables and a number of much smaller "dimension" or reference tables. A star query is one that joins several of the dimension tables, usually by predicates in the query, to one of the fact tables.
Oracle cost-based optimization recognizes star queries and generates efficient execution plans for them; indeed, you must use cost-based optimization to get efficient star query execution. To enable cost-based optimization, simply ANALYZE your tables and be sure that the OPTIMIZER_MODE initialization parameter is set to its default value of CHOOSE.
Star transformation is a cost-based transformation designed to execute star queries efficiently. Whereas star optimization works well for schemas with a small number of dimensions and dense fact tables, star transformation works well for schemas with a large number of dimensions and sparse fact tables.
Star transformation is enabled by setting the initialization parameter STAR_TRANSFORMATION_ENABLED to TRUE. You can use the STAR_TRANSFORMATION hint to make the optimizer use the best plan in which the transformation has been used.
Recoverability has various levels: recovery from disk failure, human error, software failure, fire, and so on, requires different procedures. Oracle8 provides only part of the solution. Organizations must decide how much to spend on backup and recovery by considering the business cost of a long outage.
The NOLOGGING option enables you to perform certain operations without the overhead of generating a log. Even without logging, you can avoid disk failure if you use disk mirroring or RAID technology. If you load your warehouse from tapes every day or week, you might satisfactorily recover from all failures simply by saving copies of the tape in several remote locations and reloading from tape when something goes wrong.
At the other end of the spectrum, you could both mirror disks and take backups and archive logs, and maintain a remote standby system. The mirrored disks prevent loss of availability for disk failure, and also protect against total loss in the event of human error (such as dropping the wrong table) or software error (such as disk block corruption). In the event of fire, power failure, or other problems at the primary site, the backup site prevents long outages.