20
Understanding Parallel Execution Performance Issues

This chapter provides a conceptual explanation of parallel execution performance issues, and additional performance techniques.

See Also: Oracle8 Concepts, for basic principles of parallel execution.

See your operating system-specific Oracle documentation for more information about tuning while using parallel execution.

Understanding Parallel Execution Performance Issues

The Formula for Memory, Users, and Parallel Server Processes

Key to the tuning of parallel operations is an understanding of the relationship between memory requirements, the number of users (processes) a system can support, and the maximum number of parallel server processes. The goal is to obtain the dramatic performance enhancement made possible by parallelizing certain operations, and by using hash joins rather than sort merge joins. This performance goal must often be balanced with the need to support multiple users.

In considering the maximum number of processes a system can support, it is useful to divide the processes into three classes, based on their memory requirements. Table 20-1 defines high, medium, and low memory processes.

Analyze the maximum number of processes that can fit in memory as follows:

Figure 20-1 Formula for Memory/Users/Server Relationship

Table 20-1 Memory Requirements for Three Classes of Process

Class	Description
Low Memory Processes: 100K to 1MB	These processes include table scans; index lookups; index nested loop joins; single-row aggregates (such as sum or average with no GROUP BYs, or very few groups); sorts that return only a few rows; and direct loading. This class of Data Warehousing process is similar to OLTP processes in the amount of memory required. Process memory could be as low as a few hundred kilobytes of fixed overhead. You could potentially support thousands of users performing this kind of operation. You can take this requirement even lower by using the multithreaded server, and support even more users.
Medium Memory Processes: 1MB to 10MB	This class of process includes large sorts; sort merge join; GROUP BY or ORDER BY operations returning a large number of rows; parallel insert operations which involve index maintenance; and index creation. These processes require the fixed overhead needed by a low memory process, plus one or more sort areas, depending on the operation. For example, a typical sort merge join would sort both its inputs-resulting in two sort areas. GROUP BY or ORDER BY operations with many groups or rows also require sort areas. Look at the EXPLAIN PLAN output for the operation to identify the number and type of joins, and the number and type of sorts. Optimizer statistics in the plan show the size of the operations. When planning joins, remember that you do have a number of choices.
High Memory Processes: 10MB to 100MB	High memory processes include one or more hash joins, or a combination of one or more hash joins with large sorts. These processes require the fixed overhead needed by a low memory process, plus hash area. The hash area size required might range from 8MB to 32MB, and you might need two of them. If you are performing 2 or more serial hash joins, each process uses 2 hash areas. In a parallel operation, each parallel server process does at most 1 hash join at a time; therefore, you would need 1 hash area size per server. In summary, the amount of hash join memory for an operation equals parallel degree multiplied by hash area size, multiplied by the lesser of either 2, or the number of hash joins in the operation.

Class

Description

Low Memory Processes:

100K to 1MB

These processes include table scans; index lookups; index nested loop joins; single-row aggregates (such as sum or average with no GROUP BYs, or very few groups); sorts that return only a few rows; and direct loading.

This class of Data Warehousing process is similar to OLTP processes in the amount of memory required. Process memory could be as low as a few hundred kilobytes of fixed overhead. You could potentially support thousands of users performing this kind of operation. You can take this requirement even lower by using the multithreaded server, and support even more users.

Medium Memory Processes:

1MB to 10MB

This class of process includes large sorts; sort merge join; GROUP BY or ORDER BY operations returning a large number of rows; parallel insert operations which involve index maintenance; and index creation.

These processes require the fixed overhead needed by a low memory process, plus one or more sort areas, depending on the operation. For example, a typical sort merge join would sort both its inputs-resulting in two sort areas. GROUP BY or ORDER BY operations with many groups or rows also require sort areas.

Look at the EXPLAIN PLAN output for the operation to identify the number and type of joins, and the number and type of sorts. Optimizer statistics in the plan show the size of the operations. When planning joins, remember that you do have a number of choices.

High Memory Processes:

10MB to 100MB

High memory processes include one or more hash joins, or a combination of one or more hash joins with large sorts.

These processes require the fixed overhead needed by a low memory process, plus hash area. The hash area size required might range from 8MB to 32MB, and you might need two of them. If you are performing 2 or more serial hash joins, each process uses 2 hash areas. In a parallel operation, each parallel server process does at most 1 hash join at a time; therefore, you would need 1 hash area size per server.

In summary, the amount of hash join memory for an operation equals parallel degree multiplied by hash area size, multiplied by the lesser of either 2, or the number of hash joins in the operation.

Note: The process memory requirements of parallel DML and parallel DDL operations also depend upon the query portion of the statement.

Setting Buffer Pool Size for Parallel Operations

The formula whereby you can calculate the maximum number of processes your system can support (referred to here as max_processes) is:

In general, if max_processes is much bigger than the number of users, you can consider running parallel operations. If max_processes is considerably less than the number of users, you must consider other alternatives, such as those described in "How to Balance the Formula" on page 20-5.

With the exception of parallel update and delete, parallel operations do not generally benefit from larger buffer pool sizes. Parallel update and delete benefit from a larger buffer pool when they update indexes. This is because index updates have a random access pattern and I/O activity can be reduced if an entire index or its interior nodes can be kept in the buffer pool. Other parallel operations can benefit only if the buffer pool can be made larger and thereby accommodate the inner table or index for a nested loop join.

See Also: Oracle8 Concepts for a comparison of hash joins and sort merge joins.
"Tuning the Buffer Cache" on page 14-26 on setting buffer pool size.

How to Balance the Formula

Use the following techniques to balance the memory/users/server formula given in Figure 20-1:

Oversubscribe, with Attention to Paging

You can permit the potential workload to exceed the limits recommended in the formula. Total memory required, minus the SGA size, can be multiplied by a factor of 1.2, to allow for 20% oversubscription. Thus, if you have 1G of memory, you might be able to support 1.2G of demand: the other 20% could be handled by the paging system.

You must, however, verify that a particular degree of oversubscription will be viable on your system by monitoring the paging rate and making sure you are not spending more than a very small percent of the time waiting for the paging subsystem. Your system may perform acceptably even if oversubscribed by 60%, if on average not all of the processes are performing hash joins concurrently. Users might then try to use more than the available memory, so you must monitor paging activity in such a situation. If paging goes up dramatically, consider another alternative.

On average, no more than 5% of the time should be spent simply waiting in the operating system on page faults. More than 5% wait time indicates an I/O bound paging subsystem. Use your operating system monitor to check wait time: The sum of time waiting and time running equals 100%. If you are running close to 100% CPU, then you are not waiting. If you are waiting, it should not be on account of paging.

If wait time for paging devices exceeds 5%, it is a strong indication that you must reduce memory requirements in one of these ways:

Reducing the memory required for each class of process
Reducing the number of processes in memory-intensive classes
Adding memory

If the wait time indicates an I/O bottleneck in the paging subsystem, you could resolve this by striping.

Reduce the Number of Memory-Intensive Processes

Adjusting the Degree of Parallelism. You can adjust not only the number of operations that run in parallel, but also the degree of parallelism with which operations run. To do this, issue an ALTER TABLE statement with a PARALLEL clause, or use a hint. See the Oracle8 SQL Reference for more information.

You can limit the parallel pool by reducing the value of PARALLEL_MAX_SERVERS. Doing so places a system-level limit on the total amount of parallelism, and is easy to administer. More processes are then forced to run in serial mode.

Scheduling Parallel Jobs. Queueing jobs is another way to reduce the number of processes but not reduce parallelism. Rather than reducing parallelism for all operations, you may be able to schedule large parallel batch jobs to run with full parallelism one at a time, rather than concurrently. Queries at the head of the queue would have a fast response time, those at the end of the queue would have a slow response time. However, this method entails a certain amount of administrative overhead.

Decrease Data Warehousing Memory per Process

Note: The following discussion focuses upon the relationship of HASH_AREA_SIZE to memory, but all the same considerations apply to SORT_AREA_SIZE. The lower bound of SORT_AREA_SIZE, however, is not as critical as the 8MB recommended minimum HASH_AREA_SIZE.

If every operation performs a hash join and a sort, the high memory requirement limits the number of processes you can have. To allow more users to run concurrently you may need to reduce the DSS process memory.

Moving Processes from High to Medium Memory Requirements. You can move a process from the high-memory to the medium-memory class by changing from hash join to merge join. You can use initialization parameters to limit available memory and thus force the optimizer to stay within certain bounds.

To do this, you can reduce HASH_AREA_SIZE to well below the recommended minimum (for example, to 1 or 2MB). Then you can let the optimizer choose sort merge join more often (as opposed to telling the optimizer never to use hash joins). In this way, hash join can still be used for small tables: the optimizer has a memory budget within which it can make decisions about which join method to use. Alternatively, you can use hints to force only certain queries (those whose response time is not critical) to use sort-merge joins rather than hash joins.

Remember that the recommended parameter values provide the best response time. If you severely limit these values you may see a significant effect on response time.

Moving Processes from High or Medium Memory Requirements to Low Memory Requirements. If you need to support thousands of users, you must create access paths such that operations do not touch much data.

Decrease the demand for index joins by creating indexes and/or summary tables.
Decrease the demand for GROUP BY sorting by creating summary tables and encouraging users and applications to reference summaries rather than detailed data.
Decrease the demand for ORDER BY sorts by creating indexes on frequently sorted columns.

Decrease Parallelism for Multiple Users

In general there is a trade-off between parallelism for fast single-user response time and efficient use of resources for multiple users. For example, a system with 2G of memory and a HASH_AREA_SIZE of 32MB can support about 60 parallel server processes. A 10 CPU machine can support up to 3 concurrent parallel operations (2 * 10 * 3 = 60). In order to support 12 concurrent parallel operations, you could override the default parallelism (reduce it); decrease HASH_AREA_SIZE; buy more memory, or use some combination of these three strategies. Thus you could ALTER TABLE t PARALLEL (DEGREE 5) for all parallel tables t, set HASH_AREA_SIZE to 16MB, and increase PARALLEL_MAX_SERVERS to 120. By reducing the memory of each parallel server by a factor of 2, and reducing the parallelism of a single operation by a factor of 2, the system can accommodate 2 * 2 = 4 times more concurrent parallel operations.

The penalty for taking such an approach is that when a single operation happens to be running, the system will use just half the CPU resource of the 10 CPU machine. The other half will be idle until another operation is started.

To determine whether your system is being fully utilized, you can use one of the graphical system monitors available on most operating systems. These monitors often give you a better idea of CPU utilization and system performance than monitoring the execution time of an operation. Consult your operating system documentation to determine whether your system supports graphical system monitors.

Examples: Balancing Memory, Users, and Processes

The examples in this section show how to evaluate the relationship between memory, users, and parallel server processes, and balance the formula given in Figure 20-1. They show concretely how you might adjust your system workload so as to accommodate the necessary number of processes and users.

Example 1

Assume that your system has 1G of memory, the degree of parallelism is 10, and that your users perform 2 hash joins with 3 or more tables. If you need 300MB for the SGA, that leaves 700MB to accommodate processes. If you allow a generous hash area size (32MB) for best performance, then your system can support:

1 parallel operation (32MB * 10 * 2 = 640MB)
1 serial operation (32MB * 2 = 64MB)

This makes a total of 704MB. (Note that the memory is not significantly oversubscribed.)

Remember that every parallel, hash, or sort merge join operation takes a number of parallel server processes equal to twice the degree of parallelism (utilizing 2 server sets), and often each individual process of a parallel operation uses a lot of memory. Thus you can support many more users by having them run serially, or by having them run with less parallelism.

To service more users, you can drastically reduce hash area size to 2MB. You may then find that the optimizer switches some operations to sort merge join. This configuration can support 17 parallel operations, or 170 serial operations, but response times may be significantly higher than if you were using hash joins.

Notice the trade-off above: by reducing memory per process by a factor of 16, you can increase the number of concurrent users by a factor of 16. Thus the amount of physical memory on the machine imposes another limit on total number of parallel operations you can run involving hash joins and sorts.

Example 2

In a mixed workload example, consider a user population with diverse needs, as described in Table 20-2. In this situation, you would have to make some choices. You could not allow everyone to run hash joins-even though they outperform sort merge joins-because you do not have the memory to support this level of workload.

You might consider it safe to oversubscribe at 50% because of the infrequent batch jobs during the day: 700MB * 1.5 = 1.05GB. This would give you enough virtual memory for the total workload.

Table 20-2 How to Accommodate a Mixed Workload

User Needs	How to Accommodate
DBA: runs nightly batch jobs, and occasional batch jobs during the day. These might be parallel operations that do hash joins that use a lot of memory.	You might take 20 parallel server processes, and set HASH_AREA_SIZE to a mid-range value, perhaps 20MB, for a single powerful batch job in the high memory class. (This might be a big GROUP BY with join to produce a summary of data.) Twenty servers multiplied by 20MB equals 400MB of memory.
Analysts: interactive users who pull data into their spreadsheets	You might plan for 10 analysts running serial operations that use complex hash joins accessing a large amount of data. (You would not allow them to do parallel operations because of memory requirements.) Ten such serial processes at 40MB apiece equals 400MB of memory.
Users: Several hundred users doing simple lookups of individual customer accounts, making reports on already joined, partially summarized data	To support hundreds of users doing low memory processes at about 0.5MB apiece, you might reserve 200MB.

Example 3

Suppose your system has 2G of memory, and you have 200 parallel server processes and 100 users doing heavy data warehousing operations involving hash joins. You decide to leave such tasks as index retrievals and small sorts out of the picture, concentrating on the high memory processes. You might have 300 processes, of which 200 must come from the parallel pool and 100 are single threaded. One quarter of the total 2G of memory might be used by the SGA, leaving 1.5G of memory to handle all the processes. You could apply the formula considering only the high memory requirements, including a factor of 20% oversubscription:

Figure 20-2 Formula for Memory/User/Server Relationship: High-Memory Processes

Here, 5MB = 1.8G/300. Less than 5MB of hash area would be available for each process, whereas 8MB is the recommended minimum. If you must have 300 processes, you may need to force them to use other join methods in order to change them from the highly memory-intensive class to the moderately memory-intensive class. Then they may fit within your system's constraints.

Example 4

Consider a system with 2G of memory and 10 users who want to run intensive data warehousing parallel operations concurrently and still have good performance. If you choose parallelism of degree 10, then the 10 users will require 200 processes. (Processes running big joins need twice the number of parallel server processes as the degree of parallelism, so you would set PARALLEL_MAX_SERVERS to 10 * 10 * 2.) In this example each process would get 1.8G/200-or about 9MB of hash area-which should be adequate.

With only 5 users doing large hash joins, each process would get over 16MB of hash area, which would be fine. But if you want 32MB available for lots of hash joins, the system could only support 2 or 3 users. By contrast, if users are just computing aggregates the system needs adequate sort area size-and can have many more users.

Example 5

If a system with 2G of memory needs to support 1000 users, all of them running big operations, you must evaluate the situation carefully. Here, the per-user memory budget is only 1.8MB (that is, 1.8G divided by 1,000). Since this figure is at the low end of the medium memory process class, you must rule out parallel operations, which use even more resources. You must also rule out big hash joins. Each sequential process could require up to 2 hash areas plus the sort area, so you would have to set HASH_AREA_SIZE to the same value as SORT_AREA_SIZE, which would be 600K (1.8MB/3). Such a small hash area size is likely to be ineffective, so you may opt to disable hash joins altogether.

Given the organization's resources and business needs, is it reasonable for you to upgrade your system's memory? If memory upgrade is not an option, then you must change your expectations. To adjust the balance you might:

Accept the fact that the system will actually support a limited number of users doing big hash joins.
Expect to support the 1000 users doing index lookups and joins that do not require large amounts of memory. Sort merge joins require less memory, but throughput will go down because they are not as efficient as hash joins.
Give the users access to summary tables, rather than to the whole database.
Classify users into different groups, and give some groups more memory than others. Instead of all users doing sorts with a small sort area, you could have a few users doing high-memory hash joins, while most users use summary tables or do low-memory index joins. (You could accomplish this by forcing users in each group to use hints in their queries such that operations are performed in a particular way.)

Parallel Execution Space Management Issues

This section describes space management issues that come into play when using parallel execution.

ST (Space Transaction) Enqueue for Sorts and Temporary Data
External Fragmentation

These issues become particularly important for parallel operation running on a parallel server, the more nodes involved, the more tuning becomes critical.

ST (Space Transaction) Enqueue for Sorts and Temporary Data

Every space management transaction in the database (such as creation of temporary segments in PARALLEL CREATE TABLE, or parallel direct-load inserts of non-partitioned tables) is controlled by a single ST enqueue. A high transaction rate (more than 2 or 3 per minute) on the ST enqueue may result in poor scalability on Oracle Parallel Server systems with many nodes, or a timeout waiting for space management resources.

Try to minimize the number of space management transactions, in particular:

the number of sort space management transactions
the creation and removal of objects
transactions caused by fragmentation in a tablespace.

Use dedicated temporary tablespaces to optimize space management for sorts. This is particularly beneficial on a parallel server. You can monitor this using V$SORT_SEGMENT.

Set INITIAL and NEXT extent size to a value in the range of 1MB to 10MB. Processes may use temporary space at a rate of up to 1MB per second. Do not accept the default value of 40K for next extent size, because this will result in many requests for space per second.

If you are unable to allocate extents for various reasons, you can recoalesce the space by using the ALTER TABLESPACE ... COALESCE SPACE command. This should be done on a regular basis for temporary tablespaces in particular.

External Fragmentation

External fragmentation is a concern for parallel load, direct-load insert, and PARALLEL CREATE TABLE ... AS SELECT. Memory tends to become fragmented as extents are allocated and data is inserted and deleted. This may result in a fair amount of free space that is unusable because it consists of small, non-contiguous chunks of memory. To reduce external fragmentation on partitioned tables, set all extents to the same size. Set MINEXTENTS to the same value as NEXT, which should be equal to INITIAL; set PERCENT_INCREASE to zero. The system can handle this well with a few thousand extents per object, so you can set MAXEXTENTS to a few thousand. For tables that are not partitioned, the initial extent should be small.

Optimizing Parallel Execution on Oracle Parallel Server

This section describe several aspects of parallel execution on Oracle Parallel Server.

Lock Allocation

This section provides parallel execution tuning guidelines for optimal lock management on Oracle Parallel Server.

To optimize parallel execution on Oracle Parallel Server, you need to correctly set GC_FILES_TO_LOCKS. On Oracle Parallel Server a certain number of parallel cache management (PCM) locks are assigned to each data file. Data block address (DBA) locking in its default behavior assigns one lock to each block. During a full table scan a PCM lock must then be acquired for each block read into the scan. To speed up full table scans, you have three possibilities:

For data files containing truly read-only data, set the tablespace to read only. Then there will be no PCM locking at all.
Alternatively, for data that is mostly read-only, assign very few hashed PCM locks (for example, 2 shared locks) to each data file. Then these will be the only locks you have to acquire when you read the data.
If you want DBA or fine-grain locking, group together the blocks controlled by each lock, using the ! option. This has advantages over default DBA locking because with the default, you would need to acquire a million locks in order to read 1 million blocks. When you group the blocks you reduce the number of locks allocated by the grouping factor. Thus a grouping of !10 would mean that you would only have to acquire one tenth as many PCM locks as with the default. Performance improves due to the dramatically reduced amount of lock allocation. As a rule of thumb, performance with a grouping of !10 might be comparable to the speed of hashed locking.
To speed up parallel DML operations, consider using hashed locking rather than DBA locking. A parallel server process works on non-overlapping partitions; it is recommended that partitions not share files. You can thus reduce the number of lock operations by having only 1 hashed lock per file. Since the parallel server process only works on non-overlapping files, there will be no lock pings.

The following guidelines impact memory usage, and thus indirectly affect performance:

Never allocate PCM locks for datafiles of temporary tablespaces.
Never allocate PCM locks for datafiles that contain only rollback segments. These are protected by GC_ROLLBACK_LOCKS and GC_ROLLBACK_SEGMENTS.
Allocate specific PCM locks for the SYSTEM tablespace. This practice ensures that data dictionary activity such as space management never interferes with the data tablespaces at a cache management level (error 1575).
For example, on a read-only database with a data warehousing application's query-only workload, you might create 500 PCM locks on the SYSTEM tablespace in file 1, then create 50 more locks to be shared for all the data in the other files. Space management work will then never interfere with the rest of the database.

See Also: Oracle8 Parallel Server Concepts & Administration for a thorough discussion of PCM locks and locking parameters.

Allocation of Processes and Instances

Parallel execution assigns each instance a unique number, which is determined by the INSTANCE_NUMBER initialization parameter. The instance number regulates the order of instance startup.

Note: For Oracle Parallel Server, the PARALLEL_INSTANCE_GROUP parameter determines what instance group will be used for a particular operation. For more information, see Oracle8 Parallel Server Concepts & Administration.

Oracle computes a target degree of parallelism by examining the maximum of the degree for each table and other factors, before run time. At run time, a parallel operation is executed sequentially if insufficient parallel server processes are available. PARALLEL_MIN_PERCENT sets the minimum percentage of the target number of parallel server processes that must be available in order for the operation to run in parallel. When PARALLEL_MIN_PERCENT is set to n, an error message is sent if n percent parallel server processes are not available. If no parallel server processes are available, a parallel operation is executed sequentially.

Load Balancing for Multiple Concurrent Parallel Operations

Load balancing is the distribution of parallel server processes to achieve even CPU and memory utilization, and to minimize remote I/O and communication between nodes.

When multiple concurrent operations are running on a single node, load balancing is done by the operating system. For example, if there are 10 CPUs and 5 parallel server processes, the operating system distributes the 5 processes among the CPUs. If a second node is added, the operating system still distributes the workload.

For a parallel server, however, no single operating system performs the load balancing: instead, parallel execution performs this function.

If an operation requests more than one instance, allocation priorities involve table caching and disk affinity.

Thus, if there are 5 parallel server processes, it is advantageous for them to run on as many nodes as possible.

In Oracle Server release 8.0, allocation of processes and instances is based on instance groups. With instance groups a parallel server system will be partitioned into disjoint logical subsystems. Parallel resources will be allocated out of a particular instance group only if the parallel coordinator is part of the group. This approach supports application and data partitioning.

See Also: Oracle8 Parallel Server Concepts & Administration for more information about instance groups.

Disk Affinity

Some Oracle Parallel Server platforms use disk affinity. Without disk affinity, Oracle tries to balance the allocation evenly across instances; with disk affinity, Oracle tries to allocate parallel server processes for parallel table scans on the instances that are closest to the requested data. Disk affinity minimizes data shipping and internode communication on a shared nothing architecture. It can significantly increase parallel operation throughput and decrease response time.

Disk affinity is used for parallel table scans, parallel temporary tablespace allocation, parallel DML, and parallel index scan. It is not used for parallel table creation or parallel index creation. Access to temporary tablespaces preferentially uses local datafiles. It guarantees optimal space management extent allocation. Disks striped by the operating system are treated by disk affinity as a single unit.

In the following example of disk affinity, table T is distributed across 3 nodes, and a full table scan on table T is being performed.

Figure 20-3 Disk Affinity Example

If a query requires 2 instances, then two instances from the set 1, 2, and 3 are used.
If a query requires 3 instances, then instances 1, 2, and 3 are used.
If a query requires 4 instances, then all four instances are used.
If there are two concurrent operations against table T, each requiring 3 instances (and enough processes are available on the instances for both operations), then both operations will use instances 1, 2, and 3. Instance 4 will not be used. In contrast, without disk affinity instance 4 would be used.

Resource Timeout

A parallel DML transaction spanning Oracle Parallel Server instances may be waiting too long for a resource due to potential deadlock involving this transaction and other parallel or non-parallel DML transactions. Set the PARALLEL_TRANSACTION_RESOURCE_TIMEOUT parameter to specify how long a parallel DML transaction should wait for a resource before aborting.

See Also: Oracle8 SQL Reference

Parallel Execution Tuning Techniques

This section describes performance techniques for parallel operations.

Overriding the Default Degree of Parallelism

The default degree of parallelism is appropriate for reducing response time while guaranteeing use of CPU and I/O resources for any parallel operations. If an operation is I/O bound, you should consider increasing the default degree of parallelism. If it is memory bound, or several concurrent parallel operations are running, consider decreasing the default degree.

Oracle uses the default degree of parallelism for tables that have PARALLEL attributed to them in the data dictionary, or when the PARALLEL hint is specified. If a table does not have parallelism attributed to it, or has NOPARALLEL (the default) attributed to it, then that table is never scanned in parallel-regardless of the default degree of parallelism that would be indicated by the number of CPUs, instances, and devices storing that table.

Use the following guidelines when adjusting the degree of parallelism:

You can adjust the degree of parallelism either by using ALTER TABLE or by using hints.
To increase the number of concurrent parallel operations, reduce the degree of parallelism.
For I/O-bound parallel operations, first spread the data over more disks than there are CPUs. Then, increase parallelism in stages. Stop when the query becomes CPU bound.
For example, assume a parallel indexed nested loop join is I/O bound performing the index lookups, with #CPUs=10 and #disks=36. The default degree of parallelism is 10, and this is I/O bound. You could first try parallel degree 12. If still I/O bound, you could try parallel degree 24; if still I/O bound, you could try 36.

To override the default degree of parallelism:

Determine the maximum number of query servers your system can support.
Divide the parallel server processes among the estimated number of concurrent queries.

Rewriting SQL Statements

The most important issue for parallel query execution is ensuring that all parts of the query plan that process a substantial amount of data execute in parallel. Use EXPLAIN PLAN to verify that all plan steps have an OTHER_TAG of PARALLEL_TO_PARALLEL, PARALLEL_TO_SERIAL, PARALLEL_COMBINED_WITH_PARENT, or PARALLEL_COMBINED_WITH_CHILD. Any other keyword (or null) indicates serial execution, and a possible bottleneck.

By making the following changes you can increase the optimizer's ability to generate parallel plans:

Convert subqueries, especially correlated subqueries, into joins. Oracle can parallelize joins more efficiently than subqueries. This also applies to updates.
Use a PL/SQL function in the WHERE clause of the main query, instead of a correlated subquery.

Rewrite queries with distinct aggregates as nested queries. For example, rewrite

SELECT COUNT(DISTINCT C) FROM T;

SELECT COUNT(*)FROM (SELECT DISTINCT C FROM T);

Creating and Populating Tables in Parallel

Oracle cannot return results to a user process in parallel. If a query returns a large number of rows, execution of the query may indeed be faster; however, the user process can only receive the rows serially. To optimize parallel query performance with queries that retrieve large result sets, use PARALLEL CREATE TABLE ... AS SELECT or direct-load insert to store the result set in the database. At a later time, users can view the result set serially.

Note: Parallelism of the SELECT does not influence the CREATE statement. If the CREATE is parallel, however, the optimizer tries to make the SELECT run in parallel also.

When combined with the NOLOGGING option, the parallel version of CREATE TABLE ... AS SELECT provides a very efficient intermediate table facility.

For example:

CREATE TABLE summary PARALLEL NOLOGGING 
AS SELECT dim_1, dim_2 ..., SUM (meas_1) FROM facts
GROUP BY dim_1, dim_2;

These tables can also be incrementally loaded with parallel insert. You can take advantage of intermediate tables using the following techniques:

Common subqueries can be computed once and referenced many times. This may be much more efficient than referencing a complex view many times.
Decompose complex queries into simpler steps in order to provide application-level checkpoint/restart. For example, a complex multi-table join on a database 1 terabyte in size could run for dozens of hours. A crash during this query would mean starting over from the beginning. Using CREATE TABLE ... AS SELECT and/or PARALLEL INSERT AS SELECT, you can rewrite the query as a sequence of simpler queries that run for a few hours each. If a system failure occurs, the query can be restarted from the last completed step.
Materialize a Cartesian product. This may allow queries against star schemas to execute in parallel. It may also increase scalability of parallel hash joins by increasing the number of distinct values in the join column.
Consider a huge table of retail sales data that is joined to region and to department lookup tables. There are 5 regions and 25 departments. If the huge table is joined to regions using parallel hash partitioning, the maximum speedup is 5. Similarly, if the huge table is joined to departments, the maximum speedup is 25. But if a temporary table containing the Cartesian product of regions and departments is joined with the huge table, the maximum speedup is 125.
Efficiently implement manual parallel deletes by creating a new table that omits the unwanted rows from the original table, and then dropping the original table. Alternatively, you can use the convenient parallel delete feature, which can directly delete rows from the original table.
Create summary tables for efficient multidimensional drill-down analysis. For example, a summary table might store the sum of revenue grouped by month, brand, region, and salesperson.
Reorganize tables, eliminating chained rows, compressing free space, and so on, by copying the old table to a new table. This is much faster than export/import and easier than reloading.
Note: Be sure to use the ANALYZE command on newly created tables. Also consider creating indexes. To avoid I/O bottlenecks, specify a tablespace with at least as many devices as CPUs. To avoid fragmentation in allocating space, the number of files in a tablespace should be a multiple of the number of CPUs.

Creating Indexes in Parallel

Multiple processes can work together simultaneously to create an index. By dividing the work necessary to create an index among multiple server processes, the Oracle Server can create the index more quickly than if a single server process created the index sequentially.

Parallel index creation works in much the same way as a table scan with an ORDER BY clause. The table is randomly sampled and a set of index keys is found that equally divides the index into the same number of pieces as the degree of parallelism. A first set of query processes scans the table, extracts key,ROWID pairs, and sends each pair to a process in a second set of query processes based on key. Each process in the second set sorts the keys and builds an index in the usual fashion. After all index pieces are built, the parallel coordinator simply concatenates the pieces (which are ordered) to form the final index.

Parallel local index creation uses a single server set. Each server process in the set is assigned a table partition to scan, and to build an index partition for. Because half as many server processes are used for a given degree of parallelism, parallel local index creation can be run with a higher degree of parallelism.

You can optionally specify that no redo and undo logging should occur during index creation. This can significantly improve performance, but temporarily renders the index unrecoverable. Recoverability is restored after the new index is backed up. If your application can tolerate this window where recovery of the index requires it to be re-created, then you should consider using the NOLOGGING option.

The PARALLEL clause in the CREATE INDEX command is the only way in which you can specify the degree of parallelism for creating the index. If the degree of parallelism is not specified in the parallel clause of CREATE INDEX, then the number of CPUs is used as the degree of parallelism. If there is no parallel clause, index creation will be done serially.

Attention: When creating an index in parallel, the STORAGE clause refers to the storage of each of the subindexes created by the query server processes. Therefore, an index created with an INITIAL of 5MB and a PARALLEL DEGREE of 12 consumes at least 60MB of storage during index creation because each process starts with an extent of 5MB. When the query coordinator process combines the sorted subindexes, some of the extents may be trimmed, and the resulting index may be smaller than the requested 60MB.

When you add or enable a UNIQUE key or PRIMARY KEY constraint on a table, you cannot automatically create the required index in parallel. Instead, manually create an index on the desired columns using the CREATE INDEX command and an appropriate PARALLEL clause and then add or enable the constraint. Oracle then uses the existing index when enabling or adding the constraint.

Multiple constraints on the same table can be enabled concurrently and in parallel if all the constraints are already in the enabled novalidate state. In the following example, the ALTER TABLE ... ENABLE CONSTRAINT statement performs the table scan that checks the constraint in parallel:

CREATE TABLE a (a1 NUMBER CONSTRAINT ach CHECK (a1 > 0) ENABLE NOVALIDATE) 
PARALLEL 5; 
INSERT INTO a values (1); 
COMMIT; 
ALTER TABLE a ENABLE CONSTRAINT ach;

See Also: For more information on how extents are allocated when using the parallel query feature, see Oracle8 Concepts.
Refer to the Oracle8 SQL Reference for the complete syntax of the CREATE INDEX command.

Refreshing Tables in Parallel

Parallel DML combined with the updatable join views facility provides an efficient solution for refreshing the tables of a data warehouse system. To refresh tables is to update them with the differential data generated from the OLTP production system.

In the following example, assume that you want to refresh a table named CUSTOMER(c_key, c_name, c_addr). The differential data contains either new rows or rows that have been updated since the last refresh of the data warehouse. In this example, the updated data is shipped from the production system to the data warehouse system by means of ASCII files. These files must be loaded into a temporary table, named DIFF_CUSTOMER, before starting the refresh process. You can use SQL Loader with both the parallel and direct options to efficiently perform this task.

Once DIFF_CUSTOMER is loaded, the refresh process can be started. It is performed in two phases:

updating the table
inserting the new rows in parallel

Updating the Table

A straightforward SQL implementation of the update uses subqueries:

UPDATE customer 
SET(c_name, c_addr) = 
(SELECT c_name, c_addr
FROM diff_customer
WHERE diff_customer.c_key = customer.c_key)
WHERE c_key IN(SELECT c_key FROM diff_customer);

Unfortunately, the two subqueries in the preceding statement affect the performance.

An alternative is to rewrite this query using updatable join views. To do this you must first add a primary key constraint to the DIFF_CUSTOMER table to ensure that the modified columns map to a key-preserved table:

CREATE UNIQUE INDEX diff_pkey_ind on diff_customer(c_key) 
PARALLEL NOLOGGING;
ALTER TABLE diff_customer ADD PRIMARY KEY (c_key);

The CUSTOMER table can then be updated with the following SQL statement:

UPDATE /*+PARALLEL(customer,12)*/ customer
(SELECT customer.c_name as c_name,customer.c_addr as c_addr,
diff_customer.c_name as c_newname, diff_customer.c_addr as c_newaddr
FROM customer, diff_customer
WHERE customer.c_key = diff_customer.c_key)
SET c_name = c_newname, c_addr = c_newaddr;

If the CUSTOMER table is partitioned, parallel DML can be used to further improve the response time. It could not be used with the original SQL statement because of the subquery in the SET clause.

See Also: "Rewriting SQL Statements" on page 20-18
Oracle8 Application Developer's Guide for information about key-preserved tables

Inserting the New Rows into the Table in Parallel

The last phase of the refresh process consists in inserting the new rows from the DIFF_CUSTOMER to the CUSTOMER table. Unlike the update case, you cannot avoid having a subquery in the insert statement:

INSERT /*+PARALLEL(customer,12)*/ INTO customer
SELECT * FROM diff_customer
WHERE diff_customer.c_key NOT IN (SELECT /*+ HASH_AJ */ key FROM customer);

But here, the HASH_AJ hint transforms the subquery into an anti-hash join. (The hint is not required if the parameter ALWAYS_ANTI_JOIN is set to hash in the initialization file). Doing so allows you to use parallel insert to execute the preceding statement very efficiently. Note that parallel insert is applicable even if the table is not partitioned.

Using Hints with Cost Based Optimization

Cost-based optimization is a highly sophisticated approach to finding the best execution plan for SQL statements. Oracle automatically uses cost-based optimization with parallel execution.

Attention: You must use ANALYZE to gather current statistics for cost-based optimization. In particular, tables used in parallel should always be analyzed. Always keep your statistics current by running ANALYZE after DDL and DML operations.

Use discretion in employing hints. If used, hints should come as a final step in tuning, and only when they demonstrate a necessary and significant performance advantage. In such cases, begin with the execution plan recommended by cost-based optimization, and go on to test the effect of hints only after you have quantified your performance expectations. Remember that hints are powerful; if you use them and the underlying data changes you may need to change the hints. Otherwise, the effectiveness of your execution plans may deteriorate.

Always use cost-based optimization unless you have an existing application that has been hand-tuned for rule-based optimization. If you must use rule-based optimization, rewriting a SQL statement can give orders of magnitude improvements.

Note: If any table in a query has a parallel degree greater than one (including the default degree), Oracle uses the cost-based optimizer for that query-even if OPTIMIZER_MODE = RULE, or if there is a RULE hint in the query itself.

See Also: "OPTIMIZER_PERCENT_PARALLEL" on page 19-5. This parameter controls parallel awareness.

Tuning Parallel Insert Performance

This section provides an overview of parallel operation functionality.

See Also: Oracle8 Concepts for a detailed discussion of parallel Data Manipulation Language and degree of parallelism.
For a discussion of parallel DML affinity, please see Oracle8 Parallel Server Concepts & Administration.

INSERT

Oracle8 INSERT functionality can be summarized as follows:

Table 20-3 Summary of INSERT Features

Insert Type

Parallel

Serial

NOLOGGING

Conventional

Yes

Direct Load Insert (Append)

Yes: requires

* ALTER SESSION ENABLE PARALLEL DML

* Table PARALLEL attribute or PARALLEL hint

* APPEND hint (optional)

Yes: requires

* APPEND hint

Yes: requires

* NOLOGGING attribute set for table or partition

If parallel DML is enabled and there is a PARALLEL hint or PARALLEL attribute set for the table in the data dictionary, then inserts will be parallel and appended, unless a restriction applies. If either the PARALLEL hint or PARALLEL attribute is missing, then the insert is performed serially.

Direct-Load INSERT

Append mode is the default during a parallel insert: data is always inserted into a new block which is allocated to the table. Therefore the APPEND hint is optional. You should use append mode to increase the speed of insert operations-but not when space utilization needs to be optimized. You can use NOAPPEND to override append mode.

Note that the APPEND hint applies to both serial and parallel insert: even serial insert will be faster if you use it. APPEND, however, does require more space and locking overhead.

You can use NOLOGGING with APPEND to make the process even faster. NOLOGGING means that no redo log is generated for the operation. NOLOGGING is never the default; use it when you wish to optimize performance. It should not normally be used when recovery is needed for the table or partition. If recovery is needed, be sure to take a backup immediately after the operation. Use the ALTER TABLE [NO]LOGGING statement to set the appropriate value.

See Also: Oracle8 Concepts

Parallelizing INSERT, UPDATE, and DELETE

When the table or partition has the PARALLEL attribute in the data dictionary, that attribute setting is used to determine parallelism of UPDATE and DELETE statements as well as queries. An explicit PARALLEL hint for a table in a statement overrides the effect of the PARALLEL attribute in the data dictionary.

You can use the NOPARALLEL hint to override a PARALLEL attribute for the table in the data dictionary. Note, in general, that hints take precedence over attributes.

DML operations are considered for parallelization only if the session is in a PARALLEL DML enabled mode. (Use ALTER SESSION ENABLE PARALLEL DML to enter this mode.) The mode does not affect parallelization of queries or of the query portions of a DML statement.

See Also: Oracle8 Concepts for more information on parallel INSERT, UPDATE and DELETE.

Parallelizing INSERT ... SELECT

In the INSERT... SELECT statement you can specify a PARALLEL hint after the INSERT keyword, in addition to the hint after the SELECT keyword. The PARALLEL hint after the INSERT keyword applies to the insert operation only, and the PARALLEL hint after the SELECT keyword applies to the select operation only. Thus parallelism of the INSERT and SELECT operations are independent of each other. If one operation cannot be performed in parallel, it has no effect on whether the other operation can be performed in parallel.

The ability to parallelize INSERT causes a change in existing behavior, if the user has explicitly enabled the session for parallel DML, and if the table in question has a PARALLEL attribute set in the data dictionary entry. In that case existing INSERT ... SELECT statements that have the select operation parallelized may also have their insert operation parallelized.

Note also that if you query multiple tables, you can specify multiple SELECT PARALLEL hints and multiple PARALLEL attributes.

Example

Add the new employees who were hired after the acquisition of ACME.

INSERT /*+ PARALLEL(emp,4) */ INTO emp
SELECT /*+ PARALLEL(acme_emp,4) */ * 
FROM acme_emp;

The APPEND keyword is not required in this example, because it is implied by the PARALLEL hint.

Parallelizing UPDATE and DELETE

The PARALLEL hint (placed immediately after the UPDATE or DELETE keyword) applies not only to the underlying scan operation, but also to the update/delete operation. Alternatively, you can specify update/delete parallelism in the PARALLEL clause specified in the definition of the table to be modified.

If you have explicitly enabled parallel DML for the session or transaction, UPDATE/DELETE statements that have their query operation parallelized may also have their UPDATE/DELETE operation parallelized. Any subqueries or updatable views in the statement may have their own separate parallel hints or clauses, but these parallel directives do not affect the decision to parallelize the update or delete. If these operations cannot be performed in parallel, it has no effect on whether the UPDATE or DELETE portion can be performed in parallel.

Parallel UPDATE and DELETE can be done only on partitioned tables.

Example 1

Give a 10% salary raise to all clerks in Dallas.

UPDATE /*+ PARALLEL(emp,5) */ emp 
SET sal=sal * 1.1
WHERE job='CLERK' and 
deptno in 
(SELECT deptno FROM dept WHERE location='DALLAS');

The PARALLEL hint is applied to the update operation as well as to the scan.

Example 2

Fire all employees in the accounting department, which will now be outsourced.

DELETE /*+ PARALLEL(emp,2) */ FROM emp
WHERE deptno IN 
(SELECT deptno FROM dept WHERE dname='ACCOUNTING');

Again, the parallelism will be applied to the scan as well as update operation on table EMP.

Additional PDML Examples

The following examples show the use of parallel DML.

Note: As these examples demonstrate, you must enable parallel DML before using the PARALLEL or APPEND hints. You must issue a COMMIT or ROLLBACK command immediately after executing parallel INSERT, UPDATE, or DELETE. You can issue no other SQL commands before committing or rolling back.

The following statement enables parallel DML:

ALTER SESSION ENABLE PARALLEL DML;

Serial as well as parallel direct-load insert requires commit or rollback immediately afterwards.

INSERT   /*+ APPEND NOPARALLEL(table1) */ INTO table1

A select statement issued at this point would fail, with an error message, because no SQL can be performed before a COMMIT or ROLLBACK is issued.

ROLLBACK;

After this ROLLBACK, a SELECT statement will succeed:

SELECT * FROM V$PQ_SESSTAT;

Parallel update likewise requires commit or rollback immediately afterwards:

UPDATE   /*+ PARALLEL(table1,2) */ table1 
SET col1 = col1 + 1; 
COMMIT; 
SELECT * FROM V$PQ_SESSTAT;

As does parallel delete:

DELETE   /*+ PARALLEL(table3,2) */ FROM table3 
WHERE col2 < 5; 
COMMIT; 
SELECT * FROM V$PQ_SESSTAT;

20Understanding Parallel Execution Performance Issues

Understanding Parallel Execution Performance Issues

The Formula for Memory, Users, and Parallel Server Processes

Figure 20-1 Formula for Memory/Users/Server Relationship

Table 20-1 Memory Requirements for Three Classes of Process

Setting Buffer Pool Size for Parallel Operations

How to Balance the Formula

Oversubscribe, with Attention to Paging

Reduce the Number of Memory-Intensive Processes

Decrease Data Warehousing Memory per Process

Decrease Parallelism for Multiple Users

Examples: Balancing Memory, Users, and Processes

Example 1

Example 2

Table 20-2 How to Accommodate a Mixed Workload

Example 3

Figure 20-2 Formula for Memory/User/Server Relationship: High-Memory Processes

Example 4

Example 5

Parallel Execution Space Management Issues

ST (Space Transaction) Enqueue for Sorts and Temporary Data

External Fragmentation

Optimizing Parallel Execution on Oracle Parallel Server

Lock Allocation

Allocation of Processes and Instances

Load Balancing for Multiple Concurrent Parallel Operations

Disk Affinity

Figure 20-3 Disk Affinity Example

Resource Timeout

Parallel Execution Tuning Techniques

Overriding the Default Degree of Parallelism

Rewriting SQL Statements

Creating and Populating Tables in Parallel

Creating Indexes in Parallel

Refreshing Tables in Parallel

Updating the Table

Inserting the New Rows into the Table in Parallel

Using Hints with Cost Based Optimization

Tuning Parallel Insert Performance

INSERT

Table 20-3 Summary of INSERT Features

Direct-Load INSERT

Parallelizing INSERT, UPDATE, and DELETE

Parallelizing INSERT ... SELECT

Parallelizing UPDATE and DELETE

Additional PDML Examples

20
Understanding Parallel Execution Performance Issues