21
Diagnosing Parallel Execution Performance Problems

This section summarizes common tools and techniques you can use to obtain performance feedback on parallel operations.

See Also: Oracle8 Concepts, for basic principles of parallel execution.

See your operating system-specific Oracle documentation for more information about tuning while using parallel execution.

Diagnosing Problems

Use the decision tree in Figure 21-1 to diagnose parallel performance problems. Some key issues are the following:

Quantify your performance expectations, to determine whether there is a problem.
Determine whether a problem pertains to optimization (such as an inefficient plan, which may require re-analyzing tables or adding hints) or to execution (such as simple operations like scanning, loading, grouping, or indexing running much slower than published guidelines).
Determine whether the problem arises when running in parallel (such as load imbalance or resource bottleneck) or whether the problem is also present when running serially.
Figure 21-1 Parallel Execution Performance Checklist

Is There Regression?

Does parallel execution's actual performance deviate from what you expected? If performance is as you expected, can you justify the notion that there is a performance problem? Perhaps you have a desired outcome in mind, to which you are comparing the current outcome. Perhaps you have a justifiable performance expectation which the system is not achieving. You might have achieved this level of performance or particular execution plan in the past, but now, with a similar environment and operation, this is not being met.

If performance is not as you expected, can you quantify the deviation? For data warehousing operations, the execution plan is key. For critical data warehousing operations, save the EXPLAIN PLAN results. Then, as you analyze the data, reanalyze, upgrade Oracle, and load in new data over the course of time, you can compare any new execution plan with the old plan. You can take this approach either proactively or reactively.

Alternatively, you may find that you get a plan that works better if you use hints. You may want to understand why hints were necessary, and figure out how to get the optimizer to generate the desired plan without the hints. Try increasing the statistical sample size: better statistics may give you a better plan. If you had to use a PARALLEL hint, look to see whether you had OPTIMIZER_PERCENT_PARALLEL set to 100%.

Is There a Plan Change?

If there has been a change in the execution plan, determine whether the plan is (or should be) parallel or serial.

Is There a Parallel Plan?

If the execution plan is (or should be) parallel:

If you want a parallel plan, but the optimizer has not given you one, try increasing OPTIMIZER_PERCENT_PARALLEL to 100 and see if this improves performance.
Study the EXPLAIN PLAN output. Did you analyze all the tables? Perhaps you need to use hints in a few cases. Verify that the hint gives better results.

See Also: Parallel EXPLAIN PLAN tags are defined in Table 23-2.

Is There a Serial Plan?

If the execution plan is (or should be) serial, consider the following strategies:

Use an index. Sometimes adding the right index can greatly improve performance. Consider adding an extra column to the index: perhaps your operation could obtain all its data from the index, and not require a table scan. Perhaps you need to use hints in a few cases. Verify that the hint gives better results.
If you do not analyze often, and you can spare the time, it is a good practice to compute statistics. This particularly important if you are performing many joins; it will result in better plans. Alternatively, you can estimate statistics.
Note: Using different sample sizes can cause the plan to change. Generally, the higher the sample size, the better the plan.
Use histograms for non-uniform distributions.
Check initialization parameters to be sure the values are reasonable.
Replace bind variables with literals.
Note whether execution is I/O or CPU bound, then check the optimizer cost model.
Convert subqueries to joins.
Use CREATE TABLE ... AS SELECT to break a complex operation into smaller pieces. With a large query referencing five or six tables, it may be difficult to determine which part of the query is taking the most time. You can isolate the troublesome parts of the query by breaking it into steps and analyzing each step.

See Also: Oracle8 Concepts regarding CREATE TABLE ... AS SELECT
"Step 3: Analyzing Data" on page 19-45 regarding COMPUTE and ESTIMATE

Is There Parallel Execution?

If the cause of regression cannot be traced to problems in the plan, then the problem must be an execution issue. For data warehousing operations, both serial and parallel, consider memory. Check the paging rate and make sure the system is using memory as effectively as possible. Check buffer, sort, and hash area sizing. After you run a query or DML operation, you can look at the V$SESSTAT, V$PQ_SESSTAT and V$PQ_SYSSTAT views to see the number of server processes used, and other information for the session and system.

Is There Skew?

If parallel execution is occurring, is there unevenness in workload distribution (skew)? For example, if there are 10 CPUs and a single user, you can see whether the workload is evenly distributed across CPUs. This may vary over time, with periods that are more or less I/O intensive, but in general each CPU should have roughly the same amount of activity.

The statistics in V$PQ_TQSTAT show rows produced and consumed per parallel server process. This is a good indication of skew, and does not require single user operation.

Operating system statistics show you the per-processor CPU utilization and per-disk I/O activity. Concurrently running tasks make it harder to see what is going on, however. It can be useful to run in single-user mode and check operating system monitors which show system level CPU and I/O activity.

When workload distribution is unbalanced, a common culprit is the presence of skew in the data. For a hash join, this may be the case if the number of distinct values is less than the degree of parallelism. When joining two tables on a column with only 4 distinct values, you will not get scaling on more than 4. If you have 10 CPUs, 4 of them will be saturated but 6 will be idle. To avoid this problem, change the query: use temporary tables to change the join order such that all operations have more values in the join column than the number of CPUs.

If I/O problems occur you may need to reorganize your data, spreading it over more devices. If parallel execution problems occur, check to be sure you have followed the recommendation to spread data over at least as many devices as CPUs.

If there is no skew in workload distribution, check for the following conditions:

Is there device contention? Are there enough disk controllers to provide adequate I/O bandwidth?
Is the system I/O bound, with too little parallelism? If so, consider increasing parallelism up to the number of devices.
Is the system CPU bound, with too much parallelism? Check the operating system CPU monitor to see whether a lot of time is being spent in system calls. The resource may be overcommitted, and too much parallelism may cause processes to compete with themselves.
Are there more concurrent users than the system can support?

Executing Parallel SQL Statements

After analyzing your tables and indexes, you should be able to run operations and see speedup that scales linearly with the degree of parallelism used. The following operations should scale:

table scans
nested loop join
sort merge join
hash join
"not in"
group by
select distinct
union and union all
aggregation
PL/SQL functions called from SQL
order by
create table as select
create index
rebuild index
rebuild index partition
move partition
split partition
update
delete
insert ... select
enable constraint
star transformation

Start with simple parallel operations. Evaluate total I/O throughput with SELECT COUNT(*) FROM facts. Evaluate total CPU power by adding a complex WHERE clause. I/O imbalance may suggest a better physical database layout. After you understand how simple scans work, add aggregation, joins, and other operations that reflect individual aspects of the overall workload. Look for bottlenecks.

Besides query performance you should also monitor parallel load, parallel index creation, and parallel DML, and look for good utilization of I/O and CPU resources.

Using EXPLAIN PLAN to See How an Operation Is Parallelized

Use an EXPLAIN PLAN statement to view the sequential and parallel operation plan. The OTHER_TAG column of the plan table summarizes how each plan step is parallelized, and describes the text of the operation used by parallel server processes for each operation. Optimizer cost information for selected plan steps is given in the COST, BYTES, and CARDINALITY columns. Table 23-2 summarizes the meaning of the OTHER_TAG column.

EXPLAIN PLAN thus provides detailed information as to how specific operations are being performed. You can then change the execution plan for better performance. For example, if many steps are serial (where OTHER_TAG is blank, serial to parallel, or parallel to serial), then the parallel controller could be a bottleneck. Consider the following SQL statement, which summarizes sales data by region:

EXPLAIN PLAN SET STATEMENT_ID = `Jan_Summary' FOR
SELECT dim_1 SUM(meas1) FROM facts WHERE dim_2 < `02-01-1995'
GROUP BY dim_1

This script extracts a compact hierarchical plan from the EXPLAIN PLAN output:

SELECT 
SUBSTR( LPAD(' ',2*(level-1)) ||
DECODE(id, 0, statement_id, operation) ||
' '  || options || ' ' || object_name ||
' [' || partition_start || ',' || partition_stop ||
'] ' || other_tag, 
1, 79)  "step [start,stop] par"
FROM plan_table
START WITH id = 0
CONNECT BY PRIOR id = parent_id
AND PRIOR NVL(statement_id,' ') = 
NVL(statement_id);

Following is the query plan for "Jan_Summary":

Jan_Summary [,]
SORT GROUP BY [,] PARALLEL_TO_SERIAL
TABLE ACCESS FULL facts  [NUMBER(1),NUMBER(1)] PARALLEL_TO_PARALLEL;

Each parallel server scans a portion of the first partition of the facts table. All other partitions are pruned, as shown by the stop and start partition number.

Figure 21-2 illustrates how, with the PARALLEL_TO_PARALLEL keyword, data from the partitioned table is redistributed to a second set of parallel server processes for parallel grouping. Query server set 1 executes the table scan from the preceding example. The rows coming out are repartitioned through the table queue to parallel server set 2, which executes the GROUP BY operation. Because the GROUP BY operation indicates PARALLEL_TO_SERIAL, another table queue collects its results and sends it to the parallel coordinator, and then to the user.

Figure 21-2 Data Redistribution among Parallel Server Processes

As a rule, if the PARALLEL_TO_PARALLEL keyword exists, there will be two sets of parallel server processes. This means that for grouping, sort merge, or hash joins, twice the number of parallel server processes are assigned to the operation. This requires redistribution of data or rows from set 1 to set 2. If there is no PARALLEL_TO_PARALLEL keyword, then the operation gets just one set of servers. Such serial processes include aggregations, such as COUNT(*) FROM facts or SELECT * FROM facts WHERE DATE = '7/1/94'.

For non-distributed operations, the OBJECT_NODE column gives the name of the table queue. If the PARALLEL_TO_PARALLEL keyword exists, then the EXPLAIN PLAN of the parent operation should have SQL that references the child table queue in its FROM clause. In this way the plan describes the order in which the output from operations is consumed.

Using the Dynamic Performance Views

Dynamic performance views list internal Oracle8 data structures and statistics that you can query periodically to monitor progress of a long-running operation. When used in conjunction with data dictionary views, these tables provide a wealth of information. The challenge is visualizing the data and then acting upon it.

Note: On Oracle Parallel Server, global versions of these views aggregate the information over multiple instances. The global views have analogous names such as GV$FILESTAT for V$FILESTAT, and so on.

See Also: Oracle8 Parallel Server Concepts & Administration for more information about global dynamic performance views.

V$FILESTAT

This view sums read and write requests, number of blocks, and service times for every datafile in every tablespace. It can help you diagnose I/O problems and workload distribution problems.

The file numbers listed in V$FILESTAT can be joined to those in the DBA_DATA_FILES view to group I/O by tablespace or to find the filename for a given file number. By doing ratio analysis you can determine the percentage of the total tablespace activity used by each file in the tablespace. If you make a practice of putting just one large, heavily accessed object in a tablespace, you can use this technique to identify objects that have a poor physical layout.

You can further diagnose disk space allocation problems using the DBA_EXTENTS view. Ensure that space is allocated evenly from all files in the tablespace. Monitoring V$FILESTAT during a long-running operation and correlating I/O activity to the EXPLAIN PLAN output is a good way to follow progress.

V$PARAMETER

This view lists the name, current value, and default value of all system parameters. In addition, the view indicates whether the parameter may be modified online with an ALTER SYSTEM or ALTER SESSION command.

V$PQ_SESSTAT

This view is valid only when queried from a session that is executing parallel SQL statements. Thus it cannot be used to monitor a long running operation. It gives summary statistics about the parallel statements executed in the session, including total number of messages exchanged between server processes and the actual number of parallel server processes used.

V$PQ_SLAVE

This view tallies the current and total CPU time and number of messages sent and received per parallel server process. It can be monitored during a long-running operation. Verify that there is little variance among processes in CPU usage and number of messages processed. A variance may indicate a load-balancing problem. Attempt to correlate the variance to a variance in the base data distribution. Extreme imbalance could indicate that the number of distinct values in a join column is much less than the degree of parallelism. See "Parallel CREATE TABLE ... AS SELECT" in Oracle8 Concepts for a possible workaround.

V$PQ_SYSSTAT

The V$PQ_SYSSTAT view aggregates session statistics from all parallel server processes. It sums the total parallel server message traffic, and gives the status of the pool of parallel server processes.

This view can help you determine the appropriate number of parallel server processes for an instance. The statistics that are particularly useful are "Servers Busy", "Servers Idle", "Servers Started", and "Servers Shutdown".

Periodically examine V$PQ_SYSSTAT to determine whether the parallel server processes for the instance are actually busy, as follows:

SELECT * FROM V$PQ_SYSSTAT 
WHERE statistic = "Servers Busy";

STATISTIC             VALUE
--------------------- -----------
Servers Busy          70

V$PQ_TQSTAT

This view provides a detailed report of message traffic at the level of the table queue. It is valid only when queried from a session that is executing parallel SQL statements. A table queue is the pipeline between parallel server groups or between the parallel coordinator and a parallel server group or between a parallel server group and the coordinator. These table queues are represented in the plan by the tags PARALLEL_TO_PARALLEL, SERIAL_TO_PARALLEL, or PARALLEL_TO_SERIAL, respectively.

The view contains a row for each parallel server process that reads or writes each table queue. A table queue connecting 10 consumers to 10 producers will have 20 rows in the view. Sum the bytes column and group by TQ_ID (table queue identifier) for the total number of bytes sent through each table queue. Compare this with the optimizer estimates; large variations may indicate a need to analyze the data using a larger sample.

Compute the variance of bytes grouped by TQ_ID. Large variances indicate workload imbalance. Drill down on large variances to determine whether the producers start out with unequal distributions of data, or whether the distribution itself is skewed. The latter may indicate a low number of distinct values.

For many of the dynamic performance views, the system parameter TIMED_STATISTICS must be set to TRUE in order to get the most useful information. You can use ALTER SYSTEM to turn TIMED_STATISTICS on and off dynamically.

V$SESSTAT and V$SYSSTAT

The V$SESSTAT view provides statistics related to parallel execution for each session. The statistics include total number of queries, DML and DDL statements executed in a session and the total number of intra- and inter-instance messages exchanged during parallel execution during the session.

V$SYSSTAT does the same as V$SESSTAT for the entire system.

Querying the Dynamic Performance Views: Example

The following example illustrates output from two of these views:

SQLDBA> update /*+ parallel (iemp, 2) */ iemp set empno = empno +1; 
91 rows processed. 
SQLDBA> commit; 
Statement processed. 
SQLDBA> select * from v$pq_sesstat; 
STATISTIC                      LAST_QUERY SESSION_TOTAL 
------------------------------ ---------- ---------- 
Queries Parallelized                    0          0 
DML Parallelized                        1          2 
DFO Trees                               1          2 
Server Threads                          2          0 
Allocation Height                       2          0 
Allocation Width                        0          0 
Local Msgs Sent                        34         60 
Distr Msgs Sent                         0          0 
Local Msgs Recv'd                      34         60 
Distr Msgs Recv'd                       0          0 
11 rows selected. 

SQLDBA> select * from v$pq_sysstat; 
STATISTIC                      VALUE 
------------------------------ ---------- 
Servers Busy                            0 
Servers Idle                            2 
Servers Highwater                       2 
Server Sessions                         4 
Servers Started                         2 
Servers Shutdown                        0 
Servers Cleaned Up                      0 
Queries Initiated                       0 
DML Initiated                           2 
DFO Trees                               2 
Local Msgs Sent                        60 
Distr Msgs Sent                         0 
Local Msgs Recv'd                      60 
Distr Msgs Recv'd                       0 
15 rows selected.

In V$PQ_SESSTAT, some of the statistics provide the following information.

DML Parallelized

number of statements with insert, delete and update that were parallelized by the last operation and by this session

Queries Parallelized

number of all other parallel statements

DFO Trees

number of fragments of query plan that were parallelized

Server Threads

total number of server processes (typically 2x degree)

Allocation Height

requested number of servers on each instance

Allocation Width

requested number of instances

In V$PQ_SYSSTAT, the "DML Initiated" statistic indicates the number of DML operations done in the system.

Note that statements such as INSERT ... SELECT are treated as a single DML statement, not as one DML statement and one query.

See Also: Oracle8 SQL Reference for information about statistics.

Checking Operating System Statistics

There is considerable overlap between information available in Oracle and information available though operating system utilities (such as sar and vmstat on UNIX-based systems). Operating systems provide performance statistics on I/O, communication, CPU, memory and paging, scheduling, and synchronization primitives. The Oracle V$SESSTAT view provides the major categories of OS statistics as well.

Typically, Operating system information about I/O devices and semaphore operations is harder to map back to database objects and operations than is Oracle information. However, some operating systems have good visualization tools and efficient means of collecting the data.

Operating system information about CPU and memory usage is very important for assessing performance. Probably the most important statistic is CPU usage. The goal of low -level performance tuning is to become CPU bound on all CPUs. Once this is achieved, you can move up a level and work at the SQL level to find an alternate plan that might be more I/O intensive but uses less CPU.

Operating system memory and paging information is valuable for fine tuning the many system parameters that control how memory is divided among memory-intensive warehouse subsystems like parallel communication, sort, and hash join.

Minimum Recovery Time

If the system requires ten minutes to run an operation, it will take at least 10 minutes to roll back the operation: this is the best performance achievable. If there is some instance failure and some system failure, then recovery time will increase because not all the server processes are available to provide rollback.

Parallel DML Restrictions

You must either commit or roll back directly after you issue a parallel INSERT, UPDATE, or DELETE statement, or a serial insert with the APPEND hint. You can perform no other SQL commands until this is done.

Discrete transactions are not supported for parallel DML.

A session that is enabled for parallel DML may put transactions in the session in a special mode. If any DML statement in a transaction modifies a table in parallel, no subsequent serial or parallel query or DML statement can access the same table again in that transaction. This means that the results of parallel modifications cannot be seen during the transaction.

A complete listing of parallel DML and direct-load insert restrictions is found in Oracle8 Concepts. If a parallel restriction is violated, the operation is simply performed serially. If a direct-load insert restriction is violated, then the APPEND hint is ignored and a conventional insert is performed. No error message is returned.

DML Parallelized	number of statements with insert, delete and update that were parallelized by the last operation and by this session
Queries Parallelized	number of all other parallel statements
DFO Trees	number of fragments of query plan that were parallelized
Server Threads	total number of server processes (typically 2x degree)
Allocation Height	requested number of servers on each instance
Allocation Width	requested number of instances

21Diagnosing Parallel Execution Performance Problems