12
Tuning Oracle Parallel Server and Inter-Instance Performance

The chapter describes Oracle Parallel Server and Cache Fusion-related statistics and provides procedures that explain how to use these statistics to monitor and tune performance. This chapter also briefly explains how Cache Fusion resolves reader/writer conflicts in Oracle Parallel Server. It describes Cache Fusion's benefits in general terms that apply to most types of systems and applications.

The topics in this chapter include:

How Cache Fusion Produces Consistent Read Blocks

When a data block requested by one instance is in the memory cache of a remote instance, Cache Fusion resolves the read/write conflict using remote memory access, not disk access. The requesting instance sends a request for a consistent-read copy of the block to the holding instance. The Block Server Process (BSP) on the holding instance transmits the consistent-read image of the requested block directly from the holding instance's buffer cache to the requesting instance's buffer cache across a high speed interconnect.

As Figure 12-1 illustrates, Cache Fusion enables the buffer cache of one node to send data blocks directly to the buffer cache of another node by way of low latency, high bandwidth interconnects. This reduces the need for expensive disk I/O in parallel cache management.

Cache Fusion also leverages new interconnect technologies for low latency, user-space based, interprocessor communication. This potentially lowers CPU usage by reducing operating system context switches for inter-node messages. Oracle manages write/write contention using conventional disk-based Parallel Cache Management (PCM).

Figure 12-1 Cache Fusion Ships Blocks from Cache to Cache Across the Interconnect

Note:
Cache Fusion is always enabled.

Partitioning Data to Improve Write/Write Conflict Resolution

Cache Fusion only solves part of the block conflict resolution issue by providing improved scalability for applications that experience high levels of reader/writer contention. For applications with high writer/writer concurrency, you also need to accurately partition your application's tables to reduce the potential for writer/writer conflicts.

Improved Scalability with Cache Fusion

Cache Fusion improves application transaction throughput and scalability by providing:

Reduced context switches, and hence reduced CPU utilization, during reader/writer cache coherency conflicts
Further reduced CPU utilization for user-mode IPC platforms
Reduced I/O for block pinging and reduced X-to-S lock conversions
Consistent-read block transfers by way of high speed interconnects

Applications demonstrating high reader/writer conflict rates under disk-based PCM benefit the most from Cache Fusion. Packaged applications also scale more effectively as a result of Cache Fusion. Applications in which OLTP and reporting functions execute on separate nodes may also benefit from Cache Fusion.

Reporting functions that access data from tables modified by OLTP functions receive their versions of data blocks by way of high speed interconnects. This reduces the pinging of data blocks to disk. Performance gains are derived primarily from reduced X-to-S lock conversions and the corresponding reduction in disk I/O for X-to-S lock conversions.

Furthermore, the instance that was changing the cached data block before it received a read request for the same block from another instance would not have to request exclusive access to the block again for subsequent changes. This is because the instance retains the exclusive lock and the buffer after the block is shipped to the reading instance.

Note:
All applications achieve some performance gains from Cache Fusion. The degree of improvement depends upon the operating system, the application workload, and the overall system configuration.

Consistent-Read Block Transfers By Way of High Speed Interconnects

Because Cache Fusion exploits high speed IPCs, Oracle Parallel Server benefits from the performance gains of the latest technologies for low latency communication across cluster interconnects. Further performance gains can be expected with even more efficient protocols, such as Virtual Interface Architecture (VIA) and user-mode IPCs.

Cache Fusion reduces CPU utilization by taking advantage of user-mode IPCs, also known as "memory-mapped IPCs", for both Unix and NT based platforms. If the appropriate hardware support is available, operating system context switches are minimized beyond the basic reductions achieved with Cache Fusion alone. This also eliminates costly data copying and system calls.

User-mode IPCs, if efficiently implemented by hardware support, can reduce CPU use because user processes can communicate without using the operating system kernel. In other words, there is no need to switch from user execution mode to kernel execution mode.

Reduced I/O for Block Pinging and Reduced X to S Lock Conversions

Cache Fusion reduces expensive lock operations and disk I/O for data and undo segment blocks by transmitting consistent-read blocks directly from one instance's buffer cache to another. This can reduce the latency required to resolve reader/writer conflicts by as much as 90 percent.

Cache Fusion resolves reader/writer concurrency with approximately one tenth of the processing effort required by disk-based PCM, using little or no disk I/O. To do this, Cache Fusion only incurs overhead for processing the consistent-read request and for constructing a consistent-read copy of the requested block in memory and transferring it to the requesting instance. On some platforms this can take less than one millisecond.

The Interconnect and Interconnect Protocols for Oracle Parallel Server

The primary components affecting Cache Fusion performance are the interconnect and the protocols that process inter-node communication. The interconnect bandwidth, its latency, and the efficiency of the IPC protocol determine the speed with which Cache Fusion processes consistent-read block requests.

Influencing Interconnect Processing

Once your interconnect is operative, you cannot significantly influence its performance. However, you can influence a protocol's efficiency by adjusting the IPC buffer sizes.

Supported Interconnect Software

Interconnects that support Oracle Parallel Server and Cache Fusion use one of these protocols:

TCP/IP (Transmission Control Protocol/Interconnect Protocol)
UDP (User Datagram Protocol)
VIA (Virtual Interface Architecture)
Other proprietary protocols that are hardware vendor-specific

Oracle Parallel Server can use any interconnect product that supports these protocols. The interconnect product must also be certified for Oracle Parallel Server hardware cluster platforms.

Performance Expectations

Cache Fusion performance levels may vary in terms of latency and throughput from application to application. Performance is further influenced by the type and mixture of transactions your system processes.

The performance gains from Cache Fusion also vary with each workload. The hardware, the interconnect protocol specifications, and the operating system resource usage also affect performance.

If your application did not demonstrate a significant amount of consistent-read contention prior to Cache Fusion, your performance with Cache Fusion will likely remain unchanged. However, if your application experienced numerous lock conversions and heavy disk I/O as a result of consistent-read conflicts, your performance with Cache Fusion should improve significantly.

A comparison of the locking and I/O statistics for Oracle 8.1. and Oracle 8.0 reveals a major reduction of exclusive to shared lock requests and physical write I/O. The following section, "Monitoring Cache Fusion and Inter-Instance Performance", describes how to evaluate Cache Fusion performance in more detail.

See Also:
Chapter 7 for more information on lock types.

Monitoring Cache Fusion and Inter-Instance Performance

This section describes how to obtain and analyze Oracle Parallel Server and Cache Fusion statistics to monitor inter-instance performance. Topics in this section include:

Cache Fusion and Oracle Parallel Server Performance Monitoring Goals

The main goal of monitoring Cache Fusion and Oracle Parallel Server performance is to determine the cost of global processing and quantify the resources required to maintain coherency and synchronize the instances. Do this by analyzing the performance statistics from several views as described in the following sections. Use these monitoring procedures on an ongoing basis to observe processing trends and to maintain processing at optimal levels.

Many statistics are available to measure the work done by different components of the database kernel, such as the cache layer, the transaction layer or the I/O layer. Moreover, timed statistics allow you to accurately determine the time spent on processing certain requests or the time waited for specific events.

From these statistics sources, work rates, wait time and efficiency ratios can be derived.

See Also:
Chapter 11 for additional suggestions on which statistics to collect and how to use them to compute performance ratios.

Statistics for Monitoring Oracle Parallel Server and Cache Fusion

Oracle collects Cache Fusion-related performance statistics from the buffer cache and DLM layers. Oracle also collects general Oracle Parallel Server statistics for lock requests and lock waits. You can use several views to examine these statistics.

Maintaining an adequate history of system performance helps you identify trends as these statistics change. This facilitates identifying contributors to increasing response times and reduced throughput. It would also be helpful in spotting workload changes and peak processing requirements.

Procedures in this section use statistics that are grouped according to the following topics:

As described in Chapter 11, consider maintaining statistics from the V$SYSSTAT view and the V$SYSTEM_EVENT view on a per second and per transaction basis to obtain a general profile of the workload. Relevant observations from these views are:

Requests or counts per transaction, for example, physical reads per transaction and logical reads per transaction
Wait times or elapsed time per transaction, for example, read I/O wait time per transaction and lock convert time per transaction
Requests or counts per second
Average times per request, for example, average time to receive a consistent read buffer from another instance

By maintaining these statistics, you can accurately estimate the effect of an increasing cost for a certain type of operation on transaction response times. Major increases in work rates or average delays also contribute to identifying capacity issues.

You must set the parameter TIMED_STATISTICS to TRUE for Oracle to collect statistics for most views discussed in the procedures in this section. The timed statistics from views discussed in this chapter are displayed in units of 1/100ths of a second.

Note:
You must also run CATPARR.SQL to create Oracle Parallel Server-related views and tables for storing and viewing statistics as described under the next heading.

Creating Oracle Parallel Server Data Dictionary Views with CATPARR.SQL

The SQL script CATPARR.SQL creates parallel server data dictionary views. To run this script, you must have SYSDBA privileges.

CATALOG.SQL creates the standard V$ dynamic views, as well as:

GV$CACHE
GV$PING
GV$CLASS_PING
GV$FILE_PING
GV$ROWCACHE
GV$LIBRARYCACHE

You can rerun CATPARR.SQL if you want the EXT_TO_OBJ table to contain the latest information after you add extents. If you drop objects without rerunning CATPARR.SQL, EXT_TO_OBJ may display misleading information.

Global Dynamic Performance Views

Tuning and performance information for the Oracle database is stored in a set of dynamic performance tables known as the "V$ fixed views". Each active instance has its own set of fixed views. In Oracle Parallel Server, you can query a global dynamic performance (GV$) view to retrieve the V$ view information from all qualified instances. A global fixed view is available for all of the existing dynamic performance views except for V$ROLLNAME, V$CACHE_LOCK, V$LOCK_ACTIVITY, and V$LOCKS_WITH_COLLISIONS.

The global view contains all the columns from the local view, with an additional column, INST_ID (datatype INTEGER). This column displays the instance number from which the associated V$ information was obtained. You can use the INST_ID column as a filter to retrieve V$ information from a subset of available instances. For example, the query:

   SELECT * FROM GV$LOCK WHERE INST_ID = 2 or INST_ID = 5;

Retrieves information from the V$ views on instances 2 and 5.

Each global view contains a GLOBAL hint that creates a parallel query to retrieve the contents of the local view on each instance.

If you have reached the limit of PARALLEL_MAX_SERVERS on an instance and you attempt to query a GV$ view, one additional parallel server process will be spawned for this purpose. The extra process is not available for parallel operations other than GV$ queries.

Note:
If PARALLEL_MAX_SERVERS is set to zero for an instance, additional parallel server processes do not spawn to accommodate a GV$ query.

If you have reached the limit of PARALLEL_MAX_SERVERS on an instance and issue multiple GV$ queries, all but the first query will fail. In most parallel queries, if a server process could not be allocated this would result in either an error or a sequential execution of the query by the query coordinator.

See Also:

Chapter 2

Oracle8i Reference for restrictions on GV$ views and complete descriptions of all related parameters and V$ dynamic performance views

Analyzing Global Cache and Cache Fusion Statistics

Oracle collects global cache statistics at the buffer cache layer within an instance. These statistics include counts and timings of requests for global resources.

Requests for global locks on data blocks originate in the buffer cache of the requesting instance. Before a request enters the DLM, Oracle allocates data structures in the System Global Area to track the state of the request. These structures are called "lock elements".

To monitor global cache statistics, query the V$SYSSTAT view and analyze its output as described in the following procedures.

Procedures for Monitoring Global Cache Statistics

Complete the following steps to analyze global cache statistics.

Use this syntax to query V$SYSSTAT:

   SELECT * FROM V$SYSSTAT WHERE NAME LIKE 'global cache%';

Oracle responds with output similar to:

NAME                                                           VALUE
--------------------------------------------------------------------
global cache cr blocks received                                 7372
global cache cr block receive time                              2293
global cache cr blocks served                                   7882
global cache cr block serve time                                  60
global cache cr block send time                                  239
global cache cr block log flushes                                119
global cache cr block log flush time                             140
global cache cr timeouts                                           2
global cache cr requests blocked                                   0

The average latency of a consistent block request, in other words, its round-trip time, can be calculated as:

The result, which should typically be about 15 milliseconds depending on your system configuration and volume, is the average latency of a consistent-read request round trip from requesting instance, to holding instance, and back to the requesting instance. If your CPU has limited idle time and your system typically processes long-running queries, the latency may be higher. However, it is possible to have an average latency of less than one millisecond.

Consistent-read server request latency can also be influenced by a high value for the DB_MULTI_BLOCK_READ_COUNT parameter. This is because a requesting process may issue more than one request for a block depending on the setting of this parameter. Correspondingly, the requesting process may wait longer.

For a high number of incoming requests, especially in report-intensive applications, or if there are multiple nodes from which requests can arrive at a BSP, the round-trip time can increase because BSP's service time increases. To determine whether the length of the delay is caused by BSP, compute the average service time per request.

Track the average BSP service time per request and the total round-trip time per request as presented in this step.

To determine which part of the service time correlates most with the total service time, derive the time waited for a log flush and the time spent in sending the completed request using the following two equations:

By calculating these averages, you can account for almost all the processing steps of a consistent read block request. The remaining difference between the total round-trip time and the BSP service time per request falls onto processing time in the LMD processing and network IPC time.

Calculate the average convert times and average get times using one of these formulas:

High convert times may indicate excessive global concurrency. A large number of global cache gets, global cache converts, and a rapid increase in average convert or get times indicates that there is excessive contention. Another cause may be that latencies for lock operations are high due to overall system workload or system problems. A reasonable value for a cache get is 20 to 30 milliseconds while converts should take 10 to 20 milliseconds on average.

Oracle increments global cache gets when a new lock on a resource is opened. A convert is counted when there is already an open lock and Oracle converts it to another mode.

The elapsed time for a get thus includes the allocation and initialization of new locks. If the average cache get or average convert times are excessive, your system may be experiencing timeouts.

If the global cache convert times or global cache get times are high, refer to statistics in the V$SYSTEM_EVENTS view to identify events with a high value for TIME_WAITED statistics.

Analyze lock convert timeouts by examining the value for "global cache convert timeouts". If your V$SYSSTAT output shows a value other than zero for this statistic, check your system for congestion or high contention. In general, convert timeouts should not occur; their existence indicates serious performance problems.

Analyze the global cache consistent-read timeouts by examining the value for this statistic in your V$SYSSTAT output. Oracle increments this statistic after the system waits too long for the completion of a consistent-read request. If this statistic shows a value other than zero, too much time has elapsed after the initiation of a consistent-read request and a timeout has occurred. If this happens, you will also usually find that the average time for consistent-read request completions has increased. If you have timeouts and the latency is high, your system may have an excessive workload or there may be excessive contention for data blocks. It might also be used as an indicator of IPC or network problems.

Other Useful Cache Fusion Statistics

The following describes additional Cache Fusion statistics that you may find useful in diagnosing global cache and Cache Fusion operations. Use these statistics to monitor all the major operations of a consistent block request.

global cache cr blocks received

When a process requests a consistent read for a data block that it cannot satisfy from its local cache, it sends a request to another instance. Once the request is complete, in other words, the buffer has been received, Oracle decrements the request count.

global cache cr block receive time

This statistic records the total time it took for consistent read requests to complete, in other words, the accumulated round-trip time for all requests for consistent read blocks.

global cache cr timeouts

This statistic identifies a request for a consistent read block that has a long delay and that has timed out. This could be due to system performance problems, a slow interconnect network or dropped network packets. The value of this statistic should always be 0.

global cache cr blocks served

This is the number of requests for a consistent read block served by BSP. Oracle increments this statistic when the block is sent.

global cache cr block serve time

This statistic represents the accumulated time it took BSP to fill all incoming requests. For each request, the start time is recorded immediately after BSP takes a request off the request queue. The interval is computed after the blocks is sent.

global cache cr block send time

This is the time required by BSP to initiate a send of a consistent read block. For each request, timing starts when the block is sent and stops when the send has completed. It is a part of the serve time. Note that this statistic only measures the time it takes to initiate the send; it does not measure the time elapsed before the block arrives at the requestor.

global cache cr block log flushes

For changes to buffers containing a version of a data block that the block sever process has produced, a log flush must be initiated. BSP handles the wait asynchronously by managing a completion queue. Once LGWR has completed flushing the changes to a buffer that is on the log flush queue, BSP can send it. Therefore it periodically checks the queue. Oracle increments this statistic when a log flush is queued.

global cache cr block log flush time

This is the time waited for a log flush. It is part of the serve time.

Analyzing Global Lock Statistics

Global lock statistics provide counts and timings for both PCM and non-PCM lock activity. Oracle collects global lock statistics from the DLM API layer. All Oracle clients to the DLM, of which the buffer cache is only one, make their requests to the DLM through this layer. Thus, global lock statistics include lock requests originating from all layers of the kernel, while global cache statistics relate to buffer cache Oracle Parallel Server activity.

Use the procedures in this section to monitor data from the V$SYSSTAT view to derive averages, latencies, and counts. This establishes a rough indicator of the Oracle Parallel Server workload generated by an instance.

Procedures for Analyzing Global Lock Statistics

Use the following procedures to view and analyze statistics from the V$SYSSTAT view for global lock processing.

Use this syntax to query V$SYSSTAT:

   SELECT * FROM V$SYSSTAT WHERE NAME LIKE 'global lock%';

Oracle responds with output similar to:

NAME                                                             	VALUE
---------------------------------------------------------------- 	----------
global lock sync gets                                                   	703
global lock async gets                                                	12748
global lock get time                                                   	1071
global lock sync converts                                               	303
global lock async converts                                               	41
global lock convert time                                                 	93
global lock releases                                                    	573

Use your V$SYSSTAT output to perform the calculations and analyses described in the remaining procedures in this group of procedures.

Calculate the average global lock gets using this formula:

If the result is more than 20 or 30 milliseconds, query the TIME_WAITED column in the V$SYSTEM_EVENTS view using the DESCEND keyword to identify which lock events are waited for most frequently using this query:
```
   SELECT EVENT_TIME_WAITED, AVERAGE_WAIT 
   FROM V$SYSTEM_EVENTS 
   ORDER BY TIME_WAITED DESCEND;
```
Oracle increments global lock gets when a new lock on a resource is opened. A convert is counted when there is already an open lock and Oracle converts it to another mode.

The elapsed time for a get thus includes the allocation and initialization of new locks. If the average lock get or average lock convert times are excessive, your system may be experiencing timeouts.

If the global lock convert times or global lock get times are high, refer to statistics in the V$SYSTEM_EVENTS view to identify events with a high value for TIME_WAITED statistics.
Calculate the average global lock convert time using this formula:

If the result is more than 20 milliseconds, query the TIME_WAITED column in the V$SYSTEM_EVENTS view using the DESCEND keyword to identify the event causing the delay.
As mentioned, global lock statistics apply to buffer cache lock operations and lock operations for resources other than data blocks. To determine which type of resources may be a performance problem, divide the global lock get and global lock convert statistics into two categories:
- Synchronous Operations
  
  Synchronous lock gets includes, for example, global lock sync gets. These are usually performed for lock requests for resources other than cached data blocks. To determine the proportion of the time required for synchronous lock gets, divide global lock get time or global lock convert time by the corresponding number of synchronous operations.
- Asynchronous Operations
  
  Asynchronous lock operations include, for example, global lock async gets. These are typically lock operations for global cache locks. You can derive the proportion of the total time using the same calculation as for synchronous operations. In this way, the proportion of work and the cost of global cache lock requests and other lock requests can be determined.
Normally, if the proportion of global lock requests for resources other than global cache lock requests dominates the cost for all lock operations, the V$SYSTEM_EVENTS view shows high wait times for row cache locks, enqueues or library cache pins.
Analyze the V$LIBRARYCACHE and V$ROWCACHE views to observe DLM activity on non-PCM resources. These views have DLM-specific columns that identify DLM resource use. Analyze these views for DLM activity if you have frequent and extended waits for library cache pins, enqueues, or DFS lock handles.

Analyzing DLM Resource, Lock, Message, and Memory Resource Statistics

Oracle collects DLM resource, lock, and message statistics at the DLM level. Use these statistics to monitor DLM latency and workloads. These statistics appear in the V$DLM_CONVERT_LOCAL and V$DLM_CONVERT_REMOTE views.

These views record average convert times, count information, and timed statistics for each type of lock request. The V$DLM_CONVERT_LOCAL view shows statistics for local lock operations. The V$DLM_CONVERT_REMOTE view shows values for remote conversions. The average convert times in these views are in 100ths of a second.

Note:
Count information in these views is cumulative for the life of an instance.

How DLM Workloads Affect Performance

The DLM workload is an important aspect of Oracle Parallel Server and Cache Fusion performance because each consistent-read request results in a lock request. High DLM workloads as a result of heavy request rates can adversely affect performance.

The DLM performs local lock operations entirely within the local node, or in other words, without sending messages. Remote lock operations require sending messages to and waiting for responses from other nodes. Most down-converts, however, are local operations for the DLM.

The following procedures for analyzing DLM resource, locks, and message statistics appear in two groups. The first group of procedures explains how to monitor DLM resources and locks. The second group explains how to monitor message statistics.

Procedures for Analyzing DLM Resource and Lock Statistics

Use the following procedures to obtain and analyze statistics from the V$DLM_CONVERT_LOCAL and V$DLM_CONVERT_REMOTE views for DLM resource processing.

You must enable event 29700 to populate the V$DLM_CONVERT_LOCAL and V$DLM_CONVERT_REMOTE views. Do this by entering this syntax:

   EVENT="29700 TRACE NAME CONTEXT FOREVER"

Use this syntax to query the V$DLM_CONVERT_LOCAL view:

   SELECT CONVERT_TYPE, 
   AVERAGE_CONVERT_TIME, 
   CONVERT_COUNT 
   FROM V$DLM_CONVERT_LOCAL;

Oracle responds with output similar to:

CONVERT_TYPE                           AVERAGE_CONVERT_TIME CONVERT_COUNT
-------------------------------------- -------------------- -------------
NULL -> SS                                                0             0
NULL -> SX                                                0             0
NULL -> S                                                 1           146
NULL -> SSX                                               0             0
NULL -> X                                                 1            92
SS   -> SX                                                0             0
SS   -> S                                                 0             0
SS   -> SSX                                               0             0
SS   -> X                                                 0             0
SX   -> S                                                 0             0
SX   -> SSX                                               0             0
SX   -> X                                                 0             0
S    -> SX                                                0             0
S    -> SSX                                               0             0
S    -> X                                                 3            46
SSX  -> X                                                 0             0
16 rows selected.

Use this syntax to query the V$DLM_CONVERT_REMOTE view:
```
   SELECT * FROM V$DLM_CONVERT_REMOTE; 
```
Oracle responds with output identical in format to the output for the V$DLM_CONVERT_LOCAL view.

Use your output from the V$DLM_CONVERT_LOCAL and V$DLM_CONVERT_REMOTE views to perform the calculation described in the following procedure.

Calculate the ratio of local-to-remote lock operations using this query:

   SELECT r.CONVERT_TYPE,
    r.AVERAGE_CONVERT_TIME,
    l.AVERAGE_CONVERT_TIME,
    r.CONVERT_COUNT,
    l.CONVERT_COUNT,
      FROM V$DLM_CONVERT_LOCAL l, V$DLM_CONVERT_REMOTE r
      GROUP BY r.CONVERT_TYPE;

It is useful to maintain a history of workloads and latencies for lock converts. Changes in lock requests per transaction without increases in average latencies usually result from changing application workload patterns.

Deterioration of both request rates and latencies usually indicates an increased rate of lock conflicts or an increased workload due to message latencies, system problems, or timeouts. If the LMD process shows high CPU consumption, or consumption is greater than 20 percent of the CPU while overall system resource consumption is normal, consider binding the LMD process to a specific processor if the system has more than one CPU.

If latencies increase, also examine CPU data and other operating system statistics that you can obtain using utilities such as "sar," "vmstat" and "netstat" on UNIX or Perfmon on Windows NT.
Derive an estimate of CPU busy time for LMD from the V$SYSTEM_EVENT view.

For a quick estimate of the CPU time spent by LMD, you can transform the wait time event for LMD presented in the V$SYSTEM_EVENT view. To do this, look for the event name "lkmgr wait for remote messages" that represents the time that the LMD process is idle. The TIME_WAITED column contains the accumulated idle time for LMD in units of hundredths of a second.

To derive the busy time, divide the value for TIME_WAITED by the length of the measurement interval after normalizing it to seconds. In other words, a value of 17222 centiseconds is 172.22 seconds. The result is the idle time of the LMD process, or the percentage of idle time. Subtract that value from 1 and the result is the busy time for the LMD process. This is a fairly accurate estimate when compared with operating system utilities that provide information about CPU utilization per process.

Note:
You should have beginning and ending snapshots to make an accurate calculation.

DLM Message Statistics

The DLM sends messages either directly or by using flow control. For both methods, the DLM attaches markers known as "tickets" to each message. The allotment of tickets for each DLM is limited. However, the DLM can re-use tickets indefinitely.

DLMs send messages directly until no more tickets are available. When an DLM runs out of tickets, messages must wait in a flow control queue until outstanding messages have been acknowledged and more tickets are available. Flow-controlled messaging is managed by the LMD process.

The rationing of tickets prevents one node from sending an excessive amount of messages to another node during periods of heavy inter-instance communication. This also prevents one node with heavy remote consistent-read block requirements from assuming control of messaging resources throughout a cluster at the expense of other, less-busy nodes.

The V$DLM_MISC view contains the following statistics about message activity:

DLM messages sent directly
DLM messages flow controlled
DLM messages received
DLM total incoming message queue length

Procedures for Analyzing DLM Message Statistics

Use the following procedures to obtain and analyze message statistics in the V$DLM_MISC view.

Use this syntax to query the V$DLM_MISC view:

   SELECT NAME, VALUE FROM V$DLM_MISC;

Oracle responds with output similar to:

STATISTIC# NAME                                VALUE
---------- ----------------------------------- -----------
         0 dlm messages sent directly           29520

         1 dlm messages flow controlled          1851

         2 dlm messages received                29668

         3 dlm total incoming msg queue length    297

4 rows selected.

Note:
Oracle support may request information from your V$DLM_MISC output for debugging purposes.

Use your output from the V$DLM_MISC view to perform the following procedure.

Calculate the average receive queue length between two snapshots using this equation:

Oracle increments the value for "total incoming message queue length" whenever a new request enters the LMD process' message queue. When messages leave the LMD queue to begin processing, Oracle increments the value for "messages received".

The size of the queue may increase if a large number of requests simultaneously arrives at the LMD. This can occur when the volume of locking activity is high or when the LMD processes a large quantity of consistent-read requests. Typically, the average receive queue length is less than 10.

Analyzing Oracle Parallel Server I/O Statistics

In addition to the global cache and global lock statistics that were previously discussed, you can also use statistics in the V$SYSSTAT view to measure the I/O workload related to global cache synchronization. There are three important statistics in the V$SYSSTAT view for this purpose:

DBWR forced writes
Remote instance undo header writes
Remote instance undo block writes

DBWR forced writes occur when Oracle resolves inter-instance data block contention by writing the requested block to disk before the requesting node can use it.

Cache Fusion minimizes the disk I/O for consistent-reads. This can lead to a substantial reduction in physical writes and reads performed by each instance. Before Cache Fusion, a consistent-read requesting data from a remote instance could result in up to three write I/Os on the remote instance and three corresponding read I/Os for the requesting instance: one for the data block, one for the rollback segment header, and one for a rollback segment block.

Analyzing Oracle Parallel Server I/O Statistics in the V$SYSSTAT View

You can obtain the following statistics to quantify the write I/Os required for global cache synchronization.

Use this syntax to query the V$SYSSTAT view:
```
   SELECT NAME, VALUE FROM V$SYSSTAT 
   WHERE NAME IN ('DBWR forced writes', 
   'remote instance undo block writes',
   'remote instance undo header writes',
   'physical writes');
```
Oracle responds with output similar to:
```
NAME                                                           VALUE
--------------------------------------------------------- ----------
physical writes                                                41802
DBWR cross-instance writes                                      5403
remote instance undo block writes                                  0
remote instance undo header writes                                 2
4 rows selected.
```
Where the statistic "physical writes" refers to all physical writes that occurred from a particular instance performed by DBWR, the value for "DBWR cross-instance writes" accounts for all writes caused by writing a dirty buffer containing a data block that is requested for modification by another instance. As cross-instance writes are also handled by DBWR, it follows that "DBWR cross-instance writes" is a subset of all "physical writes".

The other notable statistics, "remote instance undo block writes" and "remote instance undo header writes", refer to the number of times that Oracle writes a rollback segment block to disk because another instance intends to build a consistent read version of a data block but the information required to roll back the block are not in the instance's cache. Both are a subset of "DBWR cross-instance writes". Their significance for performance is less critical in Oracle8i because Cache Fusion reduces the need to "export" and "import" rollback information. In most cases, instances send the complete version of a data block by way of the interconnect to the requesting instance.

Note that every lock conversion from Exclusive (X) to Null (N) or from Exclusive (X) to Shared (S) is associated with a write to disk when the buffer under the lock is dirty. However, in Oracle8i, the number of X to S lock conversions is reduced because Cache Fusion does not require them. In most cases, the holding instance retains the X lock.
Calculate the ratio of Oracle Parallel Server-related I/O to overall physical I/O using this equation:

You should see a noticeable decrease in this ratio between this calculation and pre-Cache Fusion statistics.
Use this equation to calculate how many writes to rollback segments occur when a remote instance needs to read from rollback segments that are in use by a local instance:

The ratio shows how much disk I/O is related to writes to rollback segments. With Cache Fusion, this ratio should be very low.
To estimate the number or percentage of reads due to global cache synchronization, use the number of lock requests for conversions from NULL(N) to Shared mode (S) counted in V$LOCK_ACTIVITY and the "physical reads" statistics from V$SYSSTAT.

The following formula computes the percentage of reads that are only for local work:

Where "lock buffers for read" represents the N to S lock conversions.

These so-called "forced reads" occur when a cached data block that was previously modified by the local instance had to be written to disk due to a "ping" from another instance and the block is then re-acquired by the local instance for a read.

See Also:
Chapter 11 for more observations regarding estimations of local and global work rates and percentages in Oracle Parallel Server clusters.

Analyzing Lock Conversions by Type

This section describes how to analyze output from three views to quantify lock conversions by type. The tasks and the views discussed in this section are:

Using the V$LOCK_ACTIVITY View to Analyze Lock Conversions

The V$LOCK_ACTIVITY view summarizes how many lock up- and down-converts have occurred during an instance's lifetime. X-to-N down-converts denote the number of times a lock was down-converted because another instance wanted to modify a resource.

The other major type of down-convert is X-to-S. This type of down-convert occurs when an instance reads a resource that was last modified by a local instance. Both types of lock conversions involve I/O. However, Cache Fusion should reduce X-to-S down-converts because they are not needed for buffer locks.

Using the V$CLASS_PING View to Identify Pinging by Block Class

The V$CLASS_PING view summarizes lock conversion activity by showing whether disk I/O is occurring on the following classes of blocks:

Data blocks
Segment headers
Extent headers
Undo blocks

All X_2_NULL_FORCED_WRITE and X_2_S_FORCED_WRITE conversions involve write I/O. In other words, values in the columns for each block class provide an indicator of the cause of the disk I/O.

Using the V$PING View to Identify Hot Objects

The V$PING view helps identify "hot" blocks and "hot" objects. The sum of each column, FORCED_READS and FORCED_WRITES, indicates the actual pinging activity on a particular block or object.

All three views provide different levels of detail. If you suspect that pinging or Oracle Parallel Server itself is the cause of a performance problem, monitor the V$LOCK_ACTIVITY view to generate an overall Oracle Parallel Server workload profile. Use information from the V$LOCK_ACTIVITY view to record the rate at which lock conversions occur.

For more details, use the V$CLASS_PING view to identify the type of block on which lock conversions and pinging are occurring. Once you have identified the class, use the V$PING view to obtain details about a particular table or index and the file and block numbers on which there is significant lock conversion activity.

If your response time or throughput requirements are no longer being met, you would normally examine the V$LOCK_ACTIVITY, V$CLASS_PING, V$CACHE, V$PING or V$FILE_PING views. In addition, you might also examine:

V$SYSSTAT to identify an increase in lock requests per transaction
V$SYSSTEM_EVENT to identify longer waits for global cache locks or consistent read server requests per transaction
Global and local work done as described in Chapter 11 to see if there is a noticeable change in performance percentages

In summary, a change in the application profile and the work rates typically warrant a detailed analysis using the above-mentioned views. Apart from diagnosing performance problems of existing applications, these views are also useful when developing applications or when deciding on a partitioning strategy.

Analyzing Latch, Oracle Parallel Server, and DLM Statistics

Latches are low-level locking mechanisms that protect System Global Area data structures. Excessive contention for latches degrades performance.

Use the V$DLM_LATCH and V$LATCH_MISSES views to monitor latch contention within the DLM. These views show information about a particular latch, its statistics, and the location in the code from where the latch is acquired.

For normal operations, the value latch statistics is limited. In some cases, multiple latches can help increase the performance for certain layers by a small amount. High latch contention is often the result of either:

Higher level performance issues or badly tuned system
Oracle internal inefficiencies or performance bugs

The following procedures are suggestions as to which information is available. Oracle does not recommend that you monitor these statistics on a regular basis and derive conclusions solely on the basis of latching issues. However, gathering this information might be useful to Oracle Support or Oracle's Development Staff. Also, latch tuning can be the object of advanced tuning activities, but in the majority of cases latch tuning will not be your actual performance problem.

On the other hand, record information from these procedures if the TIME_WAITED value for the "latch free" wait event is very high and ranks among the events that accrue the largest times as indicated by the V$SYSTEM_EVENT view.

Procedures for Analyzing Latch, Parallel Server, and DLM Statistics

Use the following procedures to analyze latch, Oracle Parallel Server, and DLM-related statistics.

Query the V$LATCH view using this syntax:

   SELECT * FROM V$LATCH;

Oracle responds with output similar to the following where the columns from left to right show the statistic name, gets, misses, and sleeps:

cache buffer handle

46184

1

0

cache buffers chained

84139946

296547

29899

cache buffers lru

4760378

11718

227

channel handle pool

1

0

0

channel operations

1

0

0

dlm ast latch

542776

494

1658

dlm cr bast queue

37194

1

0

dlm deadlock list

32839

0

0

dlm domain lock la

1

0

0

dlm domain lock ta

49164

1

0

dlm group lock lat

1

0

0

dlm group lock tab

25239

1

0

dlm lock table fre

325306

270

327

dlm process hash l

6346

0

0

dlm process table

2

0

0

dlm recovery domai

2014

0

0

dlm resource hash

683031

1709

41342

dlm resource scan

188

0

0

dlm resource table

182093

70

2

dlm shared communication

190766

211

313

dlm timeout list

113294

40

3

dml lock allocation

261

0

0

22 rows selected.

Note:
The content of the five columns in this output example from left to right are: gets, hits, misses, sleeps, and the sleeps-to-misses ratio.

If the output from the previous procedure reveals a high ratio of sleeps-to-misses, attempt to determine where the sleeps occur. To do this, execute this query on the V$LATCH_MISSES view:

   SELECT PARENT_NAME, "WHERE", SLEEP_COUNT
   FROM V$LATCH_MISSES
   ORDER BY SLEEP_COUNT DESCENDING;

Oracle responds with output similar to:

PARENT_NAME

WHERE

SLEEP_COUNT

------------------------

----------------------

--------------

dlm resource hash list

kjrrmas1: lookup master n

39392

cache buffers chains

kcbgtcr: kslbegin

27738

library cache

kglhdgn: child:

15408

shared pool

kghfnd: min scan

6876

cache buffers chains

kcbrls: kslbegin

2124

shared pool

kghalo

1667

dlm ast latch

kjucll: delete lock from

1464

7 rows selected.

Use your V$LATCH and V$LATCH_MISSES output to perform the following procedures.

Calculate the ratio of gets to misses using your V$LATCH output from the step 1 in this section in this formula:

High numbers for misses usually indicate contention for the same resources and locks. Acceptable ratios range from 90 to 95%.
Analyze the ratio of sleeps to misses using your V$LATCH_MISSES output from step 1 in this section. This ratio determines how often a process sleeps when it cannot immediately get a latch but wants to wait for the latch.

A ratio of 2 means that for each miss, a process attempts to get a latch twice before acquiring it. A high number of sleeps-to-misses usually indicates process scheduling delays or high operating system workloads. It can also indicate internal inefficiencies or high concurrency on one resource. For example, when many locks are opened simultaneously on the same resource, then processes might have to wait for a resource latch.

In the V$LATCH_MISSES view, the WHERE column shows the function in which the latch is acquired. This information is useful in determining internal performance problems. Usually, the latch slept on for long periods shows up in the V$SESSION_WAIT or V$SYSTEM_EVENT views under the 'latch free' wait event category.

The following section describes how to use the V$SYSTEM_EVENTS view in more detail.

Using the V$SYSTEM_EVENTS View to Identify Performance Problems

Data about Cache Fusion and Oracle Parallel Server events appears in the V$SYSTEM_EVENT view. To identify events for which processes have waited the longest, query the V$SYSTEM_EVENT view on the TIME_WAITED column using the DESCENDING keyword. The TIME_WAITED column shows the total wait time for each system event listed.

By generating an ordered list of event waits, you can easily locate performance bottlenecks. Each COUNT represents a voluntary context switch. The TIME_WAIT value is the cumulative time that processes waited for particular system events. The values in the TOTAL_TIMEOUT and AVERAGE_WAIT columns provide additional information about system efficiency.

Oracle recommends dividing the sum of values from the TOTAL_WAITS and TIME_WAITED columns by the number of transactions, as outlined in Chapter 11. Transactions can be defined as business transactions, for example, insurance quotes, order entry, and so on, or you can define them on the basis of "user commits" or "executions", depending on your perspective.

The goal is to estimate which event type contributes primarily to transaction response times, since in general:

By this rationale, the total wait time can be divided into subcomponents of the wait time, such as:

where tm is "time waited".

It is also useful to derive the total wait time by adding the individual events and then observing the percentages that are spent waiting for each event to derive the major cost factors for transaction response times. Reducing the time for the largest proportion of the waits will have the most significant effect on response time.

Parallel Server Events in V$SYSTEM_EVENTS

The following events appearing in the V$SYSTEM_EVENT output represent waits for Oracle Parallel Server events:
- global cache cr request
- library cache pin
- buffer busy due to global cache
- global cache lock busy
- global cache lock open x
- global cache lock open s
- global cache lock null to x
- global cache lock s to x
- global cache lock null to s
Events Related to Non-PCM Resources

You can monitor other events in addition to those listed under the previous heading because performance problems may be related to Oracle Parallel Server. These events are:
- Row cache locks
- Enqueues
- Library cache pins
- DFS lock handle
General Observations

If the time waited for global cache events is high relative to other waits, look for increased latencies, contention, or excessive system workloads using V$SYSSTAT statistics and operating system performance monitors. A high number of global cache busy or buffer busy waits indicates increased contention in the buffer cache.

In OLTP systems with data block address locking and a high degree of contention, it is not unusual when the global cache wait events represent a high proportion of the sum of the total time waited.

If a lot of wait time is used by waits for non-buffer cache resources as indicated by statistics in the rows "row cache lock", "enqueues", and "library cache pin", monitor the V$ROWCACHE and V$LIBRARYCACHE views for Oracle Parallel Server-related issues. Specifically, observe values in the DLM columns of each of these views.

Common Oracle Parallel Server problems arise from poorly managed space parameters or sequences that are not cached. In such cases, processes wait for row cache locks and enqueues and the V$ROWCACHE view will show a high number of conflicts for certain dictionary caches.

cache buffer handle	46184	1	0
cache buffers chained	84139946	296547	29899
cache buffers lru	4760378	11718	227
channel handle pool	1	0	0
channel operations	1	0	0
dlm ast latch	542776	494	1658
dlm cr bast queue	37194	1	0
dlm deadlock list	32839	0	0
dlm domain lock la	1	0	0
dlm domain lock ta	49164	1	0
dlm group lock lat	1	0	0
dlm group lock tab	25239	1	0
dlm lock table fre	325306	270	327
dlm process hash l	6346	0	0
dlm process table	2	0	0
dlm recovery domai	2014	0	0
dlm resource hash	683031	1709	41342
dlm resource scan	188	0	0
dlm resource table	182093	70	2
dlm shared communication	190766	211	313
dlm timeout list	113294	40	3
dml lock allocation	261	0	0
22 rows selected.

PARENT_NAME	WHERE	SLEEP_COUNT
------------------------	----------------------	--------------
dlm resource hash list	kjrrmas1: lookup master n	39392
cache buffers chains	kcbgtcr: kslbegin	27738
library cache	kglhdgn: child:	15408
shared pool	kghfnd: min scan	6876
cache buffers chains	kcbrls: kslbegin	2124
shared pool	kghalo	1667
dlm ast latch	kjucll: delete lock from	1464
7 rows selected.

12 Tuning Oracle Parallel Server and Inter-Instance Performance

How Cache Fusion Produces Consistent Read Blocks

Figure 12-1 Cache Fusion Ships Blocks from Cache to Cache Across the Interconnect

Partitioning Data to Improve Write/Write Conflict Resolution

Improved Scalability with Cache Fusion

Consistent-Read Block Transfers By Way of High Speed Interconnects

Reduced I/O for Block Pinging and Reduced X to S Lock Conversions

The Interconnect and Interconnect Protocols for Oracle Parallel Server

Influencing Interconnect Processing

Supported Interconnect Software

Performance Expectations

Monitoring Cache Fusion and Inter-Instance Performance

Cache Fusion and Oracle Parallel Server Performance Monitoring Goals

Statistics for Monitoring Oracle Parallel Server and Cache Fusion

Creating Oracle Parallel Server Data Dictionary Views with CATPARR.SQL

Global Dynamic Performance Views

Analyzing Global Cache and Cache Fusion Statistics

Procedures for Monitoring Global Cache Statistics

Other Useful Cache Fusion Statistics

global cache cr blocks received

global cache cr block receive time

global cache cr timeouts

global cache cr blocks served

global cache cr block serve time

global cache cr block send time

global cache cr block log flushes

global cache cr block log flush time

Analyzing Global Lock Statistics

Procedures for Analyzing Global Lock Statistics

Analyzing DLM Resource, Lock, Message, and Memory Resource Statistics

How DLM Workloads Affect Performance

Procedures for Analyzing DLM Resource and Lock Statistics

DLM Message Statistics

Procedures for Analyzing DLM Message Statistics

Analyzing Oracle Parallel Server I/O Statistics

Analyzing Oracle Parallel Server I/O Statistics in the V$SYSSTAT View

Analyzing Lock Conversions by Type

Using the V$LOCK_ACTIVITY View to Analyze Lock Conversions

Using the V$CLASS_PING View to Identify Pinging by Block Class

Using the V$PING View to Identify Hot Objects

Analyzing Latch, Oracle Parallel Server, and DLM Statistics

Procedures for Analyzing Latch, Parallel Server, and DLM Statistics

Using the V$SYSTEM_EVENTS View to Identify Performance Problems

Parallel Server Events in V$SYSTEM_EVENTS

Events Related to Non-PCM Resources

General Observations

12
Tuning Oracle Parallel Server and Inter-Instance Performance