11
General Tuning Recommendations

This chapter provides an overview of Oracle Parallel Server tuning issues by presenting a general methodology with which you can tune your Oracle Parallel Server applications. This chapter covers the following topics:

Overview of Tuning Oracle Parallel Server
Statistics for Monitoring Oracle Parallel Server Performance
Determining the Costs of Synchronization
Measuring Global and Local Work Ratios
Calculating the Cost of Global Cache Synchronization Due to Lock Contention
Resolving Problems in Oracle Parallel Server-Based Applications

See Also:
Oracle8i Parallel Server Setup and Configuration Guide for more information about performance monitoring and tuning with Oracle Parallel Server Management.

Overview of Tuning Oracle Parallel Server

With experience, you can anticipate many performance problems before deploying your Oracle Parallel Server applications. Some single-instance tuning practices are valid when tuning Oracle Parallel Server applications. However, you must also effectively tune the buffer cache, shared pool, and your shared disk subsystems with Parallel Server-specific goals in mind.

Oracle Parallel Server introduces parameters that are not used in single-instance environments. Many of these are tunable parameters that can significantly enhance Parallel Server performance when you properly set them. However, even the most effective tuning cannot overcome problems caused by poor analysis or database and application design flaws.

Tuning Oracle Parallel Server also requires monitoring several views and collecting Parallel Server-specific statistics. Do this using the methods described in this chapter.

Statistics for Monitoring Oracle Parallel Server Performance

This section describes the statistics that you can use for specifically monitoring and tuning Oracle Parallel Server applications. Topics in this section include:

Oracle maintains many statistics in local System Global Areas. Access these statistics using SQL against dynamic performance views or V$ tables as described in this chapter.

You can also use utilities such as UTLBSTAT and UTLESTAT to record statistics over a period of time to capture them for specific measurement intervals. Statistics are available as message request counters or as timed statistics. Message counters include statistics showing the number of a certain type of lock conversion. Oracle Parallel Server timed statistics reveal, for example, the total or average time waited for read and write I/O on particular operations.

The most significant statistics from an Oracle Parallel Server perspective are:

Cache-related statistics such as consistent gets, db block gets, db block changes, and waits for busy buffers
I/O statistics such as physical reads, physical writes, cross-instance writes, and wait times for reads and writes
Global cache lock requests and wait times, such as global cache gets, global cache converts, and waits for events such as null-to-X conversions

Two of the most important views showing Oracle Parallel Server-specific statistics are V$SYSSTAT and V$SYSTEM_EVENT.

Contents of V$SYSSTAT and V$SYSTEM_EVENT

The following section describes the contents of V$SYSSTAT and V$SYSTEM_EVENT and lists other Oracle Parallel Server-specific views.

Statistics in V$SYSSTAT

The V$SYSSTAT view includes the following statistics:

consistent gets
db block gets
db block changes
physical reads
physical writes
DBWR cross-instance writes
global cache get
global cache converts
global cache get time
global cache convert time
global cache cr blocks received
global cache cr block receive time
global cache cr timeouts
global cache convert timeouts
CPU used by this session

Statistics in V$SYSTEM_EVENT

The V$SYSTEM_EVENT view contains the following statistics:

db file sequential read
db file scattered read
db file parallel write
log file sync
global cache lock null to x
global cache lock null to s
global cache lock s to x
global cache lock open x
global cache lock open s
global cache cr request
global cache lock busy
buffer busy
buffer busy due to global cache
enqueue
row cache
library cache pin
lkmgr wait for remote messages

Other Parallel Server-Specific Views

Other important statistics relate to the following performance issues:

Buffer cache usage
Types of lock conversions
Lock activity with regard to block classes and files
Messages sent and received by the Distributed Lock Manager

These statistics appear in the following views:

V$CACHE
V$LOCK_ACTIVITY
V$CLASS_PING
V$FILE_PING
V$DLM_MISC

In addition to these views, you should analyze operating system statistics showing CPU usage, disk I/O, and the amount of CPU used for certain background processes such as LCK, LMD, BSP, DBW using procedures discussed in this chapter.

Recording Statistics for Tuning

Oracle recommends that you record statistics about the rates at which certain events occur. You should also record information about specific transactions within your Oracle Parallel Server environment. To do this, you can use performance-oriented utilities, such as UTLBSTAT and UTLESTAT, that compute statistic counts per second and per transaction. You can also measure the number of statement executions and the number of business transactions that applications submit.

For example, an insurance application might define a transaction as an insurance quote. This transaction might be composed of several DML operations and queries. If you know how many of these quotes your system processes in a certain time interval, divide that value by the number of quotes completed in the interval. Do this over a period of time to gauge performance.

This reveals an application profile in terms of the resources used per transaction and the application workload. The counts per transaction are useful for detecting changing workload patterns, while rates indicate the workload intensity.

Performance and Efficiency of Oracle Parallel Server Workloads

In Oracle Parallel Server, application performance and scalability are determined by the rate and cost of synchronization between nodes. You can measure the costs by identifying how effectively a transaction uses CPU resources. A transaction uses CPU resources to:

Send and receive messages
Maintain locks and resources required to guarantee global cache coherency
Process additional I/O requests due to the invalidation of cache buffers associated with ping processing

An additional cost is incurred waiting for I/O events and lock open or convert requests. Conflicts for resources between local and remote transactions while opening or converting a lock on a database resource can increase the cost of synchronization.

As a general example, approximate the cost of a transaction or request using this formula:

You can also calculate:

CPU time spent at the IPC layer
CPU time required to process lock requests
CPU spent for application processing, such as parsing SQL statements, fetching rows, and sorting
CPU time spent in processing read and write I/O caused by synchronization between nodes

Note:
Delays associated with these events are called "waits".

Statistics about these events appear in V$ tables such as V$SYSTEM_EVENT. The response time for each transaction depends on the number of requests for I/O and locks, and on the time required to process the instructions, plus the delay or wait time for each request.

Contention on certain resources adds to the cost of each measurable component. For example, increased disk service or queueing times due to:

Frequent I/O requests to certain disks
Contention for the same data or index blocks by local and remote transactions

can result in waits for busy buffers. This can in turn increase costs for each transaction and add to the operating system overhead.

As described in Chapter 5, you can create Oracle Parallel Server applications that are more scalable and that perform well enough to meet the service level requirements of a user community by minimizing either the rate or the cost of synchronization. In other words, strive to minimize inter-node synchronization and communication requirements. Partitioning, for example, where you isolate portions of an application that you can process by a group of transactions without interference from other transactions, reduces the cost and rate of global synchronization.

The choice of locking mechanisms and lock allocation, as described in Chapter 7, also has an effect on the rate and cost of global cache coherence. Recent optimizations in Oracle Parallel Server, such as Cache Fusion, attempt to reduce the cost of operations across instances by decreasing the delays for requests for shared or exclusive buffer access.

Determining the Costs of Synchronization

This section explains how to determine the cost incurred by synchronization and coherence between instances due to additional CPU time, I/O and global lock processing and contention. To do this, examine these groups of Oracle statistics:

Calculating CPU Service Time Required

To derive the CPU service time required per transaction, divide the CPU used by a session as shown in V$SYSSTAT by the number of user commits or the number of business transactions. Note that this is the amount of time required by the user process to execute in either user or kernel mode. This does not include the time spent by the operating system kernel on behalf of the transaction.

This measure is useful for comparing how applications behave in single instance environments running in exclusive mode to applications running in an Oracle Parallel Server environment. This measure is also instrumental in comparing the effect of different workloads and application design changes.

Estimating I/O Synchronization Costs

Refer to the V$SYSSTAT view for counts of the following requests:

DBWR cross-instance writes
physical writes
physical reads

Refer to the V$SYSTEM_EVENT view for time waited and average waits for the following actions:

db file parallel write
db file sequential read
db file scattered read

To estimate the time waited for reads incurred by rereading data blocks that had to be written to disk because of a request from another instance, multiply the statistic (for example, the time waited for db file sequential reads) by the percentage of read I/O caused by previous cache flushes as shown in this formula:

Where "lock buffers for read" is the value for lock converts from N to S derived from V$LOCK_ACTIVITY and "physical reads" is from the V$SYSSTAT view.

Similarly, the proportion of the time waited for database file parallel writes caused by pings can be estimated by multiplying db file parallel write time as found in V$SYSTEM_EVENTS by:

Measuring Global Cache Coherence and Contention

Table 11-1 describes some global cache coherence-related views and the types of statistics they contain.

Table 11-1 Global Cache Coherence and Contention Statistics and Their Views

View	Statistics
Refer to V$SYSSTAT to count requests for the actions shown to the right.	global cache gets (count of new locks opened) global cache converts (count of conversion for existing locks) global cache cr blocks received (count of consistent read buffers received from the BSP) Note: Also refer to the convert type-specific rows in V$LOCK_ACTIVITY.
Refer to V$SYSSTAT for the amount of time waited for the actions shown to the right.	global cache get time (total processing time including waits) global cache convert time (total processing time including waits) global cache cr block receive time (includes waits)
Refer to V$SYSTEM_EVENT for time waited for the events shown to the right.	global cache lock null to X global cache lock null to S global cache lock S to X global cache lock open X global cache lock open S global cache cr request

View

Statistics

Refer to V$SYSSTAT to count requests for the actions shown to the right.

global cache gets (count of new locks opened)

global cache converts (count of conversion for existing locks)

global cache cr blocks received (count of consistent read buffers received from the BSP)

Note: Also refer to the convert type-specific rows in V$LOCK_ACTIVITY.

Refer to V$SYSSTAT for the amount of time waited for the actions shown to the right.

global cache get time (total processing time including waits)

global cache convert time (total processing time including waits)

global cache cr block receive time (includes waits)

Refer to V$SYSTEM_EVENT for time waited for the events shown to the right.

global cache lock null to X

global cache lock null to S

global cache lock S to X

global cache lock open X

global cache lock open S

global cache cr request

For indicators of high contention or excessive delays, refer to the following statistics and views:

global cache cr timeouts as found in V$SYSSTAT
global cache convert timeouts as found in V$SYSSTAT
global cache lock busy as found in V$SYSTEM_EVENT
buffer busy due to global cache as found in V$SYSTEM_EVENT

As mentioned, it is useful to maintain application profiles per transaction and per unit of time. This allows you to compare two distinct workloads or to detect changes in a workload. The rates are also helpful in determining capacities and for identifying throughput issues. Oracle recommends that you incorporate the following ratios of statistics in your performance monitoring scripts:

lock gets per transaction, for example, global cache gets per transaction
lock converts per transaction, for example, global cache converts per transaction
cr request per transaction, for example, global cache cr blocks received per transaction
lock convert waits per transaction, the value shown in the TIME_WAITED column in V$SYSTEM_EVENT
global cache lock null to x per transaction
global cache lock null to s per transaction
global cache lock s to x per transaction
lock open waits per transaction, for example, from the TIME_WAITED column in V$SYSTEM_EVENT
global cache lock open x per transaction
global cache lock open per transaction
lock busy waits per transaction, for example, from the TIME_WAITED column in V$SYSTEM_EVENT
global cache lock busy per transaction

Calculate the same statistics per second or minute by dividing the total counts or times waited by the measurement interval.

Note:
To record timed statistics, set the TIMED_STATISTICS parameter to TRUE. Oracle records these statistics in hundredths of seconds.

Measuring Global and Local Work Ratios

The percentage of buffers accessed for global work or the percentage of I/O caused by inter-instance synchronization can be important measures of how efficient your application processes share data. It can also reveal whether the database is designed for optimum scalability.

Use the following calculation to determine the percentage of buffer accesses for local operations, in other words, reads and changes of database buffers that are not subject to a lock conversion:

Similarly, compute the percentage of read and write I/O for local operations using the following equations:

This calculation implies the percent of times DBWR writes for local work.

Also:

This calculation implies the number of percent reads by user processes for local work only; it does not refer to forced reads.

In the previous formula, the physical read statistic from V$SYSSTAT is combined with the "Lock buffers for read" value from V$LOCK_ACTIVITY. You can base the local write ratio entirely on the corresponding values from V$SYSSTAT.

Apart from determining the proportion of local and global work (the degree of partitioning) you can also use these percentages to detect changing workload patterns. Moreover, they represent the probability that a data block access is either global or local. You can therefore use this information as a rough estimator in scalability calculations.

In addition to these ratios, the proportion of delays due to unavailable buffers or locks is easy to derive using the formula:

More generally:

For lock opens and converts, the percentage of waits that are caused by busy locks (locks that are being acquired or released) can be indicative of either delays in opening or converting locks or very high concurrency on a small set of buffers.

Once you identify a problem area, such as a high ratio of global busy waits, convert and consistent read timeouts, or a high percentage of DBWR cross-instance writes, you can obtain more detail about the problem by referring to these views:

V$FILE_PING
V$CACHE
V$CLASS_PING

The statistics in these views help identify the files and blocks shared by both instances. These shared files may be responsible for the majority of the costs in inter-instance synchronization and global cache coherency processing.

Calculating the Cost of Global Cache Synchronization Due to Lock Contention

Reduced throughput and degradation of transaction response times are the result of the increased costs of synchronization between instances. There are several possible sources of such increased costs as described in this section under the following headings:

Contention for the Same Data Blocks

Contention for the same data blocks occurs if rows commonly accessed from multiple instances are spread over a limited range of blocks. The probability of this happening depends on the access distribution of the data within the table as a result of the application's behavior. The probability of contention can also depend on the block size. For example, more rows fit into an 8K block than into a 4K block. The PCTFREE defined for that particular table can also affect the level of contention. In fact, database block size and PCTFREE can be part of your Oracle Parallel Server design strategy: your goal is to reduce the number of rows per block and thus the probability of conflicting access.

Indicators of very "hot" globally accessed blocks include:

A high percentage of buffer busy waits or buffer busy due to global cache waits
Frequent convert timeouts or consistent read timeouts

If you see a high proportion of global cache lock waits per transaction, consider determining which files and blocks are accessed frequently from both nodes. The following describes one method for identifying contended files.

Using V$CACHE, V$PING, and V$BH to Identify Contended Objects

The dynamic performance views V$CACHE, V$PING and V$BH have two columns, FORCED_READS and FORCED_WRITES, that allow you to determine which objects and blocks are used by both nodes.

The FORCED_WRITE column carries a count of how many times a certain block was pinged out of the local buffer cache, that is, written to disk because the current version was requested by another instance. Correspondingly, the FORCED_READ column tracks how frequently a particular block had to be reread from disk because it was previously invalidated by an exclusive request from another instance. You can also use V$FILE_PING to identify files that experience the most pinging.

Using V$FILE_PING to Identify Files with Excessive Pinging

Use the V$FILE_PING view to identify files containing blocks that are most frequently written to due to pings. Once you have identified the files with the highest ping rates, reconsider your locking policy for these files (fixed or releasable). You may also want to consider whether you should physically distribute these files to reduce I/O delays. Of course, the best strategy is to increase the percentage of local work on the objects contained in the files, that is, to avoid pinging altogether.

Note:
A frequently pinged block will be visible in the local buffer caches of all instances in the cluster.

Contention for Segment Headers and Free List Blocks

Contention for segment headers can occur when the headers for tables or indexes have to be read and updated by many transactions at the same time. Usually this happens when transactions are searching for blocks with enough free space to hold the size of the data to be inserted or updated. Oracle also updates a segment header if new extents or additional free blocks are added to the table.

New applications that insert a significant amount of data from multiple nodes can become a serious performance bottleneck. This is because Oracle must copy the segment header block containing the free lists into another instance's cache. This results in a single point of contention.

You can significantly improve this situation by creating free list groups for the tables and indexes in question. The advantage of using free list groups is to partition access to segment free lists according to instance. This reduces conflicts between instances when the INSERT and DELETE rates are high.

Contention for Resources Other Than Database Blocks

Contention for resources other than database blocks should be infrequent. However, when this occurs, it will definitely have an adverse affect on performance. Usually, there are two layers that may exhibit such problems:

The row cache or data dictionary cache
The library cache

Using Oracle sequences that are not cached, or using poorly configured space parameters for a table or tablespace, may result in frequent changes to the data dictionary. Rows from the data dictionary are stored in a separate cache and use a different buffer format than that of the data buffer cache. If these objects are often read and updated from both nodes, Oracle must invalidate them and write them to disk. Unfortunately, these objects are often used in recursive, internal transactions while other locks are held. Because of the complexity of such internal data dictionary transactions, this processing can cause serious delays.

When this occurs, the values for "row cache lock" wait count and "time waited" statistics in V$SYSTEM_EVENT increase; these become two of the most waited-for events. The percentage of time waited for row cache locks should never be more than 5% of the total wait time. To determine which objects in the row cache may be causing the delays, query the V$ROWCACHE view. Oracle's response to the query is the name of the row cache object type, such as DC_SEQUENCES or DC_USED EXTENTS, and the DLM requests and DLM conflicts for these objects. If the conflicts exceed 10 to 15 percent of the requests and row cache lock wait time is excessive, then take preventive action by caching sequence numbers, defragmenting tablespaces, or by tuning space parameters.

Most frequently, problems occur when Oracle creates sequences without the CACHE option. In these cases, the values for DC_SEQUENCES show high DLM conflict rates. Frequent space management operations due to fragmented tablespaces or inadequate extent sizes can result in pinging for DC_USED_EXTENTS and DC_FREE_EXTENTS objects in the data dictionary cache.

If there are frequent changes to the data dictionary, many data blocks from the data dictionary tables, normally from file number 1, can be found in each instance's buffer cache when you query V$CACHE or V$PING.

Library cache performance problems in Oracle Parallel Server are rare. They usually manifest themselves as high wait times for "library cache pin" and could be caused by the frequent reparsing of cursors or by the loading of procedures and packages. Query the DLM column in the V$LIBRARYCACHE view to obtain more detail about this problem. Cross-instance invalidations of library cache objects should be rare. However, such invalidations can occur if you drop objects referenced by a cursor that is executed in two instances.

A Shortage of Locks

A shortage of locks or resources may occur if the database and the buffer cache are large and if there is a constraint on memory. You can set the parameter GC_RELEASABLE_LOCKS to a value lower than the default. This value should be equal to the size of the buffer cache as determined by DB_BLOCK_BUFFERS. This allows you to save memory for locks and resources can be saved.

However, if a lot of resources are used to maintain global cache coherency and the free lists are depleted, locks need to be freed and reopened more frequently. An increasing count of global cache lock free list waits and global cache gets or global cache lock open events can indicate this. Waiting for locks to be freed or a higher percentage of locks that are opened add to the cost incurred by synchronizing instances and usually results in higher transaction response times and higher CPU use.

Resolving Problems in Oracle Parallel Server-Based Applications

This section explains how to identify and resolve problems in Oracle Parallel Server-based applications. It contains the following sections:

Query Tuning Tips

Query-intensive applications benefit from tuning techniques that maximize the amount of data for each I/O request. Before trying these techniques, monitor the performance both before and after implementing them to measure and assess their success.

Large Block Size

Use a large block size to increase the number of rows that each operation retrieves. A large block size also reduces the depth of your application's index trees. Your block size should be at least 8K if your database is used primarily for processing queries.

Increase Value for DB_FILE_MULTIBLOCK_READ_COUNT

You should also set the value for DB_FILE_MULTIBLOCK_READ_COUNT to the largest possible value. Doing this also improves the speed of full table scans by reducing the number of reads required to scan the table. Note that your system I/O is limited by the block size multiplied by the number of blocks read.

If you use operating system striping, set the stripe size to DB_FILE_MULTIBLOCK_READ_COUNT multiplied by the DB_BLOCK_SIZE times 2. If your system can differentiate index stripes from table data stripes, use a stripe size of DB_BLOCK_SIZE time 2 for indexes.

Also remember to:

Use read-only tablespaces for data and indexes
Define tablespaces holding temporary segments as type TEMPORARY and use a large blocking factor for PCM locks on data files containing tables that are frequently fully scanned

Application Tuning Tips

Transaction-based applications generally write more data to disk than other application types. You can use several methods to optimize transaction-based applications. However, you cannot use these techniques on all types of systems. After initiating any of these methods, monitor your application's performance to make sure it is acceptable.

To improve the ability of the database writer processes (DBWn) to write large amounts of data quickly, use asynchronous I/O. Oracle8i uses multiple DBWn processes to improve performance. You can also use low granularity locking to avoid false pings.

If you have partitioned users by instance and if you have enough space to accommodate the multiple lists, use free list groups. This helps your application avoid dynamic data partitioning. You can also manually partition tables by value. If the access is random, consider using a smaller block size. Remember these points:

Be aware of contention on related indexes and sequence generators
Consider using a multi-tiered architecture to route users for partitioning and for failover

Diagnosing Performance Problems

If your application is not performing well, analyze each component of the application to identify which components are causing problems. To do this, check the operating system and DLM statistics, as explained under the next heading, for indications of contention or excessive CPU usage. Excessive lock conversions that you can measure with specific procedures may reveal excessive read/write activity or high CPU requirements by DLM components.

See Also:
Chapter 11 for more information about tuning Oracle Parallel Server performance.

DLM Statistics for Monitoring Contention and CPU Usage

If your application is not performing optimally, consider examining statistics as described in the following points:

Use standard tuning techniques by running UTLBSTAT, UTLESTAT, and then querying the V$SQL_AREA view. Examine the statistics from this view and analyze the hit ratios in the shared pool and the buffer cache.
Examine the dynamic performance table statistics created when you run CATPARR.SQL.
Use the V$LOCK_ACTIVITY table to monitor lock rates.
Use the V$BH table to identify which blocks are being pinged. This table sums the number of times each block's PCM locks are downgraded from exclusive to null.
The V$PING view shows rows from the V$CACHE table where the exclusive to null count is non-zero.

Note:
Blocks that are newly acquired by an object do not appear in these tables until you rerun CATPARR.SQL.

Contention Problems Specific to Parallel Server Environments

There are significant contention problems that you can avoid in Oracle Parallel Server environments. These are the result of inserts into index blocks when multiple instances share a sequence generator for primary key values. You can minimize these problems by:

Using Sequence Number Multipliers

You may need to use a multiplier such as SEQUENCE_NUMBER x INSTANCE_NUMBER x 1,000,000,000 to prevent the instances from inserting new entries into the same index.

Using Oracle Sequences

Creating a sequence without using the CACHE clause may create a lot of overhead. It may also cause synchronization overhead if both instances use the same sequence.

11 General Tuning Recommendations