|Oracle8i Parallel Server Administration, Deployment, and Performance
Release 2 (8.1.6)
Part Number A76970-01
This chapter provides an overview of Oracle Parallel Server tuning issues by presenting a general methodology with which you can tune your Oracle Parallel Server applications. This chapter covers the following topics:
With experience, you can anticipate many performance problems before deploying your Oracle Parallel Server applications. Some single-instance tuning practices are valid when tuning Oracle Parallel Server applications. However, you must also effectively tune the buffer cache, shared pool, and your shared disk subsystems with Parallel Server-specific goals in mind.
Oracle Parallel Server introduces parameters that are not used in single-instance environments. Many of these are tunable parameters that can significantly enhance Parallel Server performance when you properly set them. However, even the most effective tuning cannot overcome problems caused by poor analysis or database and application design flaws.
Tuning Oracle Parallel Server also requires monitoring several views and collecting Parallel Server-specific statistics. Do this using the methods described in this chapter.
This section describes the statistics that you can use for specifically monitoring and tuning Oracle Parallel Server applications. Topics in this section include:
Oracle maintains many statistics in local System Global Areas. Access these statistics using SQL against dynamic performance views or V$ tables as described in this chapter.
You can also use utilities such as UTLBSTAT and UTLESTAT to record statistics over a period of time to capture them for specific measurement intervals. Statistics are available as message request counters or as timed statistics. Message counters include statistics showing the number of a certain type of lock conversion. Oracle Parallel Server timed statistics reveal, for example, the total or average time waited for read and write I/O on particular operations.
The most significant statistics from an Oracle Parallel Server perspective are:
Two of the most important views showing Oracle Parallel Server-specific statistics are V$SYSSTAT and V$SYSTEM_EVENT.
The following section describes the contents of V$SYSSTAT and V$SYSTEM_EVENT and lists other Oracle Parallel Server-specific views.
The V$SYSSTAT view includes the following statistics:
The V$SYSTEM_EVENT view contains the following statistics:
Other important statistics relate to the following performance issues:
These statistics appear in the following views:
In addition to these views, you should analyze operating system statistics showing CPU usage, disk I/O, and the amount of CPU used for certain background processes such as LCK, LMD, BSP, DBW using procedures discussed in this chapter.
Oracle recommends that you record statistics about the rates at which certain events occur. You should also record information about specific transactions within your Oracle Parallel Server environment. To do this, you can use performance-oriented utilities, such as UTLBSTAT and UTLESTAT, that compute statistic counts per second and per transaction. You can also measure the number of statement executions and the number of business transactions that applications submit.
For example, an insurance application might define a transaction as an insurance quote. This transaction might be composed of several DML operations and queries. If you know how many of these quotes your system processes in a certain time interval, divide that value by the number of quotes completed in the interval. Do this over a period of time to gauge performance.
This reveals an application profile in terms of the resources used per transaction and the application workload. The counts per transaction are useful for detecting changing workload patterns, while rates indicate the workload intensity.
In Oracle Parallel Server, application performance and scalability are determined by the rate and cost of synchronization between nodes. You can measure the costs by identifying how effectively a transaction uses CPU resources. A transaction uses CPU resources to:
An additional cost is incurred waiting for I/O events and lock open or convert requests. Conflicts for resources between local and remote transactions while opening or converting a lock on a database resource can increase the cost of synchronization.
As a general example, approximate the cost of a transaction or request using this formula:
You can also calculate:
Statistics about these events appear in V$ tables such as V$SYSTEM_EVENT. The response time for each transaction depends on the number of requests for I/O and locks, and on the time required to process the instructions, plus the delay or wait time for each request.
Contention on certain resources adds to the cost of each measurable component. For example, increased disk service or queueing times due to:
can result in waits for busy buffers. This can in turn increase costs for each transaction and add to the operating system overhead.
As described in Chapter 5, you can create Oracle Parallel Server applications that are more scalable and that perform well enough to meet the service level requirements of a user community by minimizing either the rate or the cost of synchronization. In other words, strive to minimize inter-node synchronization and communication requirements. Partitioning, for example, where you isolate portions of an application that you can process by a group of transactions without interference from other transactions, reduces the cost and rate of global synchronization.
The choice of locking mechanisms and lock allocation, as described in Chapter 7, also has an effect on the rate and cost of global cache coherence. Recent optimizations in Oracle Parallel Server, such as Cache Fusion, attempt to reduce the cost of operations across instances by decreasing the delays for requests for shared or exclusive buffer access.
This section explains how to determine the cost incurred by synchronization and coherence between instances due to additional CPU time, I/O and global lock processing and contention. To do this, examine these groups of Oracle statistics:
To derive the CPU service time required per transaction, divide the CPU used by a session as shown in V$SYSSTAT by the number of user commits or the number of business transactions. Note that this is the amount of time required by the user process to execute in either user or kernel mode. This does not include the time spent by the operating system kernel on behalf of the transaction.
This measure is useful for comparing how applications behave in single instance environments running in exclusive mode to applications running in an Oracle Parallel Server environment. This measure is also instrumental in comparing the effect of different workloads and application design changes.
Refer to the V$SYSSTAT view for counts of the following requests:
Refer to the V$SYSTEM_EVENT view for time waited and average waits for the following actions:
To estimate the time waited for reads incurred by rereading data blocks that had to be written to disk because of a request from another instance, multiply the statistic (for example, the time waited for db file sequential reads) by the percentage of read I/O caused by previous cache flushes as shown in this formula:
Where "lock buffers for read" is the value for lock converts from N to S derived from V$LOCK_ACTIVITY and "physical reads" is from the V$SYSSTAT view.
Similarly, the proportion of the time waited for database file parallel writes caused by pings can be estimated by multiplying db file parallel write time as found in V$SYSTEM_EVENTS by:
Table 11-1 describes some global cache coherence-related views and the types of statistics they contain.
Refer to V$SYSSTAT to count requests for the actions shown to the right.
Note: Also refer to the convert type-specific rows in V$LOCK_ACTIVITY.
Refer to V$SYSSTAT for the amount of time waited for the actions shown to the right.
global cache cr block receive time (includes waits)
Refer to V$SYSTEM_EVENT for time waited for the events shown to the right.
global cache cr request
For indicators of high contention or excessive delays, refer to the following statistics and views:
As mentioned, it is useful to maintain application profiles per transaction and per unit of time. This allows you to compare two distinct workloads or to detect changes in a workload. The rates are also helpful in determining capacities and for identifying throughput issues. Oracle recommends that you incorporate the following ratios of statistics in your performance monitoring scripts:
Calculate the same statistics per second or minute by dividing the total counts or times waited by the measurement interval.
The percentage of buffers accessed for global work or the percentage of I/O caused by inter-instance synchronization can be important measures of how efficient your application processes share data. It can also reveal whether the database is designed for optimum scalability.
Use the following calculation to determine the percentage of buffer accesses for local operations, in other words, reads and changes of database buffers that are not subject to a lock conversion:
Similarly, compute the percentage of read and write I/O for local operations using the following equations:
This calculation implies the percent of times DBWR writes for local work.
This calculation implies the number of percent reads by user processes for local work only; it does not refer to forced reads.
In the previous formula, the physical read statistic from V$SYSSTAT is combined with the "Lock buffers for read" value from V$LOCK_ACTIVITY. You can base the local write ratio entirely on the corresponding values from V$SYSSTAT.
Apart from determining the proportion of local and global work (the degree of partitioning) you can also use these percentages to detect changing workload patterns. Moreover, they represent the probability that a data block access is either global or local. You can therefore use this information as a rough estimator in scalability calculations.
In addition to these ratios, the proportion of delays due to unavailable buffers or locks is easy to derive using the formula:
For lock opens and converts, the percentage of waits that are caused by busy locks (locks that are being acquired or released) can be indicative of either delays in opening or converting locks or very high concurrency on a small set of buffers.
Once you identify a problem area, such as a high ratio of global busy waits, convert and consistent read timeouts, or a high percentage of DBWR cross-instance writes, you can obtain more detail about the problem by referring to these views:
The statistics in these views help identify the files and blocks shared by both instances. These shared files may be responsible for the majority of the costs in inter-instance synchronization and global cache coherency processing.
Reduced throughput and degradation of transaction response times are the result of the increased costs of synchronization between instances. There are several possible sources of such increased costs as described in this section under the following headings:
Contention for the same data blocks occurs if rows commonly accessed from multiple instances are spread over a limited range of blocks. The probability of this happening depends on the access distribution of the data within the table as a result of the application's behavior. The probability of contention can also depend on the block size. For example, more rows fit into an 8K block than into a 4K block. The PCTFREE defined for that particular table can also affect the level of contention. In fact, database block size and PCTFREE can be part of your Oracle Parallel Server design strategy: your goal is to reduce the number of rows per block and thus the probability of conflicting access.
Indicators of very "hot" globally accessed blocks include:
If you see a high proportion of global cache lock waits per transaction, consider determining which files and blocks are accessed frequently from both nodes. The following describes one method for identifying contended files.
The dynamic performance views V$CACHE, V$PING and V$BH have two columns, FORCED_READS and FORCED_WRITES, that allow you to determine which objects and blocks are used by both nodes.
The FORCED_WRITE column carries a count of how many times a certain block was pinged out of the local buffer cache, that is, written to disk because the current version was requested by another instance. Correspondingly, the FORCED_READ column tracks how frequently a particular block had to be reread from disk because it was previously invalidated by an exclusive request from another instance. You can also use V$FILE_PING to identify files that experience the most pinging.
Use the V$FILE_PING view to identify files containing blocks that are most frequently written to due to pings. Once you have identified the files with the highest ping rates, reconsider your locking policy for these files (fixed or releasable). You may also want to consider whether you should physically distribute these files to reduce I/O delays. Of course, the best strategy is to increase the percentage of local work on the objects contained in the files, that is, to avoid pinging altogether.
Contention for segment headers can occur when the headers for tables or indexes have to be read and updated by many transactions at the same time. Usually this happens when transactions are searching for blocks with enough free space to hold the size of the data to be inserted or updated. Oracle also updates a segment header if new extents or additional free blocks are added to the table.
New applications that insert a significant amount of data from multiple nodes can become a serious performance bottleneck. This is because Oracle must copy the segment header block containing the free lists into another instance's cache. This results in a single point of contention.
You can significantly improve this situation by creating free list groups for the tables and indexes in question. The advantage of using free list groups is to partition access to segment free lists according to instance. This reduces conflicts between instances when the INSERT and DELETE rates are high.
Contention for resources other than database blocks should be infrequent. However, when this occurs, it will definitely have an adverse affect on performance. Usually, there are two layers that may exhibit such problems:
Using Oracle sequences that are not cached, or using poorly configured space parameters for a table or tablespace, may result in frequent changes to the data dictionary. Rows from the data dictionary are stored in a separate cache and use a different buffer format than that of the data buffer cache. If these objects are often read and updated from both nodes, Oracle must invalidate them and write them to disk. Unfortunately, these objects are often used in recursive, internal transactions while other locks are held. Because of the complexity of such internal data dictionary transactions, this processing can cause serious delays.
When this occurs, the values for "row cache lock" wait count and "time waited" statistics in V$SYSTEM_EVENT increase; these become two of the most waited-for events. The percentage of time waited for row cache locks should never be more than 5% of the total wait time. To determine which objects in the row cache may be causing the delays, query the V$ROWCACHE view. Oracle's response to the query is the name of the row cache object type, such as DC_SEQUENCES or DC_USED EXTENTS, and the DLM requests and DLM conflicts for these objects. If the conflicts exceed 10 to 15 percent of the requests and row cache lock wait time is excessive, then take preventive action by caching sequence numbers, defragmenting tablespaces, or by tuning space parameters.
Most frequently, problems occur when Oracle creates sequences without the CACHE option. In these cases, the values for DC_SEQUENCES show high DLM conflict rates. Frequent space management operations due to fragmented tablespaces or inadequate extent sizes can result in pinging for DC_USED_EXTENTS and DC_FREE_EXTENTS objects in the data dictionary cache.
If there are frequent changes to the data dictionary, many data blocks from the data dictionary tables, normally from file number 1, can be found in each instance's buffer cache when you query V$CACHE or V$PING.
Library cache performance problems in Oracle Parallel Server are rare. They usually manifest themselves as high wait times for "library cache pin" and could be caused by the frequent reparsing of cursors or by the loading of procedures and packages. Query the DLM column in the V$LIBRARYCACHE view to obtain more detail about this problem. Cross-instance invalidations of library cache objects should be rare. However, such invalidations can occur if you drop objects referenced by a cursor that is executed in two instances.
A shortage of locks or resources may occur if the database and the buffer cache are large and if there is a constraint on memory. You can set the parameter GC_RELEASABLE_LOCKS to a value lower than the default. This value should be equal to the size of the buffer cache as determined by DB_BLOCK_BUFFERS. This allows you to save memory for locks and resources can be saved.
However, if a lot of resources are used to maintain global cache coherency and the free lists are depleted, locks need to be freed and reopened more frequently. An increasing count of global cache lock free list waits and global cache gets or global cache lock open events can indicate this. Waiting for locks to be freed or a higher percentage of locks that are opened add to the cost incurred by synchronizing instances and usually results in higher transaction response times and higher CPU use.
This section explains how to identify and resolve problems in Oracle Parallel Server-based applications. It contains the following sections:
Query-intensive applications benefit from tuning techniques that maximize the amount of data for each I/O request. Before trying these techniques, monitor the performance both before and after implementing them to measure and assess their success.
Use a large block size to increase the number of rows that each operation retrieves. A large block size also reduces the depth of your application's index trees. Your block size should be at least 8K if your database is used primarily for processing queries.
You should also set the value for DB_FILE_MULTIBLOCK_READ_COUNT to the largest possible value. Doing this also improves the speed of full table scans by reducing the number of reads required to scan the table. Note that your system I/O is limited by the block size multiplied by the number of blocks read.
If you use operating system striping, set the stripe size to DB_FILE_MULTIBLOCK_READ_COUNT multiplied by the DB_BLOCK_SIZE times 2. If your system can differentiate index stripes from table data stripes, use a stripe size of DB_BLOCK_SIZE time 2 for indexes.
Also remember to:
Transaction-based applications generally write more data to disk than other application types. You can use several methods to optimize transaction-based applications. However, you cannot use these techniques on all types of systems. After initiating any of these methods, monitor your application's performance to make sure it is acceptable.
To improve the ability of the database writer processes (DBWn) to write large amounts of data quickly, use asynchronous I/O. Oracle8i uses multiple DBWn processes to improve performance. You can also use low granularity locking to avoid false pings.
If you have partitioned users by instance and if you have enough space to accommodate the multiple lists, use free list groups. This helps your application avoid dynamic data partitioning. You can also manually partition tables by value. If the access is random, consider using a smaller block size. Remember these points:
If your application is not performing well, analyze each component of the application to identify which components are causing problems. To do this, check the operating system and DLM statistics, as explained under the next heading, for indications of contention or excessive CPU usage. Excessive lock conversions that you can measure with specific procedures may reveal excessive read/write activity or high CPU requirements by DLM components.
Chapter 11 for more information about tuning Oracle Parallel Server performance.
If your application is not performing optimally, consider examining statistics as described in the following points:
There are significant contention problems that you can avoid in Oracle Parallel Server environments. These are the result of inserts into index blocks when multiple instances share a sequence generator for primary key values. You can minimize these problems by:
You may need to use a multiplier such as SEQUENCE_NUMBER x INSTANCE_NUMBER x 1,000,000,000 to prevent the instances from inserting new entries into the same index.
Creating a sequence without using the CACHE clause may create a lot of overhead. It may also cause synchronization overhead if both instances use the same sequence.