11 Tuning Backup and Recovery

Tuning RMAN performance is mostly a matter of maximizing the speed with which RMAN creates your backups and restores from backups, on disk and especially on tape. A secondary concern is limiting the effect of backup activities on database throughput.

You may also need to tune performance of the database during instance recovery.

This chapter covers the concepts needed for performance tuning, and the features in RMAN that can help you. The discussion is divided into the following sections:

Tuning Recovery Manager: Overview
Features and Options Used to Tune RMAN Performance
Tuning RMAN Backup Performance: Procedure
Instance Recovery Performance Tuning: Fast-Start Fault Recovery

Tuning Recovery Manager: Overview

RMAN backup and restore operations have the following distinct components:

Reading or writing input data
Processing data by validating blocks and copying them from the input to the output buffers

The slowest of these operations in any RMAN task is called the bottleneck. RMAN tuning involves identifying the bottlenecks for a given task and using RMAN commands, initialization parameter settings, or adjustments to physical media to improve performance on the backup.

The key to tuning RMAN is understanding how it performs I/O. RMAN's backup and restore jobs use two types of I/O buffers: DISK and tertiary storage (usually tape). When performing a backup, RMAN reads input files using disk buffers and writes the output backup file by using either disk or tape buffers. Restore operations use disk or tape buffers for input, depending on where the backup is stored, and disk buffers for output.

To tune RMAN effectively, you must thoroughly understand concepts such as synchronous and asynchronous I/O, disk and tape buffers, and channel architecture. When you understand these concepts, then you can learn how to use fixed views to monitor bottlenecks, and use the techniques described in "Tuning RMAN Backup Performance: Procedure" to solve problems.

There are a number of concepts that affect RMAN performance and that can therefore influence your strategy for backup performance tuning:

I/O Buffer Allocation
Allocation of Tape Buffers
Synchronous and Asynchronous I/O
Factors Affecting Backup Speed to Tape
Using the RATE Parameter to Control Disk Bandwidth Usage

I/O Buffer Allocation

RMAN I/O uses two different types of buffers: disk and tape. These buffers are typically different sizes. They are allocated differently, depending upon the device type and the role the buffer plays in an RMAN operation.

Allocation for Disk Buffers

To understand how RMAN allocates buffers to read datafiles during backups, you must understand how RMAN multiplexing works.

RMAN multiplexing is RMAN's ability to read a number of files in a backup simultaneously from different sources to improve reading performance, and then write them to a single backup piece. The level of multiplexing is the number of files read simultaneously.

Multiplexing is described at greater length in "Multiplexed Backup Sets". The level of multiplexing is determined by the algorithm described in "Algorithm for Multiplexed Backups". Review this section before proceeding.

When RMAN backs up from disk, it uses the rules described in the following table to determine how large to make the buffers.

Table 11-1 Datafile Read Buffer Sizing Algorithm

Level of Multiplexing	Resulting Buffer Size
Less than or equal to 4	RMAN allocates buffers of size 1 MB so that the total buffer size for all the input files is 16 MB.
Greater than 4 but less than or equal to 8	RMAN allocates disk buffers of size 512 KB so that the total buffer size for all the files is less than 16 MB.
Greater than 8	RMAN allocates a fixed 4 disk buffers of 128 KB for each file, so that the total size is 512 KB for each file.

The number of buffers allocated depends on the following rules:

When the output of the backup resides on disk, 4 buffers are allocated, their size being operating system dependent.
If the operation is a restore, and the backup resides on disk, 4 buffers are allocated, their size being operating system dependent.
When restoring a backup, for each active datafile 4 buffers of 128K are allocated.
When image copies are produced, only 4 buffers in total are allocated, each of an operating system dependent size.

Allocation of Tape Buffers

If you backup to or restore from an sbt device, then by default the database allocates four buffers for each channel for the tape writers (or reads if doing a restore). The size of these buffers is platform dependent, but is typically 256K. This value can be changed using the ALLOCATE or SEND command using the PARMS and the BLKSIZE option.

RMAN allocates the tape buffers in the SGA if I/O slaves are being used, or the PGA otherwise.

If you use I/O slaves, then set the LARGE_POOL_SIZE initialization parameter to set aside SGA memory dedicated to holding these large memory allocations. This prevents RMAN I/O buffers from competing with the library cache for SGA memory. If I/O slaves for tape I/O were requested but there is not enough space in the SGA for them, slaves are not used, and a message appears in the alert log.

Synchronous and Asynchronous I/O

When RMAN reads or writes data, the I/O is either synchronous or asynchronous. When the I/O is synchronous, a server process can perform only one task at a time. When it is asynchronous, a server process can begin an I/O and then perform other work while waiting for the I/O to complete. It can also begin multiple I/O operations before waiting for the first to complete.

Some operating systems support native asynchronous disk I/O. The database takes advantage of this feature if it is available. On operating systems that do not support native asynchronous I/O, the database can simulate it with special I/O slave processes that are dedicated to performing I/O on behalf of another process. You can control disk I/O slaves by setting the DBWR_IO_SLAVES parameter to a nonzero value. The database allocates four backup disk I/O slaves for any nonzero value of DBWR_IO_SLAVES.

By contrast, tape I/O is always synchronous. For tape I/O, each channel allocated (whether manually or based on a CONFIGURE command) corresponds to a server process, called here a channel process.

Synchronous I/O: Example

Figure 11-1 shows synchronous I/O in a backup to tape.

Figure 11-1 Synchronous I/O

Description of "Figure 11-1 Synchronous I/O"

The following steps occur:

The channel process composes a tape buffer.

The channel process executes media manager code that processes the tape buffer and internalizes it for further processing and storage by the media manager.

The media manager code returns a message to the server process stating that it has completed writing.

The channel process can initiate a new task.

Figure 11-2 shows asynchronous I/O in a tape backup. Asynchronous I/O to tape is simulated by using tape slaves. In this case, each allocated channel corresponds to a server process, which in the explanation which follows is identified as a channel process. For each channel process, one tape slave is started (or more than one, in the case of multiple copies).

Figure 11-2 Asynchronous I/O

Description of "Figure 11-2 Asynchronous I/O"

The following steps occur:

A channel process writes blocks to a tape buffer.

The channel process sends a message to the tape slave process to process the tape buffer. The tape slave process executes media manager code that processes the tape buffer and internalizes it so that the media manager can process it.

While the tape slave process is writing, the channel process is free to read data from the datafiles and prepare more output buffers.

Once the tape slave channel returns from the media manager code, it requests a new tape buffer, which usually is ready. Thus waiting time for the channel process is reduced, and the backup is completed faster.

Factors Affecting Backup Speed to Tape

The following factors affect the speed of the backup to tape:

Native Transfer Rate
Tape Compression
Tape Streaming
Physical Tape Block Size

Native Transfer Rate

The tape native transfer rate is the speed of writing to a tape without compression. This speed represents the upper limit of the backup rate. The upper limit of your backup performance should be the aggregate transfer rate of all of your tape drives. If your backup is already performing at that rate, and if it is not using an excessive amount of CPU, then RMAN performance tuning will not help.

Tape Compression

The level of tape compression is very important for backup performance. If the tape has good compression, then the sustained backup rate is faster. For example, if the compression ratio is 2:1 and native transfer rate of the tape drive is 6 MB/s, then the resulting backup speed is 12 MB/s. In this case, RMAN must be able to read disks with a throughput of more than 12 MB/s or the disk becomes the bottleneck for the backup.

Note:

You should not use both tape compression provided by the media manager and binary backupset compression as provided by RMAN. If the media manager compression is efficient, then it is usually the better choice. Using RMAN compressed backupsets can be an effective alternative if you need to reduce bandwidth used to move uncompressed backupsets over a network to the media manager, and if the CPU overhead required to compress the data in RMAN is acceptable.

See Oracle Database Backup and Recovery Basics for more on using compressed backupsets.

Tape Streaming

Tape streaming during write operations has a major impact on tape backup performance. Almost all tape drives currently on the market are fixed-speed, streaming tape drives. Because such drives can only write data at one speed, when they run out of data to write to tape, the tape must slow down and stop. Generally, when the drive's buffer empties, the tape is moving so quickly that it actually overshoots; to continue writing, the drive must rewind the tape to locate the point where it stopped writing.

Physical Tape Block Size

The physical tape block size can affect backup performance. The block size is the amount of data written by media management software to a tape in one write operation. In general, the larger the tape block size, the faster the backup. Note that physical tape block size is not controlled by RMAN or the Oracle database server, but by media management software. See your media management software's documentation for details.

Features and Options Used to Tune RMAN Performance

There are a number of features you can use to tune your backup performance, once you have sufficient knowledge of your database and its workload and bottlenecks.

Using the RATE Parameter to Control Disk Bandwidth Usage

The RATE parameter specifies the bytes/second that RMAN reads on this channel. Use this parameter to set an upper limit for bytes read so that RMAN does not consume excessive disk bandwidth and degrade online performance.

For example, set RATE=1500K. If each disk drive delivers 3 MB/second, then RMAN leaves some disk bandwidth available to the online system.

Tuning RMAN Backup Performance: Procedure

Many factors can affect backup performance. Often, finding the solution to a slow backup is a process of trial and error. To get the best performance for a backup, follow the suggested steps in this section:

Step 1: Remove RATE Parameters from Configured and Allocated Channels
Step 2: If You Use Synchronous Disk I/O, Set DBWR_IO_SLAVES
Step 3: If You Fail to Allocate Shared Memory, Set LARGE_POOL_SIZE
Step 4: Tune RMAN Tape Streaming Performance Bottlenecks
Step 5: Query V$ Views to Identify Bottlenecks

Step 1: Remove RATE Parameters from Configured and Allocated Channels

The RATE parameter on a channel is intended to reduce, rather than increase, backup throughput, so that more disk bandwidth is available for other database operations.

If your backup is not streaming to tape, then make sure that the RATE parameter is not set on the ALLOCATE CHANNEL or CONFIGURE CHANNEL commands.

Step 2: If You Use Synchronous Disk I/O, Set DBWR_IO_SLAVES

If and only if your disk does not support asynchronous I/O, then try setting the DBWR_IO_SLAVES initialization parameter to a nonzero value. Any nonzero value for DBWR_IO_SLAVES causes a fixed number (four) of disk I/O slaves to be used for backup and restore, which simulates asynchronous I/O. If I/O slaves are used, I/O buffers are obtained from the SGA. The large pool is used, if configured. Otherwise, the shared pool is used.

Note: By setting DBWR_IO_SLAVES, the database writer processes will use slaves as well. You may need to increase the value of the PROCESSES initialization parameter.

Step 3: If You Fail to Allocate Shared Memory, Set LARGE_POOL_SIZE

Set this initialization parameter if the database reports an error in the alert.log stating that it does not have enough memory and that it will not start I/O slaves. The message should resemble the following:

ksfqxcre: failure to allocate shared memory means sync I/O will be used whenever async I/O to file not supported natively

When attempting to get shared buffers for I/O slaves, the database does the following:

If LARGE_POOL_SIZE is set, then the database attempts to get memory from the large pool. If this value is not large enough, then an error is recorded in the alert log, the database does not try to get buffers from the shared pool, and asynchronous I/O is not used.
If LARGE_POOL_SIZE is not set, then the database attempts to get memory from the shared pool.
If the database cannot get enough memory, then it obtains I/O buffer memory from the PGA and writes a message to the alert.log file indicating that synchronous I/O is used for this backup.

The memory from the large pool is used for many features, including the shared server (formerly called multi-threaded server), parallel query, and RMAN I/O slave buffers. Configuring the large pool prevents RMAN from competing with other subsystems for the same memory.

Requests for contiguous memory allocations from the shared pool are usually small (under 5 KB) in size. However, it is possible that a request for a large contiguous memory allocation can either fail or require significant memory housekeeping to release the required amount of contiguous memory. Although the shared pool may be unable to satisfy this memory request, the large pool is able to do so. The large pool does not have a least recently used (LRU) list; the database does not attempt to age memory out of the large pool.

Use the LARGE_POOL_SIZE initialization parameter to configure the large pool. To see in which pool (shared pool or large pool) the memory for an object resides, query V$SGASTAT.POOL.

The formula for setting LARGE_POOL_SIZE is as follows:

LARGE_POOL_SIZE =  number_of_allocated_channels * 
                   (16 MB + ( 4 *  size_of_tape_buffer ) )

See Also:

Oracle Database Concepts for more information about the large pool, and Oracle Database Reference for complete information about initialization parameters

Step 4: Tune RMAN Tape Streaming Performance Bottlenecks

There are several tasks you can perform to identify and remedy bottlenecks that affect RMAN's performance on tape backups:

Using BACKUP... VALIDATE To Distinguish Between Tape and Disk Bottlenecks

One reliable way to determine whether the tape streaming or disk I/O is the bottleneck in a given backup job is to compare the time required to run backup tasks with the time required to run BACKUP VALIDATE of the same tasks. BACKUP VALIDATE of a backup to tape performs the same disk reads as a real backup but performs no tape I/O. If the time required for the BACKUP VALIDATE to tape is significantly less than the time required for a real backup to tape, then writing to tape is the likely bottleneck.

Using Multiplexing to Improve Tape Streaming with Disk Bottlenecks

In some situations when performing a backup to tape, RMAN may not be able to send data blocks to the tape drive fast enough to support streaming. For example, during an incremental backup, RMAN only backs up blocks changed since a previous datafile backup as part of the same strategy. If you do not turn on change tracking, RMAN must scan entire datafiles for changed blocks, and fill output buffers as it finds such blocks. If there are not many changed blocks, RMAN may not fill output buffers fast enough to keep the tape drive streaming.

You can improve performance by increasing the degree of multiplexing used for backing up. This increases the rate at which RMAN fills tape buffers, which makes it more likely that buffers are sent to the media manager fast enough to maintain streaming.

Using Incremental Backups to Improve Backup Performance With Tape Bottlenecks

If writing to tape is the source of a bottleneck for your backups, consider using incremental backups as part of your backup strategy. Incremental level 1 backups write only the changed blocks from datafiles to tape, so that any bottleneck on writing to tape has less impact on your overall backup strategy. In particular, if tape drives are not locally attached to the node running the database being backed up, then incremental backups can be faster.

Step 5: Query V$ Views to Identify Bottlenecks

If none of the previous steps improves backup performance, then try to determine the exact source of the bottleneck. Use the V$BACKUP_SYNC_IO and V$BACKUP_ASYNC_IO views to determine the source of backup or restore bottlenecks and to see detailed progress of backup jobs.

V$BACKUP_SYNC_IO contains rows when the I/O is synchronous to the process (or thread on some platforms) performing the backup. V$BACKUP_ASYNC_IO contains rows when the I/O is asynchronous. Asynchronous I/O is obtained either with I/O processes or because it is supported by the underlying operating system.

See Also:

Oracle Database Reference for more information about these views

To determine whether your tape is streaming when the I/O is synchronous, query the EFFECTIVE_BYTES_PER_SECOND column in the V$BACKUP_SYNC_IO or V$BACKUP_ASYNC_IO view. If EFFECTIVE_BYTES_PER_SECOND is less than the raw capacity of the hardware, then the tape is not streaming. If EFFECTIVE_BYTES_PER_SECOND is greater than the raw capacity of the hardware, the tape may or may not be streaming. Compression may cause the EFFECTIVE_BYTES_PER_SECOND to be greater than the speed of real I/O.

Identifying Bottlenecks with Synchronous I/O

With synchronous I/O, it is difficult to identify specific bottlenecks because all synchronous I/O is a bottleneck to the process. The only way to tune synchronous I/O is to compare the rate (in bytes/second) with the device's maximum throughput rate. If the rate is lower than the rate that the device specifies, then consider tuning this aspect of the backup and restore process. The DISCRETE_BYTES_PER_SECOND column in the V$BACKUP_SYNC_IO view displays the I/O rate. If you see data in V$BACKUP_SYNC_IO, then the problem is that you have not enabled asynchronous I/O or you are not using disk I/O slaves.

Identifying Bottlenecks with Asynchronous I/O

Long waits are the number of times the backup or restore process told the operating system to wait until an I/O was complete. Short waits are the number of times the backup or restore process made an operating system call to poll for I/O completion in a nonblocking mode. Ready indicates the number of time when I/O was already ready for use and so there was no need to made an operating system call to poll for I/O completion.

The simplest way to identify the bottleneck is to query V$BACKUP_ASYNC_IO for the datafile that has the largest ratio for LONG_WAITS divided by IO_COUNT.

Note:

If you have synchronous I/O but you have set BACKUP_DISK_IO_SLAVES, then the I/O will be displayed in V$BACKUP_ASYNC_IO.

See Also:

Oracle Database Reference for descriptions of the V$BACKUP_SYNC_IO and V$BACKUP_ASYNC_IO views

Instance Recovery Performance Tuning: Fast-Start Fault Recovery

This section describes instance recovery, and how Oracle's Fast-Start Fault Recovery improves availability in the event of a crash or instance failure. It also offers guidelines for tuning the time required to perform crash and instance recovery.

This section contains the following topics:

Understanding Instance Recovery
Checkpointing and Cache Recovery
Configuring the Duration of Cache Recovery: FAST_START_MTTR_TARGET
Tuning FAST_START_MTTR_TARGET and Using MTTR Advisor

Understanding Instance Recovery

Instance and crash recovery are the automatic application of redo log records to Oracle data blocks after a crash or system failure. During normal operation, if an instance is shut down cleanly (as when using a SHUTDOWN IMMEDIATE statement), rather than terminated abnormally, then the in-memory changes that have not already been written to the datafiles on disk are written to disk as part of the checkpoint performed during shutdown.

However, if a single instance database crashes or if all instances of an Oracle Real Application Cluster configuration crash, then Oracle performs crash recovery at the next startup. If one or more instances of an Oracle Real Application Cluster configuration crash, then a surviving instance performs instance recovery automatically. Instance and crash recovery occur in two steps: cache recovery followed by transaction recovery.

The database can be opened as soon as cache recovery completes, so improving the performance of cache recovery is important for increasing availability.

Cache Recovery (Rolling Forward)

During the cache recovery step, Oracle applies all committed and uncommitted changes in the redo log files to the affected data blocks. The work required for cache recovery processing is proportional to the rate of change to the database (update transactions each second) and the time between checkpoints.

Transaction Recovery (Rolling Back)

To make the database consistent, the changes that were not committed at the time of the crash must be undone (in other words, rolled back). During the transaction recovery step, Oracle applies the rollback segments to undo the uncommitted changes.

Checkpointing and Cache Recovery

Periodically, Oracle records a checkpoint. A checkpoint is the highest system change number (SCN) such that all data blocks less than or equal to that SCN are known to be written out to the data files. If a failure occurs, then only the redo records containing changes at SCNs higher than the checkpoint need to be applied during recovery. The duration of cache recovery processing is determined by two factors: the number of data blocks that have changes at SCNs higher than the SCN of the checkpoint, and the number of log blocks that need to be read to find those changes.

How Checkpoints Affect Performance

Frequent checkpointing writes dirty buffers to the datafiles more often than otherwise, and so reduces cache recovery time in the event of an instance failure. If checkpointing is frequent, then applying the redo records in the redo log between the current checkpoint position and the end of the log involves processing relatively few data blocks. This means that the cache recovery phase of recovery is fairly short.

However, in a high-update system, frequent checkpointing can reduce runtime performance, because checkpointing causes DBWn processes to perform writes.

Fast Cache Recovery Trade-offs

To minimize the duration of cache recovery, you must force Oracle to checkpoint often, thus keeping the number of redo log records to be applied during recovery to a minimum. However, in a high-update system, frequent checkpointing increases the overhead for normal database operations.

If daily operational efficiency is more important than minimizing recovery time, then decrease the frequency of writes to data files due to checkpoints. This should improve operational efficiency, but also increase cache recovery time.

Configuring the Duration of Cache Recovery: FAST_START_MTTR_TARGET

The Fast-Start Fault Recovery feature reduces the time required for cache recovery, and makes the recovery bounded and predictable by limiting the number of dirty buffers and the number of redo records generated between the most recent redo record and the last checkpoint.

The foundation of Fast-Start Fault Recovery is the Fast-Start checkpointing architecture. Instead of conventional event-driven (that is, log switching) checkpointing, which does bulk writes, fast-start checkpointing occurs incrementally. Each DBWn process periodically writes buffers to disk to advance the checkpoint position. The oldest modified blocks are written first to ensure that every write lets the checkpoint advance. Fast-Start checkpointing eliminates bulk writes and the resultant I/O spikes that occur with conventional checkpointing.

With the Fast-Start Fault Recovery feature, the FAST_START_MTTR_TARGET initialization parameter simplifies the configuration of recovery time from instance or system failure. FAST_START_MTTR_TARGET specifies a target for the expected mean time to recover (MTTR), that is, the time (in seconds) that it should take to start up the instance and perform cache recovery. Once FAST_START_MTTR_TARGET is set, the database manages incremental checkpoint writes in an attempt to meet that target. If you have chosen a practical value for FAST_START_MTTR_TARGET, you can expect your database to recover, on average, in approximately the number of seconds you have chosen.

Note:

You must disable or remove the FAST_START_IO_TARGET, LOG_CHECKPOINT_INTERVAL, and LOG_CHECKPOINT_TIMEOUT initialization parameters when using FAST_START_MTTR_TARGET. Setting these parameters interferes with the mechanisms used to manage cache recovery time to meet FAST_START_MTTR_TARGET.

Practical Values for FAST_START_MTTR_TARGET

The maximum value for FAST_START_MTTR_TARGET is 3600 seconds (one hour). If you set the value to more than 3600, then Oracle rounds it to 3600.

The following example shows how to set the value of FAST_START_MTTR_TARGET:

SQL> ALTER SYSTEM SET FAST_START_MTTR_TARGET=30;

In principle, the minimum value for FAST_START_MTTR_TARGET is one second. However, the fact that you can set FAST_START_MTTR_TARGET this low does not mean that target can be achieved. There are practical limits to the minimum achievable MTTR target, due to such factors as database startup time.

The MTTR target that your database can achieve given the current value of FAST_START_MTTR_TARGET is called the effective MTTR target. You can view your current effective MTTR by viewing the TARGET_MTTR column of the V$INSTANCE_RECOVERY view.

The practical range of MTTR target values for your database is defined to be the range between the lowest effective MTTR target for your database and the longest that startup and cache recovery will take in the worst-case scenario (that is, when the whole buffer cache is dirty). A procedure for determining the range of achievable MTTR target values, one step in the process of tuning your FAST_START_MTTR_TARGET value, is described in "Determine the Practical Range for FAST_START_MTTR_TARGET".

Note:

It is usually not useful to set your FAST_START_MTTR_TARGET to a value outside the practical range. If your FAST_START_MTTR_TARGET value is shorter than the lower limit of the practical range, the effect is as if you set it to the lower limit of the practical range. In such a case, the effective MTTR target will be the best MTTR target the system can achieve, but checkpointing will be at a maximum, which can affect normal database performance. If you set FAST_START_MTTR_TARGET to a time longer than the practical range, the MTTR target will be no better than the worst-case situation.

Reducing Checkpoint Frequency to Optimize Runtime Performance

To reduce the checkpoint frequency and optimize runtime performance, you can do the following:

Set the value of FAST_START_MTTR_TARGET to 3600. This enables Fast-Start checkpointing and the Fast-Start Fault Recovery feature, but minimizes its effect on runtime performance while avoiding the need for performance tuning of FAST_START_MTTR_TARGET.
Size your online redo log files according to the amount of redo your system generates. A good rule of thumb is to switch logs at most every twenty minutes. Having your log files too small can increase checkpoint activity and reduce performance. Also note that all redo log files should be the same size.

See Also:
Oracle Database Concepts for a complete discussion of checkpoints

Monitoring Cache Recovery with V$INSTANCE_RECOVERY

The V$INSTANCE_RECOVERY view displays the current recovery parameter settings. You can also use statistics from this view to determine which factor has the greatest influence on checkpointing.

The following table lists those columns most useful in monitoring predicted cache recovery performance:

Table 11-2 V$INSTANCE_RECOVERY Columns

Column	Description
`TARGET_MTTR`	Effective mean time to recover (MTTR) target in seconds. This field is 0 if `FAST_START_MTTR_TARGET` is not specified.
`ESTIMATED_MTTR`	The current estimated mean time to recover (MTTR) in seconds, based on the current number of dirty buffers and log blocks. This field is always calculated, whether or not `FAST_START_MTTR_TARGET` is specified.

For more details on the columns in V$INSTANCE_RECOVERY, see Oracle Database Reference.

As part of the ongoing monitoring of your database, you can periodically compare V$INSTANCE_RECOVERY.TARGET_MTTR to your FAST_START_MTTR_TARGET. The two values should generally be the same if the FAST_START_MTTR_TARGET value is in the practical range. If TARGET_MTTR is consistently longer than FAST_START_MTTR_TARGET, then set FAST_START_MTTR_TARGET to a value no less than TARGET_MTTR. If TARGET_MTTR is consistently shorter, then set FAST_START_MTTR_TARGET to a value no greater than TARGET_MTTR.

Tuning FAST_START_MTTR_TARGET and Using MTTR Advisor

To determine the appropriate value for FAST_START_MTTR_TARGET for your database, use the following four step process:

Calibrate the FAST_START_MTTR_TARGET
Determine the Practical Range for FAST_START_MTTR_TARGET
Evaluate Different Target Values with MTTR Advisor
Determine Optimal Size for Redo Logs

Calibrate the FAST_START_MTTR_TARGET

The FAST_START_MTTR_TARGET initialization parameter causes the database to calculate internal system trigger values, in order to limit the length of the redo log and the number of dirty data buffers in the data cache. This calculation uses estimated time to read a redo block, estimates of the time to read and write a data block, as well as characteristics of typical workload of the system, such as how many dirty buffers corresponds to how many change vectors, and so on.

Initially, internal defaults are used in the calculation. These defaults are replaced over time by data gathered on I/O performance during system operation and actual cache recoveries.

You will have to perform several instance recoveries in order to calibrate your FAST_START_MTTR_TARGET value properly. Before starting calibration, you must decide whether FAST_START_MTTR_TARGET is being calibrated for a database crash or a hardware crash. This is a consideration if your database files are stored in a file system or if your I/O subsystem has a memory cache, because there is a considerable difference in the read and write time to disk depending on whether or not the files are cached. The appropriate value for FAST_START_MTTR_TARGET will depend upon which type of crash is more important to recover from quickly.

To effectively calibrate FAST_START_MTTR_TARGET, make sure that you run the typical workload of the system for long enough, and perform several instance recoveries to ensure that the time to read a redo block and the time to read or write a data block during recovery are recorded accurately.

Determine the Practical Range for FAST_START_MTTR_TARGET

After calibration, you can perform tests to determine the practical range for FAST_START_MTTR_TARGET for your database.

Determining Lower Bound for FAST_START_MTTR_TARGET: Scenario

To determine the lower bound of the practical range, set FAST_START_MTTR_TARGET to 1, and start up your database. Then check the value of V$INSTANCE_RECOVERY.TARGET_MTTR, and use this value as a good lower bound for FAST_START_MTTR_TARGET. Database startup time, rather than cache recovery time, is usually the dominant factor in determining this limit.

For example, set the FAST_START_MTTR_TARGET to 1:

SQL> ALTER SYSTEM SET FAST_START_MTTR_TARGET=1;

Then, execute the following query immediately after opening the database:

SQL> SELECT TARGET_MTTR, ESTIMATED_MTTR 
    FROM V$INSTANCE_RECOVERY;

Oracle responds with the following:

TARGET_MTTR ESTIMATED_MTTR 
18          15

The TARGET_MTTR value of 18 seconds is the minimum MTTR target that the system can achieve, that is, the lowest practical value for FAST_START_MTTR_TARGET. This minimum is calculated based on the average database startup time.

The ESTIMATED_MTTR field contains the estimated mean time to recovery based on the current state of the running database. Because the database has just opened, the system contains few dirty buffers, so not much cache recovery would be required if the instance failed at this moment. That is why ESTIMATED_MTTR can, for the moment, be lower than the minimum possible TARGET_MTTR.

ESTIMATED_MTTR can be affected in the short term by recent database activity. Assume that you query V$INSTANCE_RECOVERY immediately after a period of heavy update activity in the database. Oracle responds with the following:

TARGET_MTTR ESTIMATED_MTTR 
18          30

Now the effective MTTR target is still 18 seconds, and the estimated MTTR (if a crash happened at that moment) is 30 seconds. This is an acceptable result. This means that some checkpoints writes might not have finished yet, so the buffer cache contains more dirty buffers than targeted.

Now wait for sixty seconds and reissue the query to V$INSTANCE_RECOVERY. Oracle responds with the following:

TARGET_MTTR ESTIMATED_MTTR 
18          25

The estimated MTTR at this time has dropped to 25 seconds, because some of the dirty buffers have been written out during this period

Determining Upper Bound for FAST_START_MTTR_TARGET

To determine the upper bound of the practical range, set FAST_START_MTTR_TARGET to 3600, and operate your database under a typical workload for a while. Then check the value of V$INSTANCE_RECOVERY.TARGET_MTTR. This value is a good upper bound for FAST_START_MTTR_TARGET.

The procedure is substantially similar to that in "Determining Lower Bound for FAST_START_MTTR_TARGET: Scenario".

Selecting Preliminary Value for FAST_START_MTTR_TARGET

Once you have determined the practical bounds for the FAST_START_MTTR_TARGET parameter, select a preliminary value for the parameter. Choose a higher value within the practical range if your concern is with database performance, and a lower value within the practical range if your priority is shorter recovery times. The narrower the practical range, of course, the easier the choice becomes.

For example, if you discovered that the practical range was between 17 and 19 seconds, it would be quite simple to choose 19, because it makes relatively little difference in recovery time and at the same time minimizes the effect of checkpointing on system performance. However, if you found that the practical range was between 18 and 40 seconds, you might choose a compromise value of 30, and set the parameter accordingly:

SQL> ALTER SYSTEM SET FAST_START_MTTR_TARGET=30;

You might then go on to use the MTTR Advisor to determine an optimal value.

Evaluate Different Target Values with MTTR Advisor

Once you have selected a preliminary value for FAST_START_MTTR_TARGET, you can use MTTR Advisor to evaluate the effect of different FAST_START_MTTR_TARGET settings on system performance, compared to your chosen setting.

Enabling MTTR Advisor

To enable MTTR Advisor, set the two initialization parameters STATISTICS_LEVEL and FAST_START_MTTR_TARGET.

STATISTICS_LEVEL governs whether all advisors are enabled and is not specific to MTTR Advisor. Make sure that it is set to TYPICAL or ALL. Then, when FAST_START_MTTR_TARGET is set to a non-zero value, the MTTR Advisor is enabled.

Using MTTR Advisor

After enabling MTTR Advisor, run a typical database workload for a while. When MTTR Advisor is ON, the database simulates checkpoint queue behavior under the current value of FAST_START_MTTR_TARGET, and up to four other different MTTR settings within the range of valid FAST_START_MTTR_TARGET values. (The database will in this case determine the valid range for FAST_START_MTTR_TARGET itself before testing different values in the range.)

Viewing MTTR Advisor Results: V$MTTR_TARGET_ADVICE

The dynamic performance view V$MTTR_TARGET_ADVICE lets you view statistics or advisories collected by MTTR Advisor.

The database populates V$MTTR_TARGET_ADVICE with advice about the effects of each of the FAST_START_MTTR_TARGET settings for your database. For each possible value of FAST_START_MTTR_TARGET, the row contains details about how many cache writes would be performed under the workload tested for that value of FAST_START_MTTR_TARGET.

Specifically, each row contains information about cache writes, total physical writes (including direct writes), and total I/O (including reads) for that value of FAST_START_MTTR_TARGET, expressed both as a total number of operations and a ratio compared to the operations under your chosen FAST_START_MTTR_TARGET value. For instance, a ratio of 1.2 indicates 20% more cache writes.

Knowing the effect of different FAST_START_MTTR_TARGET settings on cache write activity and other I/O enables you to decide better which FAST_START_MTTR_TARGET value best fits your recovery and performance needs.

If MTTR Advisor is currently on, then V$MTTR_TARGET_ADVICE shows the Advisor information collected. If MTTR Advisor is currently OFF, the view shows information collected the last time MTTR Advisor was ON since database startup, if any. If the database has been restarted since the last time the MTTR Advisor was used, or if it has never been used, the view will not show any rows.

See Also:

Oracle Database Reference for column details of the V$MTTR_TARGET_ADVICE view

Determine Optimal Size for Redo Logs

You can use the V$INSTANCE_RECOVERY view column OPTIMAL_LOGFILE_SIZE to determine the size of your online redo logs. This field shows the redo log file size in megabytes that is considered optimal based on the current setting of FAST_START_MTTR_TARGET. If this field consistently shows a value greater than the size of your smallest online log, then you should configure all your online logs to be at least this size.

Note, however, that the redo log file size affects the MTTR. In some cases, you may be able to refine your choice of the optimal FAST_START_MTTR_TARGET value by re-running the MTTR Advisor with your suggested optimal log file size.