|Oracle® Database High Availability Best Practices
10g Release 2 (10.2)
|PDF · Mobi · ePub|
This chapter describes Oracle configuration best practices for Oracle Database and related components.
This chapter contains these topics:
See Also:Appendix A, "Database SPFILE and Oracle Net Configuration File Samples" for complete examples of database parameter settings
This section describes best practices for configuring a fault-tolerant storage subsystem that protects data while providing manageability and performance. These practices apply to all Oracle Database high-availability architectures described in Oracle Database High Availability Overview.
This section contains these topics:
Characterize your database performance requirements using different application workloads. Extract statistics during your target workloads by getting the beginning and end statistical snapshots. Example target workloads include:
The necessary statistics can be derived from Automatic Workload Repository (AWR) reports or gathered from the
GV$SYSSTAT view. Along with understanding the database performance requirements, the performance capabilities of a storage array must be evaluated.
Low-cost storage arrays, low-cost storage networks, and Oracle Database 10g can in combination create a low-cost storage grid with excellent performance and availability. Low-cost storage is most successfully deployed for databases with certain types of performance and availability requirements. Compared to traditional high-end storage arrays, low-cost storage arrays have excellent data throughput and superior price for each gigabyte. However, low-cost storage arrays do not have better I/O rates for OLTP type applications than traditional storage, although the cost for each I/O per second is comparable. The Oracle Resilient Low-Cost Storage Initiative is designed to help customers reduce IT spending and promote use of low-cost storage arrays in both departmental and enterprise environments.
The Oracle Database flash recovery area is an ideal candidate for low-cost storage. Because the flash recovery area contains recovery related files that are typically accessed with sequential 1MB streams, the performance characteristics of low-cost storage are well suited for the flash recovery area. The flash recovery area can be configured to use low-cost storage while the database area remains on traditional storage.
Best Practices for Creating a Low-Cost Storage Grid for Oracle Databases at
Oracle Resilient Low-Cost Storage Initiative Web site at
ASM is a vertical integration of both the file system and the volume manager built specifically for Oracle database files. It extends the concept of stripe and mirror everything (SAME) to optimize performance, while removing the need for manual I/O tuning (distributing the datafile layout to avoid hot spots). ASM helps manage a dynamic database environment by letting you grow the database size without shutting down the database to adjust the storage allocation. ASM also enables low-cost modular storage to deliver higher performance and greater availability by supporting mirroring as well as striping.
ASM should be used to manage all database files. However, ASM can be phased into your environment initially supporting only the flash recovery area. This approach is particularly well suited for introducing low-cost storage into an existing environment where traditional storage configurations currently exist.
To improve manageability, ASMLib should be used on platforms where it is available. ASMLib is a support library for ASM. ASMLib eliminates the impact when the mappings of disk device names change upon system reboot. Although ASMLib is not required to run ASM, it simplifies the management of disk device names, makes the discovery process simpler, and removes the challenge of having disks added to one node and not be known to other nodes in the cluster.
Chapter 16 "Migrating Databases To and From ASM with Recovery Manager" in the Oracle Database Backup and Recovery Advanced User's Guide
Oracle Database 10g Release 2 Automatic Storage Management Overview and Technical Best Practices at
Oracle ASMLib Web site at
Oracle Database Administrator's Guide for more information on configuring Automatic Storage Management
When using ASM for database storage, you should create two disk groups: one disk group for the database area and another disk group for the flash recovery area:
The database area contains active database files, such as datafiles, control files, online redo log files, Data Guard Broker metadata files, and change tracking files used for RMAN incremental backups. For example:
CREATE DISKGROUP DATA DISK '/devices/lun01','/devices/lun02','/devices/lun03','/devices/lun04';
The flash recovery area contains recovery-related files, such as a copy of the current control file, a member of each online redo log file group, archived redo log files, RMAN backup sets, and flashback log files. For example:
CREATE DISKGROUP RECO DISK '/devices/lun05','/devices/lun06','/devices/lun07','/devices/lun08', '/devices/lun09','/devices/lun10','/devices/lun11','/devices/lun12';
To simplify file management, use Oracle managed files to control file naming. Enable Oracle managed files by setting initialization parameters
Note:Using a flash recovery area by setting
DB_RECOVERY_FILE_DESTrequires that you also set
DB_RECOVERY_FILE_DEST_SIZEto bound the amount of disk space used by the flash recovery area.
DB_RECOVERY_FILE_DEST_SIZEare dynamic parameters that allow you to change the destination and size of the flash recovery area.
DB_CREATE_FILE_DEST=+DATA DB_RECOVERY_FILE_DEST=+RECO DB_RECOVERY_FILE_DEST_SIZE=500G
Allocate entire disks to the database area and flash recovery area disk groups
Partition each disk into two partitions, one for the database area and another for the flash recovery area
Figure 2-1 illustrates allocating entire disks. The advantages of this option are:
It is easier to manage the disk partitions at the operating system level, because each disk is partitioned as just one large partition.
The disadvantage of this option is:
Less I/O bandwidth, because each disk group is spread over only a subset of the available disks.
The second option is illustrated in Figure 2-2. It requires partitioning each disk into two partitions: a smaller partition on the faster outer portion of each drive for the database area, and a larger partition on the slower inner portion of each drive for the flash recovery area. The ratio for the size of the inner and outer partitions depends on the estimated size of the database area and the flash recovery area.
The advantage of this approach is:
Higher I/O bandwidth available, because both disk groups are spread over all available spindles. This advantage is considerable for the database area disk group for I/O intensive applications.
The disadvantages are:
A double disk failure may result in the loss of both disk groups, requiring the use of a standby database or tape backups for recovery.
Higher initial administrative efforts are required to partition each disk properly.
Disk multipathing software aggregates multiple independent I/O paths into a single logical path. The path abstraction provides I/O load balancing across host bus adapters (HBA) and nondisruptive failovers when there is a failure in the I/O path. Disk multipathing software should be used in conjunction with ASM.
When specifying disk names during disk group creation in ASM, the logical device representing the single logical path should be used. For example, when using Device Mapper on Linux 2.6, a logical device path of
/dev/dm-0 may be the aggregation of physical disks
/dev/sdh. Within ASM, the
asm_diskstring parameter should contain
/dev/dm-* to discover the logical device
/dev/dm-0, and that logical device should be used during disk group creation:
asm_diskstring='/dev/dm-*' CREATE DISKGROUP DATA DISK '/dev/dm-0','/dev/dm-1','/dev/dm-2','/dev/dm-3';
When setting up redundancy to protect from hardware failures, there are two options to consider:
Oracle recommends that you configure redundancy in the storage array by enabling RAID protection, such as RAID1 (mirroring) or RAID5 (striping plus parity). For example, to create an ASM disk group where redundancy is provided by the storage array, first create the RAID-protected logical unit numbers (LUNs) in the storage array, and then create the ASM disk group using the
CREATE DISKGROUP DATA EXTERNAL REDUNDANCY DISK '/devices/lun1','/devices/lun2','/devices/lun3','/devices/lun4';
If the storage array does not offer the desired level of redundancy, or if there is a need to configure redundancy across multiple storage arrays, then use ASM redundancy. ASM provides redundancy with the use of failure groups, which are defined during disk group creation. ASM redundancy can be either Normal redundancy, where files are two-way mirrored, or high redundancy, where files are three-way mirrored. Once a disk group is created, the redundancy level cannot be changed.
Failure group definition is specific to each storage setup, but these guidelines should be followed:
If every disk is available through every I/O path, as would be the case if using disk multipathing software, then leave each disk in its own failure group. This is the default ASM behavior if creating a disk group without explicitly defining failure groups.
For an array with two controllers where each controller sees only half of the drives, create a disk group with two failure groups, one for each controller, to protect against controller failure:
CREATE DISKGROUP DATA NORMAL REDUNDANCY FAILGROUP controller1 DISK '/devices/diska1','/devices/diska2','/devices/diska3','/devices/diska4' FAILGROUP controller2 DISK '/devices/diskb1','/devices/diskb2','/devices/diskb3','/devices/diskb4';
For an array with two controllers where every disk is seen through both controllers, create a disk group with each disk in its own failure group:
CREATE DISKGROUP DATA NORMAL REDUNDANCY DISK '/devices/diska1','/devices/diska2','/devices/diska3','/devices/diska4', '/devices/diskb1','/devices/diskb2','/devices/diskb3','/devices/diskb4';
CREATE DISKGROUP DATA NORMAL REDUNDANCY FAILGROUP array1 DISK '/devices/diska1','/devices/diska2','/devices/diska3','/devices/diska4' FAILGROUP array2 DISK '/devices/diskb1','/devices/diskb2','/devices/diskb3','/devices/diskb4';
When determining the proper size of a disk group that is protected with ASM redundancy, enough free space must exist in the disk group so that when a disk fails ASM can automatically reconstruct the contents of the failed drive to other drives in the disk group while the database remains online. The amount of space required to ensure ASM can restore redundancy following disk failure is in the column
REQUIRED_MIRROR_FREE_MB in the
V$ASM_DISKGROUP view. The amount of free space that can be safely used in a disk group, taking mirroring into account, and yet be able to restore redundancy after a disk failure is in the column
USABLE_FILE_MB in the
USABLE_FILE_MB should always be greater than zero. If
USABLE_FILE_MB falls below zero, then more disks should be added to the disk group.
Consider HARD-compliant storage for the greatest protection against data corruption. Data corruption is very rare, but it can have a catastrophic effect on a business when it occurs.
The goal of the Hardware Assisted Resilient Data (HARD) initiative is to eliminate a class of failures that the computer industry has so far been powerless to prevent. RAID has gained a wide following in the storage industry by ensuring the physical protection of data. HARD takes data protection to the next level by going beyond protecting physical data to protecting business data.
The HARD initiative is designed to prevent data corruption before it happens. Under the HARD initiative, Oracle partners with storage vendors to implement Oracle data validation and checking algorithms inside storage devices. This makes it possible to prevent corrupted data from being written to permanent storage.
The classes of data corruption that Oracle addresses with HARD include:
Writes that physically and logically corrupt Oracle blocks
Writes of database blocks to incorrect locations
Writes of partial or incomplete blocks
Writes by other applications to Oracle data blocks
End-to-end block validation is the technology employed by the operating system or storage subsystem to validate the Oracle Database data block contents. By validating Oracle Database data in the storage devices, data corruption is detected and eliminated before it can be written to permanent storage. This goes beyond the current Oracle Database block validation features that do not detect a stray, lost, or corrupted write until the next physical read.
Storage vendors who are partners with Oracle are given the opportunity to implement validation checks based on a specification. A particular vendor's implementation may offer features specific to its storage technology. Oracle maintains a Web site that shows a comparison of each vendor's solution by product and Oracle version.
Note:When using ASM to manage database storage, ASM should always be configured as external redundancy. Additionally, HARD protections should be disabled when doing any rebalance operations, such as adding a new disk, to avoid the risk of HARD inadvertently flagging the movement of data as a bad write.
http://www.oracle.com/technology/deploy/availability/htdocs/HARD.htmlfor the most recent information on the HARD initiative
The best practices discussed in this section apply to Oracle Database 10g database architectures in general, including all architectures described in Oracle Database High Availability Overview:
Oracle Database 10g
Oracle Database 10g with RAC
Oracle Database 10g with Data Guard
Oracle Database 10g with RAC and Data Guard (MAA)
These recommendations are identical for both the primary and standby databases when Oracle Data Guard is used. It is necessary to adopt these practices to reduce or avoid outages, reduce risk of corruption, and improve recovery performance.
This section contains the following types of best practices for configuring the database in general:
This section describes the following minimum requirements for configuring Oracle Database for high availability:
ARCHIVELOG mode enables online database backup and is necessary to recover the database to a point in time later than what has already been restored. Architectures such as Oracle Data Guard and Flashback Database require that the production database run in
See Also:Oracle Database Administrator's Guide for more information about using automatic archiving
By default, Oracle always validates the data blocks that it reads from disk. Enabling data and log block checksums by setting
TYPICAL enables Oracle to detect other types of corruption caused by underlying disks, storage systems, or I/O systems. Before a data block is written to disk, a checksum is computed and stored in the block. When the block is subsequently read from disk, the checksum is computed again and compared with the stored checksum. Any difference is treated as a media error, and an
ORA-1578 error is signaled. Block checksums are always maintained for the
SYSTEM tablespace. If
DB_BLOCK_CHECKSUM is set to
FULL, then in-memory corruption is also detected before being written to disk.
In addition to enabling data block checksums, Oracle also calculates a checksum for every redo log block before writing it to the current log. Redo record corruption is found as soon as the log is archived. Without this option, corruption in a redo log can go unnoticed until the log is applied to a standby database, or until a backup is restored and rolled forward through the log containing the corrupt log block.
Ordinarily the overhead for
TYPICAL is one to two percent and for
FULL is four to five percent. The default setting,
TYPICAL, provides critical detection of corruption at very low cost and remains a requirement for high availability. Measure the performance impact with your workload on a test system and ensure that the performance impact is acceptable before moving from
FULL on an active database.
This section describes Oracle Database best practices for reducing recovery time or increasing its availability and redundancy:
Use Oracle log multiplexing to create multiple redo log members in each redo group, one in the data area and one in the flash recovery area. This protects against a failure involving the redo log, such as a disk or I/O failure for one of the members, or a user error that accidentally removes a member through an operating system command. If at least one redo log member is available, then the instance can continue to function.
All online redo log files should be the same size and configured to switch approximately once an hour during normal activity. They should not switch more frequently than every 20 minutes during peak activity.
There should be a minimum of four online log groups to prevent
LGWR from waiting for a group to be available following a log switch. A group might be unavailable because a checkpoint has not yet completed or because the group has not yet been archived.
Oracle Database Administrator's Guide for more information about managing redo logs
Oracle Data Guard Concepts and Administration for more information about online, archived, and standby redo log files
The flash recovery area is Oracle managed disk space that provides a centralized disk location for backup and recovery files. The flash recovery area is defined by setting the following database initialization parameters:
This parameter specifies the default location for the flash recovery area.
This parameter specifies (in bytes) the hard limit on the total space to be used by target database recovery files created in the recovery area location.
The flash recovery area should be the primary location for recovery. When the flash recovery area is properly sized, files needed for repair will be readily available. The minimum recommended disk limit is the combined size of the database, incremental backups, all archived redo logs that have not been copied to tape, and flashback logs.
See Also:Oracle Database Backup and Recovery Basics for detailed information about sizing the flash recovery area and setting the retention period
Flashback Database enables you to rewind the database to a previous point in time without restoring backup copies of the datafiles. Flashback Database is a revolutionary recovery feature that operates on only the changed data. Flashback Database makes the time to correct an error proportional to the time to cause and detect the error, without recovery time being a function of the database size. You can flash back a database from both RMAN and SQL*Plus with a single command instead of using a complex procedure.
During normal runtime, Flashback Database buffers and writes before images of data blocks into the flashback logs, which reside in the flash recovery area. Ensure there is sufficient I/O bandwidth available to the flash recovery area to maintain flashback write throughput. If flashback writes are slow, as evidenced by the
flashback free buffer waits wait event, then database throughput is affected. The amount of disk writes caused by Flashback Database differs depending on the workload and application profile. For a typical OLTP workload that is using a flash recovery area with sufficient disk spindles and I/O throughput, the overhead incurred by Flashback Database is less than two percent.
Flashback Database can flash back a primary or standby database to a point in time prior to a role transition. In addition, a Flashback Database can be performed to a point in time prior to a resetlogs operation, which allows administrators more flexibility to detect and correct human errors. Flashback Database is required when using fast-start failover so that Data Guard Broker can automatically reinstate the primary database following an automatic failover.
See Also:Oracle Database Backup and Recovery Basics for more information on restore points and Flashback Database
The fast-start fault recovery feature reduces the time required to recover from a crash. It also makes the recovery bounded and predictable by limiting the number of dirty buffers and the number of redo records generated between the most recent redo record and the last checkpoint. With this feature, the
FAST_START_MTTR_TARGET initialization parameter simplifies the configuration of recovery time from instance or system failure. This parameter specifies a target for the expected recover time objective (RTO), which is the time (in seconds) that it should take to start up the instance and perform cache recovery. Once this parameter is set, the database manages incremental checkpoint writes in an attempt to meet that target. If you have chosen a practical value for this parameter, then you can expect your database to recover, on average, in approximately the number of seconds you have chosen.
See Also:Oracle Database Backup and Recovery Advanced User's Guide for more information on fast-start fault recovery
Block checking is performed after any in-memory block change.
When one of these three forms of block checking is enabled, Oracle Database verifies that the block is self-consistent whenever the corresponding types of block change occur. If the block is inconsistent, then it is marked corrupt, an ORA-1578 error is returned, and a trace file is created containing details of the problem. Without block checking enabled, corruption can go undetected until the block is accessed again. Block checking for the
SYSTEM tablespace is always enabled, no matter what setting is chosen for
Block checking can often prevent memory and data corruption. Turning on this feature typically causes an additional one percent to ten percent overhead, depending on the setting and the workload. Measure the performance impact on a test system using your workload and ensure that it is acceptable before introducing this feature on an active database.
To check for block corruption on a disk that was not preventable by utilizing
DB_BLOCK_CHECKING use one of the following:
DISK_ASYNCH_IO=TRUE to enable asynchronous disk I/O for optimal I/O performance.
For large production databases, set the
LOG_BUFFER initialization parameter to a minimum of 8 MB. This setting ensures the database allocates maximum memory (typically 16 MB) for writing Flashback Database logs.
Memory management has improved significantly with the advent of Automatic Shared Memory Management (ASM). By setting the
SGA_TARGET parameter to a nonzero value, the shared pool, large pool, Java pool, Streams pool, and buffer cache can automatically and dynamically resize, as needed. See the Oracle Database Administrator's Guide for more information.
Increase initialization parameter
PARALLEL_EXECUTION_MESSAGE_SIZE from default value of
4096. This configuration step accelerates parallel executions, including instance recovery.
PARALLEL_MIN_SERVERS so that the required number of parallel recovery processes are pre-spawned for fast recovery from an instance or node crash. This works with
FAST_START_MTTR_TARGET to bound recovery time.
When the value of
RECOVERY_ESTIMATED_IOS in the
V$INSTANCE_RECOVERY view is small (for example, < 5000), then the overhead of parallel recovery may outweigh any benefit. This will typically occur with a very aggressive setting of
FAST_START_MTTR_TARGET. In this case, set
RECOVERY_PARALLELISM to 1 to disable parallel recovery.
This section describes best practices for improving Oracle Database manageability:
Effective data collection and analysis is essential for identifying and correcting performance problems. Oracle provides a number of tools that allow a performance engineer to gather information regarding database performance.
The Oracle Database automatic performance tuning features include:
Set the AWR automatic snapshot interval to 10-20 minutes to capture performance peaks during stress testing or to diagnose performance issues.
Under usual workloads a 60-minute interval is sufficient
The server parameter file (SPFILE) enables a single, central parameter file to hold all database initialization parameters associated with all instances of a database. This provides a simple, persistent, and robust environment for managing database parameters. An SPFILE is required when using Data Guard Broker.
Oracle Database Administrator's Guide for information on managing initialization parameters with an SPFILE
Oracle Database Oracle Clusterware and Oracle Real Application Clusters Administration and Deployment Guide for information on initialization parameters with Real Application Clusters
Oracle Data Guard Broker for information on other prerequisites for using Oracle Data Guard Broker
With automatic undo management, the Oracle Database server effectively and efficiently manages undo space, leading to lower administrative complexity and cost. When Oracle Database internally manages undo segments, undo block and consistent read contention are eliminated because the size and number of undo segments are automatically adjusted to meet the current workload requirement.
To use automatic undo management, set the following initialization parameters:
This parameter should be set to
This parameter specifies the desired time in seconds to retain undo data. It must be the same on all instances.
This parameter should specify a unique undo tablespace for each instance.
Advanced object recovery features, such as Flashback Query, Flashback Version Query, Flashback Transaction Query, and Flashback Table, require automatic undo management. By default, Oracle Database automatically tunes undo retention by collecting database use statistics and estimating undo capacity needs. You can affect this automatic tuning by setting the
UNDO_RETENTION initialization parameter. It is only necessary to set this initialization parameter in the following cases:
The undo tablespace has the
AUTOEXTEND option enabled.
You want to have undo retention for LOBs.
You want a retention guarantee.
Note:By default, undo data can be overwritten by ongoing transactions, even if the
UNDO_RETENTIONsetting specifies that the undo data should be maintained. To guarantee that unexpired undo data is not overwritten, retention guarantee must be enabled for the undo tablespace.
With the retention guarantee option, the undo guarantee is preserved even if there is need for DML activity (DDL statements are still allowed). If the tablespace is configured with less space than the transaction throughput requires, then the following four things will occur in this sequence:
If you have an autoextensible file, then it will automatically grow to accommodate the retained undo data.
A warning alert is issued at 85 percent full.
A critical alert is issued at 97 percent full.
Transactions receive an out-of-space error.
See Also:Oracle Database Administrator's Guide for more information about the
UNDO_RETENTIONsetting and the size of the undo tablespace
Locally managed tablespaces perform better than dictionary-managed tablespaces, are easier to manage, and eliminate space fragmentation concerns. Locally managed tablespaces use bitmaps stored in the datafile headers and, unlike dictionary managed tablespaces, do not contend for centrally managed resources for space allocations and de-allocations.
See Also:Oracle Database Administrator's Guide for more information on locally managed tablespaces
Automatic segment space management simplifies space administration tasks, thus reducing the chance of human error. An added benefit is the elimination of performance tuning related to space management. It facilitates management of free space within objects such as tables or indexes, improves space utilization, and provides significantly better performance and scalability with simplified administration. The automatic segment space management feature is enabled by default for all new tablespaces created using default attributes.
See Also:Oracle Database Administrator's Guide for more information on segment space management
Temporary tablespaces improve the concurrency of multiple sort operations, reduce sort operation overhead, and avoid data dictionary space management operations altogether. This is a more efficient way of handling temporary segments, from the perspective of both system resource usage and database performance.
A default temporary tablespace should be specified for the entire database to prevent accidental exclusion of the temporary tablespace clause. This can be done at database creation time by using the
DEFAULT TEMPORARY TABLESPACE clause of the
CREATE DATABASE statement, or after database creation by the
ALTER DATABASE statement. Using the default temporary tablespace ensures that all disk sorting occurs in a temporary tablespace and that other tablespaces are not mistakenly used for sorting.
See Also:Oracle Database Administrator's Guide for more information on managing tablespaces
Resumable space allocation provides a way to suspend and later resume database operations if there are space allocation failures. The affected operation is suspended instead of the database returning an error. No processes need to be restarted. When the space problem is resolved, the suspended operation is automatically resumed.
To use resumable space allocation, set the
RESUMABLE_TIMEOUT initialization parameter to the number of seconds of the retry time. You must also at the session level issue the
See Also:Oracle Database Administrator's Guide for more information on managing resumable space allocation
The Database Resource Manager gives database administrators more control over resource management decisions, so that resource allocation can be aligned with the business objectives of an enterprise. The Database Resource Manager provides the ability to prioritize work within the Oracle Database server. Availability of the database encompasses both its functionality and performance. If the database is available but users are not getting the level of performance they need, then availability and service level objectives are not being met. Application performance, to a large extent, is affected by how resources are distributed among the applications that access the database. The main goal of the Database Resource Manager is to give the Oracle Database server more control over resource management decisions, thus circumventing problems resulting from inefficient operating system management and operating system resource managers. The Database Resource Manager is enabled by default.
See Also:Oracle Database Administrator's Guide for more information on Database Resource Manager
The best practices discussed in this section apply to Oracle Database 10g with RAC. These best practices build on the Oracle Database 10g configuration best practices described in Section 2.2, "Configuring Oracle Database 10g". These best practices are identical for the primary and standby databases if they are used with Data Guard in Oracle Database 10g with RAC and Data Guard - MAA. Some of these best practices might reduce performance levels, but are necessary to reduce or avoid outages. The minimal performance impact is outweighed by the reduced risk of corruption or the performance improvement for recovery.
This section includes the following topics:
With Oracle Database 10g, application workloads can be defined as services so that they can be individually managed and controlled. DBAs control which processing resources are allocated to each service during both normal operations and in response to failures. Performance metrics are tracked by service and thresholds set to automatically generate alerts should these thresholds be crossed. CPU resource allocations and resource consumption controls are managed for services using Resource Manager. Oracle tools and facilities such as Job Scheduler, Parallel Query, and Oracle Streams Advanced Queuing also use services to manage their workloads.
With Oracle Database 10g, rules can be defined to automatically allocate processing resources to services. Oracle RAC 10g instances can be allocated to process individual services or multiple services as needed. These allocation rules can be modified dynamically to meet changing business needs. For example, these rules could be modified at the end of a quarter to ensure that there are enough processing resources to complete critical financial functions on time. Rules can also be defined so that when instances running critical services fail, the workload will be automatically shifted to instances running less critical workloads. Services can be created and administered with Enterprise Manager, Database Configuration Assistant (DBCA), and the
DBMS_SERVICE PL/SQL package.
Application connections to the database should be made through a Virtual Internet Protocol (VIP) address to a service defined as part of the workload management facility allowing the greatest degree of availability and manageability.
A VIP address is an alternate public address that client connections use instead of the standard public IP address. If a node fails, then the node's VIP address fails over to another node on which the VIP address can accept connections. Clients that attempt to connect to the VIP address receive a rapid connection refused error instead of waiting for TCP connect timeout messages, thereby reducing the time wasted during the initial connection attempt to a failed node. VIP addresses are configured using the Virtual Internet Protocol Configuration Assistant (VIPCA).
See Also:Oracle Database Oracle Clusterware and Oracle Real Application Clusters Administration and Deployment Guide for more information on workload management
Oracle Clusterware is the only clusterware needed for most platforms on which RAC operates. You can also use clusterware from other vendors if the clusterware is certified for RAC. However, adding unnecessary layers of software for functionality that is already provided by Oracle Clusterware adds complexity and cost and can reduce system availability, especially for planned maintenance.
Oracle Clusterware includes a high-availability framework that provides an infrastructure to manage any application. Oracle Clusterware ensures the applications it manages start when the system starts. Oracle Clusterware also monitors the applications to make sure they are always available. For example, if a process fails, then Oracle Clusterware attempts to restart the process based on scripts that you customize. If a node in the cluster fails, then you can program processes that normally run on the failed node to restart on another node. The monitoring frequency, starting, and stopping of the applications and the application dependencies are configurable.
See Also:Oracle Database Oracle Clusterware and Oracle Real Application Clusters Administration and Deployment Guide for more information on managing application availability with Oracle Clusterware
Client-side load balancing evenly spreads new connection requests across all listeners. It is defined in your client connection definition by setting the parameter
ON. (The default is
ON for description lists). When this parameter is set to
ON, Oracle Database randomly selects an address in the address list and connects to that node's listener. This provides a balancing of client connections across the available listeners in the cluster. When the listener receives the connection request, it connects the user to an instance that it knows provides the requested service. To see what services a listener supports, run the
LSNRCTL services command.
Server-side load balancing uses the current workload being run on the available instances for the database service requested during a connection request and directs the connection request to the least loaded instance. Server-side connection load balancing requires each instance to register with all available listeners, which is accomplished by setting
REMOTE_LISTENER parameters for each instance. This is done by default when creating a database with DBCA. Connection load balancing can be further enhanced by using the load balancing advisory and defining the connection load balancing goal for each service by setting the
CLB_GOAL attributes with the
DBMS_SERVICE PL/SQL package.
Oracle Database Oracle Clusterware and Oracle Real Application Clusters Administration and Deployment Guide for more information on workload management
Oracle Database Net Services Administrator's Guide for more information on configuring listeners
The OCR maintains cluster configuration information as well as configuration information about any cluster database within the cluster. The OCR also manages information about processes that Oracle Clusterware controls. Protect the OCR from disk failure by using the ability of Oracle Clusterware to mirror the OCR. If you have external redundancy, create the OCR on the external redundant storage. If you do not have external redundancy, create a minimum of two OCRs across two different controllers.
RAC uses the voting disk to determine which instances are members of a cluster. The voting disk must reside on a shared disk. For high availability, Oracle recommends that you have multiple voting disks. Oracle Clusterware enables multiple voting disks, but you must have an odd number of voting disks, such as three, five, and so on. If you define a single voting disk, then you should use external mirroring to provide redundancy.
See Also:Oracle Database Oracle Clusterware and Oracle Real Application Clusters Administration and Deployment Guide for more information on managing OCR and voting disks
The OCR contains cluster and database configuration information for RAC and Cluster Ready Services (CRS), such as the cluster database node list, CRS application resource profiles, and Event Manager (EVM) authorizations. Oracle Clusterware automatically creates OCR backups every four hours. Oracle always retains the last three backup copies of OCR. The
CRSD process that creates the backups also creates and retains an OCR backup for each full day and at the end of each week. The backup files created by Oracle Clusterware should be backed up as part of the operating system backup using Oracle Secure Backup, standard operating-system tools, or third-party tools.
Note:The default location for generating backups on UNIX-based systems is
cluster_nameis the name of your cluster. The Windows-based default location for generating backups uses the same path structure.
In addition to using the automatically created OCR backup files, you should also export the OCR contents before and after making significant configuration changes, such as adding or deleting nodes from your environment, modifying Oracle Clusterware resources, or creating a database. Do this by using the
-export command. This exports the OCR content to a file format. The export files created by
ocrconfig should be backed up as part of the operating system backup using Oracle Secure backup, standard operating-system tools, or third-party tools.
See Also:Oracle Database Oracle Clusterware and Oracle Real Application Clusters Administration and Deployment Guide for more information on backing up OCR
For the most efficient network detection and failover, CRS and RAC should use the same interconnect subnet so that they share the same view of connections and accessibility. To verify the interconnect subnet used by RAC, run the Oracle
ORADEBUG utility on one of the instances:
SQL> ORADEBUG SETMYPID Statement processed. SQL> ORADEBUG IPC Information written to trace file. SQL> ORADEBUG tracefile_name /u01/app/oracle/admin/prod/udump/prod1_ora_24409.trc
In the trace file, examine the SSKGXPT section to determine the subnet used by RAC. In this example, the subnet in use is 192.168.0.3 and the protocol used is UDP:
SSKGXPT 0xd7be26c flags info for network 0 socket no 7 IP 192.168.0.3 UDP 14727
To verify the interconnect subnet used by CRS, examine the value of the keyname
SYSTEM.css.node_numbers.node<n>.privatename in OCR:
prompt> ocrdump -stdout -keyname SYSTEM.css.node_numbers [SYSTEM.css.node_numbers.node1.privatename] ORATEXT : halinux03ic0 … [SYSTEM.css.node_numbers.node2.privatename] ORATEXT : halinux04ic0
The hostnames (
halinux04ic0 in this example) should match the subnet in the trace file produced by
ORADEBUG (subnet 192.168.0.3). Use operating system tools to verify. For example, on Linux:
prompt> getent hosts halinux03ic0 192.168.0.3 halinux03ic0.us.oracle.com halinux03ic0
During initial setup of a RAC database, the online redo log threads and undo tablespaces for any additional instances in the cluster should be created. If the database might be an Oracle Data Guard standby database at some point, then also create the standby redo logs for each thread at this time.
See Also:Oracle Database Oracle Clusterware and Oracle Real Application Clusters Administration and Deployment Guide for an overview of storage in Oracle Real Application Clusters
The best practices discussed in this section apply to Oracle Database 10g with Data Guard. These best practices build on the ones described in Section 2.2, "Configuring Oracle Database 10g". The proper configuration of Oracle Data Guard Redo Apply and SQL Apply is essential to ensuring that all standby databases work properly and perform their roles within service levels after switchovers and failovers. Most Data Guard configuration settings can be made using Oracle Enterprise Manager. For more advanced, less frequently used Data Guard configuration parameters, the Data Guard Broker command-line interface or SQL*Plus can be used.
Data Guard enables you to use either a physical standby database (Redo Apply) or a logical standby database (SQL Apply), or both, depending on the business requirements. A physical standby database provides a physically identical copy of the primary database, with on-disk database structures that are identical to the primary database on a block-for-block basis. The database schema, including indexes, are the same. A physical standby database is kept synchronized with the primary database by applying the redo data received from the primary database through media recovery.
A logical standby database contains the same logical information as the production database, although the physical organization and structure of the data can be different. It is kept synchronized with the primary database by transforming the data in the redo log files received from the primary database into SQL statements and then executing the SQL statements on the standby database. A logical standby database can be used for other business purposes in addition to disaster recovery requirements.
This section contains configuration best practices for the following aspects of Data Guard:
This section contains information that can help you choose between physical standby and logical standby databases.
This section contains these topics:
A physical standby database provides the following benefits:
Disaster recovery and high availability
A physical standby database enables a robust and efficient disaster recovery and high-availability solution. Easy-to-manage switchover and failover capabilities allow easy role reversals between primary and physical standby databases, minimizing the downtime of the primary database for planned and unplanned outages.
Using a physical standby database, Data Guard can ensure no data loss with certain configurations, even in the face of unforeseen disasters. A physical standby database supports all datatypes, and all DDL and DML operations that the primary database can support. It also provides a safeguard against data corruption and user errors. Storage level physical corruption on the primary database do not propagate to the standby database. Similarly, logical corruption or user errors that cause the primary database to be permanently damaged can be resolved. Finally, the redo data is validated when it is applied to the standby database.
Reduction in primary database workload
Oracle Recovery Manager (RMAN) can use physical standby databases to off-load backups from the primary database saving valuable CPU and I/O cycles. The physical standby database can also be opened in read-only mode for reporting and queries.
The Redo Apply technology used by the physical standby database applies changes using low-level recovery mechanisms, which bypass all SQL level code layers; therefore, it is the most efficient mechanism for applying high volumes of redo data.
Read-write testing and reporting database
Using Flashback Database and a physical standby database, you can configure a temporary clone database for testing and reporting. The temporary clone can later be resynched with the primary database.
A logical standby database provides similar disaster recovery, high availability, and data protection benefits as a physical standby database. It also provides the following specialized benefits:
Efficient use of standby hardware resources
A logical standby database can be used for other business purposes in addition to disaster recovery requirements. It can host additional database schemas beyond the ones that are protected in a Data Guard configuration, and users can perform normal DDL or DML operations on those schemas any time. Because the logical standby tables that are protected by Data Guard can be stored in a different physical layout than on the primary database, additional indexes and materialized views can be created to improve query performance and suit specific business requirements.
Reduction in primary database workload
A logical standby database can remain open at the same time its tables are updated from the primary database, and those tables are simultaneously available for read access. This makes a logical standby database an excellent choice to do queries, summations, and reporting activities, thereby off-loading the primary database from those tasks and saving valuable CPU and I/O cycles.
Database rolling upgrade
A logical standby database can be upgraded to the next version and subsequently become the new primary database after a Data Guard switchover. This rolling upgrade procedure can dramatically reduce the planned downtime of a database upgrade.
Determining which standby type to implement can be accomplished by examining several key areas. Because logical standby does not support all datatypes, you must first determine if your application uses any unsupported datatypes. To determine if your application uses unsupported datatypes, run the following queries on the primary database:
To list unsupported tables, issue the following query:
SET PAGES 200 LINES 132 COL OWNER FORMAT A8 COL DATA_TYPE FORMAT A15 COL TABLE_NAME FORMAT A32 COL COLUMN_NAME FORMAT A25 COL ATTRIBUTES FORMAT A15 SELECT OWNER, TABLE_NAME, REASON FROM DBA_STREAMS_UNSUPPORTED WHERE OWNER NOT IN (SELECT OWNER FROM DBA_LOGSTDBY_SKIP WHERE STATEMENT_OPT='INTERNAL SCHEMA') ORDER BY OWNER
To list unsupported tables with column and data type information, issue the following query:
COL OWNER FORMAT A9 COL DATA_TYPE FORMAT A35 COL TABLE_NAME FORMAT A35 SELECT OWNER, TABLE_NAME, COLUMN_NAME, DATA_TYPE FROM DBA_TAB_COLS WHERE OWNER NOT IN (SELECT OWNER FROM DBA_LOGSTDBY_SKIP WHERE STATEMENT_OPT='INTERNAL SCHEMA') AND DATA_TYPE NOT IN ('BINARY_DOUBLE', 'BINARY_FLOAT', 'INTERVAL YEAR TO MONTH', 'INTERVAL DAY TO SECOND', 'BLOB', 'CLOB','CHAR', 'DATE','LONG', 'LONG RAW', 'NCHAR', 'NCLOB','NUMBER', 'NVARCHAR2','RAW','TIMESTAMP', 'TIMESTAMP(6)','TIMESTAMP(6) WITH TIME ZONE','TIMESTAMP(9)', 'TIMESTAMP WITH LOCAL TIMEZONE', 'TIMESTAMP WITH TIMEZONE', 'VARCHAR','VARCHAR2') ORDER BY 1,2
If either query returns rows with essential application tables, then use a physical standby database or investigate changing the primary database to use only supported datatypes. If the query does not return any rows with essential application tables, then you can use either a physical or a logical standby database.
Next, consider the need for the standby database to be accessible while changes are being applied. If you require that the standby is opened read/write with read-only access to the data being maintained and your application does not make use of unsupported datatypes, then logical standby is your best choice. If access to the standby while changes are being applied is not required or you have datatypes that are not supported by logical standby, then you should implement a physical standby.
If a logical standby database is still a viable choice, then you need to evaluate if it can handle your peak workloads. Because a logical standby database applies changes with SQL instead of a low level recovery mechanism, you need to assess database performance carefully. See "Oracle Database 10g Release 2 Best Practices: Data Guard SQL Apply" at
In some situations, a business cannot afford to lose data at any cost. In other situations, the availability of the database might be more important than protecting data. Some applications require maximum database performance and can tolerate a potential loss of data if a disaster occurs.
Based on your business requirements, choose one of the following protection modes:
Maximum protection mode guarantees that no data loss will occur if the primary database fails. To ensure that data loss cannot occur, the primary database shuts down if a fault prevents it from writing the redo stream to at least one remote standby redo log.
Maximum performance mode (the default mode) provides the highest level of data protection that is possible without affecting the performance of the primary database. This is accomplished by allowing a transaction to commit as soon as the redo data needed to recover that transaction is written to the local online redo log.
The redo data stream of the primary database is also written to at least one standby database, but that redo stream is written asynchronously with respect to the commitment of the transactions that create the redo data. When network links with sufficient bandwidth are used, this mode provides a level of data protection that approaches that of maximum availability mode, with minimal effect on primary database performance.
See Also:Oracle Data Guard Concepts and Administration for more information about data protection modes and for information about setting the data protection mode
To determine the correct data protection mode for your application, ask the questions in Table 2-1.
Is data loss acceptable if the primary site fails?
Yes: Use any protection mode.
No: Use maximum protection or maximum availability modes.
How much data loss is tolerated if a site is lost?
None: Use maximum protection or maximum availability modes.
Some: Use maximum performance mode with LGWR ASYNC
Is potential data loss between the production and the standby databases tolerated when a standby host or network connection is temporarily unavailable?
Yes: Use maximum performance or maximum availability modes.
No: Use maximum protection mode, or use maximum availability with multiple standby databases.
How far away should the disaster-recovery site be from the primary site?
The distance between sites and the network infrastructure between the sites determine the network latency and bandwidth, and therefore the protection mode that can be used. In general, latency increases and bandwidth reduces with distance.
For a low-latency, high bandwidth network, use maximum protection or maximum availability. In this case, the performance impact is minimal, and you can achieve zero data loss.
For a high-latency network, use maximum performance mode with the
What is the current or proposed network bandwidth and latency between sites?
Bandwidth must be greater than maximum redo generation rate. A guideline for two-way communication is for bandwidth to be 50 percent of the stated network capacity, but you also need to consider the network usage of other applications.
When running in maximum protection mode, consider using multiple standby databases. In maximum protection mode, when the standby host or a network connection is temporarily unavailable, the primary database continues to retry connecting to the standby database for the number of seconds specified by the
NET_TIMEOUT parameter in the
n initialization parameter. The primary database preserves zero data loss during this time period. When it is over, the primary database proceeds with subsequent transactions. By configuring multiple standby databases, the primary database transactions are not interrupted, assuming that the primary database can communicate with at least one standby database that can satisfy the protection mode requirements.
In many cases, logical standby databases can be used for reporting and data protection and recovery. However, if the logical standby database schema requires additional indices or changes to optimize reporting functions, then it is recommended to have a separate physical standby database to maintain to a consistent copy of the primary database as well.
When you use multiple standby databases, consider hosting each one in a different geographic location so that a network outage or natural disaster does not affect multiple standby databases. For example, host one standby database local to the primary database and another standby database at a remote location.
This section discusses the following configuration best practices for Data Guard:
Enable Flashback Database on both the primary and standby database so that the old primary database can be easily reinstated as a new standby database following a failover. If there is a failure during the switchover process, then it can easily be reversed when Flashback Database is enabled.
See Also:Section 220.127.116.11, "Enable Flashback Database" for more information about Flashback Database and for information about enabling Flashback Database
When the production database is in
FORCE LOGGING mode, all database data changes are logged.
FORCE LOGGING mode ensures that the standby database remains consistent with the production database. If this is not possible because you require the load performance with
NOLOGGING operations, then you must ensure that the corresponding physical standby datafiles are subsequently synchronized. The physical standby datafiles can be synchronized by either applying an incremental backup created from the primary database or by replacing the affected standby datafiles with a backup of the primary datafiles taken after the nologging operations Before the file transfer, the physical standby database must stop recovery.
For logical standby databases, when SQL Apply encounters a redo record for an operation performed with the
NOLOGGING clause, it skips over the record and continues applying changes from later records. Later, if an attempt is made to access one of the records that was updated with
NOLOGGING in effect, the following error is returned:
ORA-01403 no data found. To recover after the
NOLOGGING clause is specified for a logical standby database, re-create one or more tables from the primary database, as described in Oracle Data Guard Concepts and Administration in Section 9.4.6 "Adding or Re-Creating Tables On a Logical Standby Database."
You can enable force logging immediately by issuing an
ALTER DATABASE FORCE LOGGING statement. If you specify
FORCE LOGGING, then Oracle waits for all ongoing unlogged operations to finish.
Use Data Guard Broker to create, manage, and monitor a Data Guard configuration. The benefits of using Data Guard Broker include:
Integration with RAC
Data Guard Broker is integrated with CRS so that database role changes occur smoothly and seamlessly. This is especially apparent in the case of a planned role switchover (for example, when a physical standby database is directed to take over the primary role while the former primary database assumes the role of standby). Data Guard Broker and CRS work together to temporarily suspend service availability on the primary database, accomplish the actual role change for both databases during which CRS works with Data Guard Broker to properly restart the instances as necessary, and then resume service availability on the new primary database. Data Guard Broker manages the underlying Data Guard configuration and its database roles while CRS manages service availability that depends upon those roles. Applications that rely on CRS for managing service availability will see only a temporary suspension of service as the role change occurs in the Data Guard configuration.
Automated creation of a Data Guard configuration
Oracle Enterprise Manager provides a wizard that automates the complex tasks involved in creating a Data Guard Broker configuration, including:
Adding an existing standby database, or a new standby database created from existing backups taken through Enterprise Manager
Configuring the standby control file, server parameter file, and datafiles
Initializing communication with the standby databases
Creating standby redo log files
Enabling Flashback Database if you plan to use fast-start failover
Although the Data Guard command-line interface (DGMGRL) cannot automatically create a new standby database, you can use DGMGRL commands to configure and monitor an existing standby database, including those created using Enterprise Manager.
Simplified switchover and failover operations
Data Guard Broker simplifies switchovers and failovers by allowing you to invoke them using a single key click in Oracle Enterprise Manager or a single command at the DGMGRL command-line interface. (referred to in this documentation as manual failover). For lights-out administration, you can enable fast-start failover to allow Data Guard Broker to determine if a failover is necessary and initiate the failover to a pre-specified target standby database automatically, with no need for DBA intervention and with no loss of data.
Fast-start failover allows you to increase availability with less need for manual intervention, thereby reducing management costs. Manual failover gives you control over exactly when a failover occurs and to which target standby database. Regardless of the method you choose, Data Guard Broker coordinates the role transition on all databases in the configuration.
Built-in monitoring and alert and control mechanisms
Data Guard Broker provides built-in validation that monitors the health of all databases in the configuration. From any system in the configuration connected to any database, you can capture diagnostic information and detect obvious and subtle problems quickly with centralized monitoring, testing, and performance tools. Both Enterprise Manager and DGMGRL retrieve a complete configuration view of the progress of redo transport services on the primary database and the progress of Redo Apply or SQL Apply on the standby database.
The ability to monitor local and remote databases and respond to events is significantly enhanced by the Data Guard Broker health check mechanism and by its tight integration with the Oracle Enterprise Manager event management system.
Each database uses a flash recovery area.
The production instances archive remotely to only one apply instance.
Table 2-2 describes the recommendations for a robust archiving strategy when managing a Data Guard configuration through SQL*Plus. All of the following items are handled automatically when the Data Guard Broker is managing a configuration.
Archiving must be started on the primary database
Maintaining a standby database requires archiving to be enabled and started on the primary database.
SQL> SHUTDOWN IMMEDIATE SQL> STARTUP MOUNT; SQL> ALTER DATABASE ARCHIVELOG; SQL> ALTER DATABASE OPEN;
Remote archiving must be enabled.
Use a consistent log format (
If the flash recovery is used, then this format is ignored.
Local archiving is done first by the archiver process (
The default setting for
Remote archiving should be done to only one standby instance and node for each standby RAC database.
All production instances archive to one standby destination, using the same net service name. Oracle Net Services connect-time failover is used if you want to automatically switch to the secondary standby host when the primary standby instance has an outage.
If the archives are accessible from all nodes because ASM or some other shared file system is being used for the flash recovery area, then remote archiving can be spread across the different nodes of a standby RAC database.
The standby archive destination should use the flash recovery area.
For simplicity, the standby archive destination (
For a logical standby database,
Specify role-based destinations with the
The following example illustrates the recommended initialization parameters for a primary database communicating to a physical standby database. There are two instances,
SALES2, running in maximum protection mode.
*.DB_RECOVERY_FILE_DEST=+RECO *.LOG_ARCHIVE_DEST_1='SERVICE=SALES stby LGWR SYNC AFFIRM NET_TIMEOUT=30 REOPEN=300 VALID_FOR=(ONLINE_LOGFILES, ALL_ROLES)) DB_UNIQUE_NAME=SALES_stby' *.LOG_ARCHIVE_DEST_STATE_1='ENABLE'
The flash recovery area must be accessible to any node within the cluster and use a shared file system technology such as automatic storage management (ASM), a cluster file system, a global file system, or high-availability network file system (HA NFS). You can also mount the file system manually to any node within the cluster very quickly. This is necessary for recovery because all archived redo log files must be accessible on all nodes.
On the standby database nodes, recovery from a different node is required when the node currently running standby apply fails and cannot be restarted. In that case, any of the existing standby instances residing on a different node can initiate managed recovery. In the worst case, when the standby archived redo log files are inaccessible, the new managed recovery process (MRP) or logical standby process (LSP) on the different node fetches the archived redo log files using the FAL server to retrieve from the production nodes directly.
When configuring hardware vendor shared file system technology, verify the performance and availability implications. Investigate the following issues before adopting this strategy:
Is the shared file system accessible by any node regardless of the number of node failures?
What is the performance impact when implementing a shared file system?
Is there any effect on the interconnect traffic?
# of SRLs = sum of all production online log groups for each thread + number of threads
For example, if a primary database has two instances (threads) and each thread has four online log groups, then there should be ten SRLs. Having one more standby log group for each thread than the number of the online redo log groups for the production database reduces the likelihood that the LGWR for the production instance is blocked because an SRL cannot be allocated on the standby.
The following are additional guidelines for creating SRLs:
Create the same number of SRLs for both production and standby databases
All online redo logs and SRLs for both production and standby databases should be the same size
SRLs should exist on both production and standby databases
SRLs should be created in the Data Area protected through ASM or external redundancy
In a RAC environment, SRLs must be on a shared disk
In a RAC environment, assign the SRL to a thread when the SRL is created
Appendix A, "Database SPFILE and Oracle Net Configuration File Samples" provides detailed examples of parameter settings, including SPFILE samples and Oracle Net configuration files.
This section discusses best practices for planning and implementing redo transport services for Data Guard.
This section contains these topics:
Oracle recommends that you conduct a performance assessment with your proposed network configuration and current (or anticipated) peak redo rate. The network impact between the production and standby databases, and the impact on the primary database throughput, needs to be understood. Because the network between the production and standby databases is essential for the two databases to remain synchronized, the infrastructure must have the following characteristics:
Sufficient bandwidth to accommodate the maximum redo generation rate
Minimal latency to reduce the performance impact on the production database
Multiple network paths for network redundancy
The required bandwidth of a dedicated network connection is determined by the maximum redo rate of the production database and actual network efficiency. Depending on the data protection mode, there are other recommended practices and performance considerations. Maximum protection mode and maximum availability mode require
SYNC transport. Maximum performance protection mode uses the
ASYNC transport option or the archiver (
When you compare maximum protection mode or maximum availability mode (with
SYNC operations) with maximum performance mode (with
ASYNC operations), measure whether performance or throughput will be degraded due to the incurred latency. You should also check whether the new throughput and response time are within your application performance requirements. Distance and network configuration directly influence latency, while high latency might slow down your potential transaction throughput and increase response time. The network configuration, number of repeaters, the overhead of protocol conversions, and the number of routers also affect the overall network latency and transaction response time.
See Also:Section 2.4.2, "Data Protection Mode"
When sending redo to a standby database using the
SYNC attributes, a transaction on the primary database will not return commit complete to the foreground until the redo associated with that transaction has been written both locally and remotely. The commit time when using
LGWR SYNC will be directly impacted by the network latency and bandwidth, as well as the I/O capacity on the standby database. The total commit time is comprised of the primary database's local write (log file parallel write) and the following factors that are captured through the LNS wait event on
SENDREQ: network time + standby write (RFS write obtained from the
V$SYSTEM_EVENT view on the standby database) + network acknowledgement. However, how much the primary database is impacted depends on the application profile. In general, batch updates with infrequent commits and message queuing applications may not observe any noticeable difference.
When sending redo to a standby database using the
ASYNC attributes, the effect on primary database throughput is minimal due to the true asynchronous behavior of
ASYNC redo transport. Furthermore, there is little effect on the primary database throughput (redo bytes per second) as network latency increases. With the
ASYNC attribute, the log writer process writes to the local online redo log file, while the
LGWR network server (
n) processes (one for each destination) read redo from the online redo log and asynchronously transmit the redo to remote destinations. The
LGWR process continues processing the requests without waiting for the LNS network I/O to complete. If redo transport services transmit redo data to multiple remote destinations, then the
n processes initiate the network I/O to all of the destinations in parallel.
Figure 2-3 shows the
n process collecting redo data from the online redo log files and transmitting it over Oracle Net to the RFS process on the standby database.
For remote destinations that are serviced by the
ARCH process, you can configure the remote transfer to use multiple streams for more efficient use of network resources. Configuration of multiple streams is performed by setting the
MAX_CONNECTIONS attribute in the
LOG_ARCHIVE_DEST_n initialization parameter to a value greater than
1. The value determines the maximum number of network connections that will be used to perform the remote archival. The maximum value is five streams for each remote destination.
Because the network connections used in the multiple streams are performed by the
ARCH process, care must be taken when setting the
LOG_ARCHIVE_MAX_PROCESSES initialization parameter. The value of both
PARALLEL_MAX_SERVERS initialization parameters must be at least one greater than the total number specified for
MAX_CONNECTIONS for all remote destinations.
The following sections include best practices for network configuration and highest redo network redo rates:
Set the TCP send and receive socket buffer size to twice the bandwidth delay product (BDP) of the network link between the primary and standby systems. BDP is product of the network bandwidth and latency. Socket buffer sizes should be set using the Oracle Net parameters
SEND_BUF_SIZE, so that the socket buffer size setting affects only Oracle TCP connections. The operating system may impose limits on the socket buffer size that must be adjusted so Oracle can use larger values. For example, on Linux, the parameters
net.core.wmem_max limit the socket buffer size and must be set larger than
For example, if bandwidth is 622 Mbits and latency =30 ms, then you should set
SEND_BUF_SIZE as follows:
SEND_BUF_SIZE= 2 x 622,000,000 / 8 x 0.030 = 4,665,000 bytes
See Also:Primary Site and Network Configuration Best Practices at
With Oracle Net Services, it is possible to control data transfer by adjusting the size of the Oracle Net setting for the session data unit (SDU). Setting SDU to its maximum value of 32767 can improve performance. SDU designates the size of the Oracle Net buffer used to collect data before it is delivered to the TCP network layer for transmission across the network. Oracle internal testing of Oracle Data Guard has demonstrated that the maximum setting of 32767 performs best. The gain in performance is a result of the reduced number of system calls required to pass the data from Oracle Net buffers to the operating system TCP network layer. SDU can be set on a per connection basis with the
SDU parameter in the local naming configuration file (
tnsnames.ora) and the listener configuration file (
listener.ora), or SDU can be set for all Oracle Net connections with the profile parameter
DEFAULT_SDU_SIZE in the
Primary Site and Network Configuration Best Practices at
See Also:Oracle Database Net Services Reference for more information about the
PARALLEL_MAX_SERVERS initialization parameter to accommodate the
MAX_CONNECTIONS attribute setting for the
LOG_ARCHIVE_DEST_n initialization parameter. The
PARALLEL_MAX_SERVERS initialization parameters are related to the
MAX_CONNECTIONS attribute and affect the actual number of
ARCn processes used by an instance. For example, if the total number of connections specified by the
MAX_CONNECTIONS attributes on all destinations exceeds the value of
LOG_ARCHIVE_MAX_PROCESSES, then Data Guard will use as many
ARCn processes as possible, but the number of connections may be fewer than what is specified by
Allow for sufficient I/O bandwidth to account for additional read I/O operations caused by enabling
ASYNC redo transmission. When the
ASYNC attributes are specified on the
LOG_ARCHIVE_DEST_n initialization parameter, the log writer process writes to the local online redo log file with no additional writes to the
ASYNC buffer as in Oracle Database 10g Release 1 (10.1). In Oracle Database 10g Release 2 (10.2), after
LGWR has completed writing to the online log, the
LNSn process reads the change vectors from the online redo log and ships the redo to the standby database. With this new approach, the
LGWR process writes are completely decoupled from
LNSn network writes. Because the LNS process is performing a disk read against the online redo log, when it was not in previous releases, care should be taken that sufficient disk I/O bandwidth exist.
This section discusses the best practices for Data Guard log apply services for both physical and logical standby databases.
This section contains these topics:
To use Oracle Data Guard Redo Apply with a physical standby database, or to use any media recovery operation effectively, you need to tune your database recovery by following these best practices:
Maximize I/O rates on standby redo logs and archived redo logs.
Measure read I/O rates on the standby redo logs and archived redo log directories. Concurrent writing of shipped redo on a standby database might reduce the redo read rate due to I/O saturation. The overall recovery rate will always be bounded by the rate at which redo can be read; so ensure that the redo read rate surpasses your required recovery rate.
Assess recovery rate.
To obtain the history of recovery rates, use the following query to get a history of recovery progress:
SELECT * FROM V$RECOVERY_PROGRESS;
RATE is greater than the maximum redo generation rate at the primary database or twice the average generation rate at the primary database, then no tuning is required; otherwise follow the tuning tips below. The redo generation rate for the primary database can be monitored from Grid Control or extracted from AWR reports under statistic
LOG is greater than ten seconds, then investigate tuning I/O and checkpoints.
Use defaults for
The default setting for
FULL may reduce the recovery rate. Block checking is always recommended on the primary database and might be enabled on the standby database if the recovery rate meets expectations.
The default setting for
TYPICAL. Block checksum should always be enabled for both primary and standby databases. It catches most block corruption while incurring negligible overhead.
Increasing this parameter to 4096 may improve recovery by as much as twenty percent over the default setting of 2152. The message size parameter is used by parallel query operations, so there must be sufficient shared pool to support this increase.
Having a large database cache size can improve media recovery performance by reducing the amount of physical data block reads. Because media recovery does not require
DB_RECYCLE_CACHE_SIZE or require a large
SHARED_POOL_SIZE, the memory can be reallocated to the
Prior to converting the standby database into a primary database, reset these parameters to the primary database settings.
Assess database waits
You can determine the top system and session wait events by querying the standby database's
V$EVENT_HISTOGRAM and looking for the largest
TIME_WAITED value. You may have to capture multiple snapshots of the query results and manually extract the difference to accurately assess a certain time period. Unfortunately, there is no equivalent AWR report for physical standby databases.
If recovery is applying a lot of redo efficiently, the system will be I/O bound and the I/O wait should be reasonable for your system. The following table shows the top recovery-related waits that you may observe and the tuning tips appropriate for each wait type. Apply the tuning tips only if the recovery events are in the top ten waits.
|Wait Name||Description||Tuning Tips|
||Coordinator (recovery session or MRP process) wait for log file read I/O||Tune log read I/O|
||Coordinator synchronous wait for slave (wait for checkpoints)||Tune datafile write I/O, increase DBWR processes, increase primary/standby redo log size|
||Coordinator streaming wait for slave (wait for apply)||Increase
||Foreground waiting available free buffer in the buffer cache||Increase
||Wait for data block read||Tune datafile read I/O|
||Coordinator wait for file header read at log boundary checkpoint||Tune datafile read I/O|
||Coordinator wait for file header write at log boundary checkpoint||Tune datafile write I/O|
||Wait for checkpoint completed||Tune datafile write I/O
Increase number of DBWR processes
||Wait for data block read||Tune file read I/O|
DBWR must write out modified blocks from the buffer cache to the data files. Always use native asynchronous I/O by setting
TRUE (default). In the rare case that asynchronous I/O is not available, use
DBWR_IO_SLAVES to improve the effective data block write rate with synchronous I/O.
Ensure that you have sufficient I/O bandwidth and that I/O response time is reasonable for your system either by doing some base I/O tests, comparing the I/O statistics with those for the primary database, or by looking at some historical I/O metrics. Be aware that I/O response time may vary when many applications share the same storage infrastructure such as with a Storage Area Network (SAN) or Network Attached Storage (NAS).
Use system commands such as UNIX
vmstat or system monitoring tools to assess system resources. Alternatively, refer to Enterprise Manager, AWR reports or performance views such as
If there are I/O bottlenecks or excessive wait I/O operations, then investigate operational or application changes that increased the I/O volume. If the high waits are due to insufficient I/O bandwidth, then add more disks to the relevant ASM disk group. Verify that this is not a bus or controller bottleneck or any other I/O bottleneck. The read I/O from the standby redo log should be greater than the expected recovery rate.
Check for excessive swapping or memory paging.
Check to ensure the recovery coordinator or MRP is not CPU bound during recovery.
Increase log group size for the primary and standby databases.
Increase the online redo log size for the primary database and the standby redo logs size for the standby database to a minimum of 1 GB. Oracle Database does a full checkpoint and updates all the file headers (in an optimized manner) at each log file boundary during media recovery. To reduce the frequency of a full database checkpoint and updating all the file headers, increase the log group size so that a log switch is occurring at a minimum of 15 minute intervals.
If real-time apply is being used and redo is being sent synchronously or asynchronously by way of the LGWR process, then there is no additional data loss risk with this change. If archiver is sending the redo or the primary database is converting to ARCH mode due to heavy load, then you must balance faster recovery rates and higher data loss risk.
To ensure that the crash recovery time for the primary database is minimized even with very large redo group sizes, set
FAST_START_MTTR_TARGET to a nonzero value to enable fast-start fault recovery. If it is currently not set, then set it to 3600. This initialization parameter is relevant only for the primary database.
Assess different degrees of recovery parallelism
Parallel recovery is enabled by default for media and crash recovery with the default degree of parallelism set to the number of CPUs available. In most cases this is the optimal setting. However, in some circumstances faster recovery may be obtained by using a degree of parallelism that is different (higher or lower) than the default. To override the default setting, explicitly specify it as follows:
RECOVER MANAGED STANDBY DATABASE PARALLEL <#>;
See Also:Oracle Database 10g Release 2 Best Practices: Data Guard Redo Apply and Media Recovery at
This section discusses recommendations for Data Guard SQL Apply and logical standby databases.
This section contains these topics:
If the logical standby database is being used as a reporting or decision-support system, then increase the value of the
MAX_SERVERS initialization parameter to reserve parallel query slaves for such operations. Because the SQL Apply process by default uses all of the parallel query slaves, set the
MAX_SERVERS parameter as shown in the following equation to enable a specific number of parallel query slaves to be reserved:
MAX_SERVERS = <current MAX_SERVERS setting> + <PQ Slaves needed for reporting and decision-support operations>
Oracle recommends that you initially set the
MAX_SERVERS parameter to either 9 or 3 + (3 x CPU), whichever value is larger.
If the current settings has not been adjusted for
MAX_SERVER processes, increase the value of the
PARALLEL_MAX_SERVERS initialization parameter by the value of the
MAX_SERVERS initialization parameter on both the primary and standby database instances:
PARALLEL_MAX_SERVERS = current value + MAX_SERVERS value
For a reporting or decision-support system,
PRESERVE_COMMIT_ORDER should be set to
TRUE except when the standby database has fallen behind the primary database. Temporarily set
FALSE while SQL Apply is getting the standby database caught up to the primary database, but reset the parameter to
TRUE after the gap has been resolved.
Database objects that do not require replication to the standby database should be skipped by using the
DBMS_LOGSTDBY.SKIP procedure. Skipping such objects reduces the processing of SQL Apply.
LOG_AUTO_DELETE SQL Apply parameter by running the
DBMS_LOGSTDBY.APPLY_SET procedure. The
LOG_AUTO_DELETE parameter controls whether SQL Apply automatically deletes archived redo log files sent from the primary database once they have been applied on the logical standby database. Set this parameter to
TRUE to enable automatic deletion of archived redo log files. Set it to
FALSE to disable automatic deletion. The default value is
See Also:Oracle Database PL/SQL Packages and Types Reference and Oracle Data Guard Concepts and Administration for more information on the
With proper planning and execution, Data Guard role transitions can effectively minimize downtime and ensure that the database environment is restored with minimal impact on the business. Whether using physical standby or logical standby databases, MAA testing has determined that switchover and failover times with Oracle Data Guard 10g release 2 have been reduced to seconds. This section describes best practices for both switchover and failover.
This section contains these topics:
A database switchover performed by Oracle Data Guard is a planned transition that includes a series of steps to switch roles between a standby database and a production database. Following a successful switchover operation, the standby database assumes the production role and the production database becomes a standby database. In a RAC environment, a switchover requires that only one instance be active for each database, production and standby.
At times the term switchback is also used within the scope of database role management. A switchback operation is a subsequent switchover operation to return the roles to their original state.
Data Guard enables you to change these roles dynamically by:
Using Oracle Enterprise Manager, as described in Section 18.104.22.168.1, "Using Enterprise Manager to Perform a Data Guard Switchover"
Using the Oracle Data Guard Broker command-line interface
Issuing SQL statements, as described in Section 22.214.171.124.2, "Using SQL for Data Guard Switchover to a Physical Standby Database" and Section 126.96.36.199.3, "Using SQL for Data Guard Switchover to a Logical Standby Database" beginning
See Also:Oracle Data Guard Broker for information about using Enterprise Manager or the Data Guard broker command-line to perform database switchover
To optimize switchover processing, use the following best practices and see the Oracle Database 10g Release 2 Best Practices: Data Guard Switchover and Failover white paper available at
For logical standby databases:
See Oracle Database 10g Release 2 Best Practices: Data Guard SQL Apply at
http://www.oracle.com/technology/deploy/availability/htdocs/maa.htm to obtain an optimal SQL Apply rate.
Verify the LogMiner Multiversioned Data Dictionary was received by the primary database by querying the
SWITCHOVER_STATUS column of the
V$DATABASE fixed view on the primary database. When the query returns the
TO LOGICAL STANDBY value, you can proceed with the switchover. See the discussion about "Switchovers Involving a Logical Standby Database" in Oracle Data Guard Concepts and Administration
For physical standby databases:
See Oracle Database 10g Release 1 Best Practices: Data Guard Redo Apply and Media Recovery at
http://www.oracle.com/technology/deploy/availability/htdocs/maa.htm to obtain an optimal Redo Apply rate.
When transitioning from read-only mode to Redo Apply (recovery) mode, restart the database.
Use real-time apply so that redo data is applied to the standby database as soon as it is received.
To enable real-time apply for a physical standby database, use the following SQL statement:
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE DISCONNECT USING CURRENT LOGFILE;
To enable real-time apply for a logical standby database, use the following SQL statement:
ALTER DATABASE START LOGICAL STANDBY APPLY IMMEDIATE;
Enable Flashback Database so that if a failure occurs during the switchover, the process can be easily reversed.
Failover is the operation of taking the production database offline at one site and bringing one of the standby databases online as the new production database. A failover operation can be invoked when a catastrophic failure occurs on the production database, and there is no possibility of recovering the production database in a timely manner.
With Data Guard the process of failover can be completely automated using fast-start failover, or it can be user driven. Fast-start failover eliminates the uncertainty inherent in a process that requires manual intervention. It automatically executes a zero data loss failover within seconds of an outage being detected.
Oracle recommends that you use fast-start failover. The initial MAA tests running Oracle Database 10g Release 2 (10.2) show that failovers performed using the Data Guard Broker and fast-start failover offer a significant improvement in availability.
See Also:Oracle Database 10g Release 2 Best Practices: Data Guard Fast-Start Failover at
http://www.oracle.com/technology/deploy/availability/htdocs/maa.htmfor a comprehensive review of Oracle failover best practices
Manual failover allows for a failover process where decisions are user driven. Manual failover can be accomplished by:
Issuing SQL statements, as described in Section 188.8.131.52.2, "Using SQL to Fail Over to a Physical Standby Database"
Using Oracle Enterprise Manager, as described in Section 184.108.40.206.1, "Using Enterprise Manager to Perform a Data Guard Failover"
Using the Oracle Data Guard Broker command-line interface (DGMGRL)
See Also:Oracle Data Guard Broker for information about using Enterprise Manager or the Data Guard broker command-line for database failover
This section contains these topics:
Fast-start failover can be used only in a Data Guard Broker configuration. It can be configured only through DGMGRL or the Data Guard management pages in Oracle Enterprise Manager. Fast-start failover also requires that the redo transport services be configured for
SYNC, and the Data Guard configuration should be in maximum availability mode to achieve the stated guarantee of zero data loss failover. In addition both the primary and standby databases must have Flashback Database enabled. When enabled, fast-start failover monitors the Data Guard configuration and initiates a failover automatically to the specified target standby database automatically, with no need for DBA intervention and with no loss of data.
The following conditions will trigger a fast-start failover:
Database instance failure (or last instance failure in a RAC configuration)
Shutdown abort (or shutdown abort of the last instance in a RAC configuration)
Datafiles taken offline due to I/O errors
Both the observer (fast-start failover monitoring process) and the standby database lose their network connection to the primary database, and the standby database confirms that it is in a synchronized state.
Following a fast-start failover, the old primary database is automatically reconfigured as a new standby database upon reconnection to the configuration. This enables Data Guard to restore disaster protection in the configuration quickly and easily, returning the database to a protected state as soon as possible.
A Data Guard manual failover is a series of steps to convert a standby database into a production database. The standby database essentially assumes the role of production database. A Data Guard failover is accompanied by an application failover to fail over the users to the new primary database. After the failover, the former production database must be re-created as a new standby database to restore resiliency. The standby database can be quickly re-created by using Flashback Database. See Section 4.3.2, "Restoring a Standby Database After a Failover".
See Also:Oracle Data Guard Broker for more information on failover operations
The following are fast-start failover best practices:
Use real-time apply so that redo data is applied to the standby database as soon as it is received.
Enable Flashback Database to protect from logical failures.
Run the fast-start failover observer process on a host that is not located in the same data center as the primary or standby database.
Ideally, the observer should be run on a system that is equally distant from primary and standby databases. It should connect to the primary and standby databases using the same network as any end-user client. If the designated observer fails, Enterprise Manager can detect it and can be configured to automatically restart the observer. If unable to run at a third site the observer should be installed on the same network as the application.
If the standby database has been opened in read-only mode, then restart the database before starting Redo Apply.
Consider configuring multiple standby databases to maintain data protection following a failover.
Set the value of the
FastStartFailoverThreshold property according to your configuration characteristics, as described in Table 2-3.
|Configuration||Minimum Recommended Setting|
Single instance primary, low latency, and a reliable network
Single instance primary and a high latency network over WAN
Reconfiguration time + 30 seconds Foot 1
Footnote 1 For configurations running Oracle Database software prior to Release 10.2.0.3, use a minimum of RAC miscount + reconfiguration time + 30 seconds
For any of the settings shown in Table 2-3, perform testing to ensure that the fast-start failover threshold is not so aggressive that it will induce false failovers, or so high it does not meet your failover requirements.
Oracle Data Guard Broker for the FastStartFailoverThreshold property
Oracle Database 10g Release 2 Best Practices: Data Guard Switchover and Failover at
A manual failover, which is user-driven, should be used only in case of an emergency and should be initiated due to an unplanned outage such as:
Site disaster that results in the primary database becoming unavailable
User errors that cannot be repaired in a timely fashion
Data failures, to include widespread corruption, which impacts the production application
A failover requires that the initial production database must be reinstated as a standby database to restore fault tolerance to your environment. The standby database can be quickly reinstated by using Flashback Database. See Section 4.3.2, "Restoring a Standby Database After a Failover".
To optimize failover processing and to maintain high availability, use the following best practices:
Enable Flashback Database to reinstate databases after the failover operation has completed.
Use real-time apply in conjunction with Flashback Database to apply redo data to the standby database as soon as it is received, and to quickly rewind the database should user error or logical corruption be detected.
For logical standby databases, see Oracle Database 10g Release 2 Best Practices: Data Guard SQL Apply at
http://www.oracle.com/technology/deploy/availability/htdocs/maa.htm to obtain an optimal SQL Apply rate.
For physical standby databases:
See Oracle Database 10g Release 1 Best Practices: Data Guard Redo Apply and Media Recovery at
Go directly to the
OPEN state from the
MOUNTED state instead of restarting the standby database (as required in previous releases).
Data Guard, Restore Points, and Recovery Manager can be combined to enable a Redo Apply (physical) standby database to be used as a read/write clone database temporarily for testing, reporting, development, or any other use clone technology might have within your organization.
The following are recommendations to consider when using a standby database as a clone:
Evaluate the performance of this approach
When evaluating using this method for maintaining a clone database, measure the time needed to refresh the clone database, given the degree that the clone database will diverge from the primary database. A clone database that is used heavily and diverges significantly from the primary database (more than 15% of the total data blocks are changed between databases) might be synchronized more quickly by simply running
DUPLICATE to copy the entire database from a disk backup.
Also, ensure there are sufficient I/O resources available. The process of flashing back the changes made to the clone database and synchronizing it with the primary by applying an incremental backup is I/O intensive.
Use a separate standby database for disaster protection.
While the standby database is activated as the clone, it is not receiving redo data from the primary database and cannot provide disaster protection. To provide disaster protection while a physical standby database is activated, there should be multiple standby databases running in the Data Guard configuration so that the primary database remains protected while one of the standby databases is activated. Using a second standby database allows the Data Guard configuration to remain in maximum protection or maximum availability mode while the clone database is activated.
Enable block change tracking on the primary database.
The primary database
SCN backup will be created more quickly with block change tracking enabled. This backup is used to apply all changes that occurred on the primary database to the clone database while it was diverged. See Oracle Database Backup and Recovery Basics for information about enabling block change tracking.
Use RMAN parallelism for creating and applying the incremental backup.
When creating and applying the primary database
SCN backup, use RMAN parallelism to reduce the create and apply time. See Oracle Database Backup and Recovery Advanced User's Guide for more information about RMAN parallelism.
Create the primary incremental backup and the flashback clone simultaneously.
To reduce the overall synchronization time, create the incremental backup from the primary database at the same time the clone database is being flashed back to the initial restore point.
To use a standby database as a clone, locate a physical standby database that is not being used for disaster protection. The following sections list the general phases and steps but you can find detailed information about using a standby database as a clone in Oracle Data Guard Concepts and Administration:.
Prepare the physical standby database to be activated.
Prepare the primary database to have the physical standby be diverged.
Activate the physical standby database (diverge it from the primary database).
Use the activated clone database for testing.
Flashback the activated database back to a physical standby database.
Catch up the standby database to the primary database.
In a high-availability environment, nondatabase files must be protected along with database files. Oracle Secure Backup provides data protection for heterogeneous UNIX, Linux, Windows, and Network Attached Storage environments. Additionally, for disaster recovery purposes, some third-party tools enable remote synchronization between a set of local and remote files. For example, you can use tools such as
DRDB for remote synchronization. These tools are available for download on the internet. The following are recommendations regarding these tools:
For software updates, use
rsync to synchronize the standby system with the changes made to software on the primary system.
For configuration files, use
rsync daily or after a change, or use
For important log files, trace files, or debugging files, use
rsync daily or hourly, or use
DRDB to synchronize the entire file system.
For transaction logs or metadata files that must be synchronized with the database, use
csync2 frequently, or use a block synchronization tool such as
DRDB, a third party mirroring utility, or remote synchronization tool.
To accurately assess the primary database performance after adding Data Guard standby databases, obtain a history of statistics from the
V$SYSMETRIC_SUMMARY view or Automatic Workload Repository (AWR) snapshots before and after deploying Data Guard with the same application profile and load.
Application profile can be quickly assessed by comparing the following statistics:
Physical reads per transaction
Physical writes per transaction
CPU usage per transaction
Redo generated per transaction
Application performance can be quickly assessed by comparing the following statistics:
Redo generated per second or redo rate
User commits per second or transactions per second
Database time per second
Response time per transaction
SQL service response time
If the application profile has changed between the two scenarios, then this is not a fair comparison. Repeat the test or tune the database or system with the general principles outlined in the Oracle Database Performance Tuning Guide.
If the application profile is similar and you see there is an application performance on the primary because of a decrease in throughput or an increase in response time, then assess these common problem areas:
If you are experiencing high load (excessive CPU utilization of over 90%, paging and swapping), then you need to tune the system before proceeding with Data Guard. Use the
V$SYSMETRIC_HISTORY view to monitor system usage statistics from the operating system.
Higher I/O waits
If you are experiencing higher I/O waits from
DBWR, then the slower I/O will impact throughput and response time. This can be observed by looking at historical data of the following wait events:
Log file parallel writes
Log file sequential reads
Log file parallel reads
Data file parallel writes
Data file sequential reads parallel writes
LGWR SYNC transport, commits will take more time because we need to guarantee that the redo is available on the standby database before foregrounds get an acknowledgement from
LGWR that the commit has completed. A
LGWR commit includes the following wait events:
Log file parallel write (local write for
LGWR wait on
This wait event includes:
Time to put the packet into the network
Time to send the packet to the standby database
RFS write or standby write to the standby redo log, which includes the RFS I/O wait event plus additional overhead for checksums
Time to send a network acknowledgement back to the primary database (for example, single trip latency time)
LGWR (log writer) commit times can cause longer response time and lower throughput especially for small time-sensitive transactions. However, you may obtain sufficient gains by tuning the
LGWR local write (log file parallel write wait event) or the different components that make up the
LGWR wait on
The disk write I/O (log file parallel write or the RFS I/O) can be tuned by adding more spindles or increasing the I/O bandwidth. The network time can be reduced by tuning the Oracle Net send and receive buffer sizes, setting
SDU=32K, increasing the network bandwidth if there is saturation, and possibly finding a closer site to reduce the network latency.
ASYNC transport and
LGWR never waits for LNSn or
ARCH before writing its commit record to the current log file; however, both LNSn and archiver processes read the online redo logs; thus causing more I/O contention and possibly longer wait times for
LGWR writes (log file parallel writes). If I/O bandwidth and sufficient spindles are not allocated, then you will see higher log file parallel writes and log file sequential reads, which may have an effect on throughput and response time. In most cases, adding sufficient spindles will reduce the I/O latency.
Note:To enable most of the new statistical gathering and advisors, ensure the
STATISTICS_LEVELinitialization parameter is set to
Oracle Database Performance Tuning Guide for general performance tuning and troubleshooting best practices
The "Oracle 10g Data Guard: Primary Site and Network Configuration Best Practices" or tuning
ASYNC transport components. This white paper can be downloaded from
While it is prudent that every database has a good backup, consider your Recover Time Objective (RTO) and Recovery Point Objective (RPO) when designing a backup and recovery strategy. While many recoveries involve restoring a backup, Oracle provides other database features such as Data Guard and Flashback Technology to minimize the recovery time from a database outage.
This section discusses the best practices for maintaining a good database backup, as well as other backup options and strategies made possible by the available Oracle database features.
This section contains the following topics:
Oracle has multiple database features and products to facilitate Backup and Recovery operations, including Recovery Manager (RMAN), Oracle Secure Backup, the flash recovery area, Flashback Database and restore points.
This section contains these topics:
Recovery Manager (RMAN) is Oracle's utility to backup and recover the Oracle Database. Because of its tight integration with the database, RMAN determines automatically what files need to be backed up. But more importantly, RMAN knows what files need to be restored for media- recovery operations. RMAN uses server sessions to perform backup and recovery operations and stores metadata about backups in a repository. RMAN offers many advantages over typical user-managed backup methods, including:
Online database backups without placing tablespaces in backup mode
Data block integrity checks during backup and restore operations
Test backups and restores without actually performing the operation
RMAN automates backup and recovery. User-managed methods require you to locate backups for each datafile, copy them to the correct place using operating system commands, and choose which logs to apply. RMAN manages these tasks automatically.
There are also capabilities of Oracle backup and recovery that are only available when using RMAN, such as automated tablespace point-in-time recovery and block-media recovery.
Oracle Secure Backup provides data protection for heterogeneous UNIX, Linux, Windows and Network Attached Storage (NAS) environments. Oracle Secure Backup provides tape data protection for the entire Oracle Environment:
Oracle Database through integration with RMAN
Seamless support of Oracle Real Application Clusters (RAC)
File system data protection of distributed servers including:
Oracle Application Servers
Oracle Collaboration Suites
Oracle home and binaries
The combination of RMAN and Oracle Secure Backup provides an end-to-end tape backup solution, eliminating the need for third-party backup software.
See Also:The Oracle Secure Backup web site at
Oracle restore points can be used to protect against logical failures at risky conjunctions during database maintenance. Creating a normal restore point assigns a restore point name to a specific point in time or SCN. The restore point name can be used with Flashback Table and Flashback Database operations. Restore points can be guaranteed to ensure that a Flashback Database operation will succeed in rewinding the database back to the restore point.Guaranteed restore points are recommended for database-wide maintenance such as database or application upgrades, or running batch processes. Guaranteed restore points enable Flashback Database and retain all flashback logs necessary to ensure the database can be flashed back to the restore point. Once maintenance activities complete and results are verified, guaranteed restore points that are no longer needed should be deleted.
See Also:Oracle Database Backup and Recovery Basics for more information on Flashback Database
This section contains these topics:
Using backups to resolve an unscheduled outage of a production database may not allow you to meet your Recovery Time Objective (RTO) or SLA. For example, some outages are handled best by using Flashback Database or the standby database. However, some situations require using database backups, including the following:
Recovering from Data Failures Using File or Block Media Recovery When a block corruption, media failure, or other data failure occurs in an environment that does not include Data Guard, the only method of recovery is using an existing backup.
Double Failure Resolution A double failure scenario affects the availability of both the production and standby databases. An example of a double failure scenario is a site outage at the secondary site, which eliminates fault tolerance, followed by a media failure on the production database. The only resolution of this situation is to re-create the production database from an available backup and then re-create the standby database.
Some multiple failures, or more appropriately disasters (such as a primary site outage followed by a secondary site outage), might require the use of backups that exist only in an offsite location. Developing and following a process to deliver and maintain backup tapes at an offsite location, therefore, is necessary to restore service in the most dire of circumstances.
It is important to determine a backup frequency policy and to perform regular backups. A backup retention policy helps ensure that needed data is not destroyed too soon.
Factors Determining Backup Frequency Frequent backups are essential for any recovery scheme. Base the frequency of backups on your estimated recovery time objective for outages that cannot be resolved by Data Guard or Flashback technology. Repair time will be dictated by restore time plus recovery time. The frequency of the backup and the location of the backup will impact both of these factors. The other factor that influences how frequently a datafile is backed up is the rate or frequency of database changes such as:
Addition and deletion of tables
Insertions and deletions of rows in existing tables
Updates to data in tables
To simplify database backup and recovery, the Oracle suggested backup strategy implements the flash recovery area while using incremental backups and updated incremental backup features. For more information about the Oracle suggested strategy, see the section titled "Using the Oracle-Suggested Backup Strategy" in Oracle Database 2 Day DBA.
Establishing a Backup Retention Policy A backup retention policy is a rule set regarding which backups must be retained (on disk or other backup media) to meet recovery and other requirements. It may be safe to delete a specific backup because it is old enough to be superseded by more recent backups or because it has been stored on tape. You may also have to retain a specific backup on disk for other reasons such as archival requirements. A backup that is no longer needed to satisfy the backup retention policy is said to be obsolete.
Backup retention policy can be based on redundancy or a recovery window. In a redundancy-based retention policy, you specify a number n such that you always keep at least n distinct backups of each file in your database. In a recovery window-based retention policy, you specify a time interval in the past (for example, one week or one month) and keep all backups required to let you perform point-in-time recovery to any point during that window.
Keeping Long-Term Backups Some businesses require the ability to maintain long-term backups that may be needed years into the future. By using RMAN with the
KEEP option, it is possible to retain backups that are exempt from the retention policy and never expire, providing the capability to restore and recover the database to any desired point in time. It is important that a recovery catalog be used for the RMAN repository so that backup metadata is not lost due to lack of space, which may occur when using the target database control file for the RMAN repository.
RMAN automatically manages the backup metadata in the control file of the database that is being backed up. To protect and keep backup metadata for long periods of time, the RMAN repository is created in a separate database. This repository is usually referred to as a recovery catalog. The advantages of using a recovery catalog include:
Stores backup information long-term
Store metadata for multiple databases
Restore an available backup onto another system
Another reason to use a recovery catalog is the limited maximum size of the target database control file. If the control file is too small to hold additional backup metadata, then existing backup information is overwritten, making it difficult to restore and recover using those backups.
See Also:Oracle Database Backup and Recovery Advanced User's Guide for more information on RMAN repository
Oracle Database 10g includes a change tracking feature for incremental backups, which improves incremental backup performance by recording changed blocks in each datafile in a change tracking file. If change tracking is enabled, then RMAN uses the change tracking file to identify changed blocks for incremental backup. This avoids the need to scan every block in the datafile, reducing the number of disk reads during backup.
See Also:Oracle Database Backup and Recovery Basics for more information on block change tracking
RMAN can be configured to automatically back up the control file and server parameter file (SPFILE) whenever the database structure metadata in the control file changes or when a backup record is added. The autobackup enables RMAN to recover the database even if the current control file, catalog, and SPFILE are lost. The RMAN autobackup feature is enabled with the
CONFIGURE CONTROLFILE AUTOBACKUP ON statement.
See Also:Oracle Database Backup and Recovery Basics for more information on autobackup
This section contains these topics:
When selecting a backup mechanism, the following priorities will drive your backup strategy:
Overall backup time
Impact on primary
Space used by backup
Table 2-4 compares different backup alternatives against the different priorities you might have. The table should guide you to choose the best backup approach for your specific business requirements. You might want to minimize backup space while sacrificing recovery time. Alternatively, a higher priority might be on recovery and backup times while space is not an issue.
|Backup Option||Overall Backup Time||Impact on Primary||Space Used by Backup||Recovery Time|
Full data file copy
Full backup sets
Incremental backup with roll forward immediately
Incremental backup with roll forward deferred until recovery
Best Practices on Optimizing Recovery Times If restore time is the primary concern, then either a database copy or an incremental backup with roll forward immediately should be performed. These two are the only options that provide an immediately usable backup of the database, which then needs only to be recovered to the time of the failure using archive logs created since the last backup was performed.
Best Practices on Minimizing Space Usage If space usage is the primary concern, then an incremental backup with a deferred roll forward should be performed. If a cumulative level 1 incremental backup is performed, then it stores only those blocks that have been changed since the last level 0 backup. With a cumulative incremental backup, the last level 1 backup need only be applied to the level 0 backup. With a differential incremental backup, all level 1 backups have to be applied to the level 0 backup. A cumulative incremental backup will initially consume more space in the flash recovery area than a differential incremental backup. But over time the cumulative incremental backup will consume less space.
Best Practices on Minimizing System Resource Consumption (I/O and CPU) If system resource consumption is the primary concern, then an incremental backup with a block change-tracking file will consume the least amount of resources on the database. This is true when the amount of data changed for each backup window is below 20% of the total database size. When the amount of data changed for each backup window exceeds 20%, then performing incremental backups will still reduce the amount of disk space required to hold the backup, but may not reduce the backup time.
For many applications, only a small percentage of the entire database is changed each day regardless of whether the transaction rate is very high. Frequently, applications modify a same set of blocks frequently; so, the total dirty block set is small.
For example, a database contains about 600 GB of user data, not including temp files and redo logs. Every 24 hours, approximately 2.5% of the database is changed, which is approximately 15 GB of data. The first level 0 backup takes about 180 minutes and a subsequent level 1 backup takes 20 minutes, while the merge of the backups take 45 minutes. In this example, we observed the following results:
Level 0 backup takes 180 minutes, including
READs from the data area and
WRITEs to the flash recovery area
Level 1 backup takes 20 minutes, including
READs from the data area and
WRITEs to the flash recovery area
Rolling forward and merging the backups takes only 45 minutes included
WRITEs from the flash recovery area, which offloads possibly contention to the data area if they use separate storage.
The net savings are:
115 minutes or 64% time savings to create a complete backup
Reduced I/O on the database during backups
For bigger databases, we observed even larger gains.
When creating backups to disk or tape, use the target database control file as the RMAN repository, so that backup success does not depend on the availability of the database holding the RMAN repository. This is accomplished by running RMAN with the
NOCATALOG option. After the backup is complete, the new backup information stored in the target database control file can be resynchronized with the recovery catalog using the
RESYNC CATALOG command.
See Also:Oracle Database Backup and Recovery Reference for more information on
Using automatic disk-based backup and recovery, you can create a flash recovery area which automates management of backup-related files:
Choose a location on disk.
This location is specified by
Choose an upper bound for storage space.
This upper bound is specified by
Set a retention policy that governs how long backup files are needed for recovery.
Oracle Database 10g then manages the storage used for backup, archived redo logs, and other recovery-related files for your database within this space. Files no longer needed are eligible for deletion when RMAN must reclaim space for new files.
Take backups at the primary and secondary sites. The advantages of this practice include:
Significantly reduces RTO in certain double outage scenarios
Avoids introducing new backup procedures upon a switchover or failover
RMAN file and block media recovery is a recovery option for data failure outages at both primary and secondary sites
Consider a scenario in which backups are done only at the secondary site. Suppose there is a site outage at the secondary site where the estimated time to recover is three days. The primary site is completely vulnerable to an outage that is typically resolved by a failover, or any outage that could be resolved by having a local backup (such as a data failure outage resolved by block media recovery).
In this scenario, a production database outage can be resolved only by physically shipping the off-site tape backups that were taken at the standby site. If primary site backups were available, then restoring locally would be an available option in place of the failover that cannot be done. Data might be lost, but having primary site backups significantly shortens the RTO.
Primary site disk backups are also necessary to ensure a reasonable RTO when using RMAN file or block media recovery. Without a local on disk backup, a backup taken at the standby site must be restored to the primary site, significantly lengthening the RTO for this type of outage.
This section contains these topics:
Use RMAN command
BACKUP RECOVERY FILE DESTINATION to move disk backups created in the flash recovery area to tape. Using a single command, all files not backed up to tape previously are backed up. This prevents you from backing up files more than once and wasting tape or tracking files not backed up before. Tape backups are used to handle certain outage scenarios and for offsite and long-term storage.
Regardless of the architecture deployed, including the existence of a standby database, it is still important to have offsite backups for business requirements, to protect against disasters, and to comply with legal and regulatory requirements such as the Securities and Exchange Commission (SEC) and Health Insurance Portability and Accountability Act (HIPPA).
This section contains these topics:
Use RMAN command
BACKUP VALIDATE RMAN to regularly check database files for block corruption that has not yet been reported by a user session or by normal backup operations. RMAN scans the specified files and verifies content-checking for physical and logical errors, but does not actually perform the backup or recovery operation. Oracle records the address of the corrupt block and the type of corruption in the control file. Access these records through the
V$DATABASE_BLOCK_CORRUPTION view, which can be used by RMAN block media recovery.
BLOCK CHANGE TRACKING is enabled, then do not use the
INCREMENTAL LEVEL option with
BACKUP VALIDATE to ensure that all data blocks are read and verified.
Complete, successful, and tested backups are fundamental to the success of any recovery. Create test plans for different outage types. Start with the most common outage types and progress to the least probable.
Monitor the backup procedure for errors, and validate backups by testing your recovery procedures periodically. Also, validate the ability to backup and restore by using the RMAN commands
BACKUP VALIDATE and
Include the recovery catalog database in your backup and recovery strategy. If you do not back up the recovery catalog and a disk failure occurs that destroys the recovery catalog database, then you may lose the metadata in the catalog. Without the recovery catalog contents, recovery of your other databases is likely to be more difficult.
To benefit fully from fast instance and database failover and switchover with Real Application Clusters (RAC) and Data Guard, you should configure fast application failover. When a database service becomes unavailable, fast application failover enables clients (mid-tier applications or any program that connects directly to a database) to failover quickly and seamlessly to an available database service.
Because client failover features have evolved through several Oracle Database releases, the time required for clients to respond to various outages will vary by release. The time required for failover in certain cases is a direct function of TCP/IP network timeouts.
Table 2-5 shows typical wait times when using client failover features
|Oracle Database Version||Client Type||Site Failure||RAC Node Failure||Non-RAC Instance Failure||RAC Instance Failure|
8.0, 8i, 9i
Seconds to minutesFoot 1
10g Release 1 (10.1)
Seconds to minutes1
10g Release 1 (10.1)
Seconds to minutes1
10g Release 2 (10.2)
10g Release 2 (10.2)
Footnote 1 The wait times required in non-RAC instance failures are determined by how much time is needed to activate the standby database as the new primary database and for the client to establish a new connection.
Footnote 2 Excluding ODP.NET clients, who suffer an outage equal to that of TCP timeout
With Oracle Database 10g Release 2 (10.2), delays caused by TCP/IP network timeouts are overcome for both JDBC clients and OCI clients by using fast application failover. To use fast application failover for site failovers, a trigger is written that is invoked by the
DB_ROLE_CHANGE system event. This trigger can also manage post-failover tasks.
A very detailed description of how to achieve seamless client failover is covered in Oracle Database 10g Release 2 Best Practices: Client Failover for Highly Available Oracle Databases at
This section contains these topics:
For JDBC client failover best practices, follow these steps:
Enable fast connection failover for JDBC clients by setting the
Configure JDBC clients to use a connect descriptor that includes an address list which includes the VIP address for each node in the cluster and connects to an existing service.
Configure a remote Oracle Notification Service (ONS) subscription on the JDBC client so that an ONS daemon is not required on the client.
For OCI client failover best practices, follow these steps:
Enable fast application notification (FAN) notifications for OCI clients by initializing the environment with the
Link the OCI client applications with the thread library.
AQ_HA_NOTIFICATIONS parameter to
TRUE and configure the transparent application failover (TAF) failover attributes for services.
Use Oracle Enterprise Manager to create new services.
Add all hosts in the cluster to the RAC ONS configuration.
For client failover best practices when failing over from a RAC primary database to a standby database, follow these steps:
Create the necessary support files for Oracle Net alias modifications.
Configure the FAN ONS publisher program so that JDBC clients are notified of the primary site failure and instructed to reconnect to the new primary database.
Create a trigger on the
DB_ROLE_CHANGE system event to perform the proper steps so that JDBC and OCI clients connect to the new primary database following a role change after a Data Guard failover or switchover.