2.6 Configuring Oracle Database 11g with Oracle Data Guard

The proper configuration of Oracle Data Guard Redo Apply and SQL Apply is essential to ensuring that all standby databases work properly and perform their roles within the necessary service levels after switchovers and failovers.

The best practices for Oracle Data Guard build on the ones described in Section 2.2, "Configuring Oracle Database 11g". You can configure most Oracle Data Guard settings using Oracle Enterprise Manager and the broker. To set more advanced, less frequently used configuration parameters, use the DGMGRL command-line interface or SQL*Plus statements.

Oracle Data Guard enables you to use a physical standby database (Redo Apply), a snapshot standby database, the Active Data Guard option (real-time query), or a logical standby database (SQL Apply), real-time apply, or a combination of these.

  • A physical standby database provides a physically identical copy of the primary database, with on disk database structures that are identical to the primary database on a block-for-block basis. The database schemas, including indexes, are the same. A physical standby database is kept synchronized with the primary database by applying the redo data received from the primary database through media recovery. In addition, a physical standby database:

    • Can be opened for read-only access (real-time queries) while Redo Apply is active if you purchase a license for the Oracle Active Data Guard option.

    • Can be converted to a snapshot standby database for use as a testing database or cloning, and then later converted back to run in the physical standby database role.

    • Can be temporarily converted into a transient logical standby database on which you can perform a rolling upgrade, incurring minimal downtime.

  • A logical standby database contains the same logical information as the primary database, although the physical organization and structure of the data can be different from the primary database. SQL Apply keeps the logical standby database synchronized with the primary database by transforming the redo data received from the primary database into SQL statements and then executing the SQL statements on the standby database. You can use a logical standby database for other business purposes in addition to disaster-recovery requirements.

See Also:

2.6.1 Determine Which Type of Standby Database Is Best for Your Application

This section differentiates physical and logical standby databases to help you determine which is the best solution for your business requirements.

Use a physical standby database when:

  • Simplicity and reliability of a physical replica is preferable.

  • The primary database has a very high redo generation rate.

  • The highest level of protection against corruption is required.

  • An up-to-date standby database that is open read-only can address planned uses of the standby database while it is running in the standby role. (Requires a license for the Oracle Active Data Guard option.)

  • You want to offload fast incremental backups to the standby database. (Requires a license for the Oracle Active Data Guard option.)

  • You want to use a snapshot standby database for quality assurance testing or other uses that require a database that is open read-write.

  • To perform rolling database upgrades using a transient logical standby database.

Use a logical standby database when:

  • You want to run reporting applications that require read/write access to the standby database.

    Note: You cannot modify the data maintained by a logical standby database.

  • You want to add tables, additional schemas, indexes, and materialized views to your standby database that do not exist on your primary database.

  • You will perform a rolling database upgrade from a database currently running an Oracle Database 10g release.

    Note: If your database is already running Oracle Database 11g, use a physical standby database and the transient logical standby database rolling upgrade process.

In addition, use a logical standby database if you have any of the preceding requirements and any of the following characteristics:

  • You require basic one-way replication of the entire database.

    Note: If you have more complex replication requirements (for example, multimaster, many-to-one, transformations, and so on) then use Oracle Streams (see Section 2.8).

  • You do not have any unsupported data type, or for which EDSFoot 8  is not a viable workaround.

  • You meet other prerequisites of a logical standby database.

  • Performance tests confirm that logical standby databases can handle peak workloads.

2.6.2 Choose the Appropriate Level of Data Protection

In some situations, a business cannot afford to lose data at any cost. In other situations, the availability of the database might be more important than protecting data. Some applications require maximum database performance and can tolerate a potential loss of data if a disaster occurs.

Based on your business requirements, choose one of the following protection modes:

  • Maximum protection mode guarantees that no data loss occurs if the primary database fails. To ensure that data loss cannot occur, the primary database shuts down if a fault prevents it from writing the redo stream to at least one standby database.

  • Maximum availability mode provides the highest level of data protection that is possible without compromising the availability of the primary database.

  • Maximum performance mode (the default mode) provides the highest level of data protection that is possible without affecting the performance of the primary database. This is accomplished by allowing a transaction to commit as soon as the redo data needed to recover that transaction is written to the local online redo log.

    The redo data stream of the primary database is also written to at least one standby database, but that redo stream is written asynchronously with the commitment of the transactions that create the redo data. When network links with sufficient bandwidth are used, this mode provides a level of data protection that approaches that of maximum availability mode, with minimal effect on primary database performance.

To determine the appropriate data protection mode for your application, perform the following steps:

  1. Compare your business requirements to the data loss scenarios in Table 2-2.

    Table 2-2 Determining the Appropriate Data Protection Mode

    If data loss is not acceptable during ... Then ...

    A primary site failure, failure of one or all standby sites, or any network failure

    Use maximum protection mode.

    A primary site failure

    Use maximum availability mode. Otherwise, use maximum performance mode.


  2. Consider the effect of latency on application throughput:

    When using maximum protection mode or maximum availability mode, consider how the latency might affect application throughput. The distance between sites and the network infrastructure between the sites determine the network latency, and therefore determine the protection mode that can be used. In general, latency increases and bandwidth decreases with distance:

    • For a low-latency, high bandwidth network, use maximum protection or maximum availability protection mode.

      In this case, the effect on performance is minimal and you can achieve zero data loss.

    • For a high-latency network, use maximum performance mode with the ASYNC transport.

      In this case, the performance effect on the primary database is minimal, and you can usually limit data loss to seconds. You can still use the maximum availability mode and maximum protection mode with the SYNC transport, but you must assess if the additional COMMIT latency might exceed your application performance requirements. In some cases, the response time or throughput overhead is zero or within acceptable requirements. Large batch applications or a message queuing applications are good examples where maximum availability with SYNC is still applicable across a high-latency network.

    Bandwidth must be greater than the maximum redo generation rate. A guideline for two-way communication is for bandwidth to be 50% of the stated network capacity, but also consider the network usage of other applications. Using the maximum performance mode with the ASYNC redo transport mitigates the effect on performance.

2.6.3 Implement Multiple Standby Databases

You should deploy multiple standby databases for any of the following purposes:

  • To provide continuous protection following failover

    The standby databases in a multiple standby configuration that are not the target of the role transition (these databases are referred to as bystander standby databases) automatically apply redo data received from the new primary database.

  • To achieve zero data loss protection while also guarding against widespread geographic disasters that extend beyond the limits of synchronous communication

    For example, one standby database that receives redo data synchronously is located 200 miles away, and a second standby database that receives redo data asynchronously is located 1,500 miles away from the primary.

  • To perform rolling database upgrades while maintaining disaster protection throughout the rolling upgrade process

  • To perform testing and other ad-hoc tasks while maintaining disaster-recovery protection

    When desired, use some standby databases for such purposes while reserving at least one standby database to serve as the primary failover target.

See Also:

2.6.4 General Configuration Best Practices for Data Guard

Use the following configuration best practices for Data Guard:

See Also:

Appendix A, "Database SPFILE and Oracle Net Configuration File Samples" for detailed examples of initialization parameter settings, including SPFILE samples and Oracle Net configuration files

2.6.4.1 Use Recovery Manager to Create Standby Databases

Oracle recommends that you use the Recovery Manager (RMAN) utility to simplify the process of creating a physical standby database.

You can either create a standby database from backups of your primary database, or create a standby database over the network:

  • Use the RMAN DUPLICATE TARGET DATABASE FOR STANDBY command to create a standby database from backups of your primary database.

    You can use any backup copy of the primary database to create the physical standby database if the necessary archived redo log files to completely recover the database are accessible by the server session on the standby host. RMAN restores the most recent datafiles unless you execute the SET UNTIL command.

  • Use the RMAN FROM ACTIVE DATABASE option to create the standby database over the network if a preexisting database backup is not accessible to the standby system.

    RMAN copies the data files directly from the primary database to the standby database. The primary database must be mounted or open.

You must choose between active and backup-based duplication. If you do not specify the FROM ACTIVE DATABASE option, then RMAN performs backup-based duplication. Creating a standby database over the network is advantageous because:

  • You can transfer redo data directly to the remote host over the network without first having to go through the steps of performing a backup on the primary database. (Restoration requires multiple steps including storing the backup locally on the primary database, transferring the backup over the network, storing the backup locally on the standby database, and then restoring the backup on the standby database.)

  • With active duplication you can backup a database (as it is running) from ASM, and restore the backup to a host over the network and place the files directly into ASM.

    Before this feature, restoration required you to backup the primary and copy the backup files on the primary host file system, transfer the backup files over the network, place the backup files on the standby host file system, and then restore the files into ASM.

See Also:

2.6.4.2 Enable Flashback Database for Reinstatement After Failover

Enable Flashback Database on both the primary and standby database so that, in case the original primary database has not been damaged, you can reinstate the original primary database as a new standby database following a failover. If there is a failure during the switchover process, then it can easily be reversed when Flashback Database is enabled.

See Also:

Section 2.2.1.4, "Enable Flashback Database" for more information about Flashback Database and for information about enabling Flashback Database

2.6.4.3 Use FORCE LOGGING Mode

When the primary database is in FORCE LOGGING mode, all database data changes are logged. FORCE LOGGING mode ensures that the standby database remains consistent with the primary database. If this is not possible because you require the load performance with NOLOGGING operations, then you must ensure that the corresponding physical standby data files are subsequently synchronized. To synchronize the physical standby data files, either apply an incremental backup created from the primary database or replace the affected standby data files with a backup of the primary data files taken after the nologging operation. Before the file transfer, you must stop Redo Apply on the physical standby database.

For logical standby databases, when SQL Apply encounters a redo record for an operation performed with the NOLOGGING clause, it skips over the record and continues applying changes from later records. Later, if an attempt is made to access a record that was updated with NOLOGGING in effect, the following error is returned: ORA-01403 no data found. To recover after the NOLOGGING clause is specified for a logical standby database, re-create one or more tables from the primary database, as described in Oracle Data Guard Concepts and Administration in the section about "Adding or Re-Creating Tables On a Logical Standby Database."

You can enable force logging immediately by issuing an ALTER DATABASE FORCE LOGGING statement. If you specify FORCE LOGGING, then Oracle waits for all ongoing unlogged operations to finish.

2.6.4.4 Use the Data Guard Broker

Use the broker to create, manage, and monitor a Data Guard configuration. The benefits of using broker include:

  • Integration with Oracle RAC

    The broker is integrated with Cluster Ready Services (CRS)Foot 9  so that database role changes occur smoothly and seamlessly. This is especially apparent for a planned role switchover (for example, when a physical standby database is directed to take over the primary role while the original primary database assumes the role of standby). The broker and CRS work together to temporarily suspend service availability on the primary database, accomplish the actual role change for both databases during which CRS works with the broker to properly restart the instances as necessary, and then resume service availability on the new primary database. The broker manages the underlying Data Guard configuration and its database roles while CRS manages service availability that depends upon those roles. Applications that rely on CRS for managing service availability see only a temporary suspension of service as the role change occurs in the Data Guard configuration.

  • Automated creation of a Data Guard configuration

    Oracle Enterprise Manager provides a wizard that automates the complex tasks involved in creating a broker configuration, including:

    • Adding an existing standby database, or a new standby database created from existing backups taken through Enterprise Manager

    • Configuring the standby control file, server parameter file, and data files

    • Initializing communication with the standby databases

    • Creating standby redo log files

    • Enabling Flashback Database if you plan to use fast-start failover

    Although the Data Guard command-line interface (DGMGRL) cannot automatically create a standby database, you can use DGMGRL commands to configure and monitor an existing standby database, including those created using Enterprise Manager.

  • Simplified switchover and failover operations

    The broker simplifies switchovers and failovers by allowing you to invoke them using a single key click in Oracle Enterprise Manager or a single command at the DGMGRL command-line interface (using DGMGRL is referred to in this documentation as manual failover). For lights-out administration, you can enable fast-start failover to allow the broker to determine if a failover is necessary and initiate the failover to a pre-specified target standby database automatically, with no need for DBA intervention and with little or no loss of data.

    Fast-start failover enables you to increase availability with less need for manual intervention, thereby reducing management costs. Manual failover gives you control over exactly when a failover occurs and to which target standby database. Regardless of the method you choose, the broker coordinates the role transition on all databases in the configuration.

  • Fast Application Notification (FAN) after failovers

    The broker automatically publishes FAN/AQ (Advanced Queuing) notifications after a failover. Properly configured clients that are also configured for Fast Connection Failover can use these notifications to connect to the new primary database to resume application processing.

  • Built-in monitoring and alert and control mechanisms

    The broker provides built-in validation that monitors the health of all databases in the configuration. From any system in the configuration connected to any database, you can capture diagnostic information and detect obvious and subtle problems quickly with centralized monitoring, testing, and performance tools. Both Enterprise Manager and DGMGRL retrieve a complete configuration view of the progress of redo transport services on the primary database and the progress of Redo Apply or SQL Apply on the standby database.

    The ability to monitor local and remote databases and respond to events is significantly enhanced by the broker health check mechanism and by its tight integration with the Enterprise Manager event management system.

2.6.4.5 Use a Simple, Robust Archiving Strategy and Configuration

This archiving strategy is based on the following assumptions:

  • Each database uses a flash recovery area.

  • The primary database instances archive remotely to only one apply instance.

Table 2-3 describes the recommendations for a robust archiving strategy when managing a Data Guard configuration through SQL*Plus. All of the following items are handled automatically when the broker is managing a configuration.

Table 2-3 Archiving Recommendations

Recommendation Description

Start archiving on the primary and standby databases

Maintaining a standby database requires that you enable and start archiving on the primary database, as follows:

SQL> SHUTDOWN IMMEDIATE
SQL> STARTUP MOUNT;
SQL> ALTER DATABASE ARCHIVELOG;
SQL> ALTER DATABASE OPEN;

Archiving must also be enabled on the standby database to support role transitions. To enable archiving on the standby database:

SQL> SHUTDOWN IMMEDIATE;
SQL> STARTUP MOUNT;
SQL> ALTER DATABASE ARCHIVELOG;

If the standby database is a logical standby, then open the database:

SQL> ALTER DATABASE OPEN;

Use a consistent log format (LOG_ARCHIVE_FORMAT).

The LOG_ARCHIVE_FORMAT parameter should specify the thread, sequence, and resetlogs ID attributes, and the parameter settings should be consistent across all instances. For example: LOG_ARCHIVE_FORMAT=arch_%t_%S_%r.arc

Note: If the flash recovery area is used, then this format is ignored.

Perform remote archiving to only one standby instance and node for each Oracle RAC standby database.

All primary database instances archive to one standby destination, using the same net service name. Oracle Net Services connect-time failover is used to automatically switch to the "secondary" standby host when the "primary" standby instance has an outage.

If the archives are accessible from all nodes because ASM or some other shared file system is being used for the flash recovery area, then remote archiving can be spread across the different nodes of an Oracle RAC standby database.

Specify role-based destinations with the VALID_FOR attribute

The VALID_FOR attribute enables you to configure destination attributes for both the primary and the standby database roles in one server parameter file (SPFILE), so that the Data Guard configuration operates properly after a role transition. This simplifies switchovers and failovers by removing the need to enable and disable the role-specific parameter files after a role transition.

See Also: Appendix A, "Database SPFILE and Oracle Net Configuration File Samples"


The following example illustrates the recommended initialization parameters for a primary database communicating to a physical standby database. There are two instances, SALES1 and SALES2, running in maximum protection mode.

*.DB_RECOVERY_FILE_DEST=+RECO
*.LOG_ARCHIVE_DEST_1='SERVICE=SALES_stby SYNC AFFIRM NET_TIMEOUT=30
    REOPEN=300 VALID_FOR=(ONLINE_LOGFILES, ALL_ROLES) DB_UNIQUE_NAME=SALES_stby'
*.LOG_ARCHIVE_DEST_STATE_1=ENABLE

The flash recovery area must be accessible to any node within the cluster and use a shared file system technology such as automatic storage management (ASM), a cluster file system, a global file system, or high availability network file system (HA NFS). You can also mount the file system manually to any node within the cluster very quickly. This is necessary for recovery because all archived redo log files must be accessible on all nodes.

On the standby database nodes, recovery from a different node is required when a failure occurs on the node applying redo and the apply service cannot be restarted. In that case, any of the existing standby instances residing on a different node can initiate managed recovery. In the worst case, when the standby archived redo log files are inaccessible, the managed recovery process (MRP) or logical standby process (LSP) on the different node fetches the archived redo log files using the FAL server to retrieve from the primary node directly.

When configuring hardware vendor shared file system technology, verify the performance and availability implications. Investigate the following issues before adopting this strategy:

  • Is the shared file system accessible by any node regardless of the number of node failures?

  • What is the performance impact when implementing a shared file system?

  • Is there any effect on the interconnect traffic?

2.6.4.6 Use Standby Redo Logs and Configure Size Appropriately

You should configure standby redo logs on both sites for improved availability and performance. To determine the recommended number of standby redo logs, use the following formula:

(maximum # of logfile groups +1) * maximum # of threads

For example, if a primary database has two instances (threads) and each thread has two online log groups, then there should be six standby redo logs ((2 + 1) * 2 = 6). Having one more standby log group for each thread (that is, one greater than the number of the online redo log groups for the primary database) reduces the likelihood that the logwriter process for the primary instance is blocked because a standby redo log cannot be allocated on the standby database.

The following statements create two standby log members for each group, and each member is 10 MB. One member is created in the directory specified by the DB_CREATE_FILE_DEST initialization parameter, and the other member is created in the directory specified by DB_RECOVERY_FILE_DEST initialization parameter. Because this example assumes that there are two online redo log groups in two threads, the next group is group five.

SQL> ALTER DATABASE ADD STANDBY LOGFILE THREAD 1
GROUP 5 SIZE 10M,
GROUP 6 SIZE 10M,
GROUP 7 SIZE 10M;

SQL> ALTER DATABASE ADD STANDBY LOGFILE THREAD 2
GROUP 8 SIZE 10M,
GROUP 9 SIZE 10M,
GROUP 10 SIZE 10M;

Consider the following additional guidelines when creating standby redo logs:

  • Create the same number of standby redo logs on both the primary and standby databases.

  • Create all online redo logs and standby redo logs for both primary and standby databases so that they are the same size.

  • Create standby redo logs in the Data Area protected through ASM or external redundancy.

  • In an Oracle RAC environment, create standby redo logs on a shared disk.

  • In an Oracle RAC environment, assign the standby redo log to a thread when the standby redo log is created. In the above example, three standby redo logs are assigned to each thread.

  • Do not multiplex the standby redo logs.

To check the number and group numbers of the redo logs, query the V$LOG view:

SQL> SELECT * FROM V$LOG;

To check the results of the ALTER DATABASE ADD STANDBY LOGFILE THREAD statements, query the V$STANDBY_LOG view:

SQL> SELECT * FROM V$STANDBY_LOG;

You can also see the members created by querying the V$LOGFILE view:

SQL> SELECT * FROM V$LOGFILE;

See Also:

The section about "Configuring an Oracle Database to Receive Redo Data" in Oracle Data Guard Concepts and Administration

2.6.5 Redo Transport Services Best Practices

This section discusses best practices for planning and implementing redo transport services for Data Guard.

2.6.5.1 Conduct Performance Assessment with Proposed Network Configuration

Oracle recommends that you conduct a performance assessment with your proposed network configuration and current (or anticipated) peak redo rate. The network effect between the primary and standby databases, and the effect on the primary database throughput, must be understood. Because the network between the primary and standby databases is essential for the two databases to remain synchronized, the infrastructure must have the following characteristics:

  • Sufficient bandwidth to accommodate the maximum redo generation rate

  • If using the SYNC transport, then minimal latency is necessary to reduce the performance impact on the primary database

  • Multiple network paths for network redundancy

In configurations that use a dedicated network connection the required bandwidth is determined by the maximum redo rate of the primary database and the efficiency of the network. Depending on the data protection mode, there are other recommended practices and performance considerations. Maximum protection mode and maximum availability mode require SYNC transport. The maximum performance protection mode uses the ASYNC transport option.

Unlike the ASYNC transport mode, the SYNC transport mode can affect the primary database performance due to the incurred network latency. Distance and network configuration directly influence latency, while high latency can slow the potential transaction throughput and quicken response time. The network configuration, number of repeaters, the overhead of protocol conversions, and the number of routers also affect the overall network latency and transaction response time.

2.6.5.2 Best Practices for Primary Database Throughput

The following sections describe the behavior of the SYNC and ASYNC transport modes on primary database throughput, and setting the LOG_ARCHIVE_MAX_PROCESSES parameter.

2.6.5.2.1 The SYNC Transport

Configure the SYNC redo transport for a high degree of synchronization between the primary and standby databases.When sending redo data to a standby database using the SYNC attribute, a transaction on the primary database does not return a commit completed message to the foreground process until the redo associated with that transaction has been written both locally and remotely.

The commit time when using SYNC is directly affected by the network latency and bandwidth, and the I/O capacity on the standby database. The total commit time is comprised of the primary database's local write (log file parallel write) and the following factors that are captured through the LNS wait event on SENDREQ: network time + standby write (this is the RFS write I/O obtained from the V$SYSTEM_EVENT view on the standby database) + network acknowledgment.

Effect of the SYNC Transport on the Primary Database: How much the primary database is affected depends on the application profile. In general, batch updates with infrequent commits and message queuing applications may not observe any noticeable difference.

2.6.5.2.2 The ASYNC Transport

Configure the ASYNC redo transport for minimal impact on the primary database, but with a lower degree of synchronization.When sending redo data to a standby database using the ASYNC attribute, transactions on the primary database continue to be processed without waiting for the network I/O for requests to complete.

There is little effect on the primary database throughput (redo bytes per second) as network latency increases. The log writer process writes to the local online redo log file, while the network server processes (one for each destination) read redo from the log buffer and asynchronously transmit the redo to remote destinations. If redo transport services transmit redo to multiple remote destinations, then the network server processes initiate the network I/O to all of the destinations in parallel.

Effect of the ASYNC Transport on the Primary Database: The effect on primary database throughput is minimal due to the true asynchronous behavior of ASYNC redo transport.

Note:

The network server processes read from the log buffer unless the redo data has been flushed from the log buffer, in which case the network server processes read the redo data from the online redo log. The network server processes only read from the online redo log when changes cannot be found in the log buffer.

2.6.5.2.3 Best Practices for Setting the LOG_ARCHIVE_MAX_PROCESSES Parameter

Because the network connections used in multiple streams are initiated by the archiver process, take care when setting the LOG_ARCHIVE_MAX_PROCESSES initialization parameter. The value of the LOG_ARCHIVE_MAX_PROCESSES initialization parameter must be at least one greater than the total number of all remote destinations. Use the following equation when setting the LOG_ARCHIVE_MAX_PROCESSES parameter for highly available environments:

LOG_ARCHIVE_MAX_PROCESSES = sum(remote_destinations) + count(threads)

You can adjust these parameter settings after evaluating and testing the initial settings in your production environment.

2.6.5.3 Best Practices for Network Configuration and Highest Network Redo Rates

The following sections include best practices for network configuration and highest redo network redo rates.

See Also:

The MAA white paper "Data Guard Redo Transport & Network Configuration" at http://www.otn.oracle.com/goto/maa

2.6.5.3.1 Properly Configure TCP Send / Receive Buffer Sizes

To achieve high network throughput, especially for a high-latency, high-bandwidth network, the minimum recommended setting for the sizes of the TCP send and receive socket buffers is the bandwidth delay product (BDP) of the network link between the primary and standby systems. Settings higher than the BDP may show incremental improvement. For example, in the MAA Linux test lab, simulated high-latency, high-bandwidth networks realize small, incremental increases in throughput when using TCP send and receive socket buffers settings up to three times the BDP.

BDP is product of the network bandwidth and latency. Socket buffer sizes should be set using the Oracle Net parameters RECV_BUF_SIZE and SEND_BUF_SIZE, so that the socket buffer size setting affects only Oracle TCP connections. The operating system may impose limits on the socket buffer size that must be adjusted so Oracle can use larger values. For example, on Linux, the parameters net.core.rmem_max and net.core.wmem_max limit the socket buffer size and must be set larger than RECV_BUF_SIZE and SEND_BUF_SIZE.

For example, if bandwidth is 622 Mbits and latency is 30 ms, then you would calculate the minimum size for the RECV_BUF_SIZE and SEND_BUF_SIZE parameters as follows: 622,000,000 / 8 x 0.030 = 2,332,500 bytes.

In this example, you would set the initialization parameters as follows:

RECV_BUF_SIZE=2,332,500

SEND_BUF_SIZE=2,332,500

2.6.5.3.2 Increase SDU Size

With Oracle Net Services, it is possible to control data transfer by adjusting the size of the Oracle Net setting for the session data unit (SDU). Oracle internal testing indicates that setting the SDU to its maximum value of 32767 can improve performance. You can set SDU on a per connection basis using the SDU parameter in the local naming configuration file (TNSNAMES.ORA) and the listener configuration file (LISTENER.ORA), or you can set the SDU for all Oracle Net connections with the profile parameter DEFAULT_SDU_SIZE in the SQLNET.ORA file.

See Also:

Oracle Database Net Services Reference for more information on the SDU and DEFAULT_SDU_SIZE parameters

2.6.5.3.3 Ensure TCP.NODELAY is YES

To preempt delays in buffer flushing in the TCP protocol stack, disable the TCP Nagle algorithm by setting TCP.NODELAY to YES in the SQLNET.ORA file on both the primary and standby systems.

See Also:

Oracle Database Net Services Reference for more information about the TCP.NODELAY parameter

2.6.5.4 Best Practices for Redo Transport Compression

Beginning in Oracle Database 11g, Oracle Data Guard provides the ability to compress redo data as it is transmitted over the network. This compression can be performed while resolving redo gaps with the archive ARCn process or while shipping redo data during run time using the LGWR ASYNC transport. In certain environments, enabling compression of redo data can:

  • Reduce network utilization

  • Provide faster resolution of gaps in archived redo log files

  • Reduced redo transfer time

Redo transport compression is a feature of the Oracle Advanced Compression option for which you must purchase a license before using the redo transport compression feature.

When to Use Compression

In general, compression is most beneficial when used over low bandwidth networks. As the network bandwidth increases, the benefit is reduced. Compressing redo in a Data Guard environment is beneficial if:

  • Sufficient CPU resources are available for the compression processing.

  • The database redo rate is being throttled by a low bandwidth network.

Before enabling compression, assess the available CPU resources and decide if enabling compression is feasible. For complete information about enabling compression for gap resolution and the LGWR ASYNC transport mode, see support note 729551.1 at http://support.oracle.com/.

2.6.6 Log Apply Services Best Practices

This section discusses the best practices for Data Guard log apply services for both physical and logical standby databases.

2.6.6.1 Redo Apply Best Practices for Physical Standby Databases

To use Redo Apply with a physical standby database, or to use any media recovery operation effectively, tune your database recovery by following these best practices:

  1. Maximize I/O rates on standby redo logs and archived redo logs.

    Measure read I/O rates on the standby redo logs and archived redo log directories. Concurrent writing of shipped redo on a standby database might reduce the redo read rate due to I/O saturation. The overall recovery rate is always bounded by the rate at which redo can be read; so ensure that the redo read rate surpasses your required recovery rate.

  2. Assess recovery rate.

    To obtain the history of recovery rates, use the following query to get a history of recovery progress:

    SELECT * FROM V$RECOVERY_PROGRESS;
    

    If your ACTIVE APPLY RATE is greater than the maximum redo generation rate at the primary database or twice the average generation rate at the primary database, then no tuning is required; otherwise follow the tuning tips below. The redo generation rate for the primary database can be monitored from Grid Control or extracted from AWR reports under statistic REDO SIZE. If CHECKPOINT TIME PER LOG is greater than ten seconds, then investigate tuning I/O and checkpoints.

  3. Use defaults for DB_BLOCK_CHECKING and set DB_BLOCK_CHECKSUM=FULL

    Change DB_BLOCK_CHECKSUM from TYPICAL (the default setting) to FULL. Block checking is always recommended on the primary database and might be enabled on the standby database if the recovery rate meets expectations.

    Note:

    To check for block corruption that was not preventable through the DB_BLOCK_CHECKING parameter, use one of the following methods:
    • RMAN BACKUP command with the VALIDATE option

    • DBVERIFY utility

    • ANALYZE TABLE tablename VALIDATE STRUCTURE CASCADE SQL statement

    The default setting for DB_BLOCK_CHECKSUM is TYPICAL. Block checksum should always be enabled for both primary and standby databases. It catches most block corruption while incurring negligible overhead.

    Set the DB_LOST_WRITE_PROTECT parameter to FULL on the standby database to enable Oracle to detect writes that are lost in the I/O subsystem. The impact on media recovery is very small and generally less than 2 percent.

  4. Set DB_CACHE_SIZE to a value greater than that for the primary database. Set DB_KEEP_CACHE_SIZE and DB_RECYCLE_CACHE_SIZE to 0.

    Having a large database cache size can improve media recovery performance by reducing the amount of physical data block reads. Because media recovery does not require DB_KEEP_CACHE_SIZE and DB_RECYCLE_CACHE_SIZE or require a large SHARED_POOL_SIZE, the memory can be reallocated to the DB_CACHE_SIZE.

    Before converting the standby database into a primary database, reset these parameters to the primary database settings.

  5. Assess database wait events

    With the Active Data Guard option and real-time query, you can use Statspack from the primary database to collect data from a standby database that is opened read-only and performing recovery. Any tuning or troubleshooting exercise should start with collecting Standby Statspack reports. See support note 454848.1 at http://support.oracle.com/ for complete details about installing and using Standby Statspack.

    If you do not have a license for the Active Data Guard option, you can determine the top system and session wait events by querying the standby database's V$SYSTEM_EVENTS, V$SESSION_WAITS, and V$EVENT_HISTOGRAM and looking for the largest TIME_WAITED value. You may have to capture multiple snapshots of the query results and manually extract the difference to accurately assess a certain time period.

    If recovery is applying a lot of redo data efficiently, the system is I/O bound and the I/O wait should be reasonable for your system. The vast majority of wait events related to parallel recovery coordinators and slaves apply to the coordinator. Slaves are either applying changes (clocking on CPU) or waiting for changes to be passed from the coordinator. The database wait events are shown in Table 2-4 and Table 2-5.

    Table 2-4 Parallel Recovery Coordinator Wait Events

    Wait Name Description

    Log file sequential read

    The parallel recovery coordinator is waiting on I/O from the online redo log or the archived redo log.

    Parallel recovery read buffer free

    This event indicates that all read buffers are being used by slaves, and usually indicates that the recovery slaves lag behind the coordinator.

    Parallel recovery change buffer free

    The parallel recovery coordinator is waiting for a buffer to be released by a recovery slave. Again, this is a sign the recovery slaves are behind the coordinator.

    Datafile init write

    The parallel recovery coordinator is waiting for a file resize to finish, as would occur with file auto extend.

    Parallel recovery control message reply

    The coordinator has sent a synchronous control messages to all slaves, and is waiting for all slaves to reply.


    When dealing with recovery slave events, it is important to know how many slaves were started. Divide the wait time for any recovery slave event by the number of slaves. Table 2-5 describes the parallel recovery slave wait events.

    Table 2-5 Parallel Recovery Slave Wait Events

    Wait Name Description

    Parallel recovery slave next change

    The parallel recovery slave is waiting for a change to be shipped from the coordinator. This is in essence an idle event for the recovery slave. To determine the amount of CPU a recovery slave is using, divide the time spent in this event by the number of slaves started and subtract that value from the total elapsed time. This may be close, because there are some waits involved.

    DB File Sequential Read

    A parallel recovery slave (or serial recovery process) is waiting for a batch of synchronous data block reads to complete.

    Checkpoint completed

    Recovery is waiting for checkpointing to complete, and Redo Apply is not applying any changes currently.

    Recovery read

    A parallel recovery slave is waiting for a batched data block I/O.


  6. Tune I/O operations.

    DBWR must write out modified blocks from the buffer cache to the data files. Always use native asynchronous I/O by setting DISK_ASYNCH_IO to TRUE (default). In the rare case that asynchronous I/O is not available, use DBWR_IO_SLAVES to improve the effective data block write rate with synchronous I/O.

    Ensure that you have sufficient I/O bandwidth and that I/O response time is reasonable for your system either by doing some base I/O tests, comparing the I/O statistics with those for the primary database, or by looking at some historical I/O metrics. Be aware that I/O response time may vary when many applications share the same storage infrastructure such as with a Storage Area Network (SAN) or Network Attached Storage (NAS).

  7. Assess system resources.

    Use system commands such as UNIX sar and vmstat commands, or use system monitoring tools to assess the system resources. Alternatively, you can monitor using Oracle Enterprise Manager, AWR reports, or performance views such as V$SYSTEM_EVENT, V$ASM_DISK and V$OSSTAT.

    1. If there are I/O bottlenecks or excessive wait I/O operations, then investigate operational or application changes that increased the I/O volume. If the high waits are due to insufficient I/O bandwidth, then add more disks to the relevant ASM disk group. Verify that this is not a bus or controller bottleneck or any other I/O bottleneck. The read I/O rate from the standby redo log should be greater than the expected recovery rate.

    2. Check for excessive swapping or memory paging.

    3. Check to ensure the recovery coordinator or MRP is not CPU bound during recovery.

  8. Increase redo log size for the primary and standby databases.

    Increase the online redo log size for the primary database and the standby redo log size for the standby database to a minimum of 1 GB. Oracle Database does a full checkpoint and updates all the file headers (in an optimized manner) at each log file boundary during media recovery. To reduce the frequency of a full database checkpoint and the file header updates, increase the redo log size so that a log switch is occurring at a minimum of 20-minute intervals on a heavily loaded system. Otherwise, once every hour should be sufficient.

    To ensure that the crash recovery time for the primary database is minimized even with very large redo log sizes, set the FAST_START_MTTR_TARGET initialization parameter to a nonzero value to enable fast-start fault recovery. If the parameter is not set, then set it to 3600. This initialization parameter is relevant only for the primary database.

  9. Try different degrees of recovery parallelism.

    Parallel recovery is enabled by default for media and crash recovery with the default degree of parallelism set to the number of CPUs available. In most cases this is the optimal setting. However, in some circumstances, you may obtain faster recovery by using a degree of parallelism that is different (higher or lower) than the default. To override the default setting, explicitly specify parallel recovery using the following SQL*Plus statement:

    RECOVER MANAGED STANDBY DATABASE PARALLEL number_of_CPUs;
    

See Also:

The MAA white paper "Data Guard Redo Apply and Media Recovery" at http://www.otn.oracle.com/goto/maa

2.6.6.2 SQL Apply Best Practices for Logical Standby Databases

This section discusses recommendations for Data Guard SQL Apply and logical standby databases.

This section contains these topics:

2.6.6.2.1 Set the MAX_SERVERS Initialization Parameter

Set the initial value of the MAX_SERVERS parameter to be eight times the number of CPUs. For example:

SQL> EXECUTE DBMS_LOGSTDBY.APPLY_SET('MAX_SERVERS', <8 x CPUs>);

SQL Apply automatically distributes the server processes. There is always one process for the Reader, Builder, and Analyzer roles, but usually you need a varied number of Preparer and Applier processes. By setting the MAX_SERVERS parameter to a default setting of eight times the number of CPUs, you can avoid performing a lot of initial tuning. The previous default of 9 was much too low for an application with large transactions or a lot of concurrent transactions. A higher default setting results in more PGA memory utilization.

2.6.6.2.2 Set the PRESERVE_COMMIT_ORDER Parameter

The PRESERVE_COMMIT_ORDER parameter controls the order in which transactions are applied to the standby database. When set to TRUE (the default), unrelated transactions are applied in the same order they were committed on the primary database. Dependent transactions are always committed in the correct sequence, regardless of the setting.

Follow these guidelines for setting the PRESERVE_COMMIT_ORDER parameter:

  • Set PRESERVE_COMMIT_ORDER to FALSE if you are using the logical standby database only for disaster-recovery purposes or if the reporting application using the standby database can accommodate transactions being applied out of sequence.

    Most third-party replication solutions tolerate this relaxed COMMIT processing. For OLTP applications, setting the parameter to FALSE can potentially double the SQL Apply rates. MAA testing has shown that OLTP workload performance improves by 40% or greater when PRESERVE_COMMIT_ORDER is set to FALSE.

  • Set PRESERVE_COMMIT_ORDER to TRUE for a reporting or decision-support system.

    However, if the standby database has fallen behind the primary database, then you can temporarily set the PRESERVE_COMMIT_ORDER parameter to FALSE when you want the logical standby database to quickly catch up to the primary database. For example:

    SQL> EXECUTE DBMS_LOGSTDBY.APPLY_SET('PRESERVE_COMMIT_ORDER', 'FALSE');
    

    Reset the parameter to TRUE after the gap has been resolved.

See Also:

The MAA white paper "SQL Apply Best Practices: Oracle Data Guard 11g Release 1" for examples that show the effect of setting the PRESERVE_COMMIT_ORDER parameter for different situations

2.6.6.2.3 Skip SQL Apply for Unnecessary Objects

use the DBMS_LOGSTDBY.SKIP procedure to skip database objects that do not require replication to the standby database. Skipping such objects reduces the processing of SQL Apply.

2.6.7 Role Transition Best Practices

With proper planning and execution, Data Guard role transitions can effectively minimize downtime and ensure that the database environment is restored with minimal impact on the business. Whether using physical standby or logical standby databases, MAA testing has determined that switchover and failover times with Oracle Data Guard 11g have been reduced to seconds. This section describes best practices for both switchover and failover.

2.6.7.1 Switchovers

A database switchover performed by Oracle Data Guard is a planned transition that includes a series of steps to switch roles between a standby database and a primary database. Following a successful switchover operation, the standby database assumes the primary role and the primary database becomes a standby database. Switchovers are typically completed in only seconds to minutesFoot 10 .

Data Guard enables you to change these roles dynamically by:

See Also:

Oracle Data Guard Broker for information about using Enterprise Manager or the broker's DGMGRL command-line interface to perform database switchover

2.6.7.1.1 Switchover Best Practices

To optimize switchover processing, perform the following best practices before performing a switchover:

  • Disconnect all sessions possible using the ALTER SYSTEM KILL SESSION SQL*Plus command.

  • Stop job processing by setting the AQ_TM_PROCESSES parameter to 0.

  • Cancel any specified apply delay by using the NODELAY keyword to stop and restart log apply services on the standby database.

    • On a physical standby database:

      ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;
      ALTER DATABASE RECOVER MANAGED STANDBY DATABASE USING CURRENT LOGFILE NODELAY;
      
    • On a logical standby database:

      ALTER DATABASE STOP LOGICAL STANDBY APPLY;
      ALTER DATABASE START LOGICAL STANDBY APPLY IMMEDIATE NODELAY;
      

    You can view the current delay setting on the primary database by querying the DELAY_MINS column of the V$ARCHIVE_DEST view.

  • For logical standby databases:

    • Determine an optimal SQL Apply rate using the "SQL Apply Best Practices" white paper at http://www.otn.oracle.com/goto/maa.

    • When performing a switchover using SQL*Plus statements, you should first build the LogMiner Data Dictionary by issuing the ALTER DATABASE PREPARE TO SWITCHOVER SQL*Plus statement.

    • Verify the LogMiner Data Dictionary was received by the primary database by querying the SWITCHOVER_STATUS column of the V$DATABASE fixed view on the primary database. When the query returns the TO LOGICAL STANDBY value, you can proceed with the switchover. See the discussion about "Switchovers Involving a Logical Standby Database" in Oracle Data Guard Concepts and Administration

  • For physical standby databases in an Oracle RAC environment, ensure there is only one instance active for each primary and standby database. (A switchover involving a logical standby database does not require that only one instance is active.)

  • Configure the standby database to use real-time apply and, if possible, ensure the databases are synchronized before the switchover operation to optimize switchover processing.

    For the fastest switchover, use real-time apply so that redo data is applied to the standby database as soon as it is received, and the standby database is synchronized with the primary database before the switchover operation to minimize switchover time. To enable real-time apply:

    • For a physical standby database, use the following SQL*Plus statement:

      ALTER DATABASE RECOVER MANAGED STANDBY DATABASE DISCONNECT USING CURRENT LOGFILE;
      
    • For a logical standby database, use the following SQL*Plus statement:

      ALTER DATABASE START LOGICAL STANDBY APPLY IMMEDIATE;
      
  • Enable Flashback Database so that if a failure occurs during the switchover, the process can be easily reversed.

  • For a physical standby database, reduce the number of archiver (ARCn) processes to the minimum needed for both remote and local archiving. Additional archiver processes can take additional time to shut down, thereby increasing the overall time it takes to perform a switchover. After the switchover has completed you can reenable the additional archiver processes.

  • Set the LOG_FILE_NAME_CONVERT initialization parameter to any valid value for the environment, or if it is not needed set the parameter to null.

    As part of a switchover, the standby database must clear the online redo log files on the standby database before opening as a primary database. The time needed to complete the I/O can significantly increase the overall switchover time. By setting the LOG_FILE_NAME_CONVERT parameter, the standby database can pre-create the online redo logs the first time the MRP process is started. You can also pre-create empty online redo logs by issuing the SQL*Plus ALTER DATABASE CLEAR LOGFILE statement on the standby database.

See Also:

The MAA white paper "Switchover and Failover Best Practices" at http://www.otn.oracle.com/goto/maa

2.6.7.2 Failovers

A failover is typically used only when the primary database becomes unavailable, and there is no possibility of restoring it to service within a reasonable period. During a failover the primary database is taken offline at one site and a standby database is brought online as the primary database.

With Data Guard the process of failover can be completely automated using fast-start failover, or it can be a manual, user driven process. Oracle recommends using fast-start failover to eliminate the uncertainty inherent in a process that requires manual intervention. It automatically executes a failover within seconds of an outage being detected.

2.6.7.2.1 Comparing Fast-Start Failover and Manual Failover

There are two distinct types of failover: manual failover and fast-start failover. An administrator initiates manual failover when the primary database fails. In contrast, Data Guard automatically initiates a fast-start failover without human intervention after the primary database has been unavailable for a set period (fast-start failover threshold).

Table 2-6 contrasts the characteristics of fast-start failover and manual failover.

Table 2-6 Comparing Fast-Start Failover and Manual Failover

Points of Comparison Fast-Start Failover Manual Failover

Benefits

Allows you to increase availability with less need for manual intervention, thereby reducing management costs.

Gives you control over exactly when a failover occurs and to which target standby database.

Failover triggers

The following conditions automatically trigger a fast-start failover:

  • Database instance failure (or last instance failure in a Oracle RAC configuration).

  • Shutdown abort (or a shutdown abort of the last instance in a Oracle RAC configuration).

  • Specific conditions that are detected through the database health-check mechanism (for example, data files taken offline due to I/O errors).

    Fast-start failover can be enabled for these conditions (ENABLE FAST_START FAILOVER CONDITION) and ORA errors raised by the Oracle server when they occur.

    See Oracle Data Guard Broker for a full list of conditions.

  • Both the observer and the standby database lose their network connection to the primary database.

  • Application initiated fast-start failover using the DBMS_DG.INITIATE_FS_FAILOVER PL/SQL procedure.

A manual failover is user initiated and involves performing a series of steps to convert a standby database into a primary database. A manual failover should be performed due to an unplanned outage such as:

  • Site disaster which results in the primary database becoming unavailable (all instances of an Oracle RAC primary database).

  • User errors that cannot be repaired in a timely fashion.

  • Data failures, which impact the production application.

Management

Use the following tools to manage fast-start failover failovers:

  • Oracle Enterprise Manager

  • The broker command-line interface (DGMGRL)

See Section 5.2.1.3, "How to Perform Data Guard Switchover".

Use the following tools to perform manual failovers:

  • Oracle Enterprise Manager

  • The broker command-line interface (DGMGRL)

  • SQL statements

See Section 4.2.2.3, "Best Practices for Performing Manual Failover".

Restoring the original primary database after failover

Following a fast-start failover, the broker can automatically reconfigure the original primary database as a standby database upon reconnection to the configuration (FastStartFailoverAutoReinstate), or you can delay the reconfiguration to allow diagnostics on the failed primary. Automatic reconfiguration enables Data Guard to restore disaster protection in the configuration quickly and easily, returning the database to a protected state as soon as possible.

After manual failover, you must reinstate the original primary database as a standby database to restore fault tolerance.

Restoring bystander standby databases after failover

The broker coordinates the role transition on all databases in the configuration.

Bystanders that do no require reinstatement are available as viable standby databases to the new primary. Bystanders that require reinstatement are automatically reinstated by the observer.

A benefit of using the broker is that it provides the status of bystander databases and indicates whether a database must be reinstated. Status information is not readily available when using SQL*Plus statements to manage failover.

See Section 4.3.2, "Restoring a Standby Database After a Failover".

Application failover

The broker automatically publishes FAN/AQ (Advanced Queuing) notifications after a failover. Clients that are also configured for Fast Connection Failover can use these notifications to connect to the new primary database. You can also use the DB_ROLE_CHANGE system event to help user applications locate services on the primary database. (These events are also available for manual failovers performed by the broker. See Oracle Data Guard Broker.)

To configure fast client failover so applications can be available to the business after a failover, see Section 2.9, "Configuring Fast Connection Failover". You can also configure clients for Fast Connection Failover after a manual failover.


2.6.7.2.2 General Best Practices for Failovers

To optimize failover processing, use the following best practices:

  • Use fast-start failover

    The MAA tests running Oracle Database 11g show that failovers performed using the broker and fast-start failover offer a significant improvement in availability. For a comprehensive review of Oracle Data Guard failover best practices, see:

  • Enable Flashback Database to reinstate the failed primary databases after a failover operation has completed. Flashback Database facilitates fast point-in-time recovery, if needed.

  • Use real-time apply with Flashback Database to apply redo data to the standby database as soon as it is received, and to quickly rewind the database should user error or logical corruption be detected.

  • Consider configuring multiple standby databases to maintain data protection following a failover.

  • For logical standby databases, see the MAA white paper "SQL Apply Best Practices" to obtain an optimal SQL Apply rate.

  • For physical standby databases:

    • See the MAA white paper "Oracle Data Guard Redo Apply and Media Recovery" to optimize media recovery for Redo Apply on the MAA Web site at http://www.otn.oracle.com/goto/maa.

    • Go directly to the OPEN state from the MOUNTED state instead of restarting the standby database (as required in previous releases).

    • When transitioning from read-only mode to Redo Apply (recovery) mode, restart the database.

  • Set the LOG_FILE_NAME_CONVERT parameter. As part of a failover, the standby database must clear its online redo logs before opening as the primary database. The time needed to complete this I/O can add significantly to the overall failover time. By setting the LOG_FILE_NAME_CONVERT parameter, the standby pre-creates the online redo logs the first time the MRP process is started. You can also pre-create empty online redo logs by issuing the SQL*Plus ALTER DATABASE CLEAR LOGFILE statement on the standby database.

2.6.7.2.3 Fast-Start Failover Best Practices

Fast-start failover automatically, quickly, and reliably fails over to a designated standby database in the event of loss of the primary database, without requiring manual intervention to execute the failover. You can use fast-start failover only in an Oracle Data Guard configuration that is managed by the broker

The Oracle Data Guard configuration can be running in either the maximum availability or maximum performance mode with fast-start failover. When fast-start failover is enabled, the broker ensures fast-start failover is possible only when the configured data loss guarantee can be upheld. Maximum availability mode provides an automatic failover environment guaranteed to lose no data. Maximum performance mode provides an automatic failover environment guaranteed to lose no more than the amount of data (in seconds) specified by the FastStartFailoverLagLimit configuration property.

Use the following fast-start failover best practices in addition to the generic best practices listed in the "General Best Practices for Failovers" section:

  • Run the fast-start failover observer process on a host that is not located in the same data center as the primary or standby database.

    Ideally, you should run the observer on a system that is equally distant from the primary and standby databases. The observer should connect to the primary and standby databases using the same network as any end-user client. If the designated observer fails, Enterprise Manager can detect it and automatically restart the observer. If the observer cannot run at a third site, then you should install the observer on the same network as the application. If a third, independent location is not available, then locate the observer in the standby data center on a separate host and isolate the observer as much as possible from failures affecting the standby database.

  • Make the observer highly available by using Oracle Enterprise Manager to configure the original primary database to be automatically reinstated as a standby database when a connection to the database is reestablished. Also, Enterprise Manager enables you to define an alternate host on which to restart the observer.

    After the failover completes, the original primary database is automatically reinstated as a standby database when a connection to it is reestablished, if you set the FastStartFailoverAutoReinstate configuration property to TRUE.

  • Set the value of the FastStartFailoverThreshold property according to your configuration characteristics, as described in Table 2-7.

    Table 2-7 Minimum Recommended Settings for FastStartFailoverThreshold

    Configuration Minimum Recommended Setting

    Single-instance primary, low latency, and a reliable network

    15 seconds

    Single-instance primary and a high latency network over WAN

    30 seconds

    Oracle RAC primary

    Reconfiguration time + 30 seconds Foot 1 


    Footnote 1  For configurations running Oracle Database software earlier than release 10.2.0.3, calculate a minimum reconfiguration time using this equation: Oracle RAC miscount + reconfiguration time + 30 seconds

Test your configuration using the settings shown in Table 2-7 to ensure that the fast-start failover threshold is not so aggressive that it induces false failovers, or so high it does not meet your failover requirements.

2.6.7.2.4 Manual Failover Best Practices

You should perform a manual failover, which is user-driven, only in case of an emergency and the failover should be initiated due to an unplanned outage such as:

  • Site disaster that results in the primary database becoming unavailable

  • User errors that cannot be repaired in a timely fashion

  • Data failures, to include widespread corruption, which affects the production application

Use the following manual failover best practices in addition to the generic best practices listed in the "General Best Practices for Failovers" section:

  • Reinstate the original primary database as a standby database to restore fault tolerance to your environment. The standby database can be quickly reinstated by using Flashback Database. See Section 4.3.2, "Restoring a Standby Database After a Failover".

  • For manual failovers that involve Oracle RAC, issue the SHUTDOWN ABORT statement on all secondary Oracle RAC instances on the standby database before performing a failover.

  • For physical standby databases see the MAA white paper "Oracle Data Guard Redo Apply and Media Recovery" at http://www.otn.oracle.com/goto/maa

2.6.8 Best Practices for Snapshot Standby Database

Beginning with Oracle Database release 11g, you can convert a physical standby database into a fully updatable standby database called a snapshot standby database.

To convert a physical standby database into a snapshot standby database, issue the SQL*Plus ALTER DATABASE CONVERT TO SNAPSHOT STANDBY statement. This command causes Oracle Data Guard to perform the following actions:

  1. Recover all available redo data

  2. Create a guaranteed restore point

  3. Activate the standby database as a primary database

  4. Open the database as a snapshot standby database

To convert the snapshot standby back to a physical standby, issue the ALTER DATABASE CONVERT TO PHYSICAL STANDBY statement. This command causes the physical standby database to be flashed back to the guaranteed restore point that was created before the ALTER DATABASE CONVERT TO SNAPSHOT STANDBY statement was issued. Then, you must perform the following actions:

  1. Restart the physical standby database

  2. Restart Redo Apply on the physical standby database

Follow these best practices when creating and managing snapshot standby databases:

  • Automate the conversion steps using the broker to expedite the process of reverting a snapshot standby database to the physical standby database role. This is helpful because you can use a snapshot standby database for a switchover or failover only by first reverting it to a physical standby.

  • Create multiple standby databases if your business requires a fast recovery time objective (RTO).

  • Ensure the physical standby database that you convert to a snapshot standby is caught up with the primary database, or has a minimal apply lag. See Section 2.6.6.1, "Redo Apply Best Practices for Physical Standby Databases" for information about tuning media recovery.

  • Configure a flash recovery area and ensure there is sufficient I/O bandwidth available. This is necessary because snapshot standby databases use guaranteed restore points.

See Also:

Oracle Data Guard Concepts and Administration for complete information about creating a snapshot standby database

2.6.9 Best Practices for Deploying Multiple Standby Databases

The Oracle Database High Availability Overview describes how a multiple standby database architecture is virtually identical to that of single standby database architectures. Therefore, the configuration guidelines for implementing multiple standby databases described in this section complement the existing best practices for physical and logical standby databases.

When deploying multiple standby databases, use the following best practices:

  • Use the broker (described in Chapter 3, "Monitoring Using Oracle Grid Control") to manage your configuration and perform role transitions. However, if you choose to use SQL*Plus statements, see the MAA white paper "Multiple Standby Databases Best Practices" for best practices.

  • Use Flashback Database to quickly reinstate the original primary as the standby after a failover instead of re-creating the entire standby database from backups or from the primary database. Set the DB_FLASHBACK_RETENTION_TARGET initialization parameter to the same value on all databases in the configuration. If you are using Flashback Database for the sole purpose of reinstating databases following a failover, a DB_FLASHBACK_RETENTION_TARGET of 60 minutes is the minimum recommended value.

  • Enable supplemental logging in configurations containing logical standby databases. When creating a configuration with both physical and logical standby databases, issue the ALTER DATABASE ADD SUPPLEMENTAL LOG DATA statement to enable supplemental logging in the following situations:

    • When adding a logical standby database to an existing configuration consisting of all physical standby databases, you must enable supplemental logging on all existing physical standby databases in the configuration.

    • When adding a physical standby database to an existing configuration that contains a logical standby database, you must enable supplemental logging on the physical standby database when you create it.

    Supplemental logging is enabled on the primary database when the logical standby dictionary is built. Enabling supplemental logging is a control file change and therefore the change is not propagated to each physical standby database. Supplemental logging is enabled automatically on a logical standby database when it is first converted from a physical standby database to a logical standby database as part of the dictionary build process.To enable supplemental logging, issue the following SQL*Plus statement when connected to a physical standby database:

    SQL> ALTER DATABASE ADD SUPPLEMENTAL LOG DATA (PRIMARY KEY, UNIQUE INDEX) COLUMNS;
    
  • If logical standby databases are not configured to perform real-time queries, then consider configuring SQL Apply to delay applying redo data to the logical standby database. By delaying the application of redo, you can minimize the need to manually reinstate the logical standby database after failing over to a physical standby database.

    To set a time delay, use the DELAY=minutes attribute of the LOG_ARCHIVE_DEST_n initialization parameter.

See Also:

Oracle Database High Availability Overview to learn about the benefits of using multiple standby database and for implementation examples

2.6.10 Best Practices for Real-Time Query (Oracle Active Data Guard Option)

If you have a license for the Oracle Active Data Guard option (available starting in Oracle Database 11g Release 1), then you can open a physical standby database for read-only access while Redo Apply on the standby database continues to apply redo data received from the primary database. All queries reading from the physical standby database execute in real time and return current results, providing more efficient use of system resources without compromising data protection or extending recovery time in the event a failover is required. Hence, this capability is referred to as real-time query.

This section summarizes the best practices for deploying real-time query:

  • Use the generic best practices described previously:

  • Tune the network following the guidelines in Section 2.6.5, "Redo Transport Services Best Practices"

  • Use real-time apply on the standby database so that changes are applied as soon as the redo data is received.

  • For configurations running Oracle Database 11g release 11.1.0.6, shut down the standby instance and Redo Apply cleanly so that upon restart you can open the standby database directly in read-only mode.

  • Configure clients for efficient failover:

    • Connect both the primary database and the reporting applications using an Oracle Net alias that contains all hosts (both primary and standby) in the ADDRESS_LIST.

    • Connect the primary database and reporting applications with a role-specific service.

      Reporting applications that connect to the standby database, and primary applications that connect to the primary database, must connect to the database with the correct database role.

    • Start and stop services based on the database role.

      For example, the following automates starting and stopping the service using an after startup trigger that checks the database role and starts the appropriate service:

      CREATE OR REPLACE TRIGGER manage_service
      after startup on database
        DECLARE
         role VARCHAR(30);
        BEGIN
         SELECT DATABASE_ROLE INTO role FROM V$DATABASE;
        IF role = 'PRIMARY' THEN
         DBMS_SERVICE.START_SERVICE('sales_rw');
        ELSE
         DBMS_SERVICE.START_SERVICE('sales_ro');
        END IF;
      END;
      
  • Monitor standby performance by using Standby Statspack.

    See support note 454848.1 at http://support.oracle.com/ for complete details about installing and using Standby Statspack.

  • Query the CURRENT_SCN column of the V$DATABASE view to monitor how far behind Redo Apply on the standby database is lagging behind the primary database. See Section 2.6.10.2, "Monitoring Real-Time Query" for more information.

See Also:

The MAA white paper "Oracle Active Data Guard: Oracle Data Guard 11g Release 1" at
http://www.otn.oracle.com/goto/maa

2.6.10.1 Enabling Real-Time Query On a Consistent Standby Database

The following instructions describe how to enable real-time query on a standby database that is transactionally consistent with the primary database.

Note:

For an inconsistent standby database, you must make the standby consistent with the primary database before opening the standby for read-only access. A step-by-step description is provided in the "Oracle Active Data Guard 11g Release 1" MAA white paper.

If the standby database instance and Redo Apply have been cleanly shut down, then you can open a standby database directly to the read-only state. When open, you can start Redo Apply to enable real-time query.

Perform the following steps to enable a consistent physical standby database:

  1. Start the standby instance read-only.

    SQL> STARTUP
    

    It is unnecessary to include the READ ONLY keyword (the default start-up state) on the STARTUP command because the keyword is included implicitly when you issue an OPEN command on a standby database.

    Note: If the standby database is an Oracle RAC database, then you must first open one standby instance read-only and start Redo Apply before opening additional standby instances in read-only mode.

  2. After the database is open, start Redo Apply:

    SQL> RECOVER MANAGED STANDBY DATABASE DISCONNECT USING CURRENT LOGFILE;
    

    If you are using the broker to manage the Data Guard configuration, then the default state for the standby database is APPLY ON and Redo Apply starts automatically. If this default has been changed, then start Redo Apply by issuing the following command:

    DGMGRL> EDIT DATABASE 'RTQ' SET STATE='APPLY-ON'
    

    If the standby database is mounted and Redo Apply is running, stop Redo Apply, open the standby database for read-only access, and restart Redo Apply.

2.6.10.2 Monitoring Real-Time Query

Real-time query allows you to query data as it is being applied to a standby database while guaranteeing a transactionally consistent view of the data. Oracle provides a read-consistent view of the data through a query SCN. The query SCN on the standby database is advanced by the recovery process after all dependent changes have been applied. As it is advanced, the new query SCN is propagated to all instances in an Oracle RAC standby. Once published to all standby instances, the query SCN is exposed to the user via the CURRENT_SCN column of the V$DATABASE view on the standby database.

The query SCN on the standby database is equivalent to the CURRENT_SCN on the primary database. This allows applications connected to the standby database instances to use the query SCN as a snapshot of where the data is in relation to the primary database. Queries on the primary and standby databases return identical results as of a particular CURRENT SCN.To determine how far behind the query results on the standby database are lagging the primary database, compare the CURRENT SCN column on the primary database to the CURRENT SCN on the standby database. For example, in a configuration for which there is a dblink on the primary database that points to a standby database called RTQ_STBY, issue the following query on the primary database:

SQL> SELECT SCN_TO_TIMESTAMP((SELECT CURRENT_SCN FROM V$DATABASE))
-SCN_TO_TIMESTAMP((SELECT CURRENT_SCN FROM V$DATABASE@RTQ_STBY)) FROM DUAL;

The value returned from the query indicates the number of seconds that data on the standby database lags behind the current position of the primary database. To determine an approximate query lag on the standby without connecting to. or using, the primary database, use the APPLY LAG metric from the V$DATAGUARD_STATS view. Note that the APPLY LAG metric is valid only for the time that it was computed and not at the moment the query against the V$DATAGUAD_STATS view was issued. Use the COMPUTED_TIME column to determine the last computed time for the apply lag.

See Also:

The MAA white paper "Oracle Active Data Guard: Oracle Data Guard 11g Release 1" at

http://www.otn.oracle.com/goto/maa

2.6.11 Recommendations for Protecting Data Outside of the Database

In a highly available environment, you must protect nondatabase files along with database files. Oracle Secure Backup provides data protection for heterogeneous UNIX, Linux, Windows, and Network Attached Storage (NAS) environments. Additionally, for disaster recovery purposes, some third-party tools enable remote synchronization between a set of local and remote files. For example, you can use tools such as rsync, csync2, and DRDB for remote synchronization. These tools are available for download on the internet. The following list provides recommendations regarding these tools:

  • For software updates, use rsync to synchronize the standby system with the changes made to software on the primary system.

  • For configuration files, use rsync daily or after a change, or use csync2.

  • For important log files, trace files, or debugging files, use rsync daily or hourly, or use DRDB to synchronize the entire file system. However, do not overwrite the existing standby log or trace directories.

  • For transaction logs or metadata files that must be synchronized with the database, use rsync or csync2 frequently, or use a block synchronization tool such as DRDB, a third-party mirroring utility, or remote synchronization tool.

2.6.12 Assess Data Guard Performance

To accurately assess the primary database performance after adding Data Guard standby databases, obtain a history of statistics from the V$SYSMETRIC_SUMMARY view or Automatic Workload Repository (AWR) snapshots before and after deploying Oracle Data Guard with the same application profile and load.

To assess the application profile, compare the following statistics:

  • Physical reads per transaction

  • Physical writes per transaction

  • CPU usage per transaction

  • Redo generated per transaction

To assess the application performance, compare the following statistics:

  • Redo generated per second or redo rate

  • User commits per second or transactions per second

  • Database time per second

  • Response time per transaction

  • SQL service response time

If the application profile has changed between the two scenarios, then this is not a fair comparison. Repeat the test or tune the database or system with the general principles outlined in the Oracle Database Performance Tuning Guide.

If the application profile is similar and you observe application performance changes on the primary database because of a decrease in throughput or an increase in response time, then assess these common problem areas:

  • CPU utilization

    If you are experiencing high load (excessive CPU usage of over 90%, paging and swapping), then tune the system before proceeding with Data Guard. Use the V$OSSTAT view or the V$SYSMETRIC_HISTORY view to monitor system usage statistics from the operating system.

  • Higher I/O wait events

    If you are experiencing higher I/O waits from the log writer or database writer processes, then the slower I/O effects throughput and response time. To observe the I/O effects, look at the historical data of the following wait events:

    • Log file parallel writes

    • Log file sequential reads

    • Log file parallel reads

    • Data file parallel writes

    • Data file sequential reads parallel writes

With SYNC transport, commits take more time because of the need to guarantee that the redo data is available on the standby database before foreground processes get an acknowledgment from the log writer (LGWR) background process that the commit has completed. A LGWR process commit includes the following wait events:

  • Log File Parallel Write (local write for the LGWR process)

  • LGWR wait on SENDREQ

    This wait event includes:

    • Time to put the packet into the network

    • Time to send the packet to the standby database

    • RFS write or standby write to the standby redo log, which includes the RFS I/O wait event plus additional overhead for checksums

    • Time to send a network acknowledgment back to the primary database (for example, single trip latency time)

Longer commit times for the LGWR process can cause longer response time and lower throughput, especially for small time-sensitive transactions. However, you may obtain sufficient gains by tuning the log writer local write (Log File Parallel Write wait event) or the different components that comprise the LGWR wait on SENDREQ wait event.

To tune the disk write I/O (Log File Parallel Write or the RFS I/O), add more spindles or increase the I/O bandwidth.

To reduce the network time:

  • Tune the Oracle Net send and receive buffer sizes

  • Set SDU=32K

  • Increase the network bandwidth if there is saturation

  • Possibly find a closer site to reduce the network latency

With ASYNC transport, the LGWR process never waits for the network server processes to return before writing a COMMIT record to the current log file. However, if the network server processes has fallen behind and the redo to be shipped has been flushed from the log buffer, then the network server process reads from the online redo logs. This causes more I/O contention and possibly longer wait times for the log writer process writes (Log File Parallel Write). If I/O bandwidth and sufficient spindles are not allocated, then the log file parallel writes and log file sequential reads increase, which may affect throughput and response time. In most cases, adding sufficient spindles reduces the I/O latency.

Note:

To enable most of the statistical gathering and advisors, ensure the STATISTICS_LEVEL initialization parameter is set to TYPICAL (recommended) or ALL.

See Also:



Footnote Legend

Footnote 8: You can use Extended Datatype Support (EDS) to accommodate several more advanced data types. See the MAA white paper "Extended Datatype Support: SQL Apply and Streams" at http://www.oracle.com/technology/deploy/availability/pdf/maa_edtsoverview.pdf for more details.
Footnote 9: Cluster Ready Services (CRS) is the primary program for managing high availability operations in Oracle Clusterware.
Footnote 10: At times the term switchback is also used within the scope of database role management. A switchback operation is a subsequent switchover operation to return the roles to their original state.