|Oracle® Database High Availability Best Practices
11g Release 1 (11.1)
|PDF · Mobi · ePub|
The best practices discussed in this section apply to Oracle Database 11g with Oracle Real Application Clusters (Oracle RAC). These best practices build on the Oracle Database 11g configuration best practices described in Section 2.2, "Configuring Oracle Database 11g" and Section 2.3, "Configuring Oracle Database 11g with Oracle Clusterware". These best practices are identical for the primary and standby databases if they are used with Data Guard in Oracle Database 11g with Oracle RAC and Data Guard—MAA. Some best practices may use your system resources more aggressively to reduce or eliminate downtime. This can, in turn, affect performance service levels, so be sure to assess the impact in a test environment before implementing these practices in a production environment.
This section includes the following topics:
Instance recovery, which is the process of recovering the redo thread from the failed instance, is a critical component affecting availability. The availability of the database during instance recovery has greatly increased over the last few major releases of the Oracle Database.
When using Oracle RAC, the SMON process in one surviving instance performs instance recovery of the failed instance. This is different from crash recovery, which occurs when all instances accessing a database have failed. Crash recovery is the only type of recovery when an instance fails using a single-instance Oracle Database.
In both Oracle RAC and single-instance environments, checkpointing is the internal mechanism used to bound Mean Time To Recover (MTTR). Checkpointing is the process of writing dirty buffers from the buffer cache to disk. With more aggressive checkpointing, less redo is required for recovery after a failure. Although the objective is the same, the parameters and metrics used to tune MTTR are different in a single-instance environment versus an Oracle RAC environment.
In a single-instance environment, you can set the
FAST_START_MTTR_TARGET initialization parameter to the number of seconds the crash recovery should take. Note that crash recovery time includes the time to startup, mount, recover, and open the database.
Oracle provides several ways to help you understand the MTTR target your system is currently achieving and what your potential MTTR target could be, given the I/O capacity. See the MAA white paper "Best Practices for Optimizing Availability During Unplanned Outages Using Oracle Clusterware and Oracle Real Application Clusters" for more information.
FAST_START_PARALLEL_ROLLBACK parameter determines how many processes are used for transaction recovery, which is done after redo application. Optimizing transaction recovery is important to ensure an efficient workload after an unplanned failure. As long as the system is not CPU bound, setting this parameter to
HIGH is a best practice. This causes Oracle to use four times the CPU count (4 X cpu_count) parallel processes for transaction recovery. The default setting for this parameter is
LOW, or two times the CPU count (2 X cpu_count). Set the parameter as follows:
ALTER SYSTEM SET FAST_START_PARALLEL_ROLLBACK=HIGH SCOPE=BOTH;
By employing this database configuration best practice along with the one described in Section 2.4.3, "Ensure Asynchronous I/O Is Enabled"it is possible to achieve approximately a 20% increase in total availability at the database level.
Using asynchronous I/O is a best practice that is recommended for all Oracle Databases. See Section 18.104.22.168, "Set DISK_ASYNCH_IO" for guidelines.
Separate dedicated channels on one fibre may be needed, or you can optionally configure Dense Wavelength Division Multiplexing (referred to as DWDM) to allow communication between the sites without using repeaters and to allow greater distances (greater than 10 kmFoot 6 ) between the sites. However, the disadvantage is that DWDM can be prohibitively expensive.
Footnote LegendFootnote 6: For each 100 km add 1 ms latency with interconnect and I/O with 3 ms. LGWR is impacted by at least the network I/O. The sweet spot is within a metro area. 90%