|Oracle® Database High Availability Best Practices
11g Release 1 (11.1)
|PDF · Mobi · ePub|
Oracle Clusterware is software that manages the availability of user applications and Oracle databases. Oracle Clusterware is the only clusterware needed for most platforms on which Oracle RAC operates. You can also use clusterware from other vendors if the clusterware is certified for use with Oracle RAC. However, adding unnecessary layers of software for functionality that is provided by Oracle Clusterware adds complexity and cost and can reduce system availability, especially for planned maintenance.
Oracle Clusterware includes a high availability framework that provides an infrastructure to manage any application. Oracle Clusterware ensures the applications it manages start when the system starts. Oracle Clusterware also monitors the applications to make sure they are always available. For example, if a process fails, then Oracle Clusterware attempts to restart the process based on scripts that you customize. If a node in the cluster fails, then you can program processes that normally run on the failed node to restart on another node. The monitoring frequency, starting, and stopping of the applications and the application dependencies are configurable.
See Also:Oracle Clusterware Administration and Deployment Guide for more information about managing application availability with Oracle Clusterware
Use the following configuration best practices for planned and unplanned maintenance activities. The following sections put these configuration best practices to use in an operational context:
Follow these best practices when installing different software releases of Oracle Clusterware, ASM, and the Oracle Database on your cluster:
Install an Oracle Clusterware release that is equal to or higher than the release of the database and ASM.
Maintain the same release between the database, the clusterware, and ASM if possible. Supported mixed-release environments (as described in compatibility matrices) increase administration costs and require diligent planning before applying Oracle patches and patch sets.
See Also:Your platform-specific Oracle Clusterware installation guide
Proper capacity planning is a critical success factor for all aspects of Oracle clustering technology, but it is of particular importance for planned maintenance. You must ensure that the work a cluster is responsible for can be done when a small part of the cluster (for example, a node) is unavailable. If the cluster cannot keep up after a planned or unplanned outage, the potential for cascading problems is higher due to system resource starvation.
When sizing your cluster, ensure that n percentage of the cluster can meet your service levels where n percentage represents the amount of computing resource left over after a typical planned or unplanned outage. For example, if you have a four-node cluster and you want to apply patches in a rolling fashion—meaning one node is upgraded at a time—then three nodes can run the work requested by the application.
One other aspect to capacity planning that is important during planned maintenance is ensuring that any work being done as part of the planned maintenance is separated from the application work when possible. For example, if a patch requires that a SQL script is run after all nodes have been patchedFoot 1 , it is a best-practice to run this script on the last node receiving the patch before allowing the application to start using that node. This technique ensures that the SQL script has full use of the operating system resources on the node and it is less likely to affect the application.
All rolling patch features require that the software home being patched is local, not shared. The software must be physically present in a local file system on each node in the cluster and it is not on a shared cluster file system.
The reason for this requirement is that if a shared cluster file system is used, patching the software on one node affects all of the nodes, and would require that you shut down all components using the software on all nodes. Using a local file system allows software to be patched on one node without affecting the software on any other nodes.
Traditionally, Oracle Database patch sets have been done in-place, which means that the new code was applied directly over the old code. There were a variety of reasons for applying patch sets in-place such as less space consumption and a simpler install. However, many of these reasons are no longer valid in today's IT environment. The downside to an in-place database patch set upgrade is that the application cannot connect to the database while new code is being copied inFoot 2 . To avoid this availability impact, use a combination of Oracle cloning technology and an out-of-place patch set installation. Cloning technology allows the existing software to be copied to a new
ORACLE_HOME after which a patch set may be applied.
An out-of-place patch set installation with cloning has the following advantages:
Applications remain available while software is upgraded in the new
The configuration inside the
ORACLE_HOME is retained because the cloning procedure involves physically copying the softwareFoot 3 .
The one disadvantage to an out-of-place patch set installation with cloning is that you must change any
$ORACLE_HOME environment variable hard coded in application code and Oracle specific scripts.
If application availability is more important to you than changing customizations, consider performing an out-of-place patch set installation with cloning.
Note:Oracle offers other solutions—such as SQL Apply Rolling Upgrade and Oracle Streams—to reduce downtime to seconds during upgrades. Using out-of-place patch set installations derive benefits that you can obtain without the extra steps and potential limitations associated when using these features. For example, SQL Apply rolling upgrade and Oracle Streams have datatype restrictions that may prevent their use.
The ability to migrate client connections to and from the nodes on which you are working is a critical aspect of planned maintenance. Migrating client connections should always be the first step in any planned maintenance activity requiring software shutdown (for example, when performing a rolling upgrade). The potential for problems increases if there are still active database connections when the software shutdown commences.
Oracle provides services, FAN, FAN-integrated clients, client side load balancing, Fast Connection Failover, and run time connection load balancing to achieve this objective. Detailed information about client failover best practices in an Oracle RAC environment are available in the "Workload Management with Oracle Real Application Clusters" white paper on the Oracle Technology Network at
An example of a best-practice process for client redirection during planned maintenance is as follows (Note: the following example is specific to FAN ONSFoot 4 ):
FAN ONS integrated clients properly configured with run time connection load balancing and Fast Connection Failover.
Oracle Clusterware stops services on the instance to be brought down or relocates services to an alternate instance.
Oracle Clusterware returns a
FAN ONS integrated client receives the event and moves connections to other instances offering the service.
While it is technically feasible to run ASM and Oracle Database instances out of the same
ORACLE_HOME, it is not a preferred configuration. You should create an
ORACLE_HOME for the database and an ASM home for ASM instances to enable more flexibility during patches and upgrades. For example, you want to avoid having to stop your volume manager (ASM) to apply a patch that fixes code exclusively used by the database. Doing so would require that you shut down all of the databases, including the ones that are not using the patched code.
If you have many Oracle homes, then managing the listener or listeners in use can be a confusing and error-prone task. You should run the listener with the latest version when multiple versions are available. If you typically update your ASM software before your database software, then running the listener out of the ASM home simplifies the manageability of your network configuration and proactively avoids potential bugs in older listener code.
For cases where a service only has one preferred instance, ensure that the service is started immediately on an available instance after it is brought down on its preferred instance. Starting the service immediately ensures that affected clients can instantaneously reconnect and continue working. Oracle Clusterware handles this responsibility and it is of utmost importance during unplanned outages.Even though you can rely on Oracle Clusterware for to start the service during planned maintenance as well, it is safer to ensure that the service is available on an alternate instance by manually starting an alternate preferred instance ahead of time. Manually starting an alternate instance eliminates the single point of failure with a single preferred instance and you have the luxury to do this because it is a planned activity. Add at least a second preferred instance to the service definition and start the service before the planned maintenance. You can then stop the service on the instance where maintenance is being performed with the assurance that another service member is available. Adding one or more preferred instances does not have to be a permanent change. You can revert it back to the original service definition after performing the planned maintenance.Manually relocating a service rather than changing the service profile is advantageous in cases such as the following:
If you are using Oracle XA, then use manual service relocation because running a service on multiple instances is not supported.
If an application is not designed to work properly with multiple service members, then application errors or performance issues can arise.
As with all configuration changes, you should test the effect of a service with multiple members to assess its viability and impact in a test environment before implementing the change in your production environment.
With Oracle Database 11g, application workloads can be defined as services so that they can be individually managed and controlled. DBAs control which processing resources are allocated to each service during both normal operations and in response to failures. Performance metrics are tracked by service and thresholds set to automatically generate alerts should these thresholds be crossed. CPU resource allocations and resource consumption controls are managed for services using Database Resource Manager. Oracle tools and facilities such as Job Scheduler, Parallel Query, and Oracle Streams Advanced Queuing also use services to manage their workloads.
With Oracle Database 11g, you can define rules to automatically allocate processing resources to services. Oracle RAC in Oracle Database release 11g instances can be allocated to process individual services or multiple services, as needed. These allocation rules can be modified dynamically to meet changing business needs. For example, you could modify these rules after a quarter to ensure that there are enough processing resources to complete critical financial functions on time. You can also define rules so that when instances running critical services fail, the workload is automatically shifted to instances running less critical workloads. You can create and administer services with Oracle Enterprise Manager and the
DBMS_SERVICE PL/SQL package.
You should make application connections to the database through a Virtual Internet Protocol (VIP) address to a service defined as part of the workload management facility to achieve the greatest degree of availability and manageability.
A VIP address is an alternate public address that client connections use instead of the standard public IP address. If a node fails, then the node's VIP address fails over to another node but there is no listener listening on that VIP, so a client that attempts to connect to the VIP address receives a connection refused error (
ORA-12541) instead of waiting for long TCP connect timeout messages. This error causes the client to quickly move on to the next address in the address list and establish a valid database connectionFoot 5 . VIP addresses are configured using the Virtual Internet Protocol Configuration Assistant (VIPCA).
See Also:Oracle Real Application Clusters Administration and Deployment Guide for more information about workload management
Client-side load balancing evenly spreads connection requests across all listeners. It is defined in your client connection definition by setting the parameter
ON. (The default is
ON for description lists). When this parameter is set to
ON, Oracle Database randomly selects an address in the address list and connects to that node's listener. This provides a balancing of the number of client connections across the available listeners in the cluster. When the listener receives the connection request, it connects the user to an instance that it knows provides the requested service. To see what services a listener supports, run the
LSNRCTL services command.
Server-side load balancing uses the current workload being run on the available instances for the database service requested during a connection request and directs the connection request to the least loaded instance on the least loaded node. Server-side connection load balancing requires each instance to register with all available listeners, which is accomplished by setting
REMOTE_LISTENER parameters for each instance. These parameters are set by default when creating a database with DBCA.
To further enhance connection load balancing, use the load balancing advisor and define the connection load balancing goal for each service by setting the
CLB_GOAL attributes with the
DBMS_SERVICE PL/SQL package:
When using connection pools without FAN integration, set
CLB_GOAL=SHORT attribute setting is also required for the Runtime Connection Load Balancing feature of ICC and UCP (universal connection pool for Java), which use metrics from the database to properly distribute work to instances offering the application service.
Oracle Real Application Clusters Administration and Deployment Guide for more information about workload management
Oracle Database Net Services Administrator's Guide for more information about configuring listeners
The OCR contains important configuration data about cluster resources. Always protect the OCR by using the ability of Oracle Clusterware to mirror the OCR. Oracle Database automatically manages two OCRs when it mirrors the OCR.
The voting disk must reside on a shared disk. For high availability, Oracle recommends that you have multiple voting disks on multiple storage devices across different controllers, where possible. Oracle Clusterware enables multiple voting disks, but you must have an odd number of voting disks, such as three, five, and so on. If you define a single voting disk, then you should use external redundant storage to provide redundancy.
See Also:Oracle Real Application Clusters Administration and Deployment Guide for more information about managing OCR and voting disks
Oracle Clusterware automatically creates OCR backups every four hours on one node in the cluster, which is the OCR master node. Oracle always retains the last three backup copies of OCR. The
CRSD process that creates the backups also creates and retains an OCR backup for each full day and after each week. You should use Oracle Secure Backup, or standard operating-system tools, or third-party tools to back up the backup files created by Oracle Clusterware as part of the operating system backup.
Note:The default location for generating OCR backups on UNIX-based systems is
cluster_nameis the name of your cluster. The Windows-based default location for generating backups uses the same path structure. Backups are taken on the OCR master node. To list the node and location of the backup, issue the
In addition to using the automatically created OCR backup files, you can use the
-manualbackup option on the
ocrconfig command to perform a manual backup, on demand. For example, you can perform a manual backup before and after you make changes to the OCR such as adding or deleting nodes from your environment, modifying Oracle Clusterware resources, or creating a database. The
ocrconfig -manualbackup command exports the OCR content to a file format. You can then backup the export files created by
ocrconfig as a part of the operating system backup using Oracle Secure backup, standard operating-system tools, or third-party tools.
See Also:Oracle Clusterware Administration and Deployment Guide for more information about backing up the OCR
For the most efficient network detection and failover, Oracle Clusterware and Oracle RAC should use the same interconnect subnet so that they share the same view of connections and accessibility. Perform the following steps to verify the interconnect subnet:
To verify the interconnect subnet used by Oracle RAC, either check the instance startup section in the alert-log of an instance for an existing Oracle RAC database or run the Oracle
ORADEBUG utility on one instance. For example:
SQL> ORADEBUG SETMYPID Statement processed. SQL> ORADEBUG IPC Information written to trace file. SQL> ORADEBUG tracefile_name /u01/app/oracle/admin/prod/udump/prod1_ora_24409.trc
In the trace file, examine the
SSKGXPT section to determine the subnet used by Oracle RAC. In this example, the subnet in use is 192.168.0.3 and the protocol used is UDP:
SSKGXPT 0xd7be26c flags info for network 0 socket no 7 IP 192.168.0.3 UDP 14727
To verify the interconnect subnet used by the clusterware, examine the value of the keyname
.privatename in OCR:
prompt> ocrdump -stdout -keyname SYSTEM.css.node_numbers [SYSTEM.css.node_numbers.node1.privatename] ORATEXT : halinux03ic0 . . . [SYSTEM.css.node_numbers.node2.privatename] ORATEXT : halinux04ic0
Use operating system tools to verify that the hostnames (
halinux04ic0 in this example) match the subnet in the trace file produced by
ORADEBUG (subnet 192.168.0.3). The following example is performed on Linux:
prompt> getent hosts halinux03ic0 192.168.0.3 halinux03ic0.us.oracle.com halinux03ic0
Use Oracle Clusterware to make any application, including a single-instance database, highly available using cold failover cluster. You can find examples of using Oracle Clusterware to make applications highly available on Oracle Technology Network at
Footnote LegendFootnote 1: An example of this is the
CATCPU.SQLscript that must be run after installing the CPU patch on all nodes.
DBMS_SERVICEpackage to remove sessions from instances that are being worked on.
no listenererror message is returned to the clients. The clients traverse to the next address in the address list that has a non-failed-over VIP with a listener running on it.