10
High Availability Concepts and Best Practices in Real Application Clusters

This chapter describes the concepts and some of the best practices for implementing high availability in Real Application Clusters. The topics in this chapter are:

Understanding High Availability

Computing environments configured to provide nearly full-time availability are known as high availability systems. Such systems typically have redundant hardware and software that makes the system available despite failures. Well-designed high availability systems do not have single points-of-failure. Any hardware or software component that can fail has a redundant component of the same type.

When failures occur, the failover process moves processing performed by the failed component to the backup component. This process remasters system-wide resources, recovers partial or failed transactions, and restores the system to normal, preferably within a matter of microseconds. The more transparent that failover is to users, the higher the availability of the system.

Oracle offers several products and features that provide high availability. These include Real Application Clusters, Oracle Real Application Clusters Guard I, Oracle Real Application Clusters Guard II, Oracle Replication, and Oracle9i Data Guard. You can use these products in various combinations to meet your specific high availability needs. Real Application Clusters systems are inherently high availability environments that can provide continuous service for both planned and unplanned outages. Real Application Clusters Guard II provides continuous service despite unplanned failures and for online maintenance operations.

See Also:

Oracle9i Real Application Clusters Real Application Clusters Guard I - Concepts and Administration and Oracle9i Real Application Clusters Guard II Concepts, Installation, and Administration for more information about these features

Configuring Real Application Clusters for High Availability

Real Application Clusters builds higher levels of availability on top of the standard Oracle features. All single instance high availability features such as Fast-Start Recovery and online reorganizations also apply to Real Application Clusters. Fast-Start Recovery can greatly reduce the mean time to recovery (MTTR) with minimal effects on online application performance. Online reorganizations reduce the duration of planned downtime and you can perform many operations online while users update the underlying objects.

In addition to these features, Real Application Clusters exploits the redundancy provided by clustering to deliver availability with n-1 node failures in an n-node cluster. In other words, all users have access to all data as long as there is one available node in the cluster. To configure Real Application Clusters for high availability, consider the hardware and software components of your cluster as described the following section.

Cluster Components and High Availability

This section describes high availability and cluster components in the following sections:

Cluster Nodes
Cluster Interconnects
Database Software

See Also:
Chapter 2 for more information about these components

Cluster Nodes

Real Application Clusters environments are fully redundant because all nodes access all the database. The failure of one node does not affect another node's ability to process transactions. As long as the cluster has one surviving node, all database clients can process all transactions, although the clients may be subject to increased response times due to capacity constraints on the surviving node.

Cluster Interconnects

Interconnect redundancy is often overlooked in clusters. This is because the mean time to failure (MTTF) is generally several years. Therefore, cluster interconnect redundancy might not be a high priority. Also, depending on the system and sophistication level, a redundant cluster interconnect could be cost prohibitive.

However, a redundant cluster interconnect is an important aspect of a fully redundant cluster. Without this, a system is not truly free of single points-of-failure. Cluster interconnects can fail for a variety of reasons and you cannot prevent all of them.

Database Software

In Real Application Clusters, Oracle executables are installed on either the cluster file system or on the local disks of each node; and at least one instance runs on each node of a cluster. Note that if your platform supports a cluster file system (CFS) and you use it, then only one copy of the Oracle Real Application Clusters software will be installed. All instances have equal access to all data and can process any transactions. In this way, Real Application Clusters ensure full database software redundancy.

Disaster Planning

Real Application Clusters is primarily a single site, high availability solution. This means the nodes in the cluster generally exist within the same building, if not the same room. Therefore, disaster planning can be critical. Depending on how mission critical your system is, and the potential exposure of your system's location for such disasters, disaster planning can be an important high availability component.

Oracle offers other solutions such as Oracle9i Data Guard to facilitate more comprehensive disaster recovery planning. You can use these solutions with Real Application Clusters where one cluster hosts the primary database and another remote system or cluster hosts the disaster recovery database. However, Real Application Clusters are not required on either site for disaster recovery.

See Also :

Oracle9i Data Guard Concepts and Administration for more information about Data Guard

Failure Protection Validation

Once you have carefully considered your system level issues, validate that your Real Application Clusters environment optimally protects against failures. Use the following list of failure points to plan and troubleshoot your failure protection system:

Cluster component
CPU
Memory
Interconnect software
Operating System
Cluster Manager
Oracle database instance media
Corrupt or lost control file, log file, or datafile
Dropped or deleted database object
Human error

Real Application Clusters environments protect against cluster component failures and software failures. However, media failures and human error could still cause system downtime. Real Application Clusters, as with single-instance Oracle databases, operates on one set of files. For this reason, you should adopt best practices to avoid the adverse effects of media failures.

RAID-based redundancy practices avoid file loss but might not prevent rare cases of file corruptions. Also, if you mistakenly drop a database object in an Real Application Clusters environment, then you can recover that object the same way you would in a single instance database. These are the primary limitations in an otherwise very robust and highly available Real Application Clusters system.

Once you deploy your system, the key issue is the transparency of failover and its duration as described in the following section.

Failover and Real Application Clusters

This section describes the principles of failover and the features Real Application Clusters offers to implement failover in high availability systems. Topics in this section include:

Failover Basics

Failover requires that highly available systems have accurate instance monitoring or heartbeat mechanisms. In addition to having this functionality for normal operations, the system must be able to quickly and accurately synchronize resources during failover.

The process of synchronizing, or remastering, requires the graceful shutdown of the failing system as well as an accurate assumption of control of the resources that were mastered on that system. In Real Application Clusters, your system records resource information to remote nodes as well as local. This makes the information needed for failover and recovery available to the recovering instances.

See Also:

Oracle9i Real Application Clusters Real Application Clusters Guard I - Concepts and Administration for information about how to set up Oracle Real Application Clusters Guard I on your system
Oracle9i Real Application Clusters Guard II Concepts, Installation, and Administration for information about using Oracle Real Application Clusters Guard II
Oracle9i Real Application Clusters Administration for details about recovery in Real Application Clusters
Oracle9i Recovery Manager User's Guide for details on recovery

Duration of Failover

The duration of failover includes the time a system requires to remaster system-wide resources and the time to recover from failures. The duration of the failover process can be a relatively short interval on certified platforms.

For existing users, failover entails both server and client failover actions
For new users, failover only entails the duration of server failover processing

Client Failover

It is important to hide system failures from database client connections. Such connections can include application users in client server environments or middle-tier database clients in multitiered application environments. Properly configured failover mechanisms transparently reroute client sessions to an available node in the cluster. This capability in the Oracle database is referred to as Transparent Application Failover.

Transparent Application Failover

Transparent Application Failover (TAF) enables an application user to automatically reconnect to a database if the connection fails. Active transactions roll back, but the new database connection, which is achieved using a different node, is identical to the original. This is true regardless of how the connection fails.

Elements Affected by Transparent Application Failover

There are several elements associated with active database connections. These include:

Client/Server database connections
Users' database sessions executing commands
Open cursors used for fetching
Active transactions
Server-side program variables

Transparent Application Failover automatically restores some of these elements. For example, during normal client/server database operations, a client maintains a connection to the database so the client and server can communicate. If the server fails, then so does the connection. The next time the client tries to use the connection the client issues an error. At this point, the user must log in to the database again.

With Transparent Application Failover, however, Oracle automatically obtains a new connection to the database. This enables users to continue working as if the original connection had never failed. Therefore, with Transparent Application Failover, a client notices no connection loss as long as one instance remains active to serve the application.

See Also:

Oracle9i Net Services Administrator's Guide for background and configuration information about Transparent Application Failover

Uses of Transparent Application Failover

While the ability to fail over client sessions is an important benefit of Transparent Application Failover, there are other useful scenarios where Transparent Application Failover improves system availability. These topics are discussed in the following subsections:

Transactional Shutdowns

It is sometimes necessary to take nodes out of service for maintenance or repair. For example, if you want to apply patch releases without interrupting service to application clients. Transactional shutdowns facilitate shutting down selected nodes rather than an entire database. Two transactional shutdown options are available:

Use the TRANSACTIONAL clause of the SHUTDOWN statement to remove a node from service so that the shutdown event is deferred until all existing transactions are completed. In this way, client sessions can be migrated to another node of the cluster at transaction boundaries.
Use the TRANSACTIONAL LOCAL clause of the SHUTDOWN statement to perform transactional shutdown on a specified local instance. You can use this statement to prevent new transactions from starting locally, and to perform an immediate shutdown after all local transactions have completed. With this option, you can gracefully move all sessions from one instance to another by shutting down selected instances transactionally.

After performing a transactional shutdown, Oracle routes newly submitted transactions to an alternate node. An immediate shutdown is performed on the node when all existing transactions complete.

Quiescing the Database

You may need to perform administrative tasks that require isolation from concurrent user transactions or queries. To do this, you can use the quiesce database feature. This prevents you, for example, from having to shut down the database and re-open it in restricted mode to perform such tasks.

To do this, you can use the ALTER SYSTEM statement with the QUIESCE RESTRICTED clause. The QUIESCE RESTRICTED clause enables you to perform administrative tasks in isolation from concurrent user transactions or queries.

Note:

You cannot open the database on one instance if the database is being quiesced on another node. In other words, if you issued the ALTER SYSTEM QUIESCE RESTRICTED statement but it is not finished processing, you cannot open the database. Nor can you open the database if it is already in a quiesced state.

See Also:

Oracle9i Real Application Clusters Administration and the Oracle9i Database Administrator's Guide for more detailed information about the quiesce database feature and Oracle9i SQL Reference for more information about the ALTER SYSTEM QUIESCE RESTRICTED syntax

Load Balancing

A database is available when it processes transactions in a timely manner. When the load exceeds a node's capacity, client transaction response times are adversely affected and the database availability is compromised. It then becomes important to manually migrate client sessions to a less heavily loaded node to maintain response times and application availability.

In Real Application Clusters, the Transport Network Services (TNS) listener files provide automated load balancing across nodes in both shared server and dedicated server configurations. Because the parameters that control cross-instance registration are also dynamic, Real Application Clusters' load balancing feature automatically adjusts for cluster configuration changes. For example, if you add a node to your cluster database, then Oracle updates all the listener files in the cluster with the new node's listener information.

Database Client Processing During Failover

Failover processing for query clients is different than the failover processing for Database Manipulation Language clients. The important issue during failover operations in either case is that the failure is masked from existing client connections as much as possible. The following subsections describe both types of failover processing.

Query Clients

At failover, in-progress queries are reissued and processed from the beginning. This might extend the duration of the next query if the original query required longer to complete. With Transparent Application Failover (TAF), the failure is masked for query clients with an increased response time being the only issue affecting the client. If the client query can be satisfied with data in the buffer cache of the surviving node to which the client reconnected, then the increased response time is minimal. Using TAF's PRECONNECT method eliminates the need to reconnect to a surviving instance and thus further minimizes response time. However, PRECONNECT allocates resources awaiting the failover event.

After failover, server-side recovery must complete before access to the datafiles is allowed. The client transaction experiences a system pause until server-side recovery completes, if server-side recovery has not already completed.

You can also use a callback function through an OCI call to notify clients of the failover so that the clients do not misinterpret the delay for a failure. This prevents the clients from manually attempting to reestablish connections.

Database Manipulation Language Clients

Database Manipulation Language (DML) database clients perform INSERT, UPDATE, and DELETE operations. Oracle handles certain errors and performs a reconnect when those errors occur.

Without this application code, INSERT, UPDATE, and DELETE operations on the failed instance return an un-handled Oracle error code. Upon re-submission, Oracle routes the client connections to a surviving instance. The client transaction then stops only momentarily until server-side recovery completes.

Transparent Application Failover Processing During Shutdowns

Queries that cross the network after shutdown processing completes will failover. However, Oracle returns an error for queries that are in progress during shutdowns. Therefore, TAF only operates when the operating system returns a network error and the instance is completely down.

Applications that use TAF for transactional shutdown must be written to process the error ORA-01033 "ORACLE initialization or shutdown in progress". In the event of a failure, an instance will return error ORA-01033 once shutdown processing begins. Such applications need to periodically retry the failed operation, even when Oracle reports multiple ORA-01033 errors. When shutdown processing completes, TAF recognizes the failure of the network connection to instance and restores the connection to an available instance.

Connection load balancing improves connection performance by balancing the number of active connections among multiple dispatchers. In single-instance Oracle environments, the listener selects the least loaded dispatcher to manage incoming client requests. In Real Application Clusters environments, connection load balancing also has the capability of balancing the number of active connections among multiple instances.

Due to dynamic service registration, a listener is always aware of all of the instances and dispatchers regardless of their locations. Depending on the load information, a listener determines to which instance and to which dispatcher to send incoming client requests if you are using the shared server configuration.

In shared server configurations, listeners select dispatchers using the following criteria in the order shown:

Least loaded node
Least loaded instance
Least loaded dispatcher for that instance

In dedicated server configurations, listeners select instances in the following order:

Least loaded node
Least loaded instance

If a database service has multiple instances on multiple nodes, then the listener chooses the least loaded instance on the least loaded node. If you have configured the shared server, then the least loaded dispatcher of the selected instance is chosen.

See Also:

Oracle9i Net Services Administrator's Guide for more information about load balancing

Transparent Application Failover Restrictions

When a connection fails, you might experience the following:

All PL/SQL package states on the server are lost at failover
ALTER SESSION statements are lost
If failover occurs when a transaction is in progress, then each subsequent call causes an error message until the user issues an OCITransRollback call. Then Oracle issues an Oracle Call Interface (OCI) success message. Be sure to check this message to see if you must perform additional operations.
Oracle fails over the database connection and if TYPE=SELECT in the FAILOVER_MODE section of the service name description, Oracle also attempts to fail over the query
Continuing work on failed-over cursors can result in an error message

If the first command after failover is not a SQL SELECT or OCIStmtFetch statement, then an error message results. Failover only takes effect if the application is programmed with OCI release 8.0 or greater.

Server Failover

Server-side failover processing in Real Application Clusters is different from host-based failover solutions that are available on many server platforms. The following subsections describe both types of failover processing.

Real Application Clusters Failover

Real Application Clusters provides rapid server-side failover. This is accomplished by the concurrent, active-active architecture in Real Application Clusters. In other words, multiple Oracle instances are concurrently active on multiple nodes and these instances synchronize access to the same database.

All nodes also have concurrent ownership and access to all disks. When one node fails, all other nodes in the cluster maintain access to all the disks; there is no disk ownership to transfer, and database application binaries are already loaded into memory.

Depending on the size of the database, the duration of failover can vary. The larger the database, or the greater the size of its datafiles, the greater the failover benefit of using Real Application Clusters.

Host-Based Failover

Many operating system vendors and other cluster software vendors offer high availability application failover products. These failover solutions monitor application services on a given primary cluster node. They then fail over services to a secondary cluster node as needed. Host-based failover solutions generally have one active instance performing useful work for a given database application. The secondary node monitors the application service on the primary node and initiates failover when the primary node service is unavailable.

Failover in host-based systems usually includes the following steps.

Detecting failure by monitoring the heartbeat
Reorganizing cluster membership in the Cluster Manager
Transferring disk ownership from the primary node to a secondary node
Restarting application and database binaries (Oracle executables)
Performing application and database recovery
Reestablishing client connections to the failover node

Failover Processing in Real Application Clusters

The following subsections describe server failover recovery processing in Real Application Clusters:

Detecting Failure

Real Application Clusters relies on the Cluster Manager software for failure detection because the Cluster Manager maintains the heartbeat functions. The time it takes for the Cluster Manager to detect that a node is no longer in operation is a function of a configurable heartbeat timeout parameter.

The use of this parameter varies, depending on your platform. Defaults can vary significantly, depending on the clusterware you use, such as Sun Cluster or the Hewlett-Packard Service Guard OPS Edition. The parameter value is inversely related to the number of false failure detections because the cluster might incorrectly determine that a node is failing due to transient failures if the timeout interval is set too low. When a failure is detected, cluster reorganization occurs.

Reorganizing Cluster Membership

When a node fails, Oracle must alter the node's cluster membership status. This is known as a cluster reorganization and it usually happens quickly. The duration of cluster reorganization is proportional to the number of surviving nodes in the cluster.

The Global Cache Service (GCS) and Global Enqueue Service (GES) provide the Cluster Manager interfaces to the software and expose the cluster membership map to the Oracle instances when nodes are added or deleted from the cluster. The LMON process on each cluster node communicates with the Cluster Manager on the respective node and exposes that information to the respective instances. LMON also provides two more useful functions by:

Continually sending messages from the node on which it runs
Often writing to the shared disk

If a node fails to perform these two functions, then other nodes consider that node to no longer be a member of the cluster. Such a failure causes a change in a node's membership status within the cluster. Then LMON initiates recovery actions that include remastering of the Global Cache Service (GCS) and Global Enqueue Service (GES) resources and instance recovery.

At this stage, the Real Application Clusters environment is in a state of system pause, and client transactions that do not have the needed resources to complete will suspend until Oracle completes recovery processing. Other in-progress transactions, however, continue processing.

Instance Membership Recovery

The process of instance membership recovery (IMR) guarantees that all members of a cluster are functional by:

Ensuring that communications are viable between all members, and that all members are capable of responding.
Providing a mechanism for removing members that are not active. IMR arbitrates the membership and removes members that it decides no longer belong to the cluster.
Voting on membership using the control file. Each member writes a bitmap to the control file. This is part of the checkpoint progress record.
Removing members based on a communication failure. IMR will perceive members as having failed if they do not transmit periodic heartbeat messages to the control file, or if they do not respond to a query about their status.
Settling the membership votes and locking the control file voting results record (CFVRR) with an arbiter.

All instances read the CFVRR. If a member is not in the membership map, then IMR assumes a node has expired. Appropriate diagnostic information is provided. As IMR is currently configured, all members wait indefinitely for notification of node expiration. There is no forced removal of instances. Part of the fault tolerance of Real Application Clusters is a provision for the possibility that the IMR arbiter itself could fail.

Performing Database Recovery

When an instance fails, Oracle must remaster the GCS resources from the failed instance onto the surviving cluster nodes and perform instance recovery as discussed in the following sections:

Remastering Global Cache Service Resources of the Failed Instance

The time required for remastering resources is proportional to the number of GCS resources in the failed instance. This number in turn depends upon the size of the buffer caches.

During this phase, all resources previously mastered at the failed instance are redistributed across the remaining instances. These resources are reconstructed at their new master instance. All other resources previously mastered at surviving instances are not affected. For any resource request, there is a 1/n chance that the request will be satisfied locally and a (n-1)/n chance that the request involves remote operations.

In the case of a cluster database having only one surviving instance, all resource operations are satisfied locally. Once the remastering of a failed instance's GCS resource completes, Oracle recovers the in-progress transactions of the failed instance. This is known as instance recovery.

Instance Recovery

Instance recovery includes cache recovery and transaction recovery. Instance recovery requires that an active Real Application Clusters instance detects failure and performs recovery processing for the failed instance. The first Real Application Clusters instance that detects the failure, using its LMON process, controls the recovery of the failed instance by taking over its redo log files and performing instance recovery. This is why the redo log files must be on either a cluster file system file or on a shared raw device.

Instance recovery is complete when Oracle has replayed the online redo log files of the failed instance. Because Oracle can perform transaction recovery in a deferred fashion, any suspended client transactions can begin processing when cache recovery is complete.

See Also:

Oracle9i Recovery Manager User's Guide and Reference for a description of Block Media Recovery (BMR)

Cache Recovery

For cache recovery, Oracle replays the online redo logs of the failed instance. You can also make Oracle perform cache recovery using parallel execution so that parallel processes, or threads, replay the redo logs of the failed Oracle instance. It could also be important that you keep the time interval for redo log replay to a predictable duration. The Fast-Start Recovery feature in Oracle9i enables you to control this.

Oracle also provides nonblocking rollback capabilities. This means that full database access can begin as soon as Oracle has replayed the online log files. After cache recovery completes, Oracle begins transaction recovery.

See Also:

Oracle9i Database Performance Guide and Reference for more information on how to use Fast-Start Recovery

Transaction Recovery

Transaction recovery comprises rolling back all uncommitted transactions of the failed instance. Uncommitted transactions are in-progress transactions that did not commit.

The Oracle9i Fast-Start Rollback feature performs this as deferred processing that runs in the background. Oracle uses a multiversion read consistency technology to provide on-demand rollback of only those rows blocked by expired transactions. This enables new transactions to progress with minimal delay. New transactions do not have to wait for long-running expired transactions to be rolled back. Therefore, large transactions generally do not affect database recovery time.

Just as with cache recovery, Oracle9i Fast-Start Rollback rolls back expired transactions in parallel. However, single-instance Oracle databases roll back expired transactions using the CPU of one node.

Real Application Clusters provides cluster-aware Fast-Start Rollback capabilities that use all the CPU nodes of a cluster to perform parallel rollback operations. Each cluster node spawns a recovery coordinator and recovery processes to assist with parallel rollback operations. The Fast-Start Rollback feature is thus cluster aware because the database is aware of and uses all cluster resources for parallel rollback operations.

While the default behavior is to defer transaction recovery, you could choose to configure your system so that transaction recovery completes before allowing client transactions to progress. In this scenario, the ability of Real Application Clusters to parallelize transaction recovery across multiple nodes is a more visible user benefit.

High Availability Configurations

This section discusses the following Real Application Clusters high availability configurations:

Default N-Node Configurations

The Real Application Clusters n-node configuration is the default environment. All nodes of the cluster participate in client transaction processing and client sessions can be load balanced at connect time. Response time is optimized for available cluster resources, such as CPU and memory, by distributing the load across cluster nodes to create a highly available environment.

Benefits of N-Node Configurations

In the event of node failures, an instance on another node performs the necessary recovery actions. The database clients on the failed instance can be load balanced across the surviving (n-1) instances of the cluster. The increased load on each of the surviving instances can be minimized and availability increased by keeping response times within acceptable bounds. In this configuration, the database application workload can be distributed across all nodes and therefore provide optimal use of cluster machine resources.

Basic High Availability Configurations

You can easily configure a basic high availability system for Real Application Clusters in two-node environments. The primary instance on one node accepts user connections while the secondary instance on the other node accepts connections when the primary node fails, or when specifically selected through the INSTANCE_ROLE parameter. You can configure this manually by controlling the routing of transactions to specific instances. However, Real Application Clusters provides the Primary/Secondary Instance Configuration feature to accomplish this automatically.

Primary/Secondary Instance Configurations

Configure the Primary/Secondary Instance feature by setting the initsid.ora parameter ACTIVE_INSTANCE_COUNT to 1. In a two-node environment, the instance that first mounts the database assumes the primary instance role. The other instance assumes the role of secondary instance. If the primary instance fails, then the secondary instance assumes the primary role. When the failed instance returns to active status, it assumes the secondary instance role.

Remote Clients and the Primary/Secondary Configuration

The secondary instance becomes the primary instance only after the Cluster Manager informs it about the failure of the primary instance. This occurs before GCS and GES reconfiguration and cache and transaction recovery processes begin. The redirection to the surviving instance happens transparently; application programming is not required. You only need to make minor configuration changes to the client connect strings.

In Primary/Secondary Instance configurations, both instances run concurrently, as in any n-node Real Application Clusters environment. However, database application users only connect to the designated primary instance. The primary node masters all of the GCS and GES resources. This minimizes communication between the nodes and provides performance levels that are nearly comparable to traditional single instance databases.

The secondary instance can be used by specially configured clients, known as administrative clients, for batch query reporting operations or database administration tasks. This enables some level of utilization of the second node. It also helps off-load CPU capacity from the primary instance and justify the investment in redundant nodes.

The Primary/Secondary Instance configuration works in both dedicated server and shared server environments. However, it functions differently in each as described in the following sections:

Primary/Secondary Instance Configurations in Dedicated Server Environments

In current high availability configurations, dedicated server environments do not use cross-instance listener registration. Connection requests made to a specific instance's listener can only be connected to that instance's service. This behavior is similar to the default n-node configuration in dedicated server environments.

Figure 10-1 shows a cluster configuration before a node failure.

SALES1 is in contact with a listener.
A client is in contact with a listener.
SALES1 becomes the primary instance.

Figure 10-1 Primary/Secondary Configurations in Dedicated Server Environments

Text description of pss81016.gif follows

Text description of the illustration pss81016.gif

When the primary instance fails, as shown in Figure 10-2, the following steps occur:

The failure of SALES1 is communicated throughout the cluster.
A reconnection request from the client is rejected by the failed instance's listener.
The secondary instance performs recovery and becomes the primary instance.
Upon resubmitting the client request, the client reestablishes the connection through the new primary instance's listener that connects the client to the new primary instance. Note that the connection is reestablished automatically when you use address lists or if your client is configured to use connection failover.

Figure 10-2 Primary/Secondary Configurations and Node Failure with Dedicated Server

Text description of pss81017.gif follows

Text description of the illustration pss81017.gif

Primary/Secondary Instance Configurations and the Shared Server

Real Application Clusters provides reconnection performance benefits when running in shared server mode. This is accomplished by the cross-registration of all the dispatchers and listeners in the cluster.

In the Primary/Secondary configurations, the primary instance's dispatcher registers as the primary instance with both listeners, as shown in Figure 10-3:

A client could connect to either listener. (Only the connection to the primary node listener is illustrated.)
The client contacts the listener.
The listener then connects the client to the dispatcher. (Only the listener/dispatcher connection on the primary node is illustrated.)

See Also:
Oracle9i Real Application Clusters Setup and Configuration for information about configuring client connect strings

Figure 10-3 Primary/Secondary Configurations in Shared Server Environments

Text description of pss81019.gif follows

Text description of the illustration pss81019.gif

Specially configured clients can use the secondary instance for batch operations. For example, batch reporting tasks or index creation operations can be performed on the secondary instance.

See Also:

Oracle9i Real Application Clusters Administration for instructions about how to connect to secondary instances

Figure 10-4 shows how a failed primary instance is replaced by a new primary instance.

If the primary node fails, then the dispatcher in the secondary instance registers as the new primary instance with the listeners.
The client requests a reconnection to the database through either listener.
The listener directs the request to the new primary instance's dispatcher.

Figure 10-4 Primary/Secondary Configurations and Node Failure with Shared Server

Text description of pss81018.gif follows

Text description of the illustration pss81018.gif

Warming the Library Cache on the Secondary Instance

Maintaining information about frequently executed SQL and PL/SQL statements in the library cache improves the performance of the Oracle database server. In Real Application Clusters primary and secondary instance configurations, the library cache associated with the primary instance contains up-to-date information. If failover occurs, then the benefit of that information is lost unless the library cache on the secondary instance is populated beforehand.

Use the DBMS_LIBCACHE package to transfer information in the library cache of the primary instance to the library cache of the secondary instance. This process is called warming the library cache. It improves performance immediately after failover because the new primary library cache does not need to be populated with parsed SQL statements and compiled PL/SQL units.

See Also:

Oracle9i Real Application Clusters Real Application Clusters Guard I - Concepts and Administration for more information about installing and configuring the library cache warming feature and Oracle9i Supplied PL/SQL Packages and Types Reference for more information about using DBMS_LIBCACHE

Benefits of Basic High Availability Configurations

There are several reasons for using the Primary/Secondary Instance feature for this scenario instead of a default two-node configuration. The Primary/Secondary Instance feature provides:

A viable transition path for upgrading to an n-node configuration
A highly available solution for applications that do not need to scale beyond one node
Performance that is comparable to single instance databases
A gradual way to migrate from a single instance application environment to a Real Application Clusters environment

Shared High Availability Node Configurations

Operating Real Application Clusters in an n-node configuration optimally utilizes cluster resources. However, as discussed previously, this is not always possible or advisable. On the other hand, the financial investment required to have an idle node for failover might be prohibitive. These situations might instead be best suited for a shared high availability node configuration.

This type of configuration typically has several nodes each running a separate application module or service where all application services share one Real Application Clusters database. In addition, you can configure a separate designated node as a failover node. While an instance is running on that node, no users are connected to it during normal operations. In the event that one of the nodes fails, Oracle can redirect the workload to the failover node.

While this configuration is useful for applications that need to run on separate nodes, it works best if a middle-tier application or transaction processing monitor directs the appropriate application users to the appropriate nodes. Unlike the Primary/Secondary Instance Configuration, there is no database setup that automates the workload transition to the failover node. Instead, the application or middle-tier software directs users from the failed application node to the failover node. The application also must control failing back the users once the failed node is operational. Failing back frees the failover node for processing user work from subsequent node failures.

Benefits of Shared High Availability Node Configurations

In this configuration, application performance is maintained in the event of a failover. In the n-node configuration, application performance could degrade by 1/n due to the same workload being redistributed over a smaller set of cluster nodes.

Full Active Configurations with Real Application Clusters Guard II

High availability as well as improved manageability is available with Real Application Clusters Guard II which is a full instance environment that enables you to control all the instances on which services run as well as their failover properties. On failure, Real Application Clusters Guard II transfers application service loads to other available nodes without service interruptions.

See Also:

Oracle9i Real Application Clusters Guard II Concepts, Installation, and Administration on the Real Application Clusters Guard II software CD for more information about Real Application Clusters Guard II

Deploying High Availability

Real Application Clusters provides a fully redundant fault resilient environment. All cluster nodes have an active instance that has equal access to all data and resources. If a node fails, then users can access the data using a surviving instance on another node. In-progress transactions on the failed node are recovered by the first node that detects the failure. In this way, there is minimal interruption to end-user application availability with Real Application Clusters.

10 High Availability Concepts and Best Practices in Real Application Clusters

Understanding High Availability

Configuring Real Application Clusters for High Availability

Cluster Components and High Availability

Cluster Nodes

Cluster Interconnects

Database Software

Disaster Planning

Failure Protection Validation

Failover and Real Application Clusters

Failover Basics

Duration of Failover

Client Failover

Transparent Application Failover

Elements Affected by Transparent Application Failover

Uses of Transparent Application Failover

Transactional Shutdowns

Quiescing the Database

Load Balancing

Database Client Processing During Failover

Query Clients

Database Manipulation Language Clients

Transparent Application Failover Processing During Shutdowns

Transparent Application Failover Restrictions

Server Failover

Real Application Clusters Failover

Host-Based Failover

Failover Processing in Real Application Clusters

Detecting Failure

Reorganizing Cluster Membership

Instance Membership Recovery

Performing Database Recovery

Remastering Global Cache Service Resources of the Failed Instance

Instance Recovery

Cache Recovery

Transaction Recovery

High Availability Configurations

Default N-Node Configurations

Benefits of N-Node Configurations

Basic High Availability Configurations

Primary/Secondary Instance Configurations

Remote Clients and the Primary/Secondary Configuration

Primary/Secondary Instance Configurations in Dedicated Server Environments

Figure 10-1 Primary/Secondary Configurations in Dedicated Server Environments

Figure 10-2 Primary/Secondary Configurations and Node Failure with Dedicated Server

Primary/Secondary Instance Configurations and the Shared Server

Figure 10-3 Primary/Secondary Configurations in Shared Server Environments

Figure 10-4 Primary/Secondary Configurations and Node Failure with Shared Server

Warming the Library Cache on the Secondary Instance

Benefits of Basic High Availability Configurations

Shared High Availability Node Configurations

Benefits of Shared High Availability Node Configurations

Full Active Configurations with Real Application Clusters Guard II

Deploying High Availability

10
High Availability Concepts and Best Practices in Real Application Clusters