| Oracle Real Application Clusters Guard Concepts and Administration Guide Release 3.2.1 for Windows NT and Windows 2000 Part Number A95197-01 |
|
Oracle Real Application Clusters software provides a high level of availability through its multi-instance implementation of the Oracle database server. Oracle Real Application Clusters Guard helps you to configure Oracle Real Application Clusters databases into an MSCS cluster. When you do so, Oracle Real Application Clusters Guard, along with Oracle Real Application Clusters and MSCS, works to monitor and maintain the availability of nodes and cluster resources that you configure into an MSCS cluster in units called groups. Oracle Real Application Clusters supports two types of deployments:
The default Oracle Real Application Clusters deployment is called a default n-node deployment. In this deployment, all nodes of the cluster participate in client transaction processing, and client sessions can be load balanced at connect time. Response time is optimized for available cluster resources, such as CPU and memory, by distributing the load across cluster nodes to create a highly available environment.
Oracle Real Application Clusters software also supports a primary/secondary instance deployment. The primary/secondary instance deployment lets you configure a basic two-node high-availability system for Oracle Real Application Clusters. An instance designated as the primary instance on one node accepts user connections, while an instance designated as the secondary instance on the other node accepts connections when the primary node fails, or when specifically selected through the INSTANCE_ROLE parameter in the CONNECT_DATA portion of tnsnames.ora.
The concepts in this chapter apply to both methods of deployment; special considerations for primary/secondary deployments are discussed in Section 2.4.
Before you begin to configure an Oracle Real Application Clusters database into an MSCS cluster, it is helpful to understand the concepts and policies that govern how configuration enhances high availability. This chapter discusses the following topics concerning cluster concepts and policies for maintaining high availability:
| Topic | Reference |
|---|---|
When a cluster node becomes unavailable, its cluster resources (for example, shared-nothing cluster disks, Oracle database instances and applications, and IP addresses) are failed over (moved) to an available node in units called groups. Clients request access to the resources in those groups at a node-independent network address called a virtual address. The following sections describe cluster resources, groups, and virtual addresses.
An MSCS cluster resource is any physical or logical component that is available to a computing system and has the following characteristics:
Because an MSCS resource exists on only one node at a time, Oracle Real Application Clusters databases are not considered cluster resources; however, the database instances and the components used by the database (listener and network name) are.
Each cluster resource is associated with a resource type, and each resource type (Oracle Real Application Clusters database instance, listener, network name, and IP address) is associated with a resource dynamic-link library (DLL) and is managed in the cluster environment using this resource DLL. There are standard MSCS resource DLLs as well as custom Oracle resource DLLs. The same resource DLL may support several different resource types.
For example, when you use Oracle Real Application Clusters Guard to configure an Oracle Real Application Clusters database into an MSCS cluster, Oracle Real Application Clusters Guard creates several database instance resources (one for each instance associated with the database) and Oracle listener resources.
The Oracle Real Application Clusters database instance resource DLLs (FsResOdbs.dll and FsResOPSInstEx.dll) provide functions that allow MSCS to check the status of the database instances, bring them online, or take them offline, and display their properties in MSCS.
See also:
A group is a logical collection of cluster resources that forms a minimal unit of failover. During a group failover, a group of cluster resources is moved from one cluster node to another cluster node. A group is owned by only one cluster node at a time. All cluster resources required for a given workload reside in the same group. Oracle Real Application Clusters Guard provides the Configure Database Wizard to help you to configure each Oracle Real Application Clusters database instance into a group. Each group created for an Oracle Real Application Clusters database instance includes the following resources:
The Oracle Real Application Clusters Guard Manager displays two types of group folders: Groups and Instance Groups. MSCS creates a group for each disk resource and for the cluster. Oracle Real Application Clusters Guard creates an instance group for each instance associated with the database when you configure an Oracle Real Application Clusters database into an MSCS cluster. Groups include both the groups included in the Instance Groups folder and groups created by MSCS. Commands and property sheets for both types of groups are the same.
Note that the raw disk partitions that Oracle Real Application Clusters databases use to store data, redo, and log files, are not considered cluster resources. These disks must be accessible to all database instances (and therefore, all cluster nodes) concurrently; cluster resources are accessible to only one cluster node at a time.
Each group has a preferred owner node. The preferred owner node for a group containing an Oracle Real Application Clusters database instance (an instance group) is the node on which the instance exists and is the only node on the cluster on which the instance can come online. There is one and only one preferred owner node for a group containing an Oracle Real Application Clusters Guard instance. Therefore, if the group containing a database instance fails over, the instance is not brought online on the failover node.
Each group also has a set of possible owner nodes. The possible owner nodes for a group containing an Oracle Real Application Clusters database instance is any cluster node where Oracle Services for MSCS is installed, less any nodes you explicitly remove from the set using Oracle Real Application Clusters Guard Manager.
When you configure an Oracle Real Application Clusters database into an MSCS cluster, Oracle Real Application Clusters Guard Manager helps you to create a group for each database instance, requests information about and adds one or more node-independent addresses (called virtual addresses), and automatically adds an Oracle Net listener to each group for you. When Oracle Real Application Clusters Guard adds these resources to the group, it sets up a relationship among them called resource dependencies. The resource dependencies define the order in which the cluster software brings the resources offline and online.
As shown in Figure 2-1, in a group containing an Oracle Real Application Clusters database instance, there is a dependency between the instance and the IP address. In addition, the listener has a dependency on the network name, which has a dependency on the IP address. Therefore, if a node fails, the Oracle database instance resource and the listener resource will be brought offline first, followed by the network name resource, and then the IP address resource. On the node to which the group fails over (the failover node), the order is reversed; MSCS brings the IP address resource online first, followed by the network name resource. Neither the database instance nor the listener is brought online on the failover node, because each can run only on the node on which it was created. (However, if a group were simply taken offline and then placed online on its current node, then the database instance resource and listener resource would be brought online after the IP address and network name resources had been brought online.)
A virtual address is a network address at which running resources in a group can be located, regardless of the cluster node hosting those resources. A virtual address provides a constant node-independent network location that allows clients to easily locate resources without needing to know which physical cluster node is hosting those resources.
Groups move from an unavailable node to an available one after a node fails or a virtual address fails (and cannot be restarted on its current node) in an operation called failover. You identify a virtual address for a group in Oracle Real Application Clusters Guard Manager by specifying a unique network name and IP address for each group. The Configure Database Wizard in Oracle Real Application Clusters Guard Manager helps you to specify one or more virtual addresses for each database instance. Figure 2-2 shows the dialog box that helps you add one or more virtual addresses to a group.
The virtual addresses in the group makes the group a virtual server. Although at least one virtual address per group is required for client access, you can assign multiple virtual addresses to a group. You might assign multiple virtual addresses to provide increased bandwidth.
Each group appears to users and client applications as a highly available virtual server, independent of the physical identity of one particular node. To access the resources in a group, clients always access the group by connecting to the virtual address of a group. To the client, the virtual server is the interface to the cluster resources and looks like a physical node.
Figure 2-3 shows a four-node cluster with one instance group configured on each node. Clients access these groups through Virtual Server A, B, C, and D. By accessing the cluster resources through the virtual address of a group, as opposed to the physical address of an individual node, you ensure a quick connection to an available database instance even when the requested instance is not available. The process by which a quick remote connection is ensured is described in Section 2.3.1.1.
See Section 3.7 for details on the network configuration and virtual address for Oracle Real Application Clusters databases configured in an MSCS cluster.
Monitoring the state of components in an MSCS cluster is key to maintaining high availability. MCSC monitors the state of cluster nodes and cluster resources. Data it collects on Oracle Real Application Clusters database instances is communicated to Oracle Real Application Clusters Guard, so that it can monitor and evaluate the state of the database overall. The following sections describe the following:
The Windows systems that are members of a cluster are called cluster nodes. The cluster nodes are joined together through a shared storage interconnect as well as an internode network connection.
The private interconnect, sometimes referred to as a heartbeat connection or an internode network connection, allows one node to detect the availability or unavailability of another node. Typically, a private interconnect (that is distinct from the public network connection used for user and client application access) is used for this communication. If one node fails, the cluster software immediately fails over the groups from the unavailable node to an available node, and restarts the group's virtual address on an available node. Clients reconnect to a database instance through connect-time failover.
MSCS monitors the state of cluster resources (Oracle Real Application Clusters database instances, listeners, IP addresses, and network names) by polling the resources are regular intervals to determine if they are running, failed, or in the case of database instances, possibly hung. As shown in Figure 2-4 and Figure 2-5, you can set the parameters for how often each type of polling is performed and the amount of time that can pass without a response from the poll before it is considered to have failed, as follows:
The pending timeout value specifies how long the cluster software should wait for a resource in a pending state to come online (or offline) before considering that resource to have failed. By default, this value is 180 seconds.
The Is Alive interval specifies how frequently the cluster software should check the state of the resource. You can use either the default value for the resource type or specify a number (in milliseconds). This check is more thorough, but also uses more system resources than the check done during a Looks Alive interval.
The Looks Alive interval specifies how frequently the cluster software should check the registered state of the resource to determine if the resource appears to be active. You can use either the default value for the resource type or specify a number (in milliseconds). This check is less thorough, but also uses fewer system resources, than the check done during an Is Alive interval.
In addition, for non-database resources (resources other than database instances), you can specify the restart policy for the resource, which is defined when you select one of the following:
Specifies that an attempt to restart the resource on the current node should not be made.
Specifies the number of attempts that should be made in a given time period to restart the resource on the current node before implementing the resource failover policy.
You cannot change the resource failover policy. It is set as required by Oracle Real Application Clusters Guard to maintain high availability of all components. See Section 2.3 for details.
The restart policy for database instances is specified for all database instances rather than one instance at a time, so that the policy can be evaluated and applied across all database instances as a whole. (The resource failover policy for an Oracle Real Application Clusters database instance is always "If the resource is not restarted, do not fail over the group.")
Figure 2-4 shows the Policies property page for a non-database cluster resource, namely, an Oracle TNS listener. (Specifying the "Use value from resource type" option indicates that you want to use the default values that are set in MSCS. To view the default values, open MSCS Cluster Administrator, select Resource Types from the tree view, right-click Oracle Real Application Clusters Instance in the right pane, and then click Properties.)
Figure 2-5 shows the Policies property page for an Oracle Real Application Clusters database instance.
MSCS provides the results of Is Alive polling of each database instance to Oracle Real Application Clusters Guard so that it can monitor the status of the Oracle Real Application Clusters database as a whole. Section 2.2.3 describes how Oracle Real Application Cluster databases are monitored. Section 2.3.2 describes how the restart policy for database instances is specified and applied.
A global monitor component of Oracle Real Application Clusters Guard manages issues and policies that affect the database instances as a whole, such as policies that determine if and when failed instances are restarted and parameters for database instance hang detection and termination of hung instances. MSCS communicates the status of each Oracle Real Application Clusters database instance to the monitor so that the monitor has a global view of all of the database instances on the system.
Figure 2-6 shows a three-node cluster that includes nodes ntclu41, ntclu42, and ntclu43. An Oracle Real Application Clusters database, MyDB, has been configured into the cluster using Oracle Real Application Clusters Guard. The status of each database instance contained within a group is reported to the global monitor, currently on ntclu42.
If one or more of the instances fails or hangs as detected through MSCS Is Alive polling, the problem is reported to the global monitor. The database hang detection, termination, and restart policies determine what should be done with an unresponsive or failed instance. Section 2.3.2 describes how these policies are applied to Oracle Real Application Clusters database instances.
As with monitoring, the response to an unavailable node, non-database cluster resource, or Oracle Real Application Clusters database instances are each handled a little differently. However, the object in all cases is to maintain the availability of the Oracle Real Application Clusters database to clients. The following sections describe how failures are handled and availability is restored when any one of these components becomes unavailable.
Availability to the database instance associated with a listener, virtual address, or cluster node is maintained by failing over the group containing the resources, rerouting the client request using an operation called connect-time failover, or both, as follows:
If a listener fails, the cluster software attempts to restart the listener on the current node. If it cannot be restarted, then the listener is left in a failed state, and the instance and virtual address are left online. Clients that were connected to the database instance associated with the failed listener continue uninterrupted. New attempts to connect to the database instance associated with the failed listener are directed to another database instance through an operation called connect-time failover.
If a virtual address fails, the cluster software attempts to restart the virtual address on the current node. If it cannot be restarted, then the group containing the virtual address fails over to another node and the virtual address is brought online. The database instance and listener are not brought online on the new node. The group is left in a partially failed state. Clients reconnect to a database instance on another node through connect-time failover.
If a node fails or is taken offline, the cluster software fails over the groups belonging to that node to another possible owner node. The database instance and listener are not brought online on the new node, but the virtual address is. The database listener and instance are left in a failed state and thus the group will be in a partially failed state. Clients connect to the same address and make a connection to another database instance on another node through connect-time failover. When the failed node comes back online, Oracle Real Application Clusters Guard moves the group back to its preferred owner node.
A connect-time failover is a process by which a client connect request is forwarded to another listener if the first listener is not responding or if the database instance associated with that listener is unavailable. Clients that want to connect to any instance of an unconfigured Oracle Real Application Clusters database can take advantage of connect-time failover to ensure that they can connect to the database as long as at least one instance is running.
However, a significant delay can occur during connect-time failover for an unconfigured Oracle Real Application Clusters database due to TCP/IP timeout. If a node fails and new connection requests are made to that node's IP address, the connection request will wait the duration of the TCP/IP timeout interval to connect to an instance on a running node.
When you configure an Oracle Real Application Clusters database into an MSCS cluster using Oracle Real Application Clusters Guard, the TCP/IP timeout is avoided for new connections as long as the virtual address associated with the instance is available. If the virtual address is up and running, new requests for an instance on a failed node do not wait the duration of the timeout period. Requests for the connection are refused immediately and are routed transparently to another instance.
Oracle Real Application Clusters Guard keeps the virtual address associated with an instance running as follows:
Oracle Real Application Clusters Guard, therefore, provides must faster connect-time failover by ensuring that the virtual address is available and thus eliminating TCP/IP timeout delays.
Figure 2-7 illustrates a connection request to database db.us.acme.com, when an entry in tnsnames.ora file appears as follows:
db.us.acme.com= (description= (load_balance=on) (failover=on) (address_list= (address=(protocol=tcp)(host=138.2.26.155)(port=1521)) (address=(protocol=tcp)(host=138.2.26.156)(port=1521))) (connect_data= (service_name=op.us.acme.com)))
The group failover policy specifies the number of times during a given time period that the cluster software should allow the group to fail over before that group is taken offline. The failover policy provides a means to prevent a group from failing over repeatedly.
Values for group failover policy options are set to default values when you use the Oracle Real Application Clusters Guard Manager Configure Database Wizard. However, you can reset the values in these policy options with the Group Failover property page, shown in Figure 2-8. (To access this page, select the group of interest in the Oracle Real Application Clusters Guard Manager tree view and then click the Failover tab.)
Figure 2-8 shows the page for setting group failover policy.
The group failover policy consists of a failover threshold and a failover period:
The failover threshold specifies the maximum number of times a group failover can occur (during the failover period) before the cluster software stops attempting to fail over the group.
The failover period is the time during which the cluster software counts the number of times a group failover occurs. If the frequency of failover is greater than that specified for the failover threshold during the period specified for the failover period, then the cluster software stops attempting to fail over the group.
For example, if the group failover threshold is 3 and the failover period is 5, the cluster software allows the group to fail over 3 times within 5 hours before discontinuing failovers for that group.
When the first group failover occurs, a timer to measure the failover period is set to 0 and a counter to measure the number of failovers is set to 1. The timer is not reset to 0 when the failover period expires. Instead, the timer is reset to 0 when the first failover occurs after the failover period has expired.
For example, assume again that the group failover period is 5 hours and the failover threshold is 3. As shown in Figure 2-9, when the first group failover occurs at point A, the timer is set to 0. Assume a second group failover occurs 4.5 hours later at point B, and the third group failover occurs at point C. The failover period has been exceeded when the third group failover occurs. Therefore, at point C, group failovers are allowed to continue, the timer is reset to 0, and the counter is reset to 1.
Assume that another group failover occurs at point D. If you look at the entire timeline, you might expect that group failovers will be discontinued. The group failovers at points B, C, and D have occurred within a 5-hour timeframe. However, because the timer for measuring the failover period was reset to 0 at point C, the failover threshold has not been exceeded, and the cluster software allows the group to fail over.
Assume that another group failover occurs at point E. When a problem that ordinarily results in a group failover occurs at point F, the cluster software does not fail over the group. Three failovers have occurred during the 5-hour period that has passed since the timer was reset to 0 at point C. The cluster software leaves the group on the current node in a failed state.
Sometimes group failovers occur more frequently than desired. For example, suppose a Northeast database instance resource is in a group called Customers_NodeA, and you specify the following:
Assume the virtual address for the group fails. Oracle Real Application Clusters Guard attempts to restart the virtual address on the current node. Attempts to restart the virtual address fails; therefore, the Customers_NodeA group fails over to another node.
On that node, Oracle Real Application Clusters Guard attempts to restart the group's virtual address also fail, so the Customers_NodeA group fails over again. Oracle Real Application Clusters Guard will continue attempts to restart the virtual address and the Customers_NodeA group will continue to fail over until the virtual address restarts or the group has failed over 20 times within a 1-hour period. If the virtual address cannot be restarted, and the group fails over fewer than 20 times within a 1-hour time period, the Customers_NodeA group will fail over repeatedly. In such a case, consider reducing the failover threshold to eliminate the likelihood of repeated failovers.
Oracle Real Application Clusters Guard uses a series of database policies to determine if a database instance has failed or is hanging, and if so, how to resolve the problem.
A quick check on the instance is done through Looks Alive polling. Looks Alive polling checks the health of the instance by confirming that the service is running. A more thorough check of the database is also performed at regular intervals (and if Looks Alive polling fails), as follows:
The flow chart in Figure 2-10 illustrates this process.
The following sections describe the restart, hang detection, and termination policies in detail. These policies are the same regardless of whether the deployment is a default n-node deployment or a primary/secondary deployment. However, the interpretation of the restart policy is different for a primary/secondary deployment. For details on how the instance restart policy is interpreted for a primary/secondary deployment, see Section 2.4.1.
|
Note: You can write a script that Oracle Real Application Clusters Guard will run when certain database instance state changes occur, such as when a database is placed online, taken offline, or terminated. For more information, see Section 3.6. |
If Is Alive polling for an instance returns a failure status, Oracle Real Application Clusters Guard assumes that the instance has failed (or been terminated) and applies the database restart policy. The restart policy for Oracle Real Application Clusters database instances configured into an MSCS cluster with Oracle Real Application Clusters Guard are specified and managed at the database level, rather than at the instance resource level. This means that Oracle Real Application Clusters Guard, rather than the cluster software (MSCS), manages Oracle Real Application Clusters database instance restart policy.
Therefore, when Is Alive polling detects an instance failure, the instance is left in a failed state by the cluster software. However, the Oracle Real Application Clusters Guard global monitor is notified of the problem, and it examines the overall database restart policy to determine if the instance should be restarted. If so, the global monitor calls the cluster software to attempt to bring the instance online. (Under no circumstance of instance failure or hang is the group failed over to another node.)
As shown in Figure 2-11, there are three database restart options:
Oracle Real Application Clusters Guard will not attempt to restart any instance.
Oracle Real Application Clusters Guard will attempt to restart the most-recently failed or terminated instance on its current node if no other instance for the database is online.
Oracle Real Application Clusters Guard will attempt to restart any failed or terminated instance on its current node. When you select Always Restart Any Instance, you also specify the following:
Specifies the number of times within the specified time period that you want Oracle Real Application Clusters Guard to attempt to bring the instance online. If the instance cannot be brought online within the specified parameters, then it is left in a failed state and the group is left in a partially offline state. (Other resources within the group might remain online.) Valid values are in the range between 1 and 100 (inclusive) for the number of times to restart, and are in the range between 1 and 2000 (inclusive) for the number of minutes.
If you attempt to manually bring the instance online (using Oracle Real Application Clusters Guard or MSCS Cluster Administrator), then the counter for the y value is reset to zero; this is true regardless of whether the attempt was successful or not. The following examples describe what happens when you do and do not intervene to restart a failed instance. Assume you set this parameter to "Restart 2 times within 30 minutes" and the timer is currently at zero:
Specifies that Oracle Real Application Clusters Guard should wait the specified delay period before attempting to restart a failed or terminated instance. The purpose of the delay is to avoid conflicts that might occur if an instance were to be restarted while another instance was performing recovery operations for a failed instance.
Indicates that regardless of the values you specify for the preceding parameter, there will be no delay prior to restarting a terminated or failed instance if it was the last surviving instance for the database prior to being terminated or failing.
Regardless of the value you specify, there will be no delay prior to restarting a failed or terminated instance if that instance was the last one running prior to failing.
Specifies that Oracle Real Application Clusters Guard should not restart the instance if any other instance is hung or in the hang detection process. Typically, if an instance is hung, you do not want to restart another instance because it will stress the system further.
Specifies that Oracle Real Application Clusters Guard should not restart the instance if Oracle Real Application Clusters Guard terminated it because it was hung. The exception is when this instance was the last instance online before it failed.
When an instance is unresponsive (as defined by a lack of response from the Is Alive query within the Pending timeout period), Oracle Real Application Clusters Guard checks several parameters to determine whether the unresponsiveness is due to a database instance hang, or an event that is more processing-intensive than most. If Oracle Real Application Clusters Guard determines that an Oracle Real Application Clusters database instance is hung, Oracle Real Application Clusters Guard may terminate one or more instances (in an effort to resolve the problem) based on the database termination policy. Oracle Real Application Clusters Guard uses the termination policy to determine if, when, and how many instances it can terminate in an effort to resolve instance hangs.
Oracle Real Application Clusters Guard checks for several processing-intensive events by executing a query designed to determine if a specified event is occurring. Events that Oracle Real Application Clusters Guard checks for include logon storms, parse storms, instance recovery, lock remastering, and stuck archiver.
In the Oracle Real Application Clusters Guard Manager Hang Detection property page, you can specify the parameters for what is considered a logon storm or parse storm, as follows:
Defines what you want Oracle Real Application Clusters Guard to consider a logon storm. A logon storm is an event that occurs when a large number of users connect to the instance within a small window of time. This might occur, for example, when many employees arrive at work and connect to the database at the same time every day.
Oracle Real Application Clusters Guard checks for a logon storm by querying the instance to determine if more than the specified number of users have connected to the instance within the specified period of time.
Valid values are between 1 and 999 (inclusive) for the number of user connections and between 1 and 600 (inclusive) for the number of seconds.
Defines what you want Oracle Real Application Clusters Guard to consider a parse storm. A parse storm is an event that occurs when many more than the typical load of queries are executed against the instance within a given time period.
Oracle Real Application Clusters Guard checks for a parse storm by querying the database for the total number of parse calls. If the number of parse calls per second is greater than the value you specify, Oracle Real Application Clusters Guard assumes that a parse storm is occurring.
A valid value for the number of parse calls is between 1 and 999 calls (inclusive).
You can also specify whether Oracle Real Application Clusters Guard should check for each of the listed processing-intensive events by selecting or clearing the check box next to each event. Each event has an associated timeout value that you can adjust in the Oracle Real Application Clusters Guard Manager Hang Detection property page, as shown in Figure 2-12. If a query for an event does not return success or failure within the specified timeout period, then Oracle Real Application Clusters Guard checks for the next selected event. The following list describes each of the timeout events:
The amount of time Oracle Real Application Clusters Guard can spend checking for a logon storm event.
A valid value for this timeout is between 5 and 600 seconds (inclusive).
The amount of time Oracle Real Application Clusters Guard can spend checking for an instance recovery event. When remastering of a failed instance's locks completes, surviving instances clean up the in-progress transactions of the failed instance. This is known as instance recovery.
A valid value for this timeout is between 5 and 600 seconds (inclusive).
The amount of time Oracle Real Application Clusters Guard can spend checking for a lock remastering event. When an instance fails, surviving instances remaster lock resources from the failed instance. During this phase, all lock information is discarded and each surviving instance reacquires all the locks it held at the time of the failure.
A valid value for this timeout is between 5 and 600 seconds (inclusive).
The amount of time Oracle Real Application Clusters Guard can spend checking for a stuck archiver event. You can set up instances to automatically archive each group of online redo log files after it becomes an inactive redo log. If you have enabled automatic archiving, the archiver can become unresponsive, making the instance itself appear unresponsive.
A valid value for this timeout is between 5 and 600 seconds (inclusive).
The amount of time Oracle Real Application Clusters Guard can spend checking for a parse storm event.
A valid value for this timeout is between 5 and 600 seconds (inclusive).
Figure 2-12 shows the Oracle Real Application Clusters Hang Detection property page that lets you define the parameters for a logon storm and a parse storm and allows you to adjust timeout values.
When you set timeout values, consider that if success or failure is not returned for any event presented in the preceding list, an instance can be unresponsive for a time period equal to the sum of all the timeout values (maximum total timeout) before Oracle Real Application Clusters Guard takes further action. For example, if the timeout value for each event is 300 seconds (5 minutes), it is possible that the instance (or instances) will be unresponsive for 1500 seconds (25 minutes) before Oracle Real Application Clusters Guard applies the database termination policy. Conversely, if the timeout value for each event is set too low, an instance might be erroneously deemed hung and terminated when it is not hung.
If Oracle Real Application Clusters Guard determines that an instance is hung, then it applies the database termination policy, as described in Section 2.3.2.3.
Once Oracle Real Application Clusters Guard determines that an instance is hung, the monitor applies the termination policy set for the Oracle Real Application Clusters database, as shown in Figure 2-13.
There are two basic termination policy options:
When this policy is selected, a hung instance is never terminated. It is left to the DBA to resolve the hang. When this option is selected, it is possible for all instances to become hung.
When this policy is selected, Oracle Real Application Clusters Guard terminates hung instances, one at a time, with the expectation that terminating a hung instance may allow other hung instances to resume operations. However, Oracle Real Application Clusters Guard never terminates an instance while another instance is being restarted or is coming offline. This is to ensure that Oracle Real Application Clusters Guard does not put an extra burden on the database while it is going through a transitional phase. In addition, if both instances in a primary/secondary deployment are hung, Oracle Real Application Clusters Guard always terminates the primary instance first.
If you allow hung instances to be terminated, you can also specify parameters for termination, as follows:
Oracle Real Application Clusters Guard terminates the first hung instance immediately. If additional instances are hung, one is terminated at the rate you specify.
Specifies the maximum number of database instances that can be terminated when the database is hung. For example, if you specify a value of 2, Oracle Real Application Clusters Guard will terminate up to two instances per database hang. If a third instance becomes hung or is still hung after two instances have been terminated, then the third will not be terminated.
If instances are still hung when the maximum has been reached, a database administrator should intervene to resolve the problem.
Oracle Real Application Clusters Guard uses an internal algorithm to determine which instance or instances will be terminated when several are hung; you cannot predict which hung instance Oracle Real Application Clusters Guard will select for termination unless the instances are in a primary/secondary deployment. When both instances in a primary/secondary deployment are hung, Oracle Real Application Clusters Guard always terminates the primary instance first.
After a hung instance is terminated, Oracle Real Application Clusters Guard checks the instance restart policy to determine if an attempt should be made to bring that instance back online. See Section 2.3.2.1 for information on how the restart policy is applied.
Failures affect those users and applications:
Users and applications connected to the instance associated with the failure lose the connection and must reconnect to an available instance (using connect-time failover) to continue processing.
Any transactions that were in progress and uncommitted at the time of the failure are rolled back. A surviving instance replays the online redo log files of the failed instance.
Clients who attempt to establish a new connection to a failed instance or through a failed listener are redirected to a different instance using connect-time failover.
Client applications that are cluster-aware experience a brief interruption in service; to the client applications, it appears that a node was quickly rebooted. In most cases, the means to connect to a running instance is provided automatically--without operator intervention.
See Section 3.11 for information about cluster-aware applications.
Oracle Real Application Clusters supports a primary/secondary instance deployment. The primary/secondary instance deployment lets you configure a basic two-node high-availability system for Oracle Real Application Clusters. An instance designated as the primary instance on one node accepts user connections, while an instance designated as the secondary instance on the other node accepts connections when the primary node fails, or when specifically selected through the INSTANCE_ROLE parameter in the CONNECT_DATA portion of the tnsnames.ora file.
You specify the primary/secondary deployment by setting the ACTIVE_INSTANCE_COUNT parameter in each instance's initialization parameter file (init<sid>.ora) to 1. In a primary/secondary deployment, the instance that mounts the database first assumes the role of primary instance. The second instance to mount the database assumes the role of secondary instance. If the primary instance is shut down or fails, the secondary instance automatically assumes the primary role. When the failed instance returns to active status, it assumes the role of secondary instance. Figure 2-14 shows the Oracle Real Application Clusters Guard Manager property page that displays the role of an instance.
The Oracle Net listener enforces the routing of work requests to the primary and secondary instances by using the INSTANCE_ROLE parameter in the CONNECT_DATA portion of the tnsnames.ora file.
All locks are mastered by the primary instance only, which minimizes communication between nodes and improves performance.
The instance restart policy for a primary/secondary instance deployment is as described in Section 2.3.2.1. However, the interpretation of this policy is complicated by the instance roles (primary instance role and secondary instance role) and failover operations that might occur in this configuration. The following sections provide examples to describe how the instance roles and instance restart policy interact.
During typical operations, the nodes running the primary and secondary instances are up and operational. Group A, containing instance A in the primary role, is running on node A. Group B, containing instance B in the secondary role, is running on node B. If the primary instance fails, but the secondary instance is still running, then the following occurs:
Oracle Real Application Clusters Guard leaves instance A in a failed state and stops its associated listener. Instance B has the primary role and instance A (and its listener) remains in a failed state.
Because instance B is still running, Oracle Real Application Clusters Guard leaves instance A in a failed state and stops its associated listener. Instance B has the primary role and instance A (and its listener) remains in a failed state. (However, if instance B were to fail, then Oracle Real Application Clusters Guard would restart instance B because there would be no other instance online.)
Oracle Real Application Clusters Guard restarts instance A in the secondary instance role. Instance B has the primary role.
During typical operations, the nodes running the primary and secondary instances are up and operational. Group A, containing instance A in the primary role, is running on Node A. Group B, containing instance B in the secondary role, is running on Node B. If Node A fails, but instance B is still running, then the following occurs:
Instance A is left in a failed state and Oracle Real Application Clusters Guard stops its associated listener. Using the preceding example, instance B now has the primary role and instance A is in a failed state on Node B.
Because the instance on Node B is still running, instance A is left in a failed state on Node B.
Oracle Real Application Clusters Guard takes Group A offline, moves it back to Node A, and brings the group back online. Instance A is restarted with the secondary instance role. The role of each instance is the reverse of its original role.
During typical operations, the nodes running the primary and secondary instances are up and operational. Group A, containing instance A in the primary role, is running on node A. Group B, containing instance B in the secondary role, is running on node B. If the secondary instance fails, but the primary instance is still running, then Oracle Real Application Clusters leaves the roles as they are. Oracle Real Application Clusters Guard applies the restart policy to determine whether or not it should restart the failed secondary instance, as follows:
Instance B is left in a failed state on Node B and Oracle Real Application Clusters Guard stops its associated listener.
Instance B is left in a failed state and Oracle Real Application Clusters Guard stops its associated listener.
Oracle Real Application Clusters Guard attempts to restart instance B in the secondary instance role.
This section describes the Oracle Real Application Clusters Guard Manager commands that are available for managing a primary/secondary Oracle Real Application Clusters deployment. These commands allow you to move the primary role to the secondary instance, swap roles between instances, stop the secondary instance, and restore the secondary role to an instance. The commands are commonly used for planned outages (hardware and operating system upgrades) and for recovering from unplanned outages.
Table 2-1 lists the relevant commands available on the Real Application Clusters menu of Oracle Real Application Clusters Guard Manager, their effect, and some common usages.
Note that you can also use the Oracle Real Application Clusters Guard Manager Place Online and Take Offline commands with instances in a primary/secondary configuration. However, when you use these commands, you might have to issue several commands in a specific order to achieve the desired results. For example, to swap roles between instances, you must issue the Take Offline command with the instance that holds the primary role, then issue a Place Online command with that instance. When you use the commands designed specifically for managing a primary/secondary Oracle Real Application Clusters deployment, the swap is made with a single Switchover command.
The following example demonstrates some typical uses of these commands. The role assignments are as follows for the Oracle Real Application Clusters database:
The Move Primary command takes the primary instance offline, which results in a role failover. Role assignments are now as follows:
Real Application Clusters -> Restore
Role assignments are now as follows:
Both instances are now online, but the instance roles are the reverse of their original assignment. You can leave them as they are, particularly if both nodes are the same in terms of processing power and memory. However, for the purposes of this example, assume you want to return the instances to their original roles.
Real Application Clusters -> Switchover
Sales2 is taken offline, which results in a role failover, then Sales2 is placed back online and reassigned with the secondary instance role. Role assignments are back to their original instances, as follows:
|
|
![]() Copyright © 2001 Oracle Corporation. All Rights Reserved. |
|