1 Introduction to Oracle Fail Safe

Increasingly, businesses expect products and services to be available 24 hours a day, 365 days a year. While no solution can ensure 100% availability, Oracle Fail Safe minimizes the downtime of Oracle databases and other applications running on Microsoft clusters and configured with Microsoft Cluster Services (MSCS).

This chapter discusses the following topics:

What Is Oracle Fail Safe?
Benefits of Oracle Fail Safe
A Typical Oracle Fail Safe Configuration
Deploying Oracle Fail Safe Solutions

1.1 What Is Oracle Fail Safe?

Oracle Fail Safe is a user-friendly software that works with Microsoft Cluster Server (MSCS) to provide highly available business solutions on Microsoft clusters. A cluster is a configuration of two or more Microsoft Windows systems that makes them appear to network users as a single, highly available system. Each system in a cluster is referred to as a cluster node.

Oracle Fail Safe works with MSCS cluster software to provide high availability for applications and single-instance databases running on a cluster. When a cluster node fails, the cluster software moves its workload to the surviving node based on parameters that you configure using Oracle Fail Safe. This operation is called a failover.

With Oracle Fail Safe, you can reduce downtime for single-instance Oracle databases and almost any application that can be configured as a Microsoft Windows service.

Oracle Fail Safe consists of Oracle Services for MSCS and Oracle Fail Safe Manager:

Oracle Services for MSCS works with the MSCS software to configure fast, automatic failover during planned and unplanned outages for resources configured for high availability. These resources can be the Oracle database, or other Microsoft Windows services (also the software and hardware upon which these items depend). Also, Oracle Services for MSCS can attempt to restart a failed software resource so that a failover from one cluster node to another may not be required.

Note:
Oracle Services for MSCS was referred to as Oracle Fail Safe Server in previous releases of Oracle Fail Safe.
Oracle Fail Safe Manager provides a user-friendly interface and wizards that help you to configure and manage cluster resources, and troubleshooting tools that help you to diagnose problems.

Together, these components enable rapid deployment of highly available database, application, and Internet business solutions.

1.2 Benefits of Oracle Fail Safe

Oracle Fail Safe provides the key benefits discussed in the following sections:

Highly Available Resources and Applications
Ease of Use
Ease of Integration with Applications

1.2.1 Highly Available Resources and Applications

Oracle Fail Safe works with MSCS to configure both hardware and software resources for high availability. Once configured, the multiple nodes in the cluster appear to end users and clients as a single virtual server; end users and client applications connect to a single, fixed network address, called a virtual address, without requiring any knowledge of the underlying cluster. If one node in the cluster becomes unavailable, then MSCS moves the workload of the failed node (and client requests) to another node.

For example, the left side of Figure 1-1 shows a two-node cluster configuration where both nodes are available and actively processing transactions. On the surface, this configuration may seem no different from setting up two independent servers, except that the storage subsystem is configured so that the disks are connected physically to both nodes by a shared storage interconnect. Although both nodes are physically connected to the same disks, MSCS ensures that each disk can be owned and accessed by only one node at a time.

The right side of Figure 1-1 shows how, when hardware or software becomes unavailable on one node, its workload automatically moves (fails over) to the surviving node and is restarted, without administrator intervention. During the failover, ownership of the cluster disks is released from the failed server (Node A) and acquired by the surviving server (Node B). If a single-instance Oracle database was running on Node A, then Oracle Fail Safe restarts the database instance on Node B. Clients then can access the database through Node B using the same virtual address that they used to access the database when it was hosted by Node A.

Figure 1-1 Failover with Oracle Fail Safe in a Microsoft Cluster

Description of "Figure 1-1 Failover with Oracle Fail Safe in a Microsoft Cluster"

1.2.2 Ease of Use

Because of the numerous hardware and software components involved, configuring software and all of its dependent components (for example, disks, IP addresses, network) to work in a cluster can be a complex process. In contrast, Oracle Fail Safe is designed to be easy to install, administer, and use and simplifies configuration of software in a cluster.

Install ation: Using Oracle Universal Installer, you can install Oracle Fail Safe either interactively or in silent m ode. With the silent mode installation method, you install software by supplying input to Oracle Universal Installer with a response file. Also, you can perform rolling upgrades of both the operating system and applic ation software. Rolling upgrades minimize downtime by allowing one cluster node to continue hosting the cluster workload while the other system is being upgraded. See Oracle Fail Safe Installation Guide for more information.

Administration and Use: Oracle Fail Safe Manager provides a user-friendly interface to set up, configure, and manage applications and databases on the cluster. Oracle Fail Safe Manager provides wizards that automate the configuration process and ensure that the configuration is replicated consistently across cluster nodes.

Oracle Fail Safe Manager includes:

A tre e view of objects that displays multiple views of the same data to help you find information efficiently
Wiza rds that automate and simplify resource configuration, and drag-and-drop capabilities that help you quickly perform routine system maintenance, such as moving resources across nodes to balance the workload
An integr ated family of verification tools that automatically diagnose and fix common configuration problems both before and after configuration
Online documentation, including a tutorial, help, and manuals available in HTML and PDF formats
A command-line interface (FSCMD) for managing the cluster through batch programs or scripts

Figure 1-2 shows an Oracle Fail Safe Manager window. The left pane displays a tree view showing multiple views (and the current state) of clusters and cluster resources. The right pane displays a property page that lists all groups on the cluster that have been selected from the tree view and the current state of those groups. Depending on the object chosen from the tree view, the display in the right pane changes. When you select a particular cluster, node, group, or resource, the property sheet for that cluster, node, group, or resource is displayed.

Figure 1-2 Oracle Fail Safe Manager

Description of "Figure 1-2 Oracle Fail Safe Manager"

Figure 1-3 shows the Oracle Fail Safe menus and the items within each menu.

Figure 1-3 Oracle Fail Safe Manager Menus and Contents

Description of "Figure 1-3 Oracle Fail Safe Manager Menus and Contents"

1.2.3 Product Accessibility

Oracle Fail Safe has two user interfaces: the FSCMD Command-Line Interface and the Oracle Fail Safe Manager GUI. However, the Oracle Fail Safe Manager GUI is used more widely. The Oracle Fail Safe Manager GUI presents two panels: a navigation tree in the left panel and the right panel displaying its corresponding selection. Most of the time the right panel displays a tabbed set of property pages for the selection made in the navigation tree. At other times it displays a list of objects. Wizard pages are displayed when the user selects an action that requires multiple steps, such as adding a resource to a group.

The Oracle Fail Safe Manager uses specific keyboard sequences for navigation, that can be used instead of a pointer device. Oracle Fail Safe can be easily accessed in the following ways:

In the main user interface window use CTRL-T to navigate between the tree view and the multi-page (tabbed) property page.
Branches in the tree view are expanded using keyboard sequence ALT+V X and closed by using ALT+V B.

1.2.4 Ease of Integration with Applications

If you want to configure an existing application to access databases or other applications configured with Oracle Fail Safe, then few or no changes are required. Because applications always access cluster resources at the same virtual address, applications treat failover as a quick node restart.

After a failover occurs, database clients or users must reconnect and replay any transa ctions that were left undone (such as database transactions that were rolled back during instance recovery). Appl ications developed with OCI (including ODBC clients that use the Oracle ODBC driver) can take advantage of automatic reconnection after failover. See Section 7.9 for more information.

1.3 A Typical Oracle Fail Safe Configuration

Oracle Fail Safe solutions can be deployed on any Microsoft Windows cluster certified by Microsoft for configuration with MSCS.

Most clusters are configured similarly, differing only in choice of storage interconnect (SCSI, Fibre Channel, or SAN) and in the way applications are deployed across the cluster nodes.

A typical cluster configuration includes the following hardware and software:

Hardware
- Microsoft cluster nodes, each with one or more local (private) disks where executable application files are installed.
- Private (heartbeat) interconnect between the nodes for intracluster communications.
- P ublic interconnect (Internet, Intranet, or both) to the local area network (LAN) or wide area network (WAN).
- NTFS formatted disks on the shared storage interconnect (SCSI, Fibre Channel, or SAN). All data files, log files, and other files that must fail over from one node to another are located on these cluster disks.
  
  Note:
  See the documentation for your cluster hardware for information about using redundant hardware, such as RA ID, to further ensure high availability.
- Additional redundant components (UPS, network cards, disk controllers, and so on).
Software (installed on each node)
- Microsoft Windows
- Oracle Services for MSCS
- Oracle Fail Safe Manager (installed on one or more cluster nodes, one or more client workstations, or both)
- One or more of the following resources that are highly available, such as:
  - Oracle single-instance databases
  - Oracle Management Agent
  - Oracle applications or third-party applications that can be configured as Windows generic services

See Oracle Fail Safe Release Notes for information about the supported releases of these components.

Figure 1-4 shows the hardware and software components in a two-node cluster configured with Oracle Fail Safe. Note that the executable application files are installed on a private disk on each cluster node and the application data and log files reside on a shared cluster disk.

Figure 1-4 Hardware and Software Components Configured with Oracle Fail Safe

Description of "Figure 1-4 Hardware and Software Components Configured with Oracle Fail Safe"

1.4 Deploying Oracle Fail Safe Solutions

Oracle Fail Safe works with MSCS to configure resources running on a cluster, to provide fast failover, and to minimize downtime during planned (system upgrades) and unplanned (hardware or software failure) outages.

Clusters provide high availability by managing:

Unplanned group failover

Clusters manage unplanned group failovers (failure of hardware or software components) in a way that is transparent to users. When one node on the cluster becomes unavailable, another node temporarily serves both its own workload and the workload from the failed node. When a resource fails and cannot be restarted on the current node, another node takes ownership of that resource (and any other resources upon which it depends) and attempts to restart it.
Planned failover

Clusters manage planned group failovers (those which you intentionally start, such as when you upgrade software on the cluster). You can fail over the resources to another node, perform a software or hardware upgrade, and then return the resources to the original node. (This is called failing back the resources.) Then, perform the same upgrade process on the other nodes in the cluster.

Oracle Fail Safe also ensures efficient use of resources in the cluster environment by managing the following:

Independent workloads

The cluster nodes can serve separate workloads. For example, one node can host an Oracle database, and the others can host applications.
Load balancing

You can balance resources across the cluster nodes. For example, a database can be moved from a node that is heavily loaded to one that has spare capacity.

Oracle Fail Safe has a variety of deployment options to satisfy a wide range of failover requirements. Chapter 3 explains how to configure an Oracle Fail Safe solution for your business needs, including active/passive solutions and active/active solutions.