3 Oracle Collaboration Suite High Availability Concepts

This chapter discusses concepts of high availability as they relate to Oracle Collaboration Suite. It contains the following sections:

Introduction
High Availability Terminology
Oracle Collaboration Suite High-Availability Features
Oracle Collaboration Suite High-Availability Framework
Oracle Collaboration Suite RAC-Enabled Architecture
External Load Balancers
Redundant Architectures in Oracle Collaboration Suite
Causes of Downtime

Introduction

Database software and the Internet have introduced a new level of worldwide collaboration and information sharing by extending the reach of database applications throughout organizations and communities. This new level of information availability highlights the importance of high availability in data management solutions. Both small businesses and global enterprises have users all over the world who require access to data 24 hours a day. Without this access, operations can stop, and revenue is lost. This potential loss of revenue underscores the necessity of high availability in a business's infrastructure.

What is Availability?

Availability is the degree to which an application, service, or functionality is available upon user demand. Oracle Collaboration Suite is designed to provide a wide variety of high-availability solutions, ranging from load balancing and basic clustering to providing maximum system availability during catastrophic hardware and software failures.

The following are the primary characteristics of a highly available solution:

Reliability: Reliable hardware is only one component of a high availability solution. Reliable software, including the database, Web servers, and applications, are just as critical to implementing a highly available solution.
Recoverability: It is important to determine what types of failures may occur in your high availability environment, and how to recover from those failures in the time that meets your business requirements.
Timely error detection: If a component in your architecture fails, fast detection is an essential component in recovering from a possible unexpected failure.
Continuous operations: Continuous access to data is essential when very little or no downtime is acceptable to perform maintenance activities. Activities such as moving a table to another location within the database, or adding additional CPUs to your hardware should be transparent to the end user in a high availability architecture.

High Availability Terminology

The definitions of the following terms are useful for understanding the concepts of high availability in Oracle Collaboration Suite.

If you are already familiar with high availability terminology, you can skip this section and begin reading Oracle Collaboration Suite High-Availability Features:

Active-Active: A system is termed as active-active if all the equivalent members of that system are actively servicing requests concurrently and are not on any standby mode.
Active-Passive: A system is termed as active-passive if some members of that system actively service requests and some members are inactive. These inactive members are known to be passive. They are not activated until one or more of the active nodes have failed.
Hardware Cluster: A hardware cluster is a collection of computers that appears to clients as a single system and provides network services (for example: an IP address) or application services (for example: databases, Web servers).

A hardware cluster achieves high availability and scalability through the use of specialized hardware (cluster interconnect, shared storage) and software (health monitors, resource monitors). The cluster interconnect is a private link used by the hardware cluster for heartbeat information to detect node death. Heartbeat is the periodic message sent between nodes to detect system failure of any node. Due to the need for specialized hardware and software, hardware clusters are commonly provided by hardware vendors, such as SUN, HP, IBM, and Dell. The number of nodes that can be configured in a hardware cluster is vendor dependent. For the purpose of high availability of Oracle Collaboration Suite, only two nodes are required.
Cluster Agent: It is a software that runs on a node member of a hardware cluster that coordinates availability and performance operations with other nodes. A cluster agent can automate the service failover.
Clusterware: It is a software that manages the operations of the members of a cluster as a system. Clusterware provides resource grouping, monitoring, and the ability to move services between cluster members.
Primary Node: It is the node that actively carries out one or more infrastructure installations at any given time. If this node fails, the infrastructure is failed over to the secondary node. Because the primary node runs the active infrastructure installations, it is considered the hot node.
Secondary Node: This is the node that takes over the execution of the infrastructure if the primary node fails. Since the secondary node does not originally run the infrastructure, it is considered the cold node. And, because the application fails over from a hot node (primary) to a cold node (secondary), this type of failover is called cold failover.
OracleAS Cluster (Identity Management): In this configuration, Oracle Identity Management components (Oracle Internet Directory, OracleAS Single Sign-On, Oracle Delegated Administration Services, and Oracle Directory Integration and Provisioning) are deployed together in two or more nodes. Each node runs all of the Oracle Identity Management components. The traffic to these nodes is load balanced by a redundant load balancer.
Failover: In a high-availability system, the transfer of operations of a failed node to an equivalent node is termed as failover. This is done to ensure the continuity of services to customers.

If the system is an active-passive system, the passive member is activated during the failover operation and requests are directed to it instead of the failed member. If the system is an active-active system, the load balancer detects the failure and automatically redirects requests for the failed member to the surviving active members.
Failback: After a system undergoes a successful failover operation, the process of repairing the failed node and restoring it back as an active node is termed as failback. This process reverts the system back to its pre-failure configuration.
Switchover: During normal operation, active members of Oracle Collaboration Suite may require maintenance or upgrading. A switchover process can be initiated to let a substitute member take over the workload of the member that requires maintenance or upgrading (scheduled outage). The switchover operation ensures continued service to clients of Oracle Collaboration Suite.
Switchback: After a switchover operation, the process of activating the upgraded member is known as switchback. This process brings the system back to the pre-switchover configuration.
Physical Host Name: Physical host name is used to refer to the internal name of the current machine. In UNIX, this is the name returned by the command hostname.
Network Host Name: Network host name is a name assigned to an IP address either through the /etc/hosts file (in UNIX) or C:\WINDOWS\system32\drivers\etc\hosts file (in Windows), or through DNS resolution. Often, the network host name and physical host name are identical. However, each system has only one physical host name but may have multiple network host names. Thus, a system's network host name may not always be its physical host name.
Virtual IP: It is also known as cluster virtual IP and load balancer virtual IP. Generally, a virtual IP can be assigned to a hardware cluster or load balancer. To present a single system view of a cluster to network clients, a virtual IP serves as an entry point IP address to the group of servers which are members of the cluster. A hardware cluster uses a cluster virtual IP to present the entry point into the cluster. It can also be set up on a standalone machine.

The hardware cluster's software manages the movement of this IP address between the two physical nodes of the cluster, while clients connect to this IP address without knowing which physical node this IP address is currently active on.

A load balancer also uses a virtual IP as the entry point to a set of servers. These servers tend to be active at the same time. This virtual IP address is not assigned to any individual server but to the load balancer which acts as a proxy between servers and their clients.
Virtual Host Name: Virtual host name is a network-addressable host name that maps to one or more physical machines through a load balancer or a hardware cluster. For load balancers, virtual server name is used interchangeably with virtual host name. A load balancer can hold a virtual host name on behalf of a set of servers, and clients communicate indirectly with the systems using the virtual host name.

For example, if the two physical host names of the hardware cluster are node1.mycompany.com and node2.mycompany.com, the single view of this cluster can be provided by the name selfservice.mycompany.com. In the DNS, selfservice.mycompany.com maps to the virtual IP address of the Oracle Collaboration Suite Infrastructure, while Oracle Collaboration Suite Applications connect to the IP address without knowing which physical node is active and actually servicing a particular request.

Oracle Collaboration Suite High-Availability Features

Scheduled outages can disrupt operations, especially in global enterprises that support users in multiple time zones. In this case, it is important to design a system to minimize planned interruptions. It is important to consider not only the time to perform the upgrade but also the effect the changes may have on the overall application.

Oracle Collaboration Suite offers high-availability solutions as local solutions that provide high availability in a single data center deployment. Process, node, and media failures as well as human errors can be prevented by local high-availability solutions. Local physical disasters can be prevented by geographically distributed disaster recovery solutions. To ensure high availability, a number of technologies and best practices are recommended.

In addition to architectural redundancies, the following local high-availability solutions are also necessary in Oracle Collaboration Suite:

Process Death Detection and Automatic Restart: Processes may stop unexpectedly due to configuration or software problems. A proper process monitoring and restart system monitors all system processes constantly and restart them should problems appear. A system process also maintains the number of restarts within a specified time interval. This is important because restarting continuously within short time periods may lead to additional faults or failures.
Clustering: You can cluster the nodes together to allow the nodes to be viewed functionally as a single entity from the perspective of a client for run-time processing and manageability. A cluster is a set of processes, that share the same workload, running on single computer or multiple computers.
Configuration Management: Similar components of a clustered group often need to share a common configuration. Proper configuration management ensures that Oracle Collaboration Suite nodes provide the same reply to the same incoming request, allows these nodes to synchronize their configurations, and provides highly available configuration management for less administration downtime.
Server Load Balancing and Failover: Client requests to the nodes of Oracle Collaboration Suite Applications can be load balanced to ensure that the nodes have roughly the same workload. With a load-balancing mechanism in place, and a set of redundant nodes, if any of the nodes or Oracle Collaboration Suite Applications instances fail, the load balancer will route requests to the surviving nodes or instances.

Oracle Collaboration Suite High-Availability Framework

Oracle Collaboration Suite consists of different components deployed on multiple tiers. The availability of each component has a direct impact on the availability of the system. Besides providing high-availability features, Oracle Collaboration Suite must also be secure. This would ensure that both the Internet and intranet users can use the system without compromising availability and security.

High Availability in Oracle Collaboration Suite Applications

Oracle Collaboration Suite provides several features for ensuring application-level high availability. Oracle Collaboration Suite Applications is installed on separate systems outside Oracle Collaboration Suite Infrastructure.

In cluster architecture, Oracle Collaboration Suite Applications is installed independently on each node of the farm. Oracle Collaboration Suite Applications can also be installed on a dedicated set of redundant machines from Oracle Collaboration Suite Infrastructure.

High Availability in Oracle Collaboration Suite Infrastructure

Oracle Collaboration Suite provides a completely integrated infrastructure and framework for development and deployment of enterprise applications. Oracle Collaboration Suite Infrastructure provides centralized product metadata, security and management services, and configuration information and data repositories for Oracle Collaboration Suite Applications. The time and effort for developing enterprise applications are reduced by integrating the infrastructure services required by the applications. In turn, the total cost of developing and deploying these applications is reduced, and the deployed applications are more reliable.

For Oracle Collaboration Suite Infrastructure to provide all essential services, all of these components must be available. Any high-availability solution must be able to detect and recover from software failures of the processes associated with the Oracle Collaboration Suite Infrastructure components. Solutions must also be able to detect and recover from hardware failures on the hosts that are running Oracle Collaboration Suite Infrastructure.

To ensure high availability, Identity Management is installed separately against the Oracle Collaboration Suite Database on multiple nonclustered systems. All the components of Identity Management can either be co-located on the same system or distributed on separate systems.

Oracle Collaboration Suite RAC-Enabled Architecture

An Oracle Collaboration Suite cluster is a set of Oracle Collaboration Suite nodes configured to act in concert to deliver greater scalability and availability than a single node can provide. While a single Oracle Collaboration Suite node can only leverage the operating resources of only a single host, a cluster can span multiple hosts, and distribute application execution over a greater number of CPUs. While a single Oracle Collaboration Suite node is vulnerable to the failure of its host and operating system, a cluster continues to function despite the loss of an operating system or host, hiding any such failure from clients.

Clusters leverage the combined power and reliability of multiple Oracle Collaboration Suite nodes while maintaining the simplicity of a single Oracle Collaboration Suite node. For example, browser clients of applications running in a cluster interact with the applications as if the applications were running on a single server. With appropriate front-end load balancing, any node in an Oracle Collaboration Suite cluster can serve client requests. This simplifies configuration and deployment across multiple nodes and enables fault tolerance among clustered nodes.

What is RAC?

Oracle Real Application Clusters (RAC) allows the Oracle database to run any packaged or custom application unchanged across a set of clustered servers. This capability provides the highest levels of availability and the most flexible scalability. If a clustered server fails, the Oracle database will continue running on the surviving servers. When more processing power is needed, another server can be added without interrupting user's access to data.

RAC enables multiple instances that are linked by an interconnect to share access to an Oracle database. In a RAC environment, the Oracle database runs on two or more systems in a cluster while concurrently accessing a single shared database. The result is a single database system that spans multiple hardware systems yet appears as a single unified database system to the application. This enables RAC to provide high availability, scalability, and redundancy during failures within the cluster. RAC accommodates all system types, from read-only data warehouse (DSS) systems to update-intensive online transaction processing (OLTP) systems.

High availability configurations have redundant hardware and software that maintain operations by avoiding single points-of-failure. To accomplish this, the Oracle Clusterware is installed as part of the RAC installation process. Oracle Clusterware is a portable solution that is integrated and designed specifically for the Oracle database. In a RAC environment, Oracle Clusterware monitors all Oracle components (such as instances and listeners). If a failure occurs, Oracle Clusterware will automatically attempt to restart the failed component. Other non-Oracle processes can also be managed by Oracle Clusterware. During outages, Oracle Clusterware relocates the processing performed by the inoperative component to a backup component. For example, if a node in the cluster fails, Oracle Clusterware will cause client processes running on the failed node to reconnect and resume running on a surviving node.

The Oracle Clusterware requires two files, the Oracle Cluster Registry (OCR) and the voting disk. To avoid single points-of-failure, the Oracle Clusterware automatically maintains redundant copies of these files. Oracle Clusterware also enables you to replace a damaged copy of the OCR online. Oracle's recovery processes quickly re-master resources, recover partial or failed transactions, and rapidly restore the system.

RAC provides the following benefits:

Ability to tolerate and quickly recover from computer and instance failures
Fast, automatic, and intelligent connection and service relocation and failover
Rolling patch upgrades for qualified one-off patches
Rolling release upgrades of Oracle Clusterware
Load balancing advisory
Runtime connection load balancing
Flexibility to scale up processing capacity using commodity hardware without downtime or changes to the application
Comprehensive manageability integrating database and cluster features

RAC Configuration for Oracle Collaboration Suite

The RAC configuration for Oracle Collaboration Suite consists of Oracle Collaboration Suite Database deployed on a cluster with two or more nodes. Each Oracle Collaboration Suite Database node has a local copy of the Oracle Collaboration Suite software installed. There is a single Oracle Collaboration Suite Database which is shared by all the nodes.

Oracle database instances exist on each node and concurrently open the database for read or write operations. All Oracle Collaboration Suite database-related processes as well as the database listener on all the nodes use the same network port numbers for any communication. Thus each node is equivalent to another in terms of configuration and is active concurrently with other nodes.

Oracle Collaboration Suite Applications requests for Oracle Collaboration Suite Database services are equally met from all the nodes.

If a load balancer is required, it will be configured to direct incoming requests to any one of the Oracle Collaboration Suite Database nodes. The load balancer will only be used for all non-Oracle Net traffic (HTTP, LDAP, HTTPS, and so on). Oracle Net traffic is expected to go directly to the node and is balanced across the nodes using Oracle Net connect descriptors with multiple addresses in its address list. RAC uses high speed interprocess communication for internode communications.

Oracle Calendar Server does not support RAC because Oracle Calendar Server supports only active-passive configuration.

Outages in RAC

Some of the unscheduled outages in RAC can be due to Oracle instance failure or database node failure. In case of Oracle instance failure, the load balancer is notified. The load balancer stops the non-Oracle Net traffic to this node and redirects the traffic to another active node. In case of database node failure, the load balancer detects that the node is gone, stops the traffic to the node, and redirects the traffic to another active node.

Some scheduled outages in RAC can be due to configuration changes on a node or maintenance of nodes. The configuration changes are implemented on all nodes either manually or through a process or command interface. For node maintenance, all processes on the node are brought down and the load balancer is notified of the unavailability of the node. The node is brought up after maintenance and all processes are restarted. Then the load balancer is notified of the availability of the node.

External Load Balancers

Load balancers can be employed to improve the availability of both clustered and non-clustered Oracle Collaboration Suite nodes.

Clients access the cluster through a load balancer that hides the cluster configuration. Since any node can service any request, the load balancer can send requests to any Oracle Collaboration Suite nodes in the cluster. Administrators can raise the capacity of the system by introducing additional Oracle Collaboration Suite nodes to the cluster.

Load balancers can also be used to increase the availability of non-clusterable Oracle Collaboration Suite nodes. So long as the load balancer is configured to serve a set of nodes, it will route requests accordingly.

Benefits of External Load Balancers

The three main benefits of using external load balancers in Oracle Collaboration Suite are as follows:

Scalability: Load balancers improve scalability by providing an access point through which requests are routed to one of the available nodes. Nodes can be added to the group that the load balancer serves to accommodate additional users.

Availability: Load balancers improve availability by routing requests to the most available nodes. If one node goes down, or is too busy, a load balancer can send requests to another active node instead.

Manageability: Load balancers improve manageability by routing application deployment and system configuration requests to the most available node.

Types of External Load Balancers

There are three main types of external load balancers that can be used with Oracle Collaboration Suite nodes:

Hardware Load Balancer

Hardware load balancing involves placing a hardware load balancer in front of a group of Oracle Collaboration Suite nodes. The hardware load balancer routes requests to the nodes in a client-transparent fashion.

Leveraging Web Cache as an External Load Balancer

Web Cache supports content-aware load balancing and failover detection for Web based applications. These features ensure that cache misses are directed to the most available and highest performing application server.

Redundant Architectures in Oracle Collaboration Suite

Oracle Collaboration Suite provides support for redundant nodes as follows:

Database node redundancy through RAC
Identity Management node redundancy through OracleAS Cluster (Identity Management)
Middle Tier redundancy through the use of an external load balancer in front of multiple middle tiers

These redundant configurations provide increased availability either through a distributed workload, a failover setup or both. The configuration can be an active-active configuration or an active-passive configuration. These configurations are discussed in detail below.

Oracle Collaboration Suite Active-Active Configurations

The active-active configuration deploys two or more active Oracle Collaboration Suite nodes and can be used to improve scalability as well as to provide high availability. All nodes handle requests concurrently. The preceding redundancies provide Oracle Collaboration Suite components are active on all nodes in the at the same time.

Active-active solutions provide a robust cluster architecture for Oracle Collaboration Suite and are a transparent high availability solution. Because all the nodes are active, failover from one node to another is quick and requires no manual intervention. Active-active setups also provide scalability for Oracle Collaboration Suite. This configuration leverages the Real Application Cluster (RAC) feature of the Oracle database for running the Oracle Collaboration Suite database. Each node in the hardware cluster has its own ORACLE_HOME, which contains the configuration files and binaries needed to run Oracle Collaboration Suite on that node. Oracle Collaboration Suite installation across nodes is accomplished in one process. Additionally, all nodes access a set of shared files on the RAC database.

Features of an Active-Active Configuration

The features of an Oracle Collaboration Suite active-active configuration are as follows:

Identical Node Configuration: The nodes are meant to serve the same workload or application. Their configuration guarantees that they deliver the same reply to the same request. Some configuration properties may be identical and others may be node-specific, such as local host name information.
Equivalence Management: Changes made to one node will usually need to be propagated to the other nodes in an active-active configuration. This is done to maintain equivalence among all the nodes.
Independent Operation: To provide maximum availability, the loss of one Oracle Collaboration Suite node in an active-active configuration should not affect the ability of the other nodes to serve requests.

Advantages of an Active-Active Configuration

The advantages of an Oracle Collaboration Suite active-active configuration are as follows:

Increased Availability: An active-active configuration is a redundant configuration. Loss of one node can be tolerated because another node can continue to serve the same requests.
Increased Scalability and Performance: Multiple identically-configured nodes provide the capability to have a distributed workload shared among different machines and processes. If configured correctly, new nodes can also be added as the demand of the application grows.

Oracle Collaboration Suite Active-Passive Configurations

The active-passive configuration deploys an active node of Oracle Collaboration Suite that handles requests, and a passive Oracle Collaboration Suite node, which is on standby. In addition, a heartbeat mechanism is set up between these two nodes. This mechanism is provided and managed through vendor-specific clusterware. Generally, vendor-specific cluster agents are also available to automatically monitor and failover between cluster nodes. When the active node fails, an agent shuts down the active node completely, brings up the passive node, and enables application services to successfully resume processing. As a result, the active-passive roles are now switched. Active-passive configurations in a cluster are also generally referred to as cold failover clusters.

Oracle Collaboration Suite is only active on one node in the cluster at any time. When the Oracle Collaboration Suite on the active node goes down, the cluster software brings up the Oracle Collaboration Suite on one of the inactive nodes, with the same virtual host name as the failed node. Although there will be some minimal down time, this allows for faster recovery times on the middle tier, as it need not be reconfigured to point to a new Oracle Collaboration Suite node. From the perspective of middle tier applications, the new active node in the active-passive configuration is identical to the node that failed.

Features of an Active-Passive Configuration

The features of an Oracle Collaboration Suite active-passive configuration are as follows:

Shared Storage: The passive Oracle Collaboration Suite node in an active-passive configuration has access to the same Oracle binaries, configuration files, and data as the active node.
Failover Procedure: An active-passive configuration usually requires a set of scripts and procedures to detect failure of the active instance and to failover to the passive instance while minimizing downtime.

Advantages of an Active-Passive Configuration

The advantages of an Oracle Collaboration Suite active-passive configuration are as follows:

Availability: If the active node fails for any reason or must be taken offline, an identically configured passive node can take over instantly.
Reduced Operation Costs: In an active-passive configuration, only one set of processes is up and servicing requests. Management of the active node is generally less than managing an array of active nodes.
Application Independence: Some applications may not be suited to an active-active configuration. This includes applications which rely heavily on application state or on information stored locally. An active-passive configuration has only one node serving requests at any particular time.

Causes of Downtime

One of the challenges in designing a high availability solution is examining and addressing all the possible causes of downtime. It is important to consider causes of both unplanned and planned downtime when designing a fault tolerant and resilient IT infrastructure. Planned downtime can be just as disruptive to operations, especially in global enterprises that support users in multiple time zones.

The following table describes the outage categories and provides examples of each outage type.

Category	Outage Type	Description	Examples
Unplanned	Computer failure	A computer failure outage occurs when the system running the database becomes unavailable because it has shut down or is no longer accessible.	Database system hardware failureOperating system failureOracle instance failureNetwork interface failure
	Storage failure	A storage failure outage occurs when the storage resource holding some or all of the database contents becomes unavailable because it has shut down or is no longer accessible.	Disk drive failureDisk controller failureStorage array failure
	Human error	A human error outage occurs when unintentional (or perhaps malicious) actions are committed that cause data within the database to become logically corrupt or unusable.	Dropped database objectInadvertent data changesMalicious data changes
	Data corruption	A data corruption outage occurs when a hardware or software component causes corrupt data to be read or written to the database. The service level impact of a data corruption outage can vary from a small portion of the database (down to a single database block) to a large portion of the database (rendering it essentially unusable).	Operating system or storage device driver, host bus adapter, disk controller, or volume manager error causing bad disk read or writesStray writes by operating system or other application software
	Site failure	A site failure outage occurs when an event causes a significant portion of an application to stop processing or to slow to an unusable service level. A site failure may affect all processing at a data center or a subset of applications supported by a data center.	Extended site-wide power failureSite-wide network failureNatural disaster making a data center inoperableTerrorist or malicious attack on operations or the site
Planned	System changes	Planned system changes occur when performing routine and periodic maintenance operations as well as new deployments. Planned system changes include any scheduled changes to the operating environment that occur outside the organizational data structure within the database. The service level impact of a planned system change varies significantly depending on the nature and scope of the planned outage, the testing and validation efforts made prior to implementing the change, and the technologies and features in place to minimize the impact.	Adding/removing processors to/from an SMP serverAdding/removing nodes to/from a clusterAdding/removing disks drives or storage arraysChanging configuration parametersUpgrading/patching system hardware and softwareUpgrading/patching Oracle softwareUpgrading/patching application softwareSystem platform migrationDatabase relocation
	Data changes	Planned data changes occur when there are changes to the logical structure or physical organization of Oracle database objects. The primary objective of these changes is to improve performance or manageability.	Table definition changesAdding table partitioningCreating and rebuilding indexes