1 Introduction to Oracle Collaboration Suite High Availability

This guide describes the Oracle Collaboration Suite high availability solutions. Oracle Collaboration Suite has incorporated recent advances in high availability that enable you to meet your business goals in a more efficient and affordable manner.

This chapter explains the concepts of high availability.

Overview of High Availability

Many enterprises use their information technology infrastructure to provide a competitive advantage, increase productivity, and empower users to make faster and more informed decisions. However, with these benefits come an increasing dependence on that infrastructure. If a critical application becomes unavailable, then the entire business can be in jeopardy. The business can lose customers and revenue, incur penalties, or suffer bad publicity that adversely affects its stock price and customer base. Therefore it is critical to examine how your data is protected and to maximize its availability to your users.

What is High Availability

Availability is the degree to which an application, service, or feature is available upon user demand. Availability is measured by the perception of the application user. Application users experience frustration when their data is unavailable, and they do not understand or care to differentiate between the complex components of an overall solution. Performance degradation due to higher than expected usage create the same havoc as the failure of critical components in the solution.

Reliability, recoverability, timely error detection, and continuous operations are primary characteristics of a high availability solution:

Reliability: Reliable hardware is one component of a high availability solution. Reliable software (including the database, Web servers, and application) is an equally important component of a high availability solution.
Recoverability: There may be many choices in recovering from a failure if one occurs. It is important to determine what types of failures may occur in your high availability environment, and how to recover from those failures in the time that meets your business requirements. For example, if a critical table is accidentally deleted from the database, then what action should you take to recover it? Does your architecture provide the ability to recover in the time specified in a service level agreement?
Timely error detection: If a component in your architecture fails, then fast detection is another essential function in recovery. While you may be able to recover quickly from an outage, if it takes an additional 90 minutes to discover the problem, then you may not meet your service level agreement. Monitoring the health of your environment requires reliable software to view it quickly and the ability to notify the database administrator of a problem.
Continuous operations: Continuous access to your data is essential when very little or no downtime is acceptable to perform maintenance activities. Activities such as moving a table to another location within the database, or even adding additional CPUs to your hardware, should be transparent to users in a high availability architecture.

More specifically, a high availability architecture should have the following traits:

Be transparent to most failures
Provide built-in preventive measures
Provide monitoring and fast detection of failures
Provide fast recoverability
Automate the recovery operation
Protect the data so that there is minimal or no data loss
Implement the operational best practices to manage your environment
Provide the high availability solution to meet your service level agreement

As the number of mission-critical applications deployed on the Internet and intranet environments increases, users have more demands about the quality of service and high availability of these applications. A failure that directly affects critical customers or partners is unacceptable. Some failures may result in loss of data and business transactions (for example, e-commerce and business integration systems), but some others may simply affect user productivity. In turn, these outages generally result in lost business opportunities and an increase in the total cost of ownership of IT systems. Because of the constantly increasing number of systems used by employees and partners, high availability has shifted from being an occasional requirement to a general requisite that affects all types of deployments.

Oracle makes high availability solutions available to every customer regardless of the size of the enterprise. Small workgroups and global enterprises alike can extend their critical business applications. With Oracle and the Internet, applications and their data are now more reliably accessible everywhere at any time.

Oracle Collaboration Suite is designed to provide a wide variety of high availability solutions. These ranging from load balancing and basic clustering to providing maximum system availability during catastrophic hardware and software failures.

Causes of Downtime

One of the challenges in designing a high availability solution is examining and addressing all possible causes of downtime. It is important to consider causes of both unplanned and planned downtime when designing a fault-tolerant and resilient IT infrastructure. Planned downtime can be just as disruptive to operations as unplanned downtime, especially in global enterprises that support users in multiple time zones.

Table 1-1 describes the outage categories and provides examples of each outage type.

Table 1-1 Types of Outages

Category	Outage Type	Description	Examples
Unplanned	Computer failure	A computer failure occurs when the system that is running the database becomes unavailable because it has shut down or is no longer accessible.	Database system hardware failureOperating system failureOracle instance failureNetwork interface failure
Unplanned	Storage failure	A storage failure occurs when the storage containing some or all of the database content becomes unavailable because it has shut down or is no longer accessible.	Disk drive failureDisk controller failureStorage array failure
Unplanned	Human error	A human error occurs when an unintentional or malicious action is committed that causes data within the database to become logically corrupt or unusable. The service-level effect of a human error may vary significantly depending on the amount and critical nature of the affected data.	Dropped database objectInadvertent data changesMalicious data changes
Unplanned	Data corruption	A data corruption occurs when a hardware or software component causes corrupt data to be written to the database. The service level effect of a data corruption outage may vary, from a small portion of the database (down to a single database block) to a large portion of the database (making it essentially unusable).	Operating system or storage device driver, host bus adapter, disk controller, or volume manager error causing bad disk reads or writesStray writes by the operating system or other application software
Unplanned	Site failure	A site failure occurs when an event causes all or a significant portion of an application to stop processing or slow to an unusable service level. A site failure may affect all processing at a data center, or a subset of applications supported by a data center.	Extended sitewide power failureSitewide network failureNatural disaster making a data center inoperableTerrorist or malicious attack on operations or the site
Planned	System changes	Planned system changes occur during routine and periodic maintenance operations and new deployments. Planned system changes include any scheduled changes to the operating environment that occur outside the organizational data structure within the database. The service-level effect of a planned system change varies significantly depending on the nature and scope of the planned outage, the testing and validation efforts made prior to implementing the change, and the technologies and features in place to minimize the effect.	Adding or removing processors to/from an SMP serverAdding or removing nodes to or from a clusterAdding or removing disks drives or storage arraysChanging configuration parametersUpgrading or patching system hardware and softwareUpgrading or patching Oracle softwareUpgrading or patching application softwareSystem platform migrationDatabase relocation
Planned	Data changes	Planned data changes occur when there are changes to the logical structure or physical organization of Oracle database objects. The primary objective of these changes is to improve performance or manageability.	Table definition changesAdding table partitioningCreating and rebuilding indexes

Importance of Availability

The importance of high availability varies among applications. However, the need to deliver increasing levels of availability continues to accelerate as enterprises reengineer their solutions to gain competitive advantage. Most often, these new solutions rely on immediate access to critical business data. When data is not available, the operation can cease to function. Downtime can lead to lost productivity, lost revenue, damaged customer relationships, bad publicity, and lawsuits. Revenue losses and legal penalties incur because service level agreement objectives are not met.

Other factors to consider in the cost of downtime are the maximum tolerable duration of a single unplanned outage, and the maximum frequency of allowable incidents. If the event lasts less than 30 seconds, then it may have a very little effect and may be barely perceptible to users. As the duration of the outage grows, the effect may grow exponentially and negatively affect the business. When designing a solution, it is important to take into account these issues. An organization should weigh the true cost of downtime and balance it with the expected availability improvement.

Oracle provides a range of high availability solutions that fit every organization regardless of size. Small workgroups and global enterprises are able to extend the reach of their critical business applications. With Oracle and the Internet, applications and their data are now more reliably accessible everywhere, at any time.

High Availability Terminology

The following terms are used to explain the concepts of high availability of Oracle Collaboration Suite:

Active-Active: A system is active-active if all equivalent members of that system are actively servicing requests concurrently and are not on standby mode.
Active-Passive: A system is active-passive if some members of that system are active in servicing requests and some members are inactive (passive). These passive members are not activated until one or more active nodes have failed.
Hardware Cluster: A hardware cluster is a collection of computers that appears to clients as a single system and provides network services (for example: an IP address) or application services (for example: databases, Web servers).

A hardware cluster achieves high availability and scalability through the use of specialized hardware (cluster interconnect, shared storage) and software (health monitors, resource monitors). The cluster interconnect is a private link used by the hardware cluster for heartbeat information to detect node death. Heartbeat is the periodic message sent between nodes to detect system failure of any node. Due to the need for specialized hardware and software, hardware clusters are commonly provided by hardware vendors, such as Sun, HP, IBM, and Dell. The number of nodes that can be configured in a hardware cluster is vendor dependent. For Oracle Collaboration Suite high availability, only two nodes are required.
Cluster Agent: It is an application that runs on a node member of a hardware cluster that coordinates availability and performance operations with other nodes. A cluster agent can automate the service failover.
Clusterware: It is an application that manages the operations of the members of a cluster as a system. Clusterware provides resource grouping, monitoring, and the ability to move services between cluster members.
Primary Node: It is the node that actively carries out one or more infrastructure installations at any given time. If this node fails, then the secondary node carries out the task. Because the primary node runs the active infrastructure installations, it is considered the hot node.
Secondary Node: This is the node that takes over the execution of the infrastructure if the primary node fails. Because the secondary node does not originally run the infrastructure, it is considered the cold node. And, because the application fails over from a hot node (primary) to a cold node (secondary), this type of failover is called cold failover.
Shared Storage: The storage is termed as shared storage if it is accessed by all the nodes in a cluster. While the nodes have equal access to the storage, only the primary node has active access to the storage at any given time. The hardware cluster software grants the secondary node access to this storage if the primary node fails. In some cases, the primary node may relinquish control of the shared storage and move the control to the secondary.
OracleAS Cluster (Identity Management): In this configuration, Oracle Identity Management components (Oracle Internet Directory, OracleAS Single Sign-On, Oracle Delegated Administration Services, Oracle Directory Integration and Provisioning, and Oracle Application Server Certificate Authority) are deployed together in two or more nodes. Each node runs all of the Oracle Identity Management components. The traffic to these nodes is load balanced by a redundant load balancer.
Failover: In a high availability system, the transfer of operations of a failed node to an equivalent node is known as failover. This is done to ensure the continuity of services to customers.

In an active-passive system, the passive member is activated during the failover operation and requests are directed to it instead of the failed member. In an active-active system, the load balancer detects the failure and automatically redirects requests for the failed member to the surviving active members.
Failback: After a system undergoes a successful failover operation, the process of repairing the failed node and restoring it as an active node is termed as failback. This process reverts the system to its prefailure configuration.
Switchover: During normal operation, active members of Oracle Collaboration Suite may require maintenance or upgrading. A switchover process can be initiated to let a substitute member take over the workload of the member that requires maintenance or upgrading (scheduled outage). The switchover operation ensures continued service to clients of Oracle Collaboration Suite.
Switchback: After a switchover operation, the process of activating the upgraded member is known as switchback. This process brings the system back to the preswitchover configuration.
Physical Host Name: Physical host name is used to refer to the internal name of the current computer. In UNIX, this is the name returned by the command hostname.
Network Host Name: Network host name is a name assigned to an IP address either through the /etc/hosts file (in UNIX) or C:\WINDOWS\system32\drivers\etc\hosts file (in Windows), or through Domain Name System (DNS) resolution. Often, the network host name and physical host name are identical. However, each system has only one physical host name but may have multiple network host names. Thus, a network host name of a system may not always be its physical host name.
Virtual IP: It is also known as cluster virtual IP and load-balancer virtual IP. Generally, a virtual IP can be assigned to a hardware cluster or load balancer. To present a single system view of a cluster to network clients, a virtual IP serves as an entry point Internet Protocol (IP) address to the group of servers that are members of the cluster. A hardware cluster uses a cluster virtual IP to present the entry point into the cluster. It can also be set up on a standalone computer.

The software of the hardware cluster software manages the movement of this IP address between the two physical nodes of the cluster, while clients connect to this IP address without knowing on which physical node is this IP address currently active.

A load balancer also uses a virtual IP as the entry point to a set of servers. These servers tend to be active at the same time. This virtual IP address is assigned to the load balancer which acts as a proxy between servers and their clients.
Virtual Host Name: Virtual host name is a network-addressable host name that maps to one or more computers through a load balancer or a hardware cluster. For load balancers, virtual server name is used interchangeably with virtual host name. A load balancer can hold a virtual host name on behalf of a set of servers, and clients communicate indirectly with the systems using the virtual host name.

The virtual host name is the host name associated with the virtual IP. This is the name that is chosen to give the Oracle Collaboration Suite Applications a single system view of the Oracle Collaboration Suite Infrastructure with the help of a hardware cluster or a server load balancer. This name-IP entry must be added to the DNS that the site uses, so that the nodes of Oracle Collaboration Suite Applications can associate with Oracle Collaboration Suite Infrastructure without having to add this entry to their local /etc/hosts (or equivalent) file.

For example, if the two physical host names of the hardware cluster are node1.mycompany.com and node2.mycompany.com, the single view of this cluster can be provided by the name selfservice.mycompany.com. In the DNS, selfservice.mycompany.com maps to the virtual IP address of the Oracle Collaboration Suite Infrastructure, while the Oracle Collaboration Suite Applications components connect to the IP address without knowing which physical node is active and actually servicing a particular request.

Oracle Collaboration Suite High Availability Concepts

Oracle Collaboration Suite consists of different components deployed on multiple tiers. The availability of each component directly affects availability of the system. Besides providing high availability features, Oracle Collaboration Suite must also be secure. So both the Internet and intranet users can use the system without compromising availability and security.

Oracle Collaboration Suite consists of Oracle Collaboration Suite Applications and Oracle Collaboration Suite Infrastructure.

This section explains the high availability features and framework of Oracle Collaboration Suite.

Oracle Collaboration Suite High Availability Features

Oracle Collaboration Suite is equipped with high availability features that avoid or eliminate planned and unplanned downtime. Planned outages can disrupt operations, especially in global enterprises that support users in multiple time zones. In this case, it is important to design a system to minimize planned interruptions. It is important to consider not only the time to perform the upgrade but also the effect the changes may have on the overall application. Oracle Collaboration Suite provides rolling upgrades to minimize planned downtime for these activities.

A rolling upgrade is an upgrade of a software version that is performed without noticeable downtime or other disruption of services. All systems need component replacement and upgrades from time to time. The need to facilitate software upgrades is more because a system with continuous service uptime expectation cannot be just stopped for maintenance and upgrade. To provide service continuously, the software upgrades must be performed on a running system in such a way that the availability requirements are met.

Oracle Collaboration Suite supports the rolling upgrade of most Oracle Collaboration Suite components with minimal operational impact.

For unplanned outages, Oracle Collaboration Suite offers high availability solutions as local solutions that provide high availability in a single data center deployment.

Process, node, and media failures and human errors can be prevented by local high availability solutions. Local physical disasters can be prevented by geographically distributed disaster recovery solutions.

In addition to architectural redundancies, the following local high availability technologies are also necessary in a comprehensive high availability system:

Process Death Detection and Automatic Restart: Processes may die unexpectedly due to configuration or software problems. A proper process monitoring and restart system monitors all system processes constantly and restarts them should problems appear. A system process also maintains the number of restarts within a specified time interval. This is important because restarting continuously within short time periods may lead to additional faults or failures.
Clustering: You can cluster the nodes together to allow the nodes to be viewed functionally as a single entity from the perspective of a client for run-time processing and manageability. A cluster is a set of processes that share the same workload, running on single computer or multiple computers.
Configuration Management: Similar components of a clustered group often need to share a common configuration. Proper configuration management ensures that Oracle Collaboration Suite nodes provide the same reply to the same incoming request, allows these nodes to synchronize their configurations, and provides high availability configuration management to reduce administration downtime.
Server Load Balancing and Failover: Client requests to the nodes of Oracle Collaboration Suite Applications can be load balanced to ensure that the nodes have roughly the same workload. With a load balancing mechanism in place, and a set of redundant nodes, if any of the nodes or Oracle Collaboration Suite Applications instances fail, then the load balancer will route requests to the surviving nodes or instances.
Oracle Collaboration Suite Recovery Manager: It is an utility to perform backup and recovery of the complete installation of Oracle Collaboration Suite, including the configuration files, the Oracle Collaboration Suite Infrastructure, and the Oracle Collaboration Suite Applications tier.

Redundant Architectures

Oracle Collaboration Suite provides support for redundant nodes as follows:

Database node redundancy through Oracle Real Application Clusters (RAC)
Identity Management node redundancy through an OracleAS Cluster (Identity Management)
Middle-tier redundancy through the use of an external load balancer in front of multiple middle tiers

These redundant configurations provide increased availability either through a distributed workload, a failover setup, or both. The configuration can be an active-active configuration or an active-passive configuration. The basics of these configurations are discussed in detail as follows.

Oracle Collaboration Suite Active-Active Configurations

This configuration deploys two or more active Oracle Collaboration Suite nodes and can be used to improve scalability as well as to provide high availability. All nodes handle requests concurrently.

Active-active solutions provide a robust cluster architecture for Oracle Collaboration Suite. Because all the nodes are active, failover from one node to another is quick and requires no manual intervention. Active-active setups also provide scalability for Oracle Collaboration Suite. This configuration leverages the RAC feature of the Oracle database for running Oracle Collaboration Suite Database. Each node in the hardware cluster has its own ORACLE_HOME, which contains the configuration files and binaries needed to run Oracle Collaboration Suite on that node. Oracle Collaboration Suite installation across nodes is accomplished in one process. Additionally, all nodes access a set of shared files on the RAC database.

Features The features of an Oracle Collaboration Suite active-active configuration are as follows:

Identical Node Configuration: The nodes are meant to serve the same workload or application. Their configuration guarantees that they deliver the same reply to the same request. Some configuration properties may be identical and others may be node-specific, such as local host name information.
Equivalence Management: Changes made to one node will usually need to be propagated to the other nodes in an active-active configuration. This is done to maintain equivalence among all the nodes.
Independent Operation: To provide maximum availability, the loss of one Oracle Collaboration Suite node in an active-active configuration should not affect the ability of the other nodes to serve requests.

Advantages The advantages of an Oracle Collaboration Suite active-active configuration are as follows:

Increased Availability: An active-active configuration is a redundant configuration. Loss of one node can be tolerated because another node can continue to serve the same requests.
Increased Scalability and Performance: Multiple identically configured nodes enable distributed workload to be shared among different computers and processes. New nodes with the same configuration can also be added as the demand of the application grows.

Oracle Collaboration Suite Active-Passive Configurations

This configuration deploys an active node of Oracle Collaboration Suite that handles requests and a passive Oracle Collaboration Suite node, which is on standby. Heartbeat mechanism is set up between these two nodes. This mechanism is provided and managed through vendor-specific clusterware. Generally, vendor-specific cluster agents are also available to automatically monitor and fail over between cluster nodes. When the active node fails, an agent shuts down the active node completely, brings up the passive node, and enables application services to successfully resume processing. As a result, the active-passive roles are now switched. Active-passive configurations in a cluster are also generally referred to as cold failover clusters.

Oracle Collaboration Suite supports only Oracle Calendar in this configuration.

Features The features of an Oracle Collaboration Suite active-passive configuration are as follows:

Shared Storage: The passive Oracle Collaboration Suite node in an active-passive configuration has access to the same Oracle binaries, configuration files, and data as the active node.
Failover Procedure: An active-passive configuration usually requires a set of scripts and procedures to detect failure of the active instance and to fail over to the passive instance while minimizing downtime.

Advantages The advantages of an Oracle Collaboration Suite active-passive configuration are as follows:

Availability: If the active node fails for any reason or must be taken offline, an identically configured passive node is prepared to take over instantly.
Reduced Operation Costs: In an active-passive configuration, only one set of processes is up and servicing requests. Managing the active node is generally less expensive than managing an array of active nodes.
Application Independence: Some applications may not be suited to an active-active configuration. This includes applications that rely heavily on the application state or on information stored locally. An active-passive configuration has only one node serving requests at any particular time.

Oracle Collaboration Suite High Availability Framework

This section explains the framework of high availability of Oracle Collaboration Suite.

High Availability in Oracle Collaboration Suite Applications

Oracle Collaboration Suite provides several features for ensuring application-level high availability. Oracle Collaboration Suite Applications are installed on separate Oracle homes or on separate systems apart from the Oracle Collaboration Suite Infrastructure or on both. Each installation of Oracle Collaboration Suite Applications contains the following components:

Oracle Mail
Oracle Content Services
Oracle Collaboration Suite Search
Oracle Mobile Collaboration
Oracle Voicemail & Fax
Oracle Calendar
Oracle Calendar Application System
Oracle Collaborative Portlets
Oracle Real-Time Collaboration
Oracle Discussions
Oracle Workspaces

The Oracle Collaboration Suite Applications can be installed on redundant nodes which have the same set of Oracle Collaboration Suite Applications on each node. A load balancer is placed at the front end of this set of redundant nodes. Optionally, Oracle Collaboration Suite Applications can be installed separately on different nodes.

High Availability in Oracle Collaboration Suite Infrastructure

Oracle Collaboration Suite provides a completely integrated infrastructure and framework for development and deployment of enterprise applications. Oracle Collaboration Suite Infrastructure provides centralized product metadata, security and management services, and configuration information and data repositories for Oracle Collaboration Suite Applications. The time and effort for developing enterprise applications are reduced by integrating the infrastructure services required by the applications. In turn, the total cost of developing and deploying these applications is reduced, and the deployed applications are more reliable.

Oracle Collaboration Suite Infrastructure is made up of the following components:

Oracle Collaboration Suite Database
Oracle Identity Management

Oracle Identity Management includes the following components:

Oracle Internet Directory
Oracle Directory Integration and Provisioning
Oracle Delegated Administration Services
OracleAS Single Sign-On
OracleAS Certificate Authority

The OracleAS Single Sign-On tier comprises OracleAS Single Sign-On and Delegated Administration Services. The Oracle Internet Directory tier comprises Oracle Internet Directory and Oracle Directory Integration and Provisioning. The Oracle Internet Directory and OracleAS Single Sign-On tiers together provide Oracle Identity Management.

For Oracle Collaboration Suite Infrastructure to provide all essential services, all of these components must be available. Any high availability solution must be able to detect and recover from software failures of the processes associated with the Oracle Collaboration Suite Infrastructure components. Solutions must also be able to detect and recover from hardware failures on the hosts that are running Oracle Collaboration Suite Infrastructure.

To ensure high availability, Oracle Identity Management is installed separately against the Oracle Collaboration Suite Database on multiple nonclustered systems. All the components of Oracle Identity Management can either be collocated on the same system or distributed on separate systems.

In a collocated model, all Oracle Identity Management components are installed on one set of computers and all the components operate from a single Oracle home.

The distributed model is as follows:

OracleAS Single Sign-On and Oracle Delegated Administration Services are installed and operated from one set of computers.
Oracle Internet Directory and Oracle Directory Integration and Provisioning are installed and operated from another set of computers.

The components are separated in this manner because OracleAS Single Sign-On and Oracle Delegated Administration Services are typically the first components to be accessed directly by clients and other components. You can run these components on computers in the Demilitarized Zone (DMZ).