Sun Java System Directory Server Enterprise Edition 6.2 Deployment Planning Guide

Using Clustering for High Availability

From a physical perspective, a cluster consists of between one and eight servers that work together as a single entity. The servers work together to provide highly available access to applications, system resources, and data. Each server can be a symmetric multiprocessor with multiple CPUs.

A clustering solution can provide high availability for the following:

Servers and software
Storage subsystem
Network adaptor

Clustering does not mitigate all SPOFs in a directory architecture. Failures in the external network, power generation, and data center must be mitigated outside of a clustering solution.

Currently the only supported clustering technology for Directory Server is Sun Cluster 3.1. Using Sun Cluster 3.1 for directory service availability involves installing and configuring the Sun Cluster HA for Directory Server data service as a failover data service. This strategy allows Directory Server to fail over safely in a Sun Cluster 3.1 environment.

The following figure shows the position of the Sun Cluster HA for Directory Server data service in the Sun Cluster 3.1 architecture.

Figure 12–6 Sun Cluster 3.1 Architecture

Figure shows high availability deployment using Sun Cluster
3.1 Architecture

Hardware Redundancy

The architecture of a Sun Cluster hardware system is designed so that no SPOF can make a cluster unavailable. Redundant high-speed interconnects, storage system connections, and public networks ensure that cluster connectivity does not experience single failures.

Clients connect to the cluster through public network interfaces. If a network adapter card has multiple hardware interfaces, the card can connect to one or more public networks. You can set up nodes to include multiple network interface cards. The cards are configured so that one card is active, and the other cards operate as backups.

A cluster file system is a proxy between the kernel on one or more nodes and the underlying file system and volume manager. The cluster file system runs on a node that has a physical connection to the disks. For a cluster file system to be highly available, you must attach the disks to multiple nodes. A local file system that is made into a cluster file system is not highly available. A local file system implies a file system that is stored on a node's local disk.

A volume manager provides for mirrored or RAID 5 configurations for data redundancy of multihost disks. You can combine multihost disks with disk mirroring and striping to protect against both node failure and individual disk failure.

The cluster interconnect is a private network that transfers cluster-private communications and data service communications between cluster nodes. Redundant NICs, junctions, and cables protect against network failure.

Monitoring in a Clustered Solution

The cluster continuously monitors all its members. It blocks failed nodes from participating in the cluster, which prevents any exchange of corrupt data. The cluster also monitors applications, and it fails over or restarts the applications in case of failures.

Public Network Management, a subsystem of the Sun Cluster software, monitors the active interface. If the active adapter fails, Network Adapter Failover software is called to fail over the interface to one of the backup adapters.

The Cluster Membership Monitor (CMM) is a distributed set of agents, with one set per cluster member or node. The agents exchange messages over the cluster interconnect to ensure full connectivity among all nodes. When the CMM detects a change in cluster membership because of a node failure, for example, the CMM reconfigures the cluster. If the CMM detects a critical problem with a node, the CMM contacts the cluster framework. The cluster framework then forcibly shuts down the node and removes it from the cluster membership.

System Maintenance

You can minimize planned downtime for system maintenance by moving data and applications from the component that needs maintenance to another component on the system. When the maintenance is complete, you can move the data and applications back to the original component.

Directory Server Failover Data Service

The Directory Server Failover Data Service runs on a single node in a cluster. However, nodes can have multiple CPUs for scalability. A fault monitor periodically monitors this failover service.

The Resource Group Manager (RGM) manages data services as resources. When a CMM changes a cluster's membership, the RGM might request changes to the cluster's online or offline resources. The RGM starts and stops failover data services.

Disaster Recovery

The following sections describe how a service is recovered if the Directory Server Data Service fails and if the server fails.

Recovery in the Event of Application Failure

If the fault monitor determines that the Directory Server Data Service has failed, the monitor initiates action to restart the service. The action that is taken depends on the service's configuration.

You can configure the failover data service to attempt to restart a failed service on the same node. Alternatively, the data service can be configured to immediately start a failed service on a different node. If the data service is configured to attempt to restart on the same node, the fault monitor contacts the local RGM. The local RGM then attempts to restart the failed service. If this action fails, the local RGM attempts to start the service on a different node.

If a failed data service cannot be restarted on the same node, the local node's RGM attempts to locate a version of the service on another node. This action also occurs if the data service is configured to start on a different node after failure. If the local RGM finds a version of the service, the local RGM contacts the local CMM and requests that it contact the remote node over the cluster interconnect. The remote CMM then contacts the local RGM and directs it to start the service.

The following figure illustrates recovery in the event of application failure.

Figure 12–7 Application Failure and Recovery in a Sun Cluster 3.1 Architecture

Figure shows recovery after application failure in a
Sun Cluster 3.1 architecture

Recovery in the Event of Server Failure

If the server or node on which the Directory Server Data Service is running fails, the service is migrated to another working node. No user intervention is required. This service uses a failover resource group, a container that defines the Directory Server instances, and hosts that support the failover requirements.

The following figure illustrates recovery in the event of server failure.

Figure 12–8 Server Failure and Recovery in a Sun Cluster 3.1 Architecture

Figure shows recovery after server failure in a Sun Cluster
3.1 architecture