Sun Java System Directory Server Enterprise Edition 6.3 Deployment Planning Guide

Availability and Single Points of Failure

Directory Server Enterprise Edition deployments that provide high availability can quickly recover from failures. With a high availability deployment, component failures might impact individual directory queries but should not result in complete system failure. A single point of failure (SPOF) is a system component which, upon failure, renders an entire system unavailable or unreliable. When you design a highly available deployment, you identify potential SPOFs and investigate how these SPOFs can be mitigated.

SPOFs can be divided into three categories:

Hardware failures, for example, server crashes, network failures, power failures, or disk drive crashes
Software failures, for example, Directory Server or Directory Proxy Server crashes
Database corruption

Mitigating SPOFs

You can ensure that failure of a single component does not cause an entire directory service to fail by using redundancy. Redundancy involves providing redundant software components, hardware components, or both. Examples of this strategy include deploying multiple, replicated instances of Directory Server on separate hosts, or using redundant arrays of independent disks (RAID) for storage of Directory Server databases. Redundancy with replicated Directory Servers is the most efficient way to achieve high availability.

You can also use clustering to provide a highly available service. Clustering involves providing pre-packaged high availability hardware and software. An example of this strategy is deploying Sun Cluster hardware and software.

Deciding Between Redundancy and Clustering

The remainder of this chapter describes in more detail the use of redundancy and clustering to ensure high availability. This section summarizes the advantages and disadvantages of each solution.

Advantages and Disadvantages of Redundancy

The more common approach to providing a highly available directory service is to use redundant server components and replication. Redundant solutions are usually less expensive and easier to implement than clustering solutions. These solutions are also generally easier to manage. Note that replication, as part of a redundant solution, has numerous functions other than availability. While the main advantage of replication is the ability to split the read load across multiple servers, this advantage causes additional overhead in terms of server management. Replication also offers scalability on read operations and, with proper design, scalability on write operations, within certain limits. For an overview of replication concepts, see Chapter 4, Directory Server Replication, in Sun Java System Directory Server Enterprise Edition 6.3 Reference.

During a failure, a redundant system might provide poorer availability than a clustering solution. Imagine, for example, an environment in which the load is shared between two redundant server components. The failure of one server component might put an excessive load on the other server, making this server respond more slowly to client requests. A slow response might be considered a failure for clients that rely on quick response times. In other words, the availability of the service, even though the service is operational, might not meet the availability requirements of the client.

Advantages and Disadvantages of Clustering

The main advantage of a clustered solution is automatic recovery from failure, that is, recovery without user intervention. Disadvantages of clustering are complexity and inability to recover from database corruption.

In a clustered environment, the cluster uses the same IP address for Directory Server and Directory Proxy Server, regardless of which cluster node is actually running the service. That is, the IP address is transparent to the client application. In a replicated environment, each machine in the topology has its own IP address. In this case, Directory Proxy Server can be used to provide a single point of access to the directory topology. The replication topology is therefore effectively hidden from client applications. To increase this transparency, Directory Proxy Server can be configured to follow referrals and search references automatically. Directory Proxy Server also provides load balancing and the ability to switch to another machine when one fails.

How Redundancy and Clustering Handle SPOFs

In terms of the SPOFs that are described at the beginning of this chapter, redundancy and clustering handle failure in the following ways:

Single hardware failure. In a clustered environment, this kind of failure has no impact on the directory service. Only multiple hardware failures impact the service in a cluster.

A single hardware failure is fatal to a machine that is not in a clustered environment. Therefore, even if you have redundant hardware, manual intervention is required to repair the failure.
Directory Server or Directory Proxy Server failure. In a clustered environment, the server is automatically restarted. Software failure must occur multiple times in quick succession to trigger the service group to switch to another node in the cluster. This handling of a software failure is also true in a redundant environment.
Database corruption. A cluster cannot survive this kind of failure. Depending on the architecture, a redundant solution should be able to survive database corruption.

Redundancy at the Hardware Level

This section provides basic information about hardware redundancy. Many publications provide comprehensive information about using hardware redundancy for high availability. In particular, see “Blueprints for High Availability” published by John Wiley & Sons, Inc.

Hardware SPOFs can be broadly categorized as follows:

Network failures
Failure of the physical servers on which Directory Server or Directory Proxy Server are running
Load balancer failures
Storage subsystem failures
Power supply failures

Failure at the network level can be mitigated by having redundant network components. When designing your deployment, consider having redundant components for the following:

Internet connection
Network interface card
Network cabling
Network switches
Gateways and routers

You can mitigate the load balancer as an SPOF by including a redundant load balancer in your architecture.

In the event of database corruption, you must have a database failover strategy to ensure availability. You can mitigate against SPOFs in the storage subsystem by using redundant server controllers. You can also use redundant cabling between controllers and storage subsystems, redundant storage subsystem controllers, or redundant arrays of independent disks.

If you have only one power supply, loss of this supply could make your entire service unavailable. To prevent this situation, consider providing redundant power supplies for hardware, where possible, and diversifying power sources. Additional methods of mitigating SPOFs in the power supply include using surge protectors, multiple power providers, and local battery backups, and generating power locally.

Failure of an entire data center can occur if, for example, a natural disaster strikes a particular geographic region. In this instance, a well-designed multiple data center replication topology can prevent an entire distributed directory service from becoming unavailable. For more information, see Using Replication and Redundancy for High Availability.

Redundancy at the Software Level

Failure in Directory Server or Directory Proxy Server can include the following:

Excessive response time
Write overload
- Maximized file descriptors
- Maximized file system
- Poor storage configuration
- Too many indexes
Read overload
Cache issues
CPU constraints
Replication issues
- Synchronicity
- Replication propagation delay
- Replication flow
- Replication overload
Large wildcard searches

These SPOFs can be mitigated by having redundant instances of Directory Server and Directory Proxy Server. Redundancy at the software level involves the use of replication. Replication ensures that the redundant servers remain synchronized, and that requests can be rerouted with no downtime. For more information, see Using Replication and Redundancy for High Availability.