Sun Directory Server Enterprise Edition 7.0 Deployment Planning Guide

Availability and Single Points of Failure

Directory Server Enterprise Edition deployments that provide high availability can quickly recover from failures. With a high availability deployment, component failures might impact individual directory queries but should not result in complete system failure. A single point of failure (SPOF) is a system component which, upon failure, renders an entire system unavailable or unreliable. When you design a highly available deployment, you identify potential SPOFs and investigate how these SPOFs can be mitigated.

SPOFs can be divided into three categories:

Hardware failures, for example, server crashes, network failures, power failures, or disk drive crashes
Software failures, for example, Directory Server or Directory Proxy Server crashes
Database corruption

Mitigating SPOFs

You can ensure that failure of a single component does not cause an entire directory service to fail by using redundancy. Redundancy involves providing redundant software components, hardware components, or both. Examples of this strategy include deploying multiple, replicated instances of Directory Server on separate hosts, or using redundant arrays of independent disks (RAID) for storage of Directory Server databases. Redundancy with replicated Directory Servers is the most efficient way to achieve high availability.

Advantages and Disadvantages of Redundancy

The more common approach to providing a highly available directory service is to use redundant server components and replication. Redundant solutions are usually less expensive, easier to implement, and easier to manage. Note that replication, as part of a redundant solution, has numerous functions other than availability. While the main advantage of replication is the ability to split the read load across multiple servers, this advantage causes additional overhead in terms of server management. Replication also offers scalability on read operations and, with proper design, scalability on write operations, within certain limits. For an overview of replication concepts, see Chapter 7, Directory Server Replication, in Sun Directory Server Enterprise Edition 7.0 Reference.

During a failure, a redundant system might provide poor availability. Imagine, for example, an environment in which the load is shared between two redundant server components. The failure of one server component might put an excessive load on the other server, making this server respond more slowly to client requests. A slow response might be considered a failure for clients that rely on quick response times. In other words, the availability of the service, even though the service is operational, might not meet the availability requirements of the client.

How Redundancy Handles SPOFs

In terms of the SPOFs that are described at the beginning of this chapter, redundancy handles failure in the following ways:

Single hardware failure. A single hardware failure is fatal to a machine. Therefore, even if you have redundant hardware, manual intervention is required to repair the failure.
Directory Server or Directory Proxy Server failure. The server is automatically restarted.
Database corruption. Depending on the architecture, a redundant solution should be able to survive database corruption.

Redundancy at the Hardware Level

This section provides basic information about hardware redundancy. Many publications provide comprehensive information about using hardware redundancy for high availability. In particular, see “Blueprints for High Availability” published by John Wiley & Sons, Inc.

Hardware SPOFs can be broadly categorized as follows:

Network failures
Failure of the physical servers on which Directory Server or Directory Proxy Server are running
Load balancer failures
Storage subsystem failures
Power supply failures

Failure at the network level can be mitigated by having redundant network components. When designing your deployment, consider having redundant components for the following:

Internet connection
Network interface card
Network cabling
Network switches
Gateways and routers

You can mitigate the load balancer as an SPOF by including a redundant load balancer in your architecture.

In the event of database corruption, you must have a database failover strategy to ensure availability. You can mitigate against SPOFs in the storage subsystem by using redundant server controllers. You can also use redundant cabling between controllers and storage subsystems, redundant storage subsystem controllers, or redundant arrays of independent disks.

If you have only one power supply, loss of this supply could make your entire service unavailable. To prevent this situation, consider providing redundant power supplies for hardware, where possible, and diversifying power sources. Additional methods of mitigating SPOFs in the power supply include using surge protectors, multiple power providers, and local battery backups, and generating power locally.

Failure of an entire data center can occur if, for example, a natural disaster strikes a particular geographic region. In this instance, a well-designed multiple data center replication topology can prevent an entire distributed directory service from becoming unavailable. For more information, see Using Replication and Redundancy for High Availability.

Redundancy at the Software Level

Failure in Directory Server or Directory Proxy Server can include the following:

Excessive response time
Write overload
- Maximized file descriptors
- Maximized file system
- Poor storage configuration
- Too many indexes
Read overload
Cache issues
CPU constraints
Replication issues
- Synchronicity
- Replication propagation delay
- Replication flow
- Replication overload
Large wildcard searches

These SPOFs can be mitigated by having redundant instances of Directory Server and Directory Proxy Server. Redundancy at the software level involves the use of replication. Replication ensures that the redundant servers remain synchronized, and that requests can be rerouted with no downtime. For more information, see Using Replication and Redundancy for High Availability.