Sun Java Enterprise System Deployment Planning Guide

Availability

Availability is a way to specify the uptime of a system and is typically measured as the percentage of time that the system is accessible to users. The time that the system is not accessible (downtime) can be due to the failure of hardware, software, the network, or any other factor (such as loss of power) that causes the system to be down. Scheduled downtime for service (maintenance and upgrades) is not considered downtime. A basic equation to calculate system availability in terms of percentage of uptime is:

Availability = uptime / (uptime + downtime) * 100%

Typically you measure availability by the number of “nines” you can achieve. For example, 99% availability is two nines. Specifying additional nines significantly affects the deployment design. The following table quantifies the unscheduled downtime for additional nines of availability to a system that is running 24x7 year-round (a total of 8,760 hours).

Table 3–3 Unscheduled Downtime for a System Running Year-Round (8,760 hours)

Number of Nines 

Percentage Available 

Unscheduled Downtime 

99% 

88 hours 

99.9% 

9 hours 

99.99% 

45 minutes 

99.999% 

5 minutes 

Fault-Tolerant Systems

Availability requirements of four or five nines typically require a system that is fault-tolerant. A fault-tolerant system must be able to continue service even during a hardware or software failure. Typically, fault tolerance is achieved by redundancy in both hardware (such as CPUs, memory, and network devices) and in software providing key services.

A single point of failure is a hardware or software component that is part of a critical path but is not backed up by redundant components. The failure of this component results in the loss of service for the system. When designing a fault-tolerant system, you must identify and eliminate potential single points of failure.

Fault-tolerant systems can be expensive to implement and maintain. Make sure you understand the nature of the business requirements for availability and consider the strategies and costs of availability solutions that meet those requirements.

Prioritizing Service Availability

From a user perspective, availability often applies more on a service-by-service basis rather than on the availability of the entire system. For example, the unavailability of instant messaging services usually has little or no impact on the availability of other services. However, the unavailability of services upon which many other services depend (such as Directory Server) has a much wider impact. Higher availability specifications should clearly reference specific use cases and usage analysis that require the increased availability.

It is helpful to list availability needs according to an ordered set of priorities. The following table prioritizes the availability of different types of services.

Table 3–4 Availability of Services by Priority

Priority 

Service Type 

Description 

Mission critical 

Services that must be available at all times. For example, database services (such as LDAP directories) to applications. 

Must be available 

Services that must be available, but can be available at reduced performance. For example, messaging service availability might not be critical in some business environments. 

Can be postponed 

Services that must be available within a given time period. For example, calendar services availability might not be essential in some business environments. 

Optional 

Services that can be postponed indefinitely. For example, in some environments instant messaging services can be considered useful but not necessary. 

Loss of Services

Availability design includes consideration for what happens when availability is compromised or when a component is lost. This includes considering whether users connected must restart sessions and how a failure in one area affects other areas of a system. QoS requirements should consider these scenarios and specify how the deployment reacts to these situations.