Sun Java System Application Server Enterprise Edition 8.1 2005Q2 Deployment Planning Guide

Planning for Availability

This section contains the following topics:

Rightsizing Availability

To plan availability of systems and applications, assess the availability needs of the user groups that access different applications. For example, external fee-paying users and business partners often have higher quality of service (QoS) expectations than internal users. Thus, it may be more acceptable to internal users for an application feature, application, or server to be unavailable than it would be for paying external customers.

The following figure illustrates the increasing cost and complexity of mitigating against decreasingly probable events. At one end of the continuum, a simple load-balanced cluster can tolerate localized application, middleware, and hardware failures. At the other end of the scale, geographically distinct clusters can mitigate against major catastrophes affecting the entire data center.

Figure 2–3 Availability versus Cost and Complexity

To realize a good return on investment, it often makes sense identify availability requirements of features within an application. For example, it may not be acceptable for an insurance quotation system to be unavailable (potentially turning away new business), but brief unavailability of the account management function (where existing customers can view their current coverage) is unlikely to turn away existing customers.

Using Clusters to Improve Availability

At the most basic level, a cluster is a group of application server instances—often hosted on multiple physical servers—that appear to clients as a single instance. This provides horizontal scalability as well as higher availability than a single instance on a single machine. This basic level of clustering works in conjunction with the Application Server’s HTTP load balancer plug-in, which accepts HTTP and HTTPS requests and forwards them to one of the application server instances in the cluster. The ORB and integrated JMS brokers also perform load balancing to application server clusters. If an instance fails, become unavailable (due to network faults), or becomes unresponsive, requests are redirected only to existing, available machines. The load balancer can also recognize when an failed instance has recovered and redistribute load accordingly.

The HTTP load balancer also provides a health checker program that can monitor servers and specific URLs to determine whether they are available. You must carefully manage the overhead of health checking so that it does not become a large processing burden itself.

For stateless applications or applications that only involve low-value, simple user transactions, a simple load balanced cluster is often all that is required. For stateful, mission-critical applications, consider using HADB for session persistence. For an overview of HADB, see High-Availability Database in Chapter 1, Product Concepts Application Server Administration Guide.

To perform online upgrades of applications, it is best to group the application server instances into multiple clusters. The Application Server has the ability to quiesce both applications and instances. Quiescence is the ability to take an instance (or group of instances) or a specific application offline in a controlled manner without impacting the users currently being served by the instance or application. As one instance is quiesced, new users are served by the upgraded application on another instance. This type of application upgrade is called a rolling upgrade. For more information on upgrading live applications, see Upgrading Applications Without Loss of Availability in Sun Java System Application Server Enterprise Edition 8.1 2005Q2 High Availability Administration Guide.

Adding Redundancy to the System

One way to achieve high availability is to add hardware and software redundancy to the system. When one unit fails, the redundant unit takes over. This is also referred to as fault tolerance. In general, to maximize high availability, determine and remove every possible point of failure in the system.

Identifying Failure Classes

The level of redundancy is determined by the failure classes (types of failure) that the system needs to tolerate. Some examples of failure classes are:

System process
Machine
Power supply
Disk
Network failures
Building fires or other preventable disasters
Unpredictable natural catastrophes

Duplicated system processes tolerate single system process failures, as well as single machine failures. Attaching the duplicated mirrored (paired) machines to different power supplies tolerates single power failures. By keeping the mirrored machines in separate buildings, a single-building fire can be tolerated. By keeping them in separate geographical locations, natural catastrophes like earthquakes can be tolerated.

Using HADB Redundancy Units to Improve Availability

To improve availability, HADB nodes are always used in Data Redundancy Units (DRUs) as explained in Establishing Performance Goals.

Using HADB Spare Nodes to Improve Fault Tolerance

Using spare nodes improves fault tolerance. Although spare nodes are not mandatory, they provide maximum availability.

Planning Failover Capacity

Failover capacity planning implies deciding how many additional servers and processes you need to add to the Application Server deployment so that in the event of a server or process failure, the system can seamlessly recover data and continue processing. If your system gets overloaded, a process or server failure might result, causing response time degradation or even total loss of service. Preparing for such an occurrence is critical to successful deployment.

To maintain capacity, especially at peak loads, add spare machines running Application Server instances to the existing deployment.

For example, consider a system with two machines running one Application Server instance each. Together, these machines handle a peak load of 300 requests per second. If one of these machines becomes unavailable, the system will be able to handle only 150 requests, assuming an even load distribution between the machines. Therefore, half the requests during peak load will not be served.