Skip Headers

Oracle9i Application Server Best Practices
Release 1 (v1.0.2.2)

Part Number A95201-01
Go To Documentation Library
Library
Go To Product List
Solution Area
Go To Table Of Contents
Contents
Go To Index
Index

Go to previous page Go to next page

1
Availability of Java Applications

This chapter presents Oracle recommendations for keeping Web-based information systems operating around the clock, day in and day out--commonly known as high availability. Our recommendations take the form of best practices for:

We recommend a combination of:

Failure detection is effective for total failures like the failure to accept a TCP connection. Detection is much less effective for partial failures, like a process that accepts the connection but does not reply within a few seconds. These partial failures are often caused by deadlocks, resource leaks, and other server-side software bugs. Software failures often cause death by a thousand cuts. At no single event or time can a failure detector say it failed. For some slow software failures, it is much easier to simply kill and restart processes or computers periodically, before they get into trouble.

This chapter contains these topics:

Availabilty Overview

To keep Web-based information systems operating around the clock day in and day out, robust software must be carefully executed on redundant hardware. Overall Web-site availability requires good practices in programming, deployment, and operations. Weakness in one area can compromise overall availability, even if the other areas follow good practices.

This section contains these topics:

Key Practices

Key deployment practices include:

Key programming practices include:

Key operational practices include:

These mid-tier practices put much of the burden for overall Web-site availability on the database; therefore it is important to understand how to make the database highly available. Database availability is not covered in depth in this manual.

See Also:

For more information on database availability, see the Oracle9i Database Administrator's Guide or Oracle8i Database Administrator's Guide in the Oracle Database Documentation Library  

Measuring Availability

Availability of an overall system or of an individual system component is defined as the percentage of time that it meets correctness and performance specifications. A component that meets specifications 8 hours a day has 33.3% availability, as does a component that works 20 minutes every hour. A system or component that has 99% availability is, on average, down 3.65 days per year.

Mission-critical systems often have goals of 4 nines (99.99% availability), which is less than an hour downtime per year, or even 5 nines (99.999% availability), which is 5 minutes of downtime per year.

Availability may not be constant over time. It may be higher during the day shift, for example, when most transactions are expected to take place, and lower during the night shift, when planned downtime may be needed to install hardware or software upgrades. Because the Internet is global in reach, however, there is less flexibility to time-shift availability requirements. It is common to require availability every hour of every week, often referred to as 24x7.

Redundant components can be used to improve availability, but only if they can take over for the failed primary quickly. For example, if a light bulb burns out after an hour of use, but it takes 10 minutes to detect the darkness and 20 minutes to replace the bulb, then the availability of light is 60/90 = 66.6%.

Planning for a highly available Web site requires first establishing an overall availability goal and then taking a critical look at the cost and availability of every piece of hardware, every module of software, every byte of data, and every task in the day-to-day operations to determine where an increase in component availability increases overall availability with reasonable cost.

Hardware Availability

Hardware is probably the easiest to analyze, because component availability may be specified by the manufacturer, and components can usually be assumed to fail independently. Independence allows us to calculate the availability of a system consisting of two components C1 and C2 that must both be available as

The availability of a system of two redundant components, only one of which must be available, assuming zero time to detect a failure in one component and switch over to the spare, is

In general, as more non-redundant components are added, availability decreases, and as more redundant components are added, availability increases. A Noah's Ark approach to hardware availability is to simply have two of everything. This is adequate when components have similar availability and cost. However, as the following example shows, more detailed analysis can often save money.

The goal in this example is to achieve Av(C1 and C2) > .99. Assume that:

Using the formulas above, we get

If the redundant components can dynamically check each other and failover automatically (rather than require manual intervention), then recovery will be quicker. It is even better if the extra pieces of hardware can work simultaneously to increase throughput or decrease response time in the case of no failure, although adequate capacity should exist when a failure does occur.

Modern hardware typically fails very infrequently (only after many months or even years), so the primary factors affecting hardware availability are:

Software Availability

Software can fail because of underlying hardware failure, but software usually fails because of logic defects (that is, bugs). All software that is complex enough to provide interesting Web-based applications has bugs. It is not unusual for complex Java server applications to crash daily under heavy load. Such server processes can take several minutes to restart, yielding an availability of about 99.5%.

Java bugs do not always immediately crash their containing process. Memory leaks can make the process sluggish, so that it is only partially available. Because of the inherent difficulty in detecting software failures, and the lack of good software failure detectors for JVMs in the current Oracle9iAS product, our recommended best practices are:

Data Availability and Reliability of Operational Procedures

Data can be lost or corrupted due to hardware failures, software failures, intentional acts (viruses or hackers) and unintentional acts (operations staff runs the wrong batch job, for example, or backup data sets are mislabeled or lost).

To prevent data loss, solid file system and database backup practices are essential, as are training and fire drills for operations staff. To improve data integrity, use database transactions and integrity constraints. To prevent modification to static Web pages and Java code, deploy them on read-only file systems or use intrusion-detection software.

To minimize operational mistakes, good system administration tools are important. These include graphical displays of topology and memory, CPU, and network activity. Good training is also important. Be wary of a hodgepodge of homegrown shell scripts without robust error checking and data validation logic.

Operations staff must record failures, and these failures must be analyzed periodically. Hardware failures that occur more often than the manufacturer's specifications indicate either adverse machine room conditions or the need for a different manufacturer. Software failure records aid in bug tracking, justifying more code profiling and monitoring tools, or refining the period (for example, daily or weekly) for restarting JVMs or Web servers. Operational failure records often suggest needed improvements in tools, scripts, or training.

Users' View of Application Availability

In a perfect world, users would benefit transparently from all the work that system administrators and programmers put into a highly available Web site. All application screens would appear in less than a second or two, server not found or internal server error screens would never appear, users would never need to repost a form after receiving one of these errors, and you would never need to search through application logs to find out if the first post succeeded or failed. Unfortunately, the nature of HTTP and commercial Web browsers makes it impossible to mask all failures.

Even in an imperfect world, however, users should never be exposed to corrupt, missing, or incomplete high-value data such as payment transactions, addresses, or phone numbers. To allow for flexible implementation, low-value data such as catalog information or news feeds are often allowed to be less consistent.

When application data have been recovered to a consistent state, users should always be able to start a new application session. A large class of service failures can be handled by invalidating existing sessions and requiring the user to restart his session. While this procedure is not transparent to users, they can usually recover and resume their work.

When in session, a failure to process a single request should not cause the session to fail. Session failure may be seen by a user if multiple requests fail or if the session expires due to inactivity or administrative intervention.

Requests may fail with no indication as to whether session or transactional data was modified. Users may resubmit requests, whether failed or otherwise, using reload or refresh browser functions without starting a new session, but in general it is the user's responsibility to ensure that the request is idempotent. This is the typical behavior for e-commerce and e-brokerage sites. Indiscriminate retry after a failure can result in duplicate transactions, although applications typically use logs to warn of potential problems.

Many failures of idempotent and transactional requests can be retried transparently. Failures that occur in the scope of a servlet's doGet or doPost method can be transparently retried by try..catch logic. Good application software design will relieve users of retry responsibility, except for failures that occur between the browser and the servlet engine.

Hardware Redundancy and Load Balancers

All high-availability solutions have the same general approach. Keep redundant components on hand to use when something breaks. When something does break, switch automatically and quickly.

In the reference topology illustrated in Figure 1-1, any of the following components can break:

Figure 1-1 Reference Topology


Text description of avail_01.gif follows.
Text description of the illustration avail_01.gif

In this chapter, we are primarily concerned with availability of mid-tier components. It is equally important to have a thorough understanding of database availability issues and Internet service provider (ISP) availability issues, including redundant ISPs and the Boundary Gateway Protocol for ISP-facing routers.

Application server computers, Web cache computers, and firewall computers can be clustered using hardware load balancers to increase overall availability and provide scalability. Each computer in the cluster is independent but identically configured. The identical configuration is maintained using scripts to copy the same contents to each computer's local disk, or a highly available shared disk system is used.

Load balancers are balancing, failure detecting, multi-layer (Open Systems Interconnect network reference model layers 2-7) switches like Cisco's CSS 11000 or F5's BigIP, which can detect failure of clustered mid-tier servers and route around them. Load balancers also allow mid-tier servers to be taken out of service gracefully. Once out of service, a mid-tier server may be reconfigured or simply rebooted in order to reduce the chance of a crash or slowdown due to resource leaks and other bugs.

Network appliances should all support a primary/backup configuration. Cabling and power supplies need special attention and are often the source of problems. For example, there are stories of a single slice of a carpenter's saw severing both primary and backup networking cables.

In general, redundant components should have independent failure modes. Clustered computers should not have a single power supply, single network cable, or single shared disk or file server. (Exceptions can be made for single components with very high availability.) Backups of your database and operating system files should not be kept in the computer room, where they might burn up along with the online copy.

Firewalls

Firewalls separate the global network into three security zones. The two implementations shown in Figure 1-2 are common.

Figure 1-2 Firewall Implementations


Text description of avail_02.gif follows.
Text description of the illustration avail_02.gif

Implementation One shows the simplest way to implement a firewall with three zones. A single computer with three network interface cards (i1, i2, and i3) running appropriate software divides the network into zones z1, z2, and z3. Cisco's PIX, Checkpoint's FW-1, Symantec's Raptor, and many others can work this way. All these firewalls optionally support a second standby firewall machine that can take over should the primary fail.

Implementation Two is an alternative that uses two machines, each with two network interfaces, to achieve three zones. This approach may be more difficult to manage, because there are two sets of security rules that are not centralized. It is most useful when the duties of firewall machine 1 can be assumed by a router, load balancer, or other already-existing machine. It is also more challenging to configure Implementation Two for failover, because you must configure both machines for failover and then test to make sure the combination will failover in case machine 1 or machine 2 fails.

Instead of employing a passive standby firewall for availability, both firewall implementations may be clustered with load balancers to achieve both availability and scalability.

Clustering

In this chapter, a mid-tier cluster is defined as a set of processes (servers) which can all provide more or less the same service. Usually, the servers are spread over several computers in order to make failures more independent, as well as to provide scalability. Requests to servers can be routed with hardware load balancers or software clustering services. Software solutions have several drawbacks:

Hardware load balancers offer several advantages:

In Figure 1-3, Figure 1-4, and Figure 1-5, load balancers labeled balancer1, balancer2, balancer3, and balancer4 all provide scalability and failover. Failover can hide both planned and unplanned outages. The balancers in the diagrams are logical.

Many logical balancers can be mapped to a single physical load balancer device using load balancer-specific configuration commands. The commands to define a logical balancer typically associate an input address (MAC or IP) with several output addresses, some load balancing options, and some routing options (also called sticky modes). There may be balancer-specific limits on the number and type of logical balancers that can be configured, and there are network throughput limits as well.

In this chapter we assume that the number of clustered servers is in the range 2..50. This is well within range for most applications of hardware load balancers. More importantly, with 50 or fewer clustered servers, you can use single-threaded scripts to cycle through them one at a time to run administrative scripts. Managing larger clusters would require more administrative support specifically targeted at clusters than currently exists in Oracle9iAS.

Application Server Cluster

A single application server consists of the Oracle HTTP Server and a couple of JVM processes for each CPU running Jserv, JavaServer Pages (JSPs), and perhaps other Oracle9iAS components (although the other components are not covered in this chapter). Application servers can be clustered using load balancers in order to increase availability and throughput.

Figure 1-3 Application Server Cluster


Text description of avail_03.gif follows.
Text description of the illustration avail_03.gif

In Figure 1-3, Balancer1 spreads connection requests to a virtual IP address and port over several real servers, based on such policies as round robin, static weighting, or least average response time. It supports stickiness based on SSL sessions and HTTP cookies. That is, SSL connections that are part of the same secure session will be routed to the same target machine, saving session key negotiation, and HTTP connections that are part of the same HTTP session will be routed to the same target machine, saving having to migrate session state.

Web Cache Cluster

A single Web cache server caches frequently accessed Web pages in RAM. The cache does not spill to disk, and the cache server process is single-threaded. Web cache servers can be clustered using a load balancer for greater availability, memory capacity, and throughput.

In Figure 1-4, Balancer2 supports stickiness based on URL patterns, which can be used to partition cached Web pages over a cache farm. For example, Balancer2 could be configured to direct requests for URLs whose hash is even to server 1, and whose hash is odd to server 2.

Figure 1-4 Web Cache Cluster


Text description of avail_04.gif follows.
Text description of the illustration avail_04.gif

Firewall Clusters

An enterprise firewall solution will usually be made up of several machines for availability and scalability. In Figure 1-5, Balancer3 supports stickiness based on (source IP, destination IP) pairs, which ensures that all traffic for a given inbound or outbound connection can be inspected by the same firewall server. These balancers must be hardened so that they are not victims of denial-of-service attacks themselves. For a firewall cluster of M servers implementing N security zones, N balancers, one in each security zone, should work together to balance traffic over the M firewall servers. The diagram shows N=3 zones: Internet, DMZ, and intranet.

Balancer4 supports balancer2 and balancer3 capability, as well as enough firewall capability to isolate z1 and z3.

In order to guarantee that network packets cannot be routed around firewalls due to misconfiguration, all logical balancers in a security zone must map to physical balancers in that same zone.

Figure 1-5 Three-Zone Firewall Clusters


Text description of avail_05.gif follows.
Text description of the illustration avail_05.gif

Commercial Balancers

Commercial load balancers that support all the sticky modes for application as either balancer1, balancer2, balancer3, or balancer4 include:

HA-HW-1: Hardware Components

Simulate failure and replacement of every hardware component, including redundant spares. Measure the actual availability impact.

For this exercise, adopt a pessimistic, Murphy's Law attitude: if it can break, it will! Power off server machines, routers, load balancers, and firewalls. Unplug network cables. Unplug disk drives.

For each failure, does failover occur automatically? How long does it take a system administrator to locate the failure? Ideally, a management framework will issue specific alerts targeting the failed component until it is repaired. After a real failure, be sure to order replacement parts promptly.

Finally, what is the impact of repair? Are components hot-pluggable so that the repair can be effected without shutting down other components? Repair, reconfiguration, or just adding a new server to a cluster to increase capacity can be major sources of downtime.


Note:

If components are not hot-pluggable, then you should be careful not to cause real failures when unplugging components for this exercise. 


Having experienced a simulated failure in training can greatly decrease the risk of a bad problem being made worse by inexperienced operations staff under pressure.

HA-HW-2: Load Balancers

Use commercial load balancers to cluster application server computers and Web cache computers to achieve availability and scalability.

Load balancers like Cisco's CSS 11000 and F5's BigIP can quickly detect and route around failed mid-tier servers. For greatest failure isolation, each server should be an independent computer, although if the availability of the computer and its operating system is very high, then multiple servers may run on the same computer. Load balancers can also quiesce and take a mid-tier server offline, so that it can be restarted or reconfigured without having to handle incoming requests at the same time.

Ideally, N+2 mid-tier machines should be used, where N is the number of mid-tier computers needed to serve peak load. One extra computer is a ready-to-go spare, should one of the N fail. The other extra computer hosts a server that is offline for restart or maintenance.

Load balancers have failover configurations to prevent themselves from becoming single points of failure.

Load balancers can also be used to balance traffic across multiple firewall computers for scalability and availability. This can be a complex and expensive option, because you must have a load balancer in each security zone. Most commercial firewalls have a failover solution that does not require use of a load balancer. So unless scalability is a concern, use the availability solution recommended by the firewall vendor.

HA-HW-3: Configure mod_jserv

Each clustered application server runs one instance of Apache for each server and two Jserv processes for each CPU. Module mod_jserv should route requests within a single application server, not across the cluster.

Multiple processes on a single computer can help isolate software failures. You want enough Jserv processes so that some can fail and enough survive to fully utilize the computer resources until the next JVM restart cycle (see Chapter 2, "Performance and Scalability").

Module mod_jserv (using the Apache Jserv Protocol, or AJP) can be configured to route servlet requests from the Apache HTTPD processes to both local Jserv processes and to remote Jserv processes in different application servers in the cluster. It is best to configure mod_jserv to route only to local Jserv processes because:

Trade-offs with this practice include:

HA-HW-4: Independent Servers

Each clustered server should be configured identically but be as independent as possible.

Independence means a mid-tier server can be added to a cluster, removed from a cluster, or fail without affecting the clustered servers. Strictly speaking, if one clustered server fails, then the others are somewhat affected because their loads will increase. But to a good first approximation, only the load balancer needs to be reconfigured when servers are added and removed, and only the load balancer needs to detect and react to server failure automatically.

Resources that are shared by the mid-tier servers must be made highly available themselves. For example, the database, shared NFS server, network hubs, and routers must be made highly available. And if multiple servers are hosted on a single computer, then that computer and its operating system need to be more robust than a computer that is hosting a single server.

For independence, mod_jserv's configuration should connect Apache to the local computer's Jserv processes and not to Jserv processes on other computers. Each mid-tier computer is a failure container. It is best for the load balancer to route requests among the failure containers, and for each Apache to route requests solely within a failure container.

To ease management of the multiple computers, it helps to configure them as similarly as possible. Small performance differences can easily be handled with load balancing built into the load balancer. Large performance differences can be handled by running two or more application servers on a single big machine. It may be difficult to take the fastest computers completely offline for maintenance and still have enough capacity left over for peak load.

Software Design Principles

Many of the failures you need to handle are software problems. It is difficult to write efficient server software that handles request after request forever (assuming no hardware failure). Our goal is to write software that can stay up for about a day, and then you can reboot it.

The software design practices in this chapter are targeted to Java server programs such as servlets and JSPs that run in a standard JVM in an operating system process. The primary sources of bugs are:

One of the most important software design principles is that all mid-tier applications should be statesafe. This term is used instead of stateless, because we want to emphasize that it is common and normal for mid-tier application servers and caches to have a lot of state that is shared or serially reused by many requests from many users. But the per-user state (also known as session state) should be stored someplace safe between client requests. Safe places to store session state include browser cookies and a backend database, but not a single JVM process that could crash or be restarted at any time.

The Jserv implementation of HttpSession is not statesafe. That is, it does not secure a copy of its data in either the browser or the database between user requests. Therefore, HttpSession cannot be used to store recoverable session information.

HA-SW-1: Java Synchronization

Carefully monitor the use of Java synchronization. Synchronization can cause a deadlock--a situation when throughput and resource utilization both (often confusingly) go to zero.

The best way to avoid problems is to avoid use of Java synchronization. One of the most common uses of synchronization is to implement a pool of serially reusable objects. Often, you can simply add a serially reusable object to an existing pooled object. For example, you can add a JDBC connection and statement object to the instance variables of a single thread model servlet, or you can use the Oracle JDBC connection pool, rather than implement your own synchronized pool of connections and statements.

If you must use synchronization, then you should either avoid deadlock or detect it and break it. Both strategies require code changes. So neither can be completely effective, because some system code uses synchronization and cannot be changed by the application.

To prevent deadlock, simply number the objects that you must lock and ensure that clients lock objects in the same order.

Proprietary JVM extensions may be available to help spot deadlocks without having to instrument code, but there are no standard JVM facilities for detecting deadlock.

See Also:

Chapter 2, "Performance and Scalability" for more information on synchronization 

HA-SW-2: Resource Use

Monitor resource use and fix resource leaks.

The periodic restart strategy provides protection against slow resource leaks. But it is also important to spot applications that are draining resources too quickly, so that the buggy software can be fixed rather than restarted more frequently. Leaks that prevent continuous server operation for at least 24 hours must be fixed by programmers, not by application restart.

Common programming mistakes are:

Monitoring resource usage should be a combination of code instrumentation and external monitoring utilities. With code instrumentation, calls to an application-provided interface, or calls to a system-provided interface like Oracle Dynamic Monitoring Service (DMS), are inserted at key points in the application's resource usage lifecycle. Done correctly, this can give the most accurate picture of resource use. Unfortunately, the same programming errors that cause resource leaks are also likely to cause monitoring errors. That is, you could forget to release the resource, or forget to monitor the release of the resource.

Operating system commands like vmstat or ps provide process-level information such as the amount of memory allocated, number/state of threads, or number of network connections. They can be used to detect a resource leak. Development tools like Purify or JavaProbe can be used to pinpoint the leak.


Note:

In addition to compromising availability, resource leaks and overuse decrease performance. See Chapter 2, "Performance and Scalability"


HA-SW-3: Session State

Store session state in a database, indexed by session identifier (ID) stored in a cookie.

Because Jserv's HttpSession is not statesafe, it cannot be the only store for session data if you want to prevent multiple session failures in the event of the failure of a single JVM process.

The biggest problem is that an application might have session state that cannot be stored as-is in a database. For example:

To address the first two, examine the non-serializable (in the Java sense) session state and separate what is really specific to a particular client from what could be cached or parameterized, pooled, and serially reused across many clients. For example, an application might open a JDBC connection and prepare several SQL statements for each client session. When Smith logs on, the application prepares a number of SQL statements like:

Select x,y from T where param1 = ? and client_id = `Smith'

As Smith enters different values (via his browser) for param1, different (personalized) values of x and y are displayed. Session state can be reduced in this case to the single (serializable) value 'Smith' by parameterizing client_id in the SQL:

Select x,y from T where param1 = ? and client_id = ?

and then serially reusing the connection and its associated statements across many clients. The serial reuse can be accomplished with:

When implementing statesafety, tradeoffs affect performance, scalability, and availability. If you do not implement statesafe applications, then:

For those concerned with the performance of statesafe applications, here are some performance optimizations:

Note that statesafe middle tiers have been implemented by a number of high-end commercial Web sites, and they scale to extremely high workloads. A well-implemented database backing store can service thousands of state save and restore operations for each CPU.

See Also:

Chapter 2, "Performance and Scalability" 

HA-SW-4: Catch-All Exceptions

Discard objects that throw catch-all exceptions like SQLException, RemoteException, Error, or RuntimeException. Avoid poisoned pools.

In many cases, these exceptions indicate that the internal state of the invoked object is corrupt and that further invocations will also fail. Keep the object reference only if careful scrutiny of the exception shows it is benign, and further invocations on this object are likely to succeed.

Adopt a guilty unless proven innocent attitude. For example, a SQLException thrown from an Oracle JDBC invocation could represent one of thousands of error conditions in the JDBC driver, the network, or the database server. Some of these errors (for example, subclass SQLWarning) are benign. Some SQLExceptions (for example, ORA-3113: end of file on communication channel) definitely leave the JDBC object useless. Most SQLExceptions do not clearly specify what state the JDBC object is left in. The best approach is to enumerate the benign error codes that could occur frequently in your application and can definitely be retried, such as a unique key violation for user-supplied input data. If any other error code is found, then discard the potentially corrupt object that threw the exception.

Discard all object references to the (potentially) corrupt object. Be sure to remove the corrupt object from all pools, in order to prevent pools from being poisoned by corrupt objects. Do not invoke the corrupt object again. Instantiate a brand new object instead.

When you are sure that the corrupt objects have been discarded and that the catching object is not corrupt, throw a non-catch-all exception so that the caller does not discard this object, or retry the failed invocation as per "HA-SW-6: Retry Transactions Once".

HA-SW-5: Finally Clause

Always use a finally clause in each method to clean up (restore invariants to) instance variables and static variables.

Remember that in Java, it is impossible to leave the try or catch blocks (even with a throw or return statement) without executing the finally block. If for any reason the instance variables cannot be cleaned, then throw a catch-all exception that should cause the caller to discard its object reference to this now corrupt object.

If for any reason the static variables cannot be cleaned, then throw an InternalError or equivalent that will ultimately result in restarting the now corrupt JVM.

HA-SW-6: Retry Transactions Once

Retry failed transactions and idempotent HttpServlet.doGet() operations once. Otherwise, do not retry using the same method with the same parameters on the same object instance. Propagate an exception to the caller instead.

Retries are discouraged in general, because if every catch block of an N frame try..catch stack performs M retries, then the innermost method gets retried (MN)/2 times. This is likely to be perceived by the end user as a hang, and hangs are worse than receiving an error message.

If you could pick just one try..catch block to retry, then it would be best to pick the outermost block. It covers the most code and therefore also covers the most exceptions. Of course, only idempotent operations should be retried. Transactions guarantee that database operations can be retried as long as the failed try results in a rollback, and as long as all finally blocks restore variables to a state consistent with the rolled back database state. It will often be the case that a servlet's doGet method will perform retry, and a servlet's doPost method will rollback any existing transaction and retry with a new transaction.

Other cases where a retry is warranted are:

For example, if the database tier uses Oracle Real Application Clusters, then a new connection may be to any available database server machine that mounts the desired database. For JDBC, the DataSource.getConnection() method is usually configured to pick among ORAC machines.

HA-SW-7: Database Update

HttpServlet.doGet() should not update the database. HttpServlet.doPost() should use one transaction to update one database. The user must be able to query whether the update occurred or not.

The HTTP specification states that the GET method should be idempotent and free of side-effects. Proxies or caches along the route from client to mid-tier, as well as the client pressing the reload button, could cause the GET method at the mid-tier to be called 0, 1, or more times.

HTTP Post is not assumed to be idempotent. Browsers typically require client confirmation before re-POSTing, and intermediate proxies and caches do not retry or cache the result of a POST. However, a failure may require the client to manually retry (press RELOAD or press BACK and then re-submit), which is not safe unless the update is idempotent.

One way applications can warn users about potential duplicate requests is to encode a unique request-ID in a hidden form field and write the request-ID of each update request to the database. An update request first compares its request-ID with the database of already-processed requests and then warns the user about a potential duplicate. Because the user may have intended to submit two updates, the system cannot make duplicate suppression transparent. Another good practice is to label non-idempotent submit buttons with:

Transactions should not span client requests, because this can tie up shared resources indefinitely.

Requests generally should not span more than one transaction, because a failure in mid-request could leave some transactions committed and others rolled back. If this requires application-level compensation to recover, then availability or data integrity may be compromised.

Transactions generally should not span more than one database, because distributed transactions lock shared resources longer, and failure recovery may require simultaneous availability and coordination of multiple databases.

Applications that require a single client request (for example, confirm checkout) to ultimately affect several databases (for example credit card, fulfillment, shopping cart, and customer history databases) should perform the first step at one database and in the same transaction enqueue a message at the first database addressed to the second database. The second database will perform the second step and enqueue the third step, and so on. This queued transaction chain will eventually complete automatically, or an administrator will see an undeliverable message and will have to manually compensate.

Periodic Operations Procedures

The preceding sections covered provisioning Web-site hardware and software for availability. Operations staff also play an important role. First, they must carry out certain preventative procedures, primarily backups and application server process restarts, on a regular basis. Second, they must be prepared to react to failures in order to make the situation better, not worse. Finally, they must be able to react to a constantly changing environment, with new hardware for increased capacity and with frequent software and Web content updates to support the fast-paced online world.

HA-OP-1: Backup Schedule

Implement a database and operating system file backup schedule that supports restoring the entire Web site to brand-new computers, storage, and network equipment with less than 24 hours combined downtime and data loss.

This provides a last line of defense against the worst (but not all that uncommon) failures, such as careless employees, botched upgrades, security break-ins, and software bugs that corrupt stored data.

It is important to back up everything, including router, firewall, and load balancer configuration, operating systems and their configuration, Oracle9iAS software and its configuration. But the backend Oracle database should be where most of your data is stored; it should receive the most attention.

To ensure that backed up files are consistent with database data, and to further leverage the database backup and restore features, consider storing master copies of flat files in Oracle Internet File System. Because Oracle Internet File System is a database application, your master files are backed up the same way as your database. You can use FTP to copy files from Oracle Internet File System to the mid-tier file systems. On NT, you can simply create a drive letter for Oracle Internet File System. For better performance and availability, it is better to copy the mid-tier files to the local operating system file system (or cache them in Web cache) than to access Oracle Internet File System each time a file is requested.

If a failure requiring restoration of the database from a backup happens once a year and takes 24 hours to fix, then availability is 99.7%. Using a physical standby database can reduce the database restore time to a few minutes. The standby database is always in recovery mode, and its state lags the primary database by some fixed period (for example, 15 minutes). When the primary fails, applications switch to the standby. The hard part here is to detect the failure, which could be a corrupt infrequently accessed disk block, accidentally run batch job, or other non-obvious failure. It must be detected within the fixed period lag, before it possibly gets propagated to the standby.

HA-OP-2: Server Restarts

Restart Java servers daily or according to measured frequency of failure.

Many hard-to-debug software failures are caused by resource leaks in serially reusable servers that gradually degrade performance. Degradation can become so severe for some Java server processes that the JVM process stops responding, even to kill commands. Resource leaks may also occur in the Apache Web server processes, especially if a lot of application code runs there, using mod_perl or mod_php, for example. Usually, the Apache server will not fail often enough to impact overall availability, but restarting it affords an opportunity to rotate the access and error log files.

Resource leaks can cause failure of other software servers, such as:

Availability of these servers, however, should be high enough not to warrant periodic restart.

Determining a good restart frequency for Apache and Jserv processes may take some trial and error. Begin with a daily restart and record all Java process crashes and hangs. If many failures are observed, then increase the frequency of restart slightly. But also attempt to localize the faulty software module (using dumps, traces, or error logging) and insist that the bugs be fixed. If very few failures are observed, then decrease the frequency of restart.

To periodically restart all Web servers and their associated Jserv servers in a cluster without causing any client requests to fail, cycle through all Web servers and for each Web server W:

    1. Issue the "out-of-service W" command to the load balancer.

    2. Perform a graceful restart on W.

    3. Restart local Jservs on W.

    4. Issue the "in-service W" command to the load balancer.

    5. Repeat with next W.

This approach should scale to as many Web servers as most load balancers can handle. For example, if you need to restart Jservs every twelve hours or things start failing, and if a restart and warm-up of Apache and Jserv processes on a single machine takes ten minutes, then you can scale to 72 mid-tier machines.

These regular maintenance periods can also be used to refresh the configuration, code, and static content files from a master copy, archive and compress mid-tier logs, and so on.

HA-OP-3: Mid-Tier File Synchronization

Synchronize mid-tier files by taking one computer out of the cluster at a time, if possible. Otherwise stage file updates so the entire cluster goes offline for just seconds to synchronize.

File system content should be synchronized across mid-tier computers. Depending on configuration settings, some files can be changed without restarting Apache or Jserv processes, while other changes require a restart. Some files can be NFS mounted, while other files cannot. Some applications can behave erratically if new class files are dynamically loaded and mixed with old class files. In short, making a shared or networked file system highly available while changing files on a running mid-tier computer is full of unknowns and worry.

A good practice instead is to exploit the incremental offline capability afforded by the cluster load balancer to synchronize a mid-tier computer's local files with a master copy while the mid-tier computer is offline. The challenge of such an incremental offline synchronization is to move all mid-tier computers from version N of synchronized file system content to version N+1, while making sure that all requests for a given session are directed to mid-tier computers with the same version of file system content.

You can configure the load balancer to use sticky routing, so that all requests within a session are routed to the same mid-tier computer (provided it is still online). Thus, during the period of rolling synchronization, session failover should be defeated or extended. For example, you could store the synchronization version number as part of the session ID and check it on each request. This will prevent errors reading or writing session data, in case its datatype changes from one version to the next.

As version N+1 computers come online, be sure to monitor them. If something goes wrong, then you can stop the synchronization, leave unmodified version N computers online, and take the troublesome version N+1 computers offline and fix the problem.

Finally, note that some configuration, code, or content changes cannot be performed one mid-tier server at a time. For example, version N and N+1 of an application may not both be able to run against the same database schema. It is always good practice to have a method to update all servers in a cluster, so that they are all consistent and with minimal downtime. One strategy is, first stage the updates to a separate directory on each mid-tier computer. Second, go offline just long enough to swap in the updates (for example, with a symbolic link). Third, restart the servers. To minimize downtime, the swap and restart should happen in parallel for each server in the cluster.

HA-OP-4: Document Procedures

Document all recovery and repair procedures, and practice them regularly.

During a failure, operations staff will be under pressure. It may be difficult to think clearly. It is not a good time to try a new, risky, or unfamiliar repair procedure. Document all of the following:

Conduct periodic fire drills. When a real failure occurs, it will not be the first time the staff has had to react under pressure. Try to simplify complex repair procedures to minimize errors.


Go to previous page Go to next page
Oracle
Copyright © 2001 Oracle Corporation.

All Rights Reserved.
Go To Documentation Library
Library
Go To Product List
Solution Area
Go To Table Of Contents
Contents
Go To Index
Index