13
Miscellaneous Best Practices

This chapter describes miscellaneous best practices. It features the following sections:

13.1 Simulate Failures and Compute Availability Impact

For this exercise, adopt a pessimistic, Murphy's Law attitude: if it can break, it will! Power off server machines, routers, load balancers, and firewalls. Unplug network cables. Unplug disk drives.

For each failure, does fail over occur automatically? How long does it take a system administrator to locate the failure? Ideally, a management framework will issue specific alerts targeting the failed component until it is repaired. After a real failure, be sure to order replacement parts promptly.

Finally, what is the impact of repair? Are components hot-pluggable so that the repair can be effected without shutting down other components? Repair, reconfiguration, or just adding a new server to a cluster to increase capacity can be major sources of downtime. Note that if components are not hot-pluggable, you should be careful not to cause real failures when unplugging components for this exercise.

Having experienced a simulated failure in training can greatly decrease the risk of a bad problem being made worse by inexperienced operations staff under pressure.

13.2 Pooling and Sharing

Constant creation and destruction of resource objects can be very expensive in Java. Having a resource pool for such resources to be shared across clients can have significant performance & scalability gains:

Shared resources are used by multiple clients at the same time (parallel re-use), for example: read-only query results from the database
Pooled resources are used by one client at a time (serial re-use), for example: JDBC connection pools, stateless Session Beans, and servlets/JSPs.

Following are some objects that can either be cached / shared / pooled:

BP-JDBC connections and statements can be pooled and used across client requests.
Results of expensive queries can be pooled. Read-only queries can even be shared concurrently too.
Computed Java / XML objects can either be pooled or shared depending upon their statefulness.
Parsed XML files and XSL stylesheets.
Output of HTTP requests in Oracle9iAS Web Cache, Akamai cache, Inktomi cache, etc.

You should analyze your system to determine other domain objects that may meet these criteria.

13.3 Perform Incremental Performance Evaluation During the Development Cycle

Do not wait until the end of the project to do a test cycle. Performance tests should be run regularly after each stage of development. New performance test suites should be added as new features or functions are implemented. If possible, run performance regression tests regularly to compare performance results.

13.4 Run Your Performance Test on Systems That Will Simulate Your Production Environment

Developers commonly run their functional tests in single-user mode on their workstation or their desktop, which usually has only one CPU. This setup rarely represents the production environment, and it is not adequate for running performance tests. If the application is intended to be used by a large number of concurrent users running on multiple processors, simulate the production environment with a representative workload to study the performance impact.

13.5 Understand How to Configure Your Test Driver and Analyze the Result

Commercial drivers used to simulate HTTP requests can be very effective, but they are often complicated to configure. Oracle has encountered situations where customers set up their driver incorrectly, did not know how to interpret the results, ended up drawing the wrong conclusions, and wasted valuable time in locating the real problems.

13.6 Assign Someone Who is Experienced in Running and Analyzing Performance Tests

Running performance tests is not a push-button job that can be delegated to an inexperienced engineer. It requires someone who has knowledge and experience with operating systems and databases, and who understands the application that is to be analyzed.

13.7 Document All Recovery and Repair Procedures, and Practice Them Regularly

During a failure, operations staff will be under pressure. It may be difficult to think clearly. It is not a good time to try a new, risky, or unfamiliar repair procedure.

Document the following:

Backup files and archived log locations
Diagnostic tool syntax
Location of spare parts (disks, network cards)
How to replace failed parts (what should be powered off/on, what racks, chassis, or slots. hold what components
what needs to be restarted
contacts: supervisors, support personnel, other experts

Conduct periodic fire drills so when a real failure occurs, it will not be the first time the staff has had to react under pressure. Try to simplify complex repair procedures to minimize errors.

13.8 Use Available Tools to Monitor Site Load and Status

Oracle9iAS includes tools that provide good status updates without impacting the production instance.

Some of these (for example dynamic monitoring service) are well integrated with Oracle Enterprise Manager and provide graphic updates on the Oracle Enterprise Manager console. Others (for example iHAT, Oracle9iAS Cluster Monitor tool) are available as utilities on the Oracle Technology Network Web site.

Collectively, these tools provide with a means to monitor a production instance while not requiring you to be physically on the machine. It is a good practice to use these, in addition to other system level utilities or tools you may already have in place.

13.9 Rolling Period Restarts Avoid Unexpected Errors

It is generally a good practice to bounce servers periodically. This recovers slow memory leaks, temporary disk space build-up, or other hidden problems that may manifest themselves only after long durations. This is a simple way to avoid unexpected failures.

Oracle9iAS makes it easy to follow this practice without requiring a client visible system downtime. You should setup a cluster of servers and setup a staggered reboot schedule for the individual servers.

When a process is down, Oracle9iAS automatically takes it out of its routing structure. For http requests, this is ensured by Oracle9iAS Web Cache and for J2EE requests this is ensured by mod_oc4j and the clustering components.

Thus, the end customer requests are never routed to the down machine and the restart makes the system perform better! If restarting the machine is overkill, another option is to just restart the Oracle9iAS instance. If even that is an overkill, restarting individual OC4J instances is recommended, especially if you have deployed unproven or new Java applications.

13.10 Stock Spares and Have a Backup Schedule

Oracle9iAS provides easy commands to backup and restore configurations. However, these need to be executed periodically to gain the benefits of these! Moreover, in general Oracle9iAS will be one piece of a larger puzzle.

Hence, care should be taken that the backup and recovery operational schedule support restoring the entire Web site. This includes brand-new computers, storage, and network equipment with less than 24 hours combined downtime and data loss. This backup and spare parts storage provides a last line of defense against the worst failures, such as careless employees, botched upgrades, security break-ins, and software bugs that corrupt stored data.

If a failure requiring restoration of the database from a backup happens once a year and takes 24 hours to fix, then availability is 99.7%. It is possible to reduce the database restore time to a few minutes, using a physical standby database. With this scheme, the standby database is always in recovery mode, and its state lags the primary database by some fixed period (for example, 15 minutes). When the primary fails, applications switch to the standby. The hard part here is to detect the failure, which could be a corrupt infrequently accessed disk block, accidentally run batch job, or other non-obvious failure. It must be detected within the fixed period lag, before it possibly gets propagated to the standby.

It is important to backup everything, including router, firewall, and load balancer configuration, operating systems and their configuration, Oracle9iAS software and its configuration. But the backend Oracle database should be where most of your data is stored; it should receive the most attention.

13 Miscellaneous Best Practices