8 Scalability and High Availability

This chapter provides an overview of high availability and scalability solutions provided by Oracle Application Server. The topics include:

Scalability
High Availability

8.1 Scalability

Scalability is the ability of a system to provide throughput in proportion to, and limited only by, available hardware resources. A scalable system is one that can handle increasing numbers of requests without adversely affecting response time and throughput.

The growth of computational power within one operating environment is called vertical scaling. Horizontal scaling is leveraging multiple systems to work together on a common problem in parallel.

Oracle Application Server scales both vertically and horizontally. Horizontally, Oracle Application Server can increase its throughput with Oracle Application Server Clusters, where several application server instances are grouped together to share a workload. Also, Oracle Application Server provides great vertical scalability, allowing you to start several virtual machines from the same configuration files inside a single operating environment (automatically configuring ports, applications and routing). This provides the advantage of vertical scaling based on multiple processes, but eliminates the overhead of administering several separate application server instances.

8.2 High Availability

The availability of a system or any component in that system is defined by the percentage of time that it works normally. The formula for determining the availability for a system is:

Availability = average time to failure (ATTF) / [average time to failure (ATTF) + average time to recover (ATTR)]

For example, a system that works normally for twelve hours per day is 50% available. A system that has 99% availability is down 3.65 days per year on average. Critical systems may need to meet exceptionally high availability standards, and experience as little as four to five minutes of downtime per year.

Oracle Application Server is designed to provide a wide variety of high availability solutions, ranging from load balancing and basic clustering to providing maximum system availability during catastrophic hardware and software failures.

High availability solutions can be divided into three basic categories: local high availability, backup and recovery, and disaster recovery.

8.2.1 Local High Availability Solutions

Local high availability solutions ensure availability in a single data center deployment. These solutions guard against process, node, and media failures, as well as human errors. Local high availability solutions can be further divided into two types: active-passive and active-active.

Active-passive solutions deploy an active instance that handles requests and a passive instance that is on standby. When the active instance fails, the active instance is shut down and the passive instance is brought online, and resumes application services. At this point the active-passive roles are switched. This process can be done manually or it can be handled through vendor-specific clusterware. Active-passive solutions are generally referred to as cold failover clusters.

Active-active solutions deploy two or more active application server instances at all times. All instances handle requests concurrently.

Figure 8-1 illustrates active-active and active-passive system deployments.

Figure 8-1 Active-active and Active-passive High Availability Solutions

Description of "Figure 8-1 Active-active and Active-passive High Availability Solutions"

In addition to architectural redundancies, other local high availability features include:

Process death detection and automatic restart: Processes may die unexpectedly due to configuration or software problems. A proper process monitoring and restart system should monitor all system process constantly and restart them if there are problems.
Clustering: Clustering components of a system together allows the components to be viewed functionally as a single entity from the client perspective. A cluster is a set of processes running on a single or multiple computers that share the same workload. A cluster contains one or more runtime instances of an Oracle Application Server with all of the cluster-wide configuration parameters set to the same values. A cluster provides redundancy for one or more applications.
Configuration management: A clustered group of similar components often need to share common configurations. Proper configuration management enables the components to synchronize their configurations and also provides highly available configuration management with less administrative downtime.
State replications and routing: For stateful client requests, client state can be replicated to enable stateful failover of requests in the event that processes serving these requests fail.
Server load balancing and failover: When multiple instances of identical server components are available, client requests to these components can be load balanced to ensure that the instances have roughly the same workload. With a load balancing mechanism in place, the instances are redundant. If any of the instances fail, requests to the failed instance can be sent to the surviving instances.
Connection failure management: Clients often connect to services on the server and reuse these connections. When a process implementing one of these services on the server is restarted, the connection may need to be re-established. Correct re-connection management ensures that clients have uninterrupted service.

8.2.2 Backup and Recovery Solutions

Backup and recovery refers to the various strategies and procedures involved in guarding against hardware failures and data loss, and reconstructing data should a loss occur. There are failure scenarios that do not involve the catastrophic loss of an entire production environment. But regardless of the type of failure, once a failure has occurred in your system it is important to restore the failed component or process as quickly as possible.

User errors may cause a system to malfunction. In certain circumstances, a component or system failure may not be repairable. A backup and recovery facility should be available to back up the system at certain intervals and restore a backup when an irreparable failure occurs.

8.2.3 Disaster Recovery Solutions

Disaster recovery solutions are usually geographically distributed deployments that protect your applications from disasters such as floods or regional network outages. Disaster recovery solutions typically set up two homogeneous sites, one active and one passive. Each site is a self-contained system. The active site is generally called the production site, and the passive site is called the standby site.

During normal operation, the production site services requests. In the event of a site failover or switchover, the standby site takes over the production role, and all requests are routed to that site.

To maintain the standby site for failover, not only must the standby site contain homogeneous installations and applications, but data and configurations must also be synchronized constantly from the production site to the standby site.

Figure 8-2 illustrates a geographically distributed disaster recovery solution.

Figure 8-2 Geographically Distributed Disaster Recovery

Description of "Figure 8-2 Geographically Distributed Disaster Recovery"

8.2.4 Oracle Application Server High Availability

Oracle Application Server provides local high availability, backup and restore, and disaster recovery solutions for maximum protection against any kind of failure with flexible installation, deployment, and security options.

8.2.4.1 Oracle Application Server Local High Availability Solutions

Oracle Application Server local high availability is achieved by several active-active and active-passive solutions for the Oracle Application Server middle tier and the Oracle Application Server Infrastructure. With both active-active and active-passive high availability solutions, there are options that differ in ease of installation, cost, scalability, and security.

8.2.4.2 Oracle Application Server Backup and Recovery Solutions

Some failures require more involved recovery scenarios than simply restarting processes. In some cases, you will have to perform restoration operations based on backup procedures you have previously implemented.

8.2.4.2.1 Complete Backup and Restore

A complete Oracle Application Server environment backup includes:

A full backup of all files in the middle-tier Oracle homes (including Oracle software files and configuration files)
A full backup of all files in the Infrastructure Oracle home (including Oracle software files and configuration files)
A complete cold backup of the OracleAS Metadata Repository
A full backup of the Oracle system files on each host in your environment

Failures that require the complete backup and restore solution for recovery include node failure where the node needs to be completely replaced, and the deletion or corruption of Oracle software or binary files. Failures that require this type of recovery solution also then require the manual restart of all processes. For details about specific failure types and how to recover, see the Oracle Application Server Administrator's Guide.

8.2.4.2.2 Online Backup and Restore

Depending on the type of failure your system is experiencing, you may need to restore your system from an online backup. An online backup includes:

An incremental backup of the configuration files in the middle tier Oracle homes
An incremental backup of the configuration files in the Infrastructure Oracle home
An online backup of the OracleAS Metadata Repository

Failures that require online backup and restore solutions for recovery include data failure in the metadata repository and deletion or corruption of Oracle Application Server component runtime configuration files. Failures that require this type of solution also then require one or more processes to be restarted. For details about specific failure types and how to recover, see the Oracle Application Server Administrator's Guide.

8.2.4.3 Oracle Application Server Disaster Recovery Solutions

Built on top of the local high availability solutions is the Oracle Application Server disaster recovery solution. This solution requires homogeneous production and standby sites that mirror each other in Oracle Application Server and platform configurations. These configurations must be synchronized regularly to maintain the homogeneity.

Details about each of the high availability solutions you can implement with Oracle Application Server are described in detail in the Oracle Application Server High Availability Guide, along with instructions on how to configure, operate, and manage each one.