High Availability

Introduction

The subject of High Availability covers a range of features and options that can help to minimize planned and unplanned downtime, or facilitate recovery after a period of downtime. They include:

Patching Hints and Tips
Maintenance Mode
Shared Application Tier File System
Distributed AD
Staged Applications System
Nologging Operations
Disaster Recovery

This section will provide a high-level guide to the key features that can help make an Oracle E-Business Suite highly available, with the emphasis on guidelines for making the correct decisions when planning a new installation or upgrade.

Patching Hints and Tips

Patch application is a key activity undertaken by Oracle E-Business Suite DBAs. If you need to apply a large number of patches, the required downtime can be significant. However, there are several simple ways of minimizing this downtime. These strategies include:

Keep AD up-to-date - Running at the latest AD mini-pack level allows you to take full advantage of new features designed to reduce downtime and simplify maintenance.
Use a shared application tier file system - By default, Release 12 will configure multiple application tier nodes to use a shared application tier file system.
Consolidate multiple patches with AD Merge Patch - Merging multiple Oracle E-Business Suite patches into a single patch not only reduces overall downtime by eliminating duplicate tasks, but minimizes the scope for error that would arise in applying a number of separate patches.
Use Distributed AD - The Distributed AD feature allows workers of the same AD session to be started across multiple application tier nodes, making full use of available hardware resources and thereby minimizing patching downtime.
Employ a staged Applications system - A staged Applications system is an exact copy of a production system, and is used as a patching intermediary. Patches are applied to the staged system while the production system remains in use, reducing the time the production system is unavailable.
Keep your test system current with your production system - When you test the application of a patch, the test must be realistic in terms of current patch level and transaction data: you can employ either Oracle Applications Manager or the Rapid Clone tool to create a copy of your production system for tests.
Perform maintenance during normal operation where possible - For example, you can gather schema statistics or patch online help while the system is in use.
Schedule periodic downtime for application of the latest software updates - The more up-to-date your system, the less likely you are to experience known problems, and the easier it will be to resolve any new issues that may arise.

Where applicable, these strategies are described further below.

Note: For full details of carrying out patching and maintenance operations, see Oracle E-Business Suite Maintenance Procedures, Oracle E-Business Suite Maintenance Utilities, and Oracle E-Business Suite Patching Procedures.

Maintenance Mode

Maintenance Mode is a mode of operation in which the Oracle E-Business Suite system is made accessible only for patching activities. This provides optimal performance for AutoPatch sessions, and minimizes downtime needed.

Note: Maintenance Mode is only needed for AutoPatch sessions. Other AD utilities do not require Maintenance mode to be enabled.

Administrators can schedule system downtime using Oracle Applications Manager, and send alert messages to users about the impending downtime. When Maintenance Mode is entered, users attempting to log on to Oracle E-Business Suite are redirected to a system downtime URL.

There are several practical points relating to the use of Maintenance Mode:

You can toggle Maintenance Mode between Enabled and Disabled using the new Change Maintenance Mode menu in AD Administration, or the equivalent function in Oracle Applications Manager.
Although you can run AutoPatch with Maintenance Mode disabled, there will be a significant degradation in performance.
There is a separate logon page for Restricted Mode access while the system is in Maintenance Mode. Restricted Mode allows administrators access to specific privileged functionality, for example to view the timing report that shows the progress of a patching session.

Shared Application Tier File System

A traditional multi-node installation of Oracle E-Business Suite required each application tier node to maintain its own file system. Installation and migration options were subsequently introduced to enable a single APPL_TOP to be shared between all the application tier nodes of a multi-node system. This was referred to as a Shared APPL_TOP File System, usually abbreviated to Shared APPL_TOP.

A further capability that was introduced was the option to merge the APPL_TOPs of multiple nodes, each with its own set of application tier services, to give a single APPL_TOP that could then be shared between them all.

These concepts were subsequently extended to enable sharing of the application tier technology stack file system as well, the result being known as a Shared Application Tier File System.

This section describes the benefits of using a shared application tier file system in an Oracle E-Business Suite Release 12 environment. Current restrictions are also noted where applicable.

Shared Application Tier File System Features

In a shared application tier file system, all application tier files are installed on a single shared disk resource that is mounted on each application tier node. Any application tier node can be configured to perform any of the standard application tier services, such as serving forms or web pages, and all changes made to the shared file system are immediately visible on all the application tier nodes.

Benefits of using a shared application tier file system include:

Overall disk space requirements are greatly reduced, as there is only a single copy of the relevant Oracle E-Business Suite code.
Since there is only one physical application tier file system, administrative tasks need only be carried out once, on any node, and take effect immediately on all nodes.

Current restrictions on using a shared application tier file system include:

An application tier file system can only be shared across machines running either identical or binary compatible operating systems.
Sharing file systems between internal and external application tiers is not supported. This is true even for external application tiers that have reverse proxies in the DMZ.
Shared application tier file system functionality is not currently available on Windows.

Shared Disk Resources

A shared application tier file system can reside on any standard type of shared disk resource, such as a remote NFS-mounted disk or part of a RAID array. However, you should ensure that performance of the chosen disk resource is adequate to meet peak demand. For example, NFS-mounted disks may give inadequate read or write performance when there is a large amount of network traffic, and RAID arrays must be implemented carefully to strike the appropriate balance between high availability, performance and cost.

Creating a Shared Application Tier File System

By default, the Release 12 Rapid Install will configure a multi-node application tier environment to use a shared application tier file system.

Note: For further details of using a shared application tier file system, see My Oracle Support Knowledge Document 384248.1, Sharing the Application Tier File System in Oracle E-Business Suite Release 12.

High Availability Features of Shared Application Tier File System

Utilizing a shared application tier file system improves high availability in the following ways:

It is straightforward to add nodes to an existing installation, to provide greater resilience to node failure or to cater for additional users. This is particularly cost-effective with inexpensive Linux nodes.
A patch only needs to be applied to one application tier node for its effects to be visible on all other nodes that share the file system. Such a single installation also helps to minimize the duration of planned maintenance downtimes, and reduces the scope for errors during installation.

Distributed AD

Many deployments utilize large database servers and multiple, smaller application (middle) tier systems. With the increasing deployment of low cost Linux-based systems, this configuration is becoming more common.

AD has always utilized a job system, where multiple workers are assigned jobs. Information for the job system is stored in the database, and workers receive their assignments based on the contents of the relevant tables. The Distributed AD feature offers improved scalability, performance, and resource utilization, by allowing workers of the same AD session to be started on multiple application tier nodes, utilizing available resources to complete their assigned jobs more efficiently.

Requirements for Distributed AD

Because the AD workers create and update file system objects as well as database objects, a shared application tier file system (shared APPL_TOP in earlier releases) must be employed to ensure the files are created in a single, centralized location.

Using Distributed AD

On one of your shared application tier nodes, you start your AutoPatch or AD Administration session, specifying the number of local workers and the total number of workers.

While using AutoPatch or AD Administration, you can start a normal AD Controller session from any of the nodes in the shared APPL_TOP environment to perform any standard AD Controller operations, using both local and non-local workers. This is possible because the job system can be invoked multiple times during AutoPatch and AD Administration runs. Each time an individual invocation of the job system completes, distributed AD Controller sessions will wait until either the job system is invoked again (at which point it will once again start the local workers) or until the AD utility session ends (at which point distributed AD Controller will exit).

Note: See Oracle E-Business Suite Maintenance Utilities for further details of Distributed AD and AD Controller.

AD Controller Log Files

The log file created by AD Controller is created wherever the AD Controller session is started. This is to prevent file locking issues on certain platforms. It is therefore recommended that the AD Controller log file should include the node name from which the AD Controller session is invoked.

Staged Applications System

A staged Applications system represents an exact copy (clone) of your production system, including all APPL_TOPs as well as a copy of the production database. You can apply patches to a staged system while the production system remains in operation. Then you connect the staged system to the production system, update the database, and synchronize the APPL_TOPs. The downtime for the production system begins only after all patches have been successfully applied to the staged system, and you have tested the newly patched environment.

Important: A staged Applications system must duplicate the topology of your production system. For example, each physical APPL_TOP of your production system must exist in the staged system.

After the patches are applied to the staged system, and the production system is updated, you must export applied patches information from the staged system and import it to the production system. This ensures that the OAM patch history database in the production system is up-to-date and that you can continue to use patch-related features.

Note: For more information, see My Oracle Support Knowledge Document 734025.1, Using a Staged Applications System to Reduce Patching Downtime in Oracle Applications Release 12.

Nologging Operations

The nologging Oracle database feature is used to enhance performance in certain areas of Oracle E-Business Suite. For example, it may be used during patch installation, and when building summary data for Business Intelligence.

Use of nologging in an operation means that the database redo logs will contain incomplete information about the changes made, with any data blocks that have been updated during the nologging operation being marked as invalid. As a result, a database restoration to a point in time (whether from a hot backup or a cold backup) may require additional steps in order to bring the affected data blocks up-to-date, and make the restored database usable. These additional steps may involve taking new backups of the associated datafiles, or by dropping and rebuilding the affected objects. The same applies to activation of a standby database.

Note: Oracle Database 11g also allows logging to be forced to take place, ensuring all data changes are written to the database redo logs in a way that can be recreated in a restored backup, or propagated to a standby database. See Oracle Data Guard Concepts and Administration 11g for details of the force logging clause for database and tablespace commands.

Nologging Principles

At certain times, Oracle E-Business Suite uses the database nologging feature to perform resource-intensive work more efficiently. When an operation uses nologging, blocks of data are written directly to their data file, rather than going through the buffer cache in the System Global Area (SGA).

Instance recovery uses the online redo logs to reconstruct the SGA after a crash, rolling forward through any committed changes in order to ensure the data blocks are valid. Use of nologging does not affect instance recovery.

Database recovery requires rolling forward through the redo logs to recreate the requisite changes, and hence restore the database to the desired point in time. Since nologging operations write directly to the data files, bypassing the redo logs, the redo logs will not contain enough data to roll forward to perform media recovery. Instead, they will only contain enough information to mark the new blocks as invalid. Rolling forward through a nologging operation would therefore result in invalid blocks in the restored database. The same problems will potentially occur upon activating a standby database.

To make the restored backup or activated standby database usable after a nologging operation is carried out, a mechanism other than database recovery must be used to get or create current copies of the affected blocks.

There are two options, either of which may be appropriate depending on the specific circumstances:

Create a new copy of the data files after the nologging operation is complete, either by backing up the tablespace again, or by refreshing the specific data files in the standby database.
Drop and recreate the object with the invalidated blocks, using the program that maintains the object.

Nologging Usage

Nologging is used in the following situations in the Oracle E-Business Suite:

Building new objects during patch application, where use of nologging makes the initial build faster, and the downtime required for patching shorter.
Changing the physical structure of existing objects during patch application (such as partitioning a table), where use of nologging reduces the time needed for the operation itself, and consequently the overall downtime.
Certain specialized tasks where logging is not required, such as manipulating data for data warehousing applications, or maintaining summary data for business intelligence queries.
Certain concurrent manager jobs. In most such cases, the object affected by nologging will be dropped at the end of the job, and the invalidated blocks cleaned up. If a recovery is needed while concurrent jobs are in progress, re-running the affected jobs will clean up any invalidated blocks that may exist.

Actions Needed

To monitor nologging activity in your environment, you should periodically query your production database to identify any datafiles that have experienced nologging operations. You should also run the query before and after applying an Oracle E-Business Suite patch, to determine whether any nologging activity was carried out.

A suitable query can be run via monitoring software such as Oracle Enterprise Manager. Alternatively, you can construct a query based on the unrecoverable_change# and unrecoverable_time columns of the data dictionary view v$datafile. These are updated every time an unrecoverable or nologging operation marks blocks as invalid in the datafile.

The results of a query can be saved as a snapshot and compared to the last snapshot. You can then identify each occasion when nologging operations have been carried out in the database, and hence when you need to refresh backup datafiles with new copies that will be usable in the event of restoration being needed.

Disaster Recovery

A significant problem that strikes an Oracle E-Business Suite installation could put the viability of the organization at risk. Such a problem could be:

An external disaster, such as a fire at a company's data center, resulting in a loss of service that severely hampers the organization's ability to do business.
An internal disaster, such as a serious error by a privileged user, resulting in major loss or corruption of data.
A hardware or system failure or malfunction, such as a media failure, or operating system problem that corrupts the actual data in the database

This section gives an overview of the area of disaster recovery, which can be considered as the final component of a high availability strategy. Disaster recovery involves taking steps to protect the database and its environment to ensure that they can still operate in the face of major problems. Oracle provides features such as Oracle Data Guard and Flashback Database .

Data Guard is used to set up and maintain a secondary copy of a database, typically referred to as a standby database. Such a standby database is brought into use after a failover from the primary database when the primary becomes unavailable following a significant problem, or via a switchover operation that is executed to allow service to continue during planned maintenance of the environment’s platform or building services.
Flashback Database is used to “rewind” a database to a prior point in time, making it possible to recover from major logical corruptions of a database without requiring a complete restore.

You must also install any other hardware and software required to run your standby environment as a production environment after a failover, ensuring that any changes on the primary are matched on the standby. Examples include tape backup equipment and software, system management and monitoring software, and other applications.

Data Guard and Release 12

Oracle Data Guard provides mechanisms for propagating changes from one database to another, to avoid possible loss of data if one site fails. The two main variants of a Data Guard configuration are Redo Apply (often referred to as Physical Standby) and SQL Apply (often referred to as Logical Standby). . Both of these use the primary database’s redo information to propagate changes to the standby database.

Physical standby uses the normal database recovery mechanism to apply the primary database’s redo to the standby database, resulting in an identical copy of the production database.
Logical standby employs the Oracle LogMiner utility to build SQL statements that recreate changes made to the data. The logical standby mechanism is not currently utilized with Oracle E-Business Suite.

The secondary environment should be physically separate from the primary environment, to protect against disasters that affect the entire primary site. This necessitates having a reliable network connection between the two data centers, with sufficient bandwidth (capacity) for peak redo traffic. The other requirement is that the servers at the secondary site are the same type as at the primary site, in sufficient numbers to provide the required level of service; depending on your organization’s needs, this could either be a minimal level of service (supporting fewer users), or exactly the same level of service as you normally provide.

Data Guard’s reliance on redo generated from the production database has significant implications for operations in which Oracle E-Business Suite uses the nologging feature (described previously) to perform some resource-intensive tasks with faster throughput. Oracle recommends turning on the force logging feature at the database level to simplify your backup and recovery, and standby database maintenance procedures. In cases where the nologging feature is used in Release 12, and you have chosen not to use force logging, insufficient redo information will be generated to make the corresponding changes on the standby database. You may then be required to take manual steps to refresh the standby (or recreate the relevant objects) to ensure it will remain usable.

Finally, based on your organization’s business requirements, choose one of the following protection modes:

Maximum protection: This protection mode ensures that no data loss will occur if the primary database fails. To provide this level of protection, the redo data needed to recover each transaction must be written to both the local online redo log and to the standby redo log on at least one standby database before the transaction commits. To ensure data loss cannot occur, the primary database shuts down if a fault prevents it from writing its redo stream to the standby redo log of at least one transactionally-consistent standby database.
Maximum availability: This protection mode provides the highest level of data protection that is possible without compromising the availability of the primary database. Like maximum protection mode, a transaction will not commit until the redo needed to recover that transaction is written to the local online redo log, and to the standby redo log of at least one transactionally-consistent standby database. However, unlike maximum protection mode, the primary database does not shut down if a fault prevents it from writing its redo stream to a remote standby redo log. Instead, the primary database switches to maximum performance mode until the fault is corrected, and all gaps in redo log files are resolved. When all gaps have been resolved, the primary database automatically resumes operating in maximum availability mode. This strategy ensures that no data loss will occur if the primary database fails, unless a second fault prevents a complete set of redo data from being sent from the primary database to at least one standby database.
Maximum performance: This protection mode (the default) provides the highest level of data protection that is possible without affecting the performance of the primary database. This is accomplished by allowing a transaction to commit as soon as the redo data needed to recover that transaction is written to the local online redo log. The primary database's redo data stream is also written to at least one standby database, but that redo stream is written asynchronously with respect to the transactions that create the redo data. When network links with sufficient bandwidth are employed, this mode provides a level of data protection that approaches that of maximum availability mode, with minimal impact on primary database performance.

Flashback Database

Oracle recommends you enable the Flashback Database feature, to:

Help protect against logical data corruption
Allow you to reinstantiate the production database as a standby after a failover to your secondary site
Create database restore points to which you can flash back in case an upgrade or major application change encounters a serious problem

Flashback Database enables you to rewind the database to a previous point in time without restoring backup copies of the data files. This is accomplished during normal operation by Flashback Database buffering and writing before images of data blocks into the flashback logs, which reside in the flash recovery area.

Flashback Database can also flashback a primary or standby database to a point in time prior to a Data Guard role transition. In addition, a Flashback Database operation can be performed to a point in time prior to a resetlogs operation, which allows administrators more flexibility to detect and correct human errors.