1 Introduction

A variety of events can lead to the failure of a server instance. Often one failure condition leads to another. Loss of power, hardware malfunction, operating system crashes, network partitions, or unexpected application behavior may each contribute to the failure of a server instance.

Oracle Communications Services Gatekeeper uses a clustered architecture as the basis for minimizing the impact of failure events. However, even in a clustered environment it is important to prepare for a sound recovery process in the event that an individual server or server machine fails.

This chapter summarizes Services Gatekeeper failure prevention and recovery features and describes the configuration artifacts that are required to restore different portions of a Services Gatekeeper domain. The remaining sections in this guide describe how to back up Services Gatekeeper configuration artifacts and how to use those artifacts to restore the system in the event of a server failure.

Failure Prevention and Automatic Recovery Features

Services Gatekeeper, and the underlying Oracle Fusion Middleware WebLogic Server platform, provide many features that protect against server failures. In a production system, all available features should be used to ensure uninterrupted service.

Overload Alarms

Services Gatekeeper's underlying Oracle Fusion Middleware WebLogic Server platform detects increases in system load that can affect deployed application performance and stability. Oracle Fusion Middleware WebLogic Server also allows administrators to configure failure prevention actions that occur automatically at predefined load thresholds. Automatic overload protection helps avoid failures that result from unanticipated levels of application traffic or resource utilization as indicated by:

Each backward-compatible communication service in Services Gatekeeper uses a pair of attributes, OverloadPercentage and SevereOverloadPercentage, that define the amount of load on the software module required to trigger an overload alarm. Always monitor for these alarms and perform system throttling as necessary to avoid failures that could result from unanticipated levels of application traffic or resource utilization.

A workload manager's capacity being exceeded
The HTTP session count increasing to a predefined threshold value
Impending out-of-memory conditions

For more information on avoiding and managing performance problems see "Avoiding and Managing Overload" in Oracle Fusion Middleware Configuring Server Environments for Oracle WebLogic Server 11g at:

http://download.oracle.com/docs/cd/E15523_01/web.1111/e13701/overload.htm

Redundancy and Failover for Clustered Services

Using multiple Access tier and Network tier servers in dedicated clusters increases the reliability and availability of applications.

Access tier clusters maintain no stateful information about applications, so the failure of a server does not result in any data loss.

Services Gatekeeper also performs automated failover for servers within the Network tier. Production installations must use the tiered configuration to protect against individual server failures. See "Redundancy, Load Balancing, and High Availability" in Oracle Communications Services Gatekeeper Concepts Guide for more information.

Automatic Restart for Managed Servers

Oracle Fusion Middleware WebLogic Server self-health monitoring features improve the reliability and availability of server instances in a domain. Selected subsystems within each server instance monitor their health status based on criteria specific to the subsystem. (For example, the JMS subsystem monitors the condition of the JMS thread pool while the core server subsystem monitors default and user-defined execute queue statistics.) If an individual subsystem determines that it can no longer operate in a consistent and reliable manner, it registers its health state as FAILED with the host server.

Each Oracle Fusion Middleware WebLogic Server instance, in turn, checks the health state of its registered subsystems to determine its overall viability. If one or more of its critical subsystems have reached the FAILED state, the server instance marks its own health state FAILED to indicate that it cannot reliably host an application.

When used in combination with Node Manager, server self-health monitoring enables automatically rebooting servers that have failed. This improves the overall reliability of a domain, and requires no direct intervention from an administrator. For more information on how to set up Node Manager and use it to control servers, see Oracle Fusion Middleware Node Manager Administrator's Guide for Oracle WebLogic Server at:

http://download.oracle.com/docs/cd/E15523_01/web.1111/e13740/toc.htm

Managed Server Independence Mode

Managed Servers maintain a local copy of the domain configuration. When a Managed Server starts, it tries to contact the Administration Server to retrieve any changes to the domain configuration that were made since the Managed Server was last shut down. If a Managed Server cannot connect to the Administration Server during startup, it can use its locally-cached configuration information , which describes the configuration that was current at the time of the Managed Server's most recent shutdown. A Managed Server that starts up without contacting its Administration Server to check for configuration updates is running in Managed Server Independence (MSI) mode. By default, MSI mode is enabled. For information on replicating domain config files, see "Replicate domain config files for Managed Server independence" in Oracle Fusion Middleware Oracle WebLogic Server Administration Console Online Help at:

http://download.oracle.com/docs/cd/E15523_01/apirefs.1111/e13952/core/index.html

Automatic Migration of Failed Managed Servers

With Linux or UNIX operating systems, you can use Oracle Fusion Middleware WebLogic Server's server migration feature to start a candidate (backup) server automatically if a Network tier server's machine fails or becomes partitioned from the network. The server migration feature uses Node Manager, in conjunction with the wlsifconfig.sh script, to boot candidate servers automatically using a floating IP address. Candidate servers are booted only if the primary server hosting a Network tier instance becomes unreachable. For more information about using the server migration feature, see "High Availability for WebLogic Server" in Oracle Fusion Middleware High Availability Guide at:

http://download.oracle.com/docs/cd/E15523_01/core.1111/e10106/toc.htm

Geographic Redundancy for Regional Site Failures

In addition to server-level redundancy and failover capabilities, Services Gatekeeper enables configuration of peer sites to protect against catastrophic failures, such as power outages, that can affect an entire domain. It is possible to set up failover from one geographical site to another, avoiding complete service outages. See "Geographic Redundancy" in Oracle Communications Services Gatekeeper Concepts Guide for more information.

Overview of Configuration Artifacts

A Services Gatekeeper deployment utilizes two basic categories of configuration information: domain-level configuration, and database configuration. The domain-level configuration consists of the artifacts used by the underlying Oracle Fusion Middleware WebLogic Server platform to configure the behavior of managed servers, clusters, security, and other resources deployed to clusters and servers within the domain. The primary domain-level configuration artifact is the config.xml file, stored in the Domain_Home/config directory. The config.xml file generally references additional configuration files beneath the config directory to configure additional domain resources such as JDBC and JMS.

In addition to the basic domain-level configuration of the Oracle Fusion Middleware WebLogic Server platform, Services Gatekeeper stores some configuration for its core services in database tables. This includes the routing configuration for backward-compatible communication services and PRM integration data. The database tables are shared across clustered instances of Services Gatekeeper server instances. The database must be backed up at regular intervals to protect against data loss or corruption. An Oracle Real Application Cluster (RAC) deployment is also required for production installations, to provide redundancy and failover for the database configuration.

Both domain-level configuration backups and database backups may be required at different times to restore servers fully or to migrate server instances to new server hardware in a Services Gatekeeper installation.

Database Backup

Backing up the underlying database on a regular basis is an essential part of maintaining a Services Gatekeeper implementation. However that task is beyond the scope of this documentation. See your the Oracle database documentation for details on how to perform database backups.

Summary of Common Backup and Restoration Tasks

Maintaining system integrity requires that you make use of existing high availability and failover features, perform regular backups of configuration artifacts, and understand how to restore server instances or migrate servers onto viable hardware. These common tasks are summarized below.

Back up Oracle 10g, 11g, or MySQL database.

See your database backup and recovery documentation for details.
Enable Oracle Fusion Middleware WebLogic Server platform reliability and recovery features.

For a description of how to avoid and manage overload in "Avoiding and Managing Overload" in Oracle Fusion Middleware Configuring Server Environments for Oracle WebLogic Server at:

http://download.oracle.com/docs/cd/E15523_01/web.1111/e13701/overload.htm

For a description of how to replicate domain configuration files for managed server independence in, see "Replicate domain config files for Managed Server independence" in Oracle Fusion Middleware Oracle WebLogic Server Administration Console Online Help at:

http://download.oracle.com/docs/cd/E15523_01/apirefs.1111/e13952/taskhelp/domainconfig/ReplicateDomainConfigFilesForManagedServerIndependence.html

For a description of server migration see "High Availability for WebLogic Server" in Oracle Fusion Middleware High Availability Guide at: http://download.oracle.com/docs/cd/E15523_01/core.1111/e10106/toc.htm
Enable Services Gatekeeper reliability and recovery features.

See "Redundancy, Load Balancing, and High Availability" and "Geographic Redundancy" in Oracle Communications Services Gatekeeper Concepts Guide.
Back up Oracle Fusion Middleware WebLogic Server domain configuration.

See:
Back up the Oracle Middleware Fusion database configuration.

For Oracle Fusion Middleware 11g, see "Backing Up Your Environment" in Oracle Fusion Middleware Administrator's Guide 11g at:

http://download.oracle.com/docs/cd/E15523_01/core.1111/e10105/br_bkp.htm#ASADM376

and"Recovering Your Environment" in Oracle Fusion Middleware Administrator's Guide 11 at:

http://download.oracle.com/docs/cd/E15523_01/core.1111/e10105/br_rec.htm#BCGCBGDJ

For Oracle Fusion Middleware 10g, see the Oracle WebLogic Server 10g documentation at:

http://download.oracle.com/docs/cd/E12840_01/wls/docs103/sitemap.html