Introduction

A variety of events can lead to the failure of a server instance. Often one failure condition leads to another. Loss of power, hardware malfunction, operating system crashes, network partitions, or unexpected application behavior may each contribute to the failure of a server instance.

WebLogic Network Gatekeeper uses a highly clustered architecture as the basis for minimizing the impact of failure events. However, even in a clustered environment it is important to prepare for a sound recovery process in the event that an individual server or server machine fails.

This chapter summarizes WebLogic NetWork Gatekeeper failure prevention and recovery features, and describes the configuration artifacts that are required in order to restore different portions of a WebLogic Network Gatekeeper domain. The remaining sections in this guide describe how to back up WebLogic Network Gatekeeper configuration artifacts, and how to use those artifacts to restore the system in the event of a server failure.

Failure Prevention and Automatic Recovery Features

WebLogic Network Gatekeeper, and the underlying WebLogic Server platform, provide many features that protect against server failures. In a production system, all available features should be used in order to ensure uninterrupted service.

Overload Alarms

Network Gatekeeper’s underlying WebLogic Server platform detects increases in system load that can affect deployed application performance and stability. WebLogic Server also allows administrators to configure failure prevention actions that occur automatically at predefined load thresholds. Automatic overload protection helps you avoid failures that result from unanticipated levels of application traffic or resource utilization as indicated by:

A workload manager’s capacity being exceeded
The HTTP session count increasing to a predefined threshold value
Impending out of memory conditions

Redundancy and Failover for Clustered Services

Using multiple Access tier and Network tier servers in dedicated clusters increases the reliability and availability of your applications. Access tier clusters maintain no stateful information about applications, so the failure of a server does not result in any data loss. Network Gatekeeper also performs automated failover for servers within the Network tier. Any production installation must use the tiered configuration to protect against individual server failures. See Redundancy, Load Balancing, and High Availability in the Architecture Overview for more information.

Automatic Restart for Managed Servers

WebLogic Server self-health monitoring features improve the reliability and availability of server instances in a domain. Selected subsystems within each server instance monitor their health status based on criteria specific to the subsystem. (For example, the JMS subsystem monitors the condition of the JMS thread pool while the core server subsystem monitors default and user-defined execute queue statistics.) If an individual subsystem determines that it can no longer operate in a consistent and reliable manner, it registers its health state as “failed” with the host server.

Each WebLogic Server instance, in turn, checks the health state of its registered subsystems to determine its overall viability. If one or more of its critical subsystems have reached the FAILED state, the server instance marks its own health state FAILED to indicate that it cannot reliably host an application.

When used in combination with Node Manager, server self-health monitoring enables you to automatically reboot servers that have failed. This improves the overall reliability of a domain, and requires no direct intervention from an administrator. For more information, see Using Node Manager to Control Servers in the WebLogic Server 10 documentation.

Managed Server Independence Mode

Managed Servers maintain a local copy of the domain configuration. When a Managed Server starts, it tries to contact the Administration Server to retrieve any changes to the domain configuration that were made since the Managed Server was last shut down. If a Managed Server cannot connect to the Administration Server during startup, it can use its locally-cached configuration information—this is the configuration that was current at the time of the Managed Server’s most recent shutdown. A Managed Server that starts up without contacting its Administration Server to check for configuration updates is running in Managed Server Independence (MSI) mode. By default, MSI mode is enabled. See Replicating domain config files for Managed Server Independence in the WebLogic Server 10 documentation.

Automatic Migration of Failed Managed Servers

When using Linux or UNIX operating systems, you can use WebLogic Server’s server migration feature to automatically start a candidate (backup) server if a Network tier server’s machine fails or becomes partitioned from the network. The server migration feature uses node manager, in conjunction with the wlsifconfig.sh script, to automatically boot candidate servers using a floating IP address. Candidate servers are booted only if the primary server hosting a Network tier instance becomes unreachable. See Migration in the WebLogic Server 10 documentation for more information about using the server migration feature.

Geographic Redundancy for Regional Site Failures

In addition to server-level redundancy and failover capabilities, WebLogic Network Gatekeeper enables you to configure peer sites to protect against catastrophic failures, such as power outages, that can affect an entire domain. This enables you to failover from one geographical site to another, avoiding complete service outages. See Geographic Redundancy in the Architecture Overview for more information.

Overview of Configuration Artifacts

A WebLogic Network Gatekeeper deployment utilizes two basic categories of configuration information: domain-level configuration, and database configuration. The domain-level configuration consists of the artifacts used by the underlying WebLogic Server platform to configure the behavior of managed servers, clusters, security, and other resources deployed to clusters and servers within the domain. The primary domain-level configuration artifact is the config.xml file, stored in the domain-home/config directory. The config.xml file generally references additional configuration files beneath the config directory to configure additional domain resources such as JDBC and JMS.

In addition to the basic domain-level configuration of the WebLogic Server platform, WebLogic Network Gatekeeper stores some configuration for its core services in the form of database tables. This includes the routing configuration for backward-compatible communication services and PRM integration data. The database tables are shared across clustered instances of WebLogic Network Gatekeeper server instances. The database must be backed up at regular intervals to protect against data loss or corruption. An Oracle RAC deployment is also required for production installations, to provide redundancy and failover for the database configuration.

Both domain-level configuration backups and database backups may be required at different times in order to fully restore servers, or migrate server instances to new server hardware, in a WebLogic Network Gatekeeper installation.

Common Backup and Restoration Tasks

Maintaining system integrity requires that you make use of existing high availability and failover features, perform regular backups of configuration artifacts, and understand how to restore server instances or migrate servers onto viable hardware. These common tasks are summarized in Table 1-1.

Table 1-1 Common Backup and Restoration Tasks
Task	Links
Enable WebLogic Server platform reliability and recovery features.	Avoiding and Managing Overload (WebLogic Server 10 documentation) Replicating domain config files for Managed Server Independence (WebLogic Server 10 documentation) Using Node Manager to Control Servers (WebLogic Server 10 documentation) Migration (WebLogic Server 10 documentation)
Enable WebLogic Network Gatekeeper reliability and recovery features.	Redundancy, Load Balancing, and High Availability (Concepts and Architectural Overview) Geographic Redundancy (Concepts and Architectural Overview)
Backup WebLogic Server domain configuration.	Enabling Automatic Configuration Backups Storing the Domain Configuration Offline Backing Up Domain Security Data Backing Up Additional Configuration Files
Backup WebLogic Network Gatekeeper database configuration.	Backing Up An Oracle 10g Single Instance Database Backing Up An Oracle 10g Single Instance Database Backing Up An Oracle 10g RAC Database Backing Up a MySQL Database Restoring the Database from Backup
Restore a failed Access Tier or Network Tier server instance.	Restarting a Failed Administration Server Restarting Failed Access and Network Tier Servers Moving an Access or Network Tier Server to a Different Machine

System Backup and Restoration Guide