A variety of events can lead to the failure of a server instance. Often one failure condition leads to another. Loss of power, hardware malfunction, operating system crashes, network partitions, or unexpected application behavior may each contribute to the failure of a server instance.
Oracle Communications Services Gatekeeper uses a highly clustered architecture as the basis for minimizing the impact of failure events. However, even in a clustered environment it is important to prepare for a sound recovery process in the event that an individual server or server machine fails.
This chapter summarizes Oracle Communications Services Gatekeeper failure prevention and recovery features, and describes the configuration artifacts that are required in order to restore different portions of a Oracle Communications Services Gatekeeper domain. The remaining sections in this guide describe how to back up Oracle Communications Services Gatekeeper configuration artifacts, and how to use those artifacts to restore the system in the event of a server failure.
Oracle Communications Services Gatekeeper, and the underlying WebLogic Server platform, provide many features that protect against server failures. In a production system, all available features should be used in order to ensure uninterrupted service.
Oracle Communications Services Gatekeeper’s underlying WebLogic Server platform detects increases in system load that can affect deployed application performance and stability. WebLogic Server also allows administrators to configure failure prevention actions that occur automatically at predefined load thresholds. Automatic overload protection helps you avoid failures that result from unanticipated levels of application traffic or resource utilization as indicated by:
See Oracle WebLogic Server Configuring Server Environments atfor more information on avoiding and managing overload.
|Note:||Each backwards compatible communication service in Oracle Communications Services Gatekeeper uses a pair of attributes,
Using multiple Access tier and Network tier servers in dedicated clusters increases the reliability and availability of your applications. Access tier clusters maintain no stateful information about applications, so the failure of a server does not result in any data loss. Oracle Communications Services Gatekeeper also performs automated failover for servers within the Network tier. Any production installation must use the tiered configuration to protect against individual server failures. See Concepts and Architectural Overview for more information.
WebLogic Server self-health monitoring features improve the reliability and availability of server instances in a domain. Selected subsystems within each server instance monitor their health status based on criteria specific to the subsystem. (For example, the JMS subsystem monitors the condition of the JMS thread pool while the core server subsystem monitors default and user-defined execute queue statistics.) If an individual subsystem determines that it can no longer operate in a consistent and reliable manner, it registers its health state as “failed” with the host server.
Each WebLogic Server instance, in turn, checks the health state of its registered subsystems to determine its overall viability. If one or more of its critical subsystems have reached the FAILED state, the server instance marks its own health state FAILED to indicate that it cannot reliably host an application.
When used in combination with Node Manager, server self-health monitoring enables you to automatically reboot servers that have failed. This improves the overall reliability of a domain, and requires no direct intervention from an administrator. For more information on how to use Node Manager to control servers, see Oracle WebLogic Server Node Manager Administrator’s Guide at.
Managed Servers maintain a local copy of the domain configuration. When a Managed Server starts, it tries to contact the Administration Server to retrieve any changes to the domain configuration that were made since the Managed Server was last shut down. If a Managed Server cannot connect to the Administration Server during startup, it can use its locally-cached configuration information—this is the configuration that was current at the time of the Managed Server’s most recent shutdown. A Managed Server that starts up without contacting its Administration Server to check for configuration updates is running in Managed Server Independence (MSI) mode. By default, MSI mode is enabled. See Oracle WebLogic Server Administration Console on-line help atfor information on replicating domain config files for Managed Server independence in the WebLogic Server documentation.
When using Linux or UNIX operating systems, you can use WebLogic Server’s server migration feature to automatically start a candidate (backup) server if a Network tier server’s machine fails or becomes partitioned from the network. The server migration feature uses node manager, in conjunction with the
wlsifconfig.sh script, to automatically boot candidate servers using a floating IP address. Candidate servers are booted only if the primary server hosting a Network tier instance becomes unreachable. See Oracle WebLogic Server Using Clusters at for more information about using the server migration feature.
In addition to server-level redundancy and failover capabilities, Oracle Communications Services Gatekeeper enables you to configure peer sites to protect against catastrophic failures, such as power outages, that can affect an entire domain. This enables you to failover from one geographical site to another, avoiding complete service outages. Seein the Architecture Overview for more information.
A Oracle Communications Services Gatekeeper deployment utilizes two basic categories of configuration information: domain-level configuration, and database configuration. The domain-level configuration consists of the artifacts used by the underlying WebLogic Server platform to configure the behavior of managed servers, clusters, security, and other resources deployed to clusters and servers within the domain. The primary domain-level configuration artifact is the
config.xml file, stored in the
/config directory. The
config.xml file generally references additional configuration files beneath the
config directory to configure additional domain resources such as JDBC and JMS.
In addition to the basic domain-level configuration of the WebLogic Server platform, Oracle Communications Services Gatekeeper stores some configuration for its core services in the form of database tables. This includes the routing configuration for backward-compatible communication services and PRM integration data. The database tables are shared across clustered instances of Oracle Communications Services Gatekeeper server instances. The database must be backed up at regular intervals to protect against data loss or corruption. An Oracle RAC deployment is also required for production installations, to provide redundancy and failover for the database configuration.
Both domain-level configuration backups and database backups may be required at different times in order to fully restore servers, or migrate server instances to new server hardware, in a Oracle Communications Services Gatekeeper installation.
Maintaining system integrity requires that you make use of existing high availability and failover features, perform regular backups of configuration artifacts, and understand how to restore server instances or migrate servers onto viable hardware. These common tasks are summarized in Table 1-1.