Skip Headers
Oracle® Communications Converged Application Server Administration Guide
Release 5.0

Part Number E17647-03
Go to Documentation Home
Go to Book List
Book List
Go to Table of Contents
Go to Feedback page
Contact Us

Go to previous page
Go to next page
View PDF

19 Avoiding and Recovering From Server Failures

This chapter describes the Oracle Communications Converged Application Server failure prevention and recovery features, and describes the configuration artifacts that are required in order to restore different portions of a Converged Application Server domain:

A variety of events can lead to the failure of a server instance. Often one failure condition leads to another. Loss of power, hardware malfunction, operating system crashes, network partitions, or unexpected application behavior may each contribute to the failure of a server instance.

Converged Application Server uses a highly clustered architecture as the basis for minimizing the impact of failure events. However, even in a clustered environment it is important to prepare for a sound recovery process in the event that an individual server or server machine fails.

Failure Prevention and Automatic Recovery Features

Converged Application Server, and the underlying WebLogic Server platform, provide many features that protect against server failures. In a production system, all available features should be used in order to ensure uninterrupted service.

Overload Protection

Converged Application Server detects increases in system load that could affect the performance and stability of deployed SIP Servlets, and automatically throttles message processing at predefined load thresholds.

Using overload protection helps you avoid failures that could result from unanticipated levels of application traffic or resource utilization.

Converged Application Server attempts to avoid failure when certain conditions occur:

  • The rate at which SIP sessions are created reaches a configured value, or

  • The size of the SIP timer and SIP request-processing execute queues reaches a configured length.

See "overload" in the chapter Chapter 30, "Engine Tier Configuration Reference (sipserver.xml)" for more information.

The underlying WebLogic Server platform also detects increases in system load that can affect deployed application performance and stability. WebLogic Server allows administrators to configure failure prevention actions that occur automatically at predefined load thresholds. Automatic overload protection helps you avoid failures that result from unanticipated levels of application traffic or resource utilization as indicated by:

  • A workload manager's capacity being exceeded

  • The HTTP session count increasing to a predefined threshold value

  • Impending out of memory conditions

See the discussion on avoiding and managing overload in Configuring Server Environments for Oracle WebLogic Server in the Oracle WebLogic Server 11g documentation for more information.

Redundancy and Failover for Clustered Services

You can increase the reliability and availability of your applications by using multiple engine tier servers in a dedicated cluster, as well as multiple SIP data tier servers (replicas) in a dedicated SIP data tier cluster. Because engine tier clusters maintain no stateful information about applications, the failure of an engine tier server does not result in any data loss or dropped calls. Multiple replicas in a SIP data tier partition store redundant copies of call state information, and automatically failover to one another should a replica fail.

See "Goals of the Converged Application Server Architecture" in Chapter 1, "Overview of the Converged Application Server Architecture," for more information.

Automatic Restart for Failed Server Instances

WebLogic Server self-health monitoring features improve the reliability and availability of server instances in a domain. Selected subsystems within each server instance monitor their health status based on criteria specific to the subsystem. (For example, the JMS subsystem monitors the condition of the JMS thread pool while the core server subsystem monitors default and user-defined execute queue statistics.) If an individual subsystem determines that it can no longer operate in a consistent and reliable manner, it registers its health state as "failed" with the host server.

Each WebLogic Server instance, in turn, checks the health state of its registered subsystems to determine its overall viability. If one or more of its critical subsystems have reached the FAILED state, the server instance marks its own health state FAILED to indicate that it cannot reliably host an application.

When used in combination with Node Manager, server self-health monitoring enables you to automatically reboot servers that have failed. This improves the overall reliability of a domain, and requires no direct intervention from an administrator. For more information, see the discussion on using Node Manager to start Managed Servers in a domain or cluster in the Oracle WebLogic Server 11g documentation.

Managed Server Independence Mode

Managed Servers maintain a local copy of the domain configuration. When a Managed Server starts, it contacts its Administration Server to retrieve any changes to the domain configuration that were made since the Managed Server was last shut down. If a Managed Server cannot connect to the Administration Server during startup, it can use its locally-cached configuration information—this is the configuration that was current at the time of the Managed Server's most recent shutdown. A Managed Server that starts up without contacting its Administration Server to check for configuration updates is running in Managed Server Independence (MSI) mode. By default, MSI mode is enabled. See "Replicate domain config files for Managed Server Independence" in the Administration Console online Help.

Automatic Migration of Failed Managed Servers

When using Linux or UNIX operating systems, you can use WebLogic Server's server migration feature to automatically start a candidate (backup) server if a Network tier server's machine fails or becomes partitioned from the network. The server migration feature uses node manager, in conjunction with the script, to automatically boot candidate servers using a floating IP address. Candidate servers are booted only if the primary server hosting a Network tier instance becomes unreachable. See the discussion on whole server migration in Using Clusters for Oracle WebLogic Server in the Oracle WebLogic Server 11g documentation for more information about using the server migration feature.

Geographic Redundancy for Regional Site Failures

In addition to server-level redundancy and failover capabilities, Converged Application Server enables you to configure peer sites to protect against catastrophic failures, such as power outages, that can affect an entire domain. This enables you to failover from one geographical site to another, avoiding complete service outages. See "Geographically-Redundant Installations" in Chapter 1, "Overview of the Converged Application Server Architecture," for more information.

Directory and File Backups for Failure Recovery

Recovery from the failure of a server instance requires access to the domain's configuration data. By default, the Administration Server stores a domain's primary configuration data in a file called DOMAIN_HOME/config/config.xml, where DOMAIN_HOME is the root directory of the domain. The primary configuration file may reference additional configuration files for specific WebLogic Server services, such as JDBC and JMS, and for Converged Application Server services, such as SIP container properties and SIP data tier configuration. The configuration for specific services are stored in additional XML files in subdirectories of the DOMAIN_HOME/config directory, such as DOMAIN_HOME/config/jms, DOMAIN_HOME/config/jdbc, and DOMAIN_HOME/config/custom for Converged Application Server configuration files.

The Administration Server can automatically archive multiple versions of the domain configuration (the entire DOMAIN_HOME/config directory). The configuration archives can be used for system restoration in cases where accidental configuration changes need to be reversed. For example, if an administrator accidentally removes a configured resource, the prior configuration can be restored by using the last automated backup.

The Administration Server stores only a finite number of automated backups locally in DOMAIN_HOME/config. For this reason, automated domain backups are limited in their ability to guard against data corruption, such as a failed hard disk. Automated backups also do not preserve certain configuration data that are required for full domain restoration, such as LDAP repository data and server start-up scripts. Oracle recommends that you also maintain multiple backup copies of the configuration and security offline, in a source control system.

This section describes file backups that Converged Application Server performs automatically, as well as manual backup procedures that an administrator should perform periodically.

Enabling Automatic Configuration Backups

Follow these steps to enable automatic domain configuration backups on the Administration Server for your domain:

  1. Access the Administration Console for your domain.

  2. In the left pane of the Administration Console, select the name of the domain.

  3. In the right pane, click Configuration, and then select the General tab.

  4. Select Advanced to display advanced options.

  5. Select Configuration Archive Enabled.

  6. In the Archive Configuration Count box, enter the maximum number of configuration file revisions to save.

  7. Click Save.

When you enable configuration archiving, the Administration Server automatically creates a configuration JAR file archive. The JAR file contains a complete copy of the previous configuration (the complete contents of the DOMAIN_HOME\config directory). JAR file archive files are stored in the DOMAIN_HOME\configArchive directory. The files use the naming convention config-number.jar, where number is the sequential number of the archive.

When you save a change to a domain's configuration, the Administration Server saves the previous configuration in DOMAIN_HOME\configArchive\config.xml#n. Each time the Administration Server saves a file in the configArchive directory, it increments the value of the #n suffix, up to a configurable number of copies—5 by default. Thereafter, each time you change the domain configuration:

  • The archived files are rotated so that the newest file has a suffix with the highest number,

  • The previous archived files are renamed with a lower number, and

  • The oldest file is deleted.

Be aware that configuration archives are stored locally within the domain directory, and they may be overwritten according to the maximum number of revisions you selected. For these reasons, you must also create your own off-line archives of the domain configuration, as described in "Storing the Domain Configuration Offline".

Storing the Domain Configuration Offline

Although automatic backups protect against accidental configuration changes, they do not protect against data loss caused by a failure of the hard disk that stores the domain configuration, or accidental deletion of the domain directory. To protect against these failures, you must also store a complete copy of the domain configuration offline, preferably in a source control system.

Oracle recommends storing a copy of the domain configuration at regular intervals. For example, backup a new revision of the configuration when:

  • You first deploy the production system

  • You add or remove deployed applications

  • The configuration is tuned for performance

  • Any other permanent change is made.

The domain configuration backup should contain the complete contents of the DOMAIN_HOME/config directory. For example:

cd ~/ORACLE_HOME/Middleware/user_projects/domains/mydomain
tar cvf domain-backup-06-17-2007.jar config

Store the new archive in a source control system, preserving earlier versions should you need to restore the domain configuration to an earlier point in time.

Backing Up Server Start Scripts

In a Converged Application Server deployment, the start scripts used to boot engine and SIP data tier servers are generally customized to include domain-specific configuration information such as:

Backup each distinct start script used to boot engine tier, SIP data tier, or diameter relay servers in your domain.

Backing Up Logging Servlet Applications

If you use Converged Application Server logging Servlets (see Chapter 17, "Logging SIP Requests and Responses") to perform regular logging or auditing of SIP messages, backup the complete application source files so that you can easily redeploy the applications should the staging server fail or the original deployment directory becomes corrupted.

Backing Up Security Data

The WebLogic Security service stores its configuration data config.xml file, and also in an LDAP repository and other files.

Backing Up the WebLogic LDAP Repository

The default Authentication, Authorization, Role Mapper, and Credential Mapper providers that are installed with Converged Application Server store their data in an LDAP server. Each Converged Application Server contains an embedded LDAP server. The Administration Server contains the master LDAP server, which is replicated on all Managed Servers. If any of your security realms use these installed providers, you should maintain an up-to-date backup of the following directory tree:


where DOMAIN_NAME is the domain's root directory and AdminServer\data\ldap is the directory in which the Administration Server stores runtime and security data.

Each Converged Application Server has an LDAP directory, but you only need to back up the LDAP data on the Administration Server—the master LDAP server replicates the LDAP data from each Managed Server when updates to security data are made. WebLogic security providers cannot modify security data while the domain's Administration Server is unavailable. The LDAP repositories on Managed Servers are replicas and cannot be modified.

The ldap\ldapfiles subdirectory contains the data files for the LDAP server. The files in this directory contain user, group, group membership, policies, and role information. Other subdirectories under the ldap directory contain LDAP server message logs and data about replicated LDAP servers.

Do not update the configuration of a security provider while a backup of LDAP data is in progress. If a change is made—for instance, if an administrator adds a user—while you are backing up the ldap directory tree, the backups in the ldapfiles subdirectory could become inconsistent. If this does occur, consistent, but potentially out-of-date, LDAP backups are available.

Once a day, a server suspends write operations and creates its own backup of the LDAP data. It archives this backup in a ZIP file below the ldap\backup directory and then resumes write operations. This backup is guaranteed to be consistent, but it might not contain the latest security data.

For information about configuring the LDAP backup, see the discussion on backing up the LDAP repository in Avoiding and Recovering From Server Failure in the Oracle WebLogic Server 11g Documentation.

Backing Up SerializedSystemIni.dat and Security Certificates

All servers create a file named SerializedSystemIni.dat and place it in the server's root directory. This file contains encrypted security data that must be present to boot the server. You must back up this file.

If you configured a server to use SSL, also back up the security certificates and keys. The location of these files is user-configurable.

Backing Up Additional Operating System Configuration Files

Certain files maintained at the operating system level are also critical in helping you recover from system failures. Consider backing up the following information as necessary for your system:

  • Load Balancer configuration scripts. For example, any automated scripts used to configure load balancer pools and virtual IP addresses for the engine tier cluster, as well as NAT configuration settings.

  • NTP client configuration scripts used to synchronize the system clocks of engine and SIP data tier servers.

  • Host configuration files for each Converged Application Server machine (host names, virtual and real IP addresses for multi-homed machines, IP routing table information).

Restarting a Failed Administration Server

When you restart a failed Administration Server, no special steps are required. Start the Administration Server as you normally would.

If the Administration Server shuts down while Managed Servers continue to run, you do not need to restart the Managed Servers that are already running in order to recover management of the domain. The procedure for recovering management of an active domain depends upon whether you can restart the Administration Server on the same machine it was running on when the domain was started.

Restarting an Administration Server on the Same Machine

If you restart the WebLogic Administration Server while Managed Servers continue to run, by default the Administration Server can discover the presence of the running Managed Servers.


Make sure that the startup command or startup script does not include, which disables an Administration Server from discovering its running Managed Servers.

The root directory for the domain contains a file, running-managed-servers.xml, which contains a list of the Managed Servers in the domain and describes whether they are running or not. When the Administration Server restarts, it checks this file to determine which Managed Servers were under its control before it stopped running.

When a Managed Server is gracefully or forcefully shut down, its status in running-managed-servers.xml is updated to "not-running". When an Administration Server restarts, it does not try to discover Managed Servers with the "not-running" status. A Managed Server that stops running because of a system crash, or that was stopped by killing the JVM or the command prompt (shell) in which it was running, will still have the status "running' in running-managed-servers.xml. The Administration Server will attempt to discover them, and will throw an exception when it determines that the Managed Server is no longer running.

Restarting the Administration Server does not cause Managed Servers to update the configuration of static attributes. Static attributes are those that a server refers to only during its startup process. Servers instances must be restarted to take account of changes to static configuration attributes. Discovery of the Managed Servers only enables the Administration Server to monitor the Managed Servers or make runtime changes in attributes that can be configured while a server is running (dynamic attributes).

Restarting an Administration Server on Another Machine

If a machine crash prevents you from restarting the Administration Server on the same machine, you can recover management of the running Managed Servers as follows:

  1. Install the Converged Application Server software on the new administration machine (if this has not already been done).

  2. Make your application files available to the new Administration Server by copying them from backups or by using a shared disk. Your application files should be available in the same relative location on the new file system as on the file system of the original Administration Server.

  3. Make your configuration and security data available to the new administration machine by copying them from backups or by using a shared disk. For more information, refer to "Storing the Domain Configuration Offline" and "Backing Up Security Data".

  4. Restart the Administration Server on the new machine.

    Make sure that the startup command or startup script does not include, which disables an Administration Server from discovering its running Managed Servers.

When the Administration Server starts, it communicates with the Managed Servers and informs them that the Administration Server is now running on a different IP address.

Restarting Failed Managed Servers

If the machine on which the failed Managed Server runs can contact the Administration Server for the domain, simply restart the Managed Server manually or automatically using Node Manager. Note that you must configure Node Manager and the Managed Server to support automated restarts, as described in the discussion on using Node Manager to start Managed Servers in a Domain or Cluster in the Oracle WebLogic Server 11g release 1 patch set 2 documentation.

If the Managed Server cannot connect to the Administration Server during startup, it can retrieve its configuration by reading locally-cached configuration data. A Managed Server that starts in this way is running in Managed Server Independence (MSI) mode. For a description of MSI mode, and the files that a Managed Server must access to start up in MSI mode, see "Replicate domain config files for Managed Server independence" in the Administration Console online Help.

To start up a Managed Server in MSI mode:

  1. Ensure that the following files are available in the Managed Server's root directory:

    • msi-config.xml

    • SerializedSystemIni.dat


    If these files are not in the Managed Server's root directory:

    1. Copy the config.xml and SerializedSystemIni.dat file from the Administration Server's root directory (or from a backup) to the Managed Server's root directory.

    2. Rename the configuration file to msi-config.xml. When you start the server, it will use the copied configuration files.


      Alternatively, use the -Dweblogic.RootDirectory=path startup option to specify a root directory that already contains these files.
  2. Start the Managed Server at the command line or using a script.

    The Managed Server will run in MSI mode until it is contacted by its Administration Server. For information about restarting the Administration Server in this scenario, see "Restarting a Failed Administration Server".