Operations Guide

     Previous  Next    Open TOC in new window    View as PDF - New Window  Get Adobe Reader - New Window
Content starts here

Avoiding and Recovering From Server Failures

A variety of events can lead to the failure of a server instance. Often one failure condition leads to another. Loss of power, hardware malfunction, operating system crashes, network partitions, or unexpected application behavior may each contribute to the failure of a server instance.

WebLogic SIP Server uses a highly clustered architecture as the basis for minimizing the impact of failure events. However, even in a clustered environment it is important to prepare for a sound recovery process in the event that an individual server or server machine fails.

The following sections summarize WebLogic SIP Server failure prevention and recovery features, and describe the configuration artifacts that are required in order to restore different portions of a WebLogic SIP Server domain:

 


Failure Prevention and Automatic Recovery Features

WebLogic SIP Server, and the underlying WebLogic Server platform, provide many features that protect against server failures. In a production system, all available features should be used in order to ensure uninterrupted service.

Overload Protection

WebLogic SIP Server detects increases in system load that could affect the performance and stability of deployed SIP Servlets, and automatically throttles message processing at predefined load thresholds.

Using overload protection helps you avoid failures that could result from unanticipated levels of application traffic or resource utilization.

WebLogic SIP Server attempts to avoid failure when certain conditions occur:

See overload in the Configuration Reference for more information.

The underlying WebLogic Server platform also detects increases in system load that can affect deployed application performance and stability. WebLogic Server allows administrators to configure failure prevention actions that occur automatically at predefined load thresholds. Automatic overload protection helps you avoid failures that result from unanticipated levels of application traffic or resource utilization as indicated by:

See Avoiding and Managing Overload in the WebLogic Server 9.2 documentation for more information.

Redundancy and Failover for Clustered Services

You can increase the reliability and availability of your applications by using multiple engine tier servers in a dedicated cluster, as well as multiple data tier servers (replicas) in a dedicated data tier cluster. Because engine tier clusters maintain no stateful information about applications, the failure of an engine tier server does not result in any data loss or dropped calls. Multiple replicas in a data tier partition store redundant copies of call state information, and automatically failover to one another should a replica fail.

See Overview of the WebLogic SIP Server Architecture in the Configuration Manual for more information.

Automatic Restart for Failed Server Instances

WebLogic Server self-health monitoring features improve the reliability and availability of server instances in a domain. Selected subsystems within each server instance monitor their health status based on criteria specific to the subsystem. (For example, the JMS subsystem monitors the condition of the JMS thread pool while the core server subsystem monitors default and user-defined execute queue statistics.) If an individual subsystem determines that it can no longer operate in a consistent and reliable manner, it registers its health state as “failed” with the host server.

Each WebLogic Server instance, in turn, checks the health state of its registered subsystems to determine its overall viability. If one or more of its critical subsystems have reached the FAILED state, the server instance marks its own health state FAILED to indicate that it cannot reliably host an application.

When used in combination with Node Manager, server self-health monitoring enables you to automatically reboot servers that have failed. This improves the overall reliability of a domain, and requires no direct intervention from an administrator. For more information, see Using Node Manager to Control Servers in the WebLogic Server 9.2 documentation.

Managed Server Independence Mode

Managed Servers maintain a local copy of the domain configuration. When a Managed Server starts, it contacts its Administration Server to retrieve any changes to the domain configuration that were made since the Managed Server was last shut down. If a Managed Server cannot connect to the Administration Server during startup, it can use its locally-cached configuration information—this is the configuration that was current at the time of the Managed Server’s most recent shutdown. A Managed Server that starts up without contacting its Administration Server to check for configuration updates is running in Managed Server Independence (MSI) mode. By default, MSI mode is enabled. See Replicating domain config files for Managed Server Independence in the WebLogic Server 9.2 documentation.

Automatic Migration of Failed Managed Servers

When using Linux or UNIX operating systems, you can use WebLogic Server’s server migration feature to automatically start a candidate (backup) server if a Network tier server’s machine fails or becomes partitioned from the network. The server migration feature uses node manager, in conjunction with the wlsifconfig.sh script, to automatically boot candidate servers using a floating IP address. Candidate servers are booted only if the primary server hosting a Network tier instance becomes unreachable. See Migration in the WebLogic Server 9.2 documentation for more information about using the server migration feature.

Geographic Redundancy for Regional Site Failures

In addition to server-level redundancy and failover capabilities, WebLogic SIP Server enables you to configure peer sites to protect against catastrophic failures, such as power outages, that can affect an entire domain. This enables you to failover from one geographical site to another, avoiding complete service outages. See Configuring Geographically-Redundant Installations in the Configuration Guide for more information.

 


Directory and File Backups for Failure Recovery

Recovery from the failure of a server instance requires access to the domain’s configuration data. By default, the Administration Server stores a domain’s primary configuration data in a file called domain_name/config/config.xml, where domain_name is the root directory of the domain. The primary configuration file may reference additional configuration files for specific WebLogic Server services, such as JDBC and JMS, and for WebLogic SIP Server services, such as SIP container properties and data tier configuration. The configuration for specific services are stored in additional XML files in subdirectories of the domain_name/config directory, such as domain_name/config/jms, domain_name/config/jdbc, and domain_name/config/custom for WebLogic SIP Server configuration files.

The Administration Server can automatically archive multiple versions of the domain configuration (the entire domain-name/config directory). The configuration archives can be used for system restoration in cases where accidental configuration changes need to be reversed. For example, if an administrator accidentally removes a configured resource, the prior configuration can be restored by using the last automated backup.

The Administration Server stores only a finite number of automated backups locally in domain_name/config. For this reason, automated domain backups are limited in their ability to guard against data corruption, such as a failed hard disk. Automated backups also do not preserve certain configuration data that are required for full domain restoration, such as LDAP repository data and server start-up scripts. BEA recommends that you also maintain multiple backup copies of the configuration and security offline, in a source control system.

This section describes file backups that WebLogic SIP Server performs automatically, as well as manual backup procedures that an administrator should perform periodically.

Enabling Automatic Configuration Backups

Follow these steps to enable automatic domain configuration backups on the Administration Server for your domain:

  1. Access the Administration Console for your domain.
  2. In the left pane of the Administration Console, select the name of the domain.
  3. In the right pane, click the Configuration->General tab.
  4. In the Advanced Options bar, click Show.
  5. In the Archive Configuration Count box, enter the maximum number of configuration file revisions to save.
  6. Click Apply.

When you enable configuration archiving, the Administration Server automatically creates a configuration JAR file archive each time the Administrator uses the Activate Changes button in the Administration Console to change the active configuration. The JAR file contains a complete copy of the previous configuration (the complete contents of the domain-name\config directory). JAR file archive files are stored in the domain-name\configArchive directory. The files use the naming convention config-number.jar, where number is the sequential number of the archive.

When you save a change to a domain’s configuration, the Administration Server saves the previous configuration in domain-name\configArchive\config.xml#n. Each time the Administration Server saves a file in the configArchive directory, it increments the value of the #n suffix, up to a configurable number of copies—5 by default. Thereafter, each time you change the domain configuration:

Keep in mind that configuration archives are stored locally within the domain directory, and they may be overwritten according to the maximum number of revisions you selected. For these reasons, you must also create your own off-line archives of the domain configuration, as described in Storing the Domain Configuration Offline.

Storing the Domain Configuration Offline

Although automatic backups protect against accidental configuration changes, they do not protect against data loss caused by a failure of the hard disk that stores the domain configuration, or accidental deletion of the domain directory. To protect against these failures, you must also store a complete copy of the domain configuration offline, preferably in a source control system.

BEA recommends storing a copy of the domain configuration at regular intervals. For example, back up a new revision of the configuration when:

The domain configuration backup should contain the complete contents of the domain_name/config directory. For example:

cd ~/bea/user_projects/domains/mydomain
tar cvf domain-backup-06-17-2007.jar config

Store the new archive in a source control system, preserving earlier versions should you need to restore the domain configuration to an earlier point in time.

Backing Up Server Start Scripts

In a WebLogic SIP Server deployment, the start scripts used to boot engine and data tier servers are generally customized to include domain-specific configuration information such as:

Backup each distinct start script used to boot engine tier, data tier, or diameter relay servers in your domain.

Backing Up Logging Servlet Applications

If you use WebLogic SIP Server logging Servlets (see Logging SIP Requests and Responses) to perform regular logging or auditing of SIP messages, backup the complete application source files so that you can easily redeploy the applications should the staging server fail or the original deployment directory becomes corrupted.

Backing Up Security Data

The WebLogic Security service stores its configuration data config.xml file, and also in an LDAP repository and other files.

Backing Up the WebLogic LDAP Repository

The default Authentication, Authorization, Role Mapper, and Credential Mapper providers that are installed with WebLogic SIP Server store their data in an LDAP server. Each WebLogic SIP Server contains an embedded LDAP server. The Administration Server contains the master LDAP server, which is replicated on all Managed Servers. If any of your security realms use these installed providers, you should maintain an up-to-date backup of the following directory tree:

domain_name\adminServer\ldap

where domain_name is the domain’s root directory and adminServer is the directory in which the Administration Server stores runtime and security data.

Each WebLogic SIP Server has an LDAP directory, but you only need to back up the LDAP data on the Administration Server—the master LDAP server replicates the LDAP data from each Managed Server when updates to security data are made. WebLogic security providers cannot modify security data while the domain’s Administration Server is unavailable. The LDAP repositories on Managed Servers are replicas and cannot be modified.

The ldap/ldapfiles subdirectory contains the data files for the LDAP server. The files in this directory contain user, group, group membership, policies, and role information. Other subdirectories under the ldap directory contain LDAP server message logs and data about replicated LDAP servers.

Do not update the configuration of a security provider while a backup of LDAP data is in progress. If a change is made—for instance, if an administrator adds a user—while you are backing up the ldap directory tree, the backups in the ldapfiles subdirectory could become inconsistent. If this does occur, consistent, but potentially out-of-date, LDAP backups are available.

Once a day, a server suspends write operations and creates its own backup of the LDAP data. It archives this backup in a ZIP file below the ldap\backup directory and then resumes write operations. This backup is guaranteed to be consistent, but it might not contain the latest security data.

For information about configuring the LDAP backup, see Configuring Backups for the Embedded LDAP Server in the WebLogic Server 9.2 Documentation.

Backing Up SerializedSystemIni.dat and Security Certificates

All servers create a file named SerializedSystemIni.dat and place it in the server’s root directory. This file contains encrypted security data that must be present to boot the server. You must back up this file.

If you configured a server to use SSL, also back up the security certificates and keys. The location of these files is user-configurable.

Backing Up Additional Operating System Configuration Files

Certain files maintained at the operating system level are also critical in helping you recover from system failures. Consider backing up the following information as necessary for your system:

 


Restarting a Failed Administration Server

When you restart a failed Administration Server, no special steps are required. Start the Administration Server as you normally would.

If the Administration Server shuts down while Managed Servers continue to run, you do not need to restart the Managed Servers that are already running in order to recover management of the domain. The procedure for recovering management of an active domain depends upon whether you can restart the Administration Server on the same machine it was running on when the domain was started.

Restarting an Administration Server on the Same Machine

If you restart the WebLogic Administration Server while Managed Servers continue to run, by default the Administration Server can discover the presence of the running Managed Servers.

Note: Make sure that the startup command or startup script does not include -Dweblogic.management.discover=false, which disables an Administration Server from discovering its running Managed Servers.

The root directory for the domain contains a file, running-managed-servers.xml, which contains a list of the Managed Servers in the domain and describes whether they are running or not. When the Administration Server restarts, it checks this file to determine which Managed Servers were under its control before it stopped running.

When a Managed Server is gracefully or forcefully shut down, its status in running-managed-servers.xml is updated to “not-running”. When an Administration Server restarts, it does not try to discover Managed Servers with the “not-running” status. A Managed Server that stops running because of a system crash, or that was stopped by killing the JVM or the command prompt (shell) in which it was running, will still have the status “running’ in running-managed-servers.xml. The Administration Server will attempt to discover them, and will throw an exception when it determines that the Managed Server is no longer running.

Restarting the Administration Server does not cause Managed Servers to update the configuration of static attributes. Static attributes are those that a server refers to only during its startup process. Servers instances must be restarted to take account of changes to static configuration attributes. Discovery of the Managed Servers only enables the Administration Server to monitor the Managed Servers or make runtime changes in attributes that can be configured while a server is running (dynamic attributes).

Restarting an Administration Server on Another Machine

If a machine crash prevents you from restarting the Administration Server on the same machine, you can recover management of the running Managed Servers as follows:

  1. Install the WebLogic SIP Server software on the new administration machine (if this has not already been done).
  2. Make your application files available to the new Administration Server by copying them from backups or by using a shared disk. Your application files should be available in the same relative location on the new file system as on the file system of the original Administration Server.
  3. Make your configuration and security data available to the new administration machine by copying them from backups or by using a shared disk. For more information, refer to Storing the Domain Configuration Offline and Backing Up Security Data.
  4. Restart the Administration Server on the new machine.
  5. Make sure that the startup command or startup script does not include -Dweblogic.management.discover=false, which disables an Administration Server from discovering its running Managed Servers.

When the Administration Server starts, it communicates with the Managed Servers and informs them that the Administration Server is now running on a different IP address.

 


Restarting Failed Managed Servers

If the machine on which the failed Managed Server runs can contact the Administration Server for the domain, simply restart the Managed Server manually or automatically using Node Manager. Note that you must configure Node Manager and the Managed Server to support automated restarts, as described in Using Node Manager to Control Servers in the WebLogic Server 9.2 documentation.

If the Managed Server cannot connect to the Administration Server during startup, it can retrieve its configuration by reading locally-cached configuration data. A Managed Server that starts in this way is running in Managed Server Independence (MSI) mode. For a description of MSI mode, and the files that a Managed Server must access to start up in MSI mode, see Replicating domain config files for Managed Server Independence in the WebLogic Server 9.2 documentation.

To start up a Managed Server in MSI mode:

  1. Ensure that the following files are available in the Managed Server’s root directory:
    • msi-config.xml.
    • SerializedSystemIni.dat
    • boot.properties
    • If these files are not in the Managed Server’s root directory:

    1. Copy the config.xml and SerializedSystemIni.dat file from the Administration Server’s root directory (or from a backup) to the Managed Server’s root directory.
    2. Rename the configuration file to msi-config.xml. When you start the server, it will use the copied configuration files.
    3. Note: Alternatively, use the -Dweblogic.RootDirectory=path startup option to specify a root directory that already contains these files.
  2. Start the Managed Server at the command line or using a script.
  3. The Managed Server will run in MSI mode until it is contacted by its Administration Server. For information about restarting the Administration Server in this scenario, see Restarting a Failed Administration Server.


  Back to Top       Previous  Next