13 Avoiding and Recovering From Server Failures

This chapter describes the Oracle Communications WebRTC Session Controller failure prevention and recovery features, and includes the configuration artifacts that are required to restore different portions of a WebRTC Session Controller domain.

Failure Prevention and Automatic Recovery Features

A variety of events can lead to the failure of a server instance. Often one failure condition leads to another. Loss of power, hardware malfunction, operating system malfunctions, network partitions, or unexpected application behavior may each contribute to the failure of a server instance.

WebRTC Session Controller uses a highly clustered architecture as the basis for minimizing the impact of failure events. However, even in a clustered environment it is important to prepare for a sound recovery process if an individual server fails.

WebRTC Session Controller, and the underlying WebLogic Server platform, provide many features that protect against server failures. In a production system, use all available features to ensure uninterrupted service.

High Availability

High availability refers to a system design that eliminates or minimizes the amount of time that a system is inaccessible due to some type of system failure.

WebRTC Session Controller achieves high availability primarily due to the features of the underlying Weblogic Server platform. These features include:

  • WebLogic Server clusters that distribute the work load among the multiple instances of WebLogic Server running on the nodes in the cluster. In the event of failure, the session state of the failed WebLogic Server is available to other node that can continue the work. If the cluster is configured correctly, services can also migrate to another node in the event of failure. See "Understanding Weblogic Server Clustering" in Administering Clusters for Oracle WebLogic Server for more information.

  • Coherence clusters that distribute data across members to ensure that data is always available. See "Configuring and Managing Coherence Clusters" in Administering Clusters for Oracle WebLogic Server for more information.

  • Overload protection that enables WebLogic Server to detect and recover from overload conditions. See "Avoiding and Managing Overload" in Administering Server Environments for more information.

  • Network channels that segregate traffic by type to use resources effectively. See "Configuring Network Resources" in Administering Server Environments for more information

  • Work Managers that optimize and prioritize work based on rules and performance statistics. See "Using Work Managers to Optimize Scheduled Work" in Administering Server Environments for more information.

You can also use virtual machines (VMs) to mitigate system failure. An individual server has multiple points of potential failure, including CPU, RAM, network ports, and disk drives. A virtual machine, on the other hand, can satisfy its resource requirements from a pool of hardware resources so that a physical disk failure does not result in a failure of the virtual disk. The virtual machine simply employs another available disk drive to compensate for the one that failed. A balanced deployment of VMs running separate Signalling Engines and Media Engines on different hosts can take full advantage of cross-host high availability for both Signalling Engine and Media Engine clusters.

For information on installing a Media Engine cluster to support redundancy and failover, high-availability, and load balancing, see the sections on installing media engine clusters in the Oracle Communications WebRTC Session Controller Installation Guide.

Overload Protection

There are two sets of tuning parameters related to overload protection, one set for the SIP side and another set for the HTTP or WebSocket side. For WebRTC Session Controller, the greater threats are from the HTTP (Internet) side.

WebRTC Session Controller detects increases in system load that could affect the performance and stability of deployed SIP Servlets, and automatically throttles message processing at predefined load thresholds.

Using overload protection helps you avoid failures that could result from unanticipated levels of application traffic or resource utilization.

WebRTC Session Controller attempts to avoid failure when certain conditions occur:

  • The rate at which SIP sessions are created reaches a configured value, or

  • The size of the SIP timer and SIP request-processing execute queues reaches a configured length.

See "Engine Server Configuration Reference (sipserver.xml)" for more information.

The underlying WebLogic Server platform also detects increases in system load that can affect deployed application performance and stability. WebLogic Server allows administrators to configure failure prevention actions that occur automatically at predefined load thresholds. Automatic overload protection helps you avoid failures that result from unanticipated levels of application traffic or resource utilization as indicated by:

  • A workload manager's capacity being exceeded

  • The HTTP session count increasing to a predefined threshold value

  • Impending out of memory conditions

See "Avoiding and Managing Overload" in Administering Server Environments for Oracle WebLogic Server for more information.

Redundancy and Failover for Clustered Services

You can increase the reliability and availability of your applications by using multiple servers and partitions in a dedicated cluster.

Server partitions store redundant copies of call state information, and automatically failover to one another should a partition or server fail.

See Oracle Communications WebRTC Session Controller Concepts for more information.

Automatic Restart for Failed Server Instances

WebLogic Server self-health monitoring features improve the reliability and availability of server instances in a domain. Selected subsystems within each server instance monitor their health status based on criteria specific to the subsystem. (For example, the JMS subsystem monitors the condition of the JMS thread pool while the core server subsystem monitors default and user-defined execute queue statistics.) If an individual subsystem determines that it can no longer operate in a consistent and reliable manner, it registers its health state as failed with the host server.

Each WebLogic Server instance, in turn, checks the health state of its registered subsystems to determine its overall viability. If one or more of its critical subsystems have reached the FAILED state, the server instance marks its own health state FAILED to indicate that it cannot reliably host an application.

When used in combination with Node Manager, server self-health monitoring enables you to automatically restart servers that have failed. This improves the overall reliability of a domain, and requires no direct intervention from an administrator. For more information, see "Using Node Manager to Control Servers" in the Administering Node Manager for Oracle WebLogic Server.

Managed Server Independence Mode

Managed Servers maintain a local copy of the domain configuration. When a Managed Server starts, it contacts its Administration Server to retrieve any changes to the domain configuration that were made since the Managed Server was last shut down. If a Managed Server cannot connect to the Administration Server during startup, it can use its locally-cached configuration information—this is the configuration that was current at the time of the Managed Server's most recent shutdown. A Managed Server that starts without contacting its Administration Server to check for configuration updates is running in Managed Server Independence (MSI) mode. By default, MSI mode is enabled. See "Replicate domain config files for Managed Server Independence" in the Administration Console Online Help for more information.

Automatic Migration of Failed Managed Servers

When using Linux or UNIX operating systems, you can use WebLogic Server's server migration feature to automatically start a candidate (backup) server if a Network tier server fails or becomes partitioned from the network. The server migration feature uses node manager, with the wlsifconfig.sh script, to automatically start candidate servers using a floating IP address. Candidate servers are started only if the primary server hosting a Network tier instance becomes unreachable. See the discussion on "Whole Server Migration" in Administering Clusters for Oracle WebLogic Server for more information about using the server migration feature.

Geographic Redundancy for Regional Site Failures

In addition to server-level redundancy and failover capabilities, you can configure peer sites to protect against catastrophic failures, such as power outages, that can affect an entire domain. This configuration enables you to failover from one geographical site to another, avoiding complete service outages.

There is no specific configuration in WebRTC Session Controller to support redundant sites. They are two independent sites that are not aware of each other, which means that you need to configure and provision each site manually.

Directory and File Backups for Failure Recovery

Recovery from the failure of a server instance requires access to the domain's configuration data. By default, the Administration Server stores a domain's primary configuration data in a file called domain_home/config/config.xml, where domain_home is the root directory of the domain.

The primary configuration file may reference additional configuration files for specific WebLogic Server services, such as JDBC and JMS, and for WebRTC Session Controller services, such as SIP container properties and SIP call-state storage configuration. The configuration for specific services are stored in additional XML files in subdirectories of the domain_home/config directory, such as domain_home/config/jms, domain_home/config/jdbc, and domain_home/config/custom for WebRTC Session Controller configuration files.

The Administration Server can automatically archive multiple versions of the domain configuration (the entire domain_home/config directory). Use the configuration archives for system restoration in cases where accidental configuration changes need to be reversed. For example, if an administrator accidentally removes a configured resource, the prior configuration can be restored by using the last automated backup.

The Administration Server stores only a finite number of automated backups locally in domain_home/config. For this reason, automated domain backups are limited in their ability to guard against data corruption, such as a failed hard disk. Automated backups also do not preserve certain configuration data that are required for full domain restoration, such as LDAP repository data and server start-up scripts. Oracle recommends that you also maintain multiple backup copies of the configuration and security offline, in a source control system.

This section describes file backups that WebRTC Session Controller performs automatically and manual backup procedures that an administrator should perform periodically.

Enabling Automatic Configuration Backups

Follow these steps to enable automatic domain configuration backups on the Administration Server for your domain:

  1. Access the Administration Console for your domain.

  2. In the left pane of the Administration Console, select the name of the domain.

  3. In the right pane, click Configuration, and then select the General tab.

  4. Select Advanced to display advanced options.

  5. Select Configuration Archive Enabled.

  6. In the Archive Configuration Count box, enter the maximum number of configuration file revisions to save.

  7. Click Save.

When you enable configuration archiving, the Administration Server automatically creates a configuration JAR file archive. The JAR file contains a complete copy of the previous configuration (the complete contents of the domain_home\config directory). JAR file archive files are stored in the domain_home\configArchive directory. The files use the naming convention config-number.jar, where number is the sequential number of the archive.

When you save a change to a domain's configuration, the Administration Server saves the previous configuration in domain_home\configArchive\config.xml#n. Each time the Administration Server saves a file in the configArchive directory, it increments the value of the #n suffix, up to a configurable number of copies—5 by default. Thereafter, each time you change the domain configuration:

  • The archived files are rotated so that the newest file has a suffix with the highest number,

  • The previous archived files are renamed with a lower number, and

  • The oldest file is deleted.

Be aware that configuration archives are stored locally within the domain directory, and they may be overwritten according to the maximum number of revisions you selected. For these reasons, you must also create your own off-line archives of the domain configuration, as described in "Storing the Domain Configuration Offline".

Storing the Domain Configuration Offline

Although automatic backups protect against accidental configuration changes, they do not protect against data loss caused by a failure of the hard disk that stores the domain configuration, or accidental deletion of the domain directory. To protect against these failures, you must also store a complete copy of the domain configuration offline, preferably in a source control system.

Oracle recommends creating a full snapshot of the domain at regular intervals. For example, you might want to create a snapshot when the following events occur:

  • You first deploy the production system

  • You add or remove deployed applications

  • The configuration is tuned for performance

  • Any other permanent change is made.

Note:

The domain directory is present on the Administration Server and each Managed Server but the Administration Server has the master copy, which you must back up. You do not need to back up any files on a Managed Server.

The WebLogic pack command creates a template archive file (.jar) based on an existing WebLogic domain. For example, the following command creates a template file called C:\oracle\user_templates\mydomain.jar.

pack -domain=C:\oracle\user_projects\domains\mydomain -template=C:\oracle\user_templates\mydomain.jar -template_name="My WebLogic Domain"

The name of the template is My WebLogic Domain.

See Creating Templates and Domains Using the Pack and Unpack Commands for information on using the pack and unpack commands.

Store the new archive in a source control system, preserving earlier versions should you need to restore the domain to an earlier point in time.

Backing Up Logging Servlet Applications

If you use WebRTC Session Controller logging Servlets (see "Logging SIP Requests and Responses and EDRs") to perform regular logging or auditing of SIP messages, backup the complete application source files so that you can easily redeploy the applications should the staging server fail or the original deployment directory becomes corrupted.

Backing Up Security Data

The WebLogic Security service stores its configuration data config.xml file, and also in an LDAP repository and other files.

Backing Up the WebLogic LDAP Repository

The default Authentication, Authorization, Role Mapper, and Credential Mapper providers that are installed with WebRTC Session Controller store their data in an LDAP server. Each WebRTC Session Controller contains an embedded LDAP server. The Administration Server contains the master LDAP server, which is replicated on all Managed Servers. If any of your security realms use these installed providers, you should maintain an up-to-date backup of the following directory tree:

domain_home\servers\AdminServer\data\ldap

where domain_home is the domain's root directory and servers\AdminServer\data\ldap is the directory in which the Administration Server stores run-time and security data.

Each WebRTC Session Controller has an LDAP directory, but you only need to back up the LDAP data on the Administration Server—the master LDAP server replicates the LDAP data from each Managed Server when updates to security data are made. WebLogic security providers cannot modify security data while the domain's Administration Server is unavailable. The LDAP repositories on Managed Servers are replicas and cannot be modified.

The ldap\ldapfiles subdirectory contains the data files for the LDAP server. The files in this directory contain user, group, group membership, policies, and role information. Other subdirectories under the ldap directory contain LDAP server message logs and data about replicated LDAP servers.

Do not update the configuration of a security provider while a backup of LDAP data is in progress. If a change is made—for instance, if an administrator adds a user—while you are backing up the ldap directory tree, the backups in the ldapfiles subdirectory could become inconsistent. If this does occur, consistent, but potentially out-of-date, LDAP backups are available.

Once a day, a server suspends write operations and creates its own backup of the LDAP data. It archives this backup in a ZIP file below the ldap\backup directory and then resumes write operations. This backup is guaranteed to be consistent, but it might not contain the latest security data.

For information about configuring the LDAP backup, see the "Back Up LDAP Repository" section in Administering Server Startup and Shutdown for Oracle WebLogic Server.

Backing Up Additional Operating System Configuration Files

Certain files maintained at the operating system level are also critical in helping you recover from system failures. Consider backing up the following information as necessary for your system:

  • Load Balancer configuration scripts. For example, any automated scripts used to configure load balancer pools and virtual IP addresses for the engine tier cluster and NAT configuration settings.

  • NTP client configuration scripts used to synchronize the system clocks of engine servers.

  • Host configuration files for each WebRTC Session Controller system (host names, virtual and real IP addresses for multi-homed machines, IP routing table information).

Restarting a Failed Administration Server

If an Administration Server fails, only configuration, deployment, and monitoring features are affected, but Managed Servers continue to operate and process client requests. Potential losses incurred due to an Administration Server failure include:

  • Loss of in-progress management and deployment operations.

  • Loss of ongoing logging functionality.

  • Loss of SNMP trap generation for WebLogic Server instances (as opposed to WebRTC Session Controller instances). On Managed Servers, WebRTC Session Controller traps are generated even without the Administration Server.

To resume normal management activities, restart the failed Administration Server instance as soon as possible.

When you restart a failed Administration Server, no special steps are required. Start the Administration Server as you normally would.

If the Administration Server shuts down while Managed Servers continue to run, you do not need to restart the Managed Servers that are already running to recover management of the domain. The procedure for recovering management of an active domain depends upon whether you can restart the Administration Server on the same system it was running on when the domain was started.

Restarting an Administration Server on the Same System

If you restart the WebLogic Administration Server while Managed Servers continue to run, by default the Administration Server can discover the presence of the running Managed Servers.

Note:

Ensure that the startup command or startup script does not include -Dweblogic.management.discover=false, which disables an Administration Server from discovering its running Managed Servers.

The root directory for the domain contains a file, running-managed-servers.xml, which contains a list of the Managed Servers in the domain and describes their running state. When the Administration Server restarts, it checks this file to determine which Managed Servers were under its control before it stopped running.

When a Managed Server is gracefully or forcefully shut down, its status in running-managed-servers.xml is updated to "not-running." When an Administration Server restarts, it does not try to discover Managed Servers with the "not-running" status. A Managed Server that stops running because of a system malfunction, or that was stopped by killing the JVM or the command prompt (shell) in which it was running, will still have the status "running' in running-managed-servers.xml. The Administration Server will attempt to discover them, and will throw an exception when it determines that the Managed Server is no longer running.

Restarting the Administration Server does not cause Managed Servers to update the configuration of static attributes. Static attributes are those that a server refers to only during its startup process. Servers instances must be restarted to take account of changes to static configuration attributes. Discovery of the Managed Servers only enables the Administration Server to monitor the Managed Servers or make run-time changes to attributes configurable while a server is running (dynamic attributes).

Restarting an Administration Server on Another System

If a system malfunction prevents you from restarting the Administration Server on the same system, you can recover management of the running Managed Servers as follows:

  1. Install the WebRTC Session Controller software on the new system (if this has not already been done).apply any patches that had been applied to the failed server.

  2. Apply any patches that had been applied to the failed server.

  3. Use the unpack command to create a WebLogic domain from the template that you created when you backed up the domain. See "Storing the Domain Configuration Offline" for more information. See Creating Templates and Domains Using the Pack and Unpack Commands for more information on the pack and unpack commands.

    Your application files should be available in the same relative location on the new file system as on the file system of the original Administration Server.

  4. Make your configuration and security data available to the new administration system by copying them from backups or by using a shared disk. For more information, refer to "Storing the Domain Configuration Offline" and "Backing Up Security Data".

  5. Restart the Administration Server on the new system.

    Ensure that the startup command or startup script does not include -Dweblogic.management.discover=false, which disables an Administration Server from discovering its running Managed Servers.

When the Administration Server starts, it communicates with the Managed Servers and informs them that the Administration Server is now running on a different IP address.

Restarting Failed Managed Servers

If the system on which the failed Managed Server runs can contact the Administration Server for the domain, simply restart the Managed Server manually or automatically using Node Manager. You must configure Node Manager and the Managed Server to support automated restarts, as described in the discussion on "How Node Manager Restarts a Managed Server" in the Administering Node Manager for Oracle WebLogic Server.

If the Managed Server cannot connect to the Administration Server during startup, it can retrieve its configuration by reading locally-cached configuration data. A Managed Server that starts in this way is running in Managed Server Independence (MSI) mode.

For a description of MSI mode, and the files that a Managed Server must access to start in MSI mode, see "Replicate domain config files for Managed Server independence" in Administration Console Online Help.

To start a Managed Server in MSI mode:

  1. Ensure that the following files are available in the Managed Server's root directory:

    • msi-config.xml

    • SerializedSystemIni.dat

    • boot.properties

    If these files are not in the Managed Server's root directory:

    1. Copy the config.xml and SerializedSystemIni.dat file from the Administration Server's root directory (or from a backup) to the Managed Server's root directory.

    2. Rename the configuration file to msi-config.xml. When you start the server, it will use the copied configuration files.

      Note:

      Alternatively, use the -Dweblogic.RootDirectory=path startup option to specify a root directory that already contains these files.
  2. Start the Managed Server at the command-line or using a script.

    The Managed Server will run in MSI mode until it is contacted by its Administration Server. For information about restarting the Administration Server in this scenario, see "Restarting a Failed Administration Server".