4 Avoiding and Recovering From Server Failure

Oracle WebLogic Server instances may fail periodically even in a clustered environment. Events including loss of power, hardware malfunction, operating system crashes, network partitions, and unexpected application behavior, can lead to the failure of a server instance. For high availability requirements, implement a clustered architecture that helps to minimize the impact of failure events and recover the failed server.

See Failover and Replication in a Cluster in Administering Clusters for Oracle WebLogic Server.

Failure Prevention and Recovery Features

WebLogic Server provides several recovery features that protect the servers by avoiding conflicts resulting from unanticipated levels of application and resource utilization. The features include automatic restart, server-level migration, service-level migration, and so on.

Overload Protection

WebLogic Server detects increases in system load that can affect application performance and stability, and allows administrators to configure failure prevention actions that occur automatically at predefined load thresholds.

Overload protection helps you avoid failures that result from unanticipated levels of application traffic or resource utilization.

WebLogic Server attempts to avoid failure when certain conditions occur:

  • Workload manager capacity is exceeded

  • HTTP session count increases to a predefined threshold value

  • Impending out of memory conditions

Failover for Clustered Services

You can increase the reliability and availability of your applications by hosting them on a WebLogic Server cluster. Clusterable services, such as EJBs and Web applications, can be deployed uniformly—on each Managed Server—in a cluster, so that if the server instance upon which a service is deployed fails, the service can fail over to another server in the cluster, without interruption in service or loss of state.

See Failover and Replication in a Cluster in Administering Clusters for Oracle WebLogic Server.

Automatic Restart for Failed Server Instances

WebLogic Server self-health monitoring improves the reliability and availability of server instances in a domain. Selected subsystems within each WebLogic Server instance monitor their health status based on criteria specific to the subsystem. For example, the JMS subsystem monitors the condition of the JMS thread pool while the core server subsystem monitors default and user-defined execute queue statistics. If an individual subsystem determines that it can no longer operate in a consistent and reliable manner, it registers its health state as "failed" with the host server.

Each WebLogic Server instance, in turn, checks the health state of its registered subsystems to determine its overall viability. If one or more of its critical subsystems have reached the FAILED state, the server instance marks its own health state FAILED to indicate that it cannot reliably host an application.

Using Node Manager, server self-health monitoring enables you to automatically reboot servers that have failed. This improves the overall reliability of a domain, and requires no direct intervention from an administrator.

See Node Manager and System Crash Recovery in the Administering Node Manager for Oracle WebLogic Server.

Server-Level Migration

WebLogic Server provides the capability to migrate clustered server instances. A clustered server that is configured to be migratable can be moved in its entirety from one machine to another, at the command of an administrator, or automatically, in the event of failure. The migration process makes all of the services running on the server instance available on a different machine, but not the state information for the singleton services that were running at the time of failure. See Whole Server Migration in Administering Clusters for Oracle WebLogic Server.

Service-Level Migration

WebLogic Server supports migration of a individual singleton service as well as the server-level migration capability described in the previous section. Singleton services are services that run in a cluster but must run on only a single instance at any given time, such as JMS and the JTA transaction recovery system.

An administrator can migrate a JMS server or the JTS transaction recovery from one server instance to another in a cluster, either in response to a server failure or as part of regularly-scheduled maintenance. This capability improves the availability of pinned services in a cluster, because those services can be quickly restarted on a redundant server should the host server fail.

See Service Migration in Administering Clusters for Oracle WebLogic Server.

Managed Server Independence Mode

Managed Servers maintain a local copy of the domain configuration. When a Managed Server starts, it contacts its Administration Server to retrieve any changes to the domain configuration that were made since the Managed Server was last shut down. If a Managed Server cannot connect to the Administration Server during startup, it can use its locally cached configuration information—this is the configuration that was current at the time of the Managed Server's most recent shutdown. A Managed Server that starts up without contacting its Administration Server to check for configuration updates is running in Managed Server Independence (MSI) mode. By default, MSI mode is enabled. See Disable Managed Server independence in Oracle WebLogic Server Administration Console Online Help.

Directory and File Backups for Failure Recovery

Backup creates a copy of existing files, folders, directories, and restore them in case of data loss. WebLogic server performs few backups automatically and also, encourages the administrators to perform some backup procedures. These procedures include domain configuration directory, LDAP repository, and other security certificates associated with server applications.

This section describes file backups that WebLogic Server performs automatically, and recommended backup procedures that an administrator should perform.

Recovery from the failure of a server instance requires access to the domain configuration and security data. The WebLogic Security service stores its configuration data in the config.xml file, and also in an LDAP repository and other files.

See Domain Configuration Files in Understanding Domain Configuration for Oracle WebLogic Server.

Back Up Domain Configuration Directory

By default, an Administration Server stores the domain configuration data in the domain_name\config directory, where domain_name is the root directory of the domain.

Back up the config directory to a secure location in case a failure of the Administration Server renders the original copy unavailable. If an Administration Server fails, you can copy the backup version to a different machine and restart the Administration Server on the new machine.

Each time a Managed Server starts up, it contacts the Administration Server and if there are changes in to the domain configuration, the Managed Server updates its local copy of the domain config directory.

During operation, if changes are made to the domain configuration, the Administration Server notifies the Managed Servers which update their local /config directory. So, each Managed Server always has an current copy of its configuration data cached locally.

Do not add non-configuration files in the config directory or subdirectories. Non-configuration files include log (.log) and lock (.lck) files. Administration Server replicates the config directory in all Managed Server instances. Storing non-configuration files in the config directory can cause performance issues in the domain.

Back Up LDAP Repository

The default Authentication, Authorization, Role Mapper, and Credential Mapper providers that are installed with WebLogic Server store their data in an LDAP server. Each WebLogic Server instance contains an embedded LDAP server. The Administration Server contains the primary LDAP server which is replicated on all Managed Servers. If any of your security realms use these installed providers, you should maintain an up-to-date backup of the following directory tree:

domain_name\servers\ adminServer\data\ldap

where domain_name is the domain root directory and adminServer is the directory in which the Administration Server stores run time and security data.

Each WebLogic Server instance has an LDAP directory, but you only need to back up the LDAP data on the Administration Server—the primary LDAP server replicates the LDAP data from each Managed Server when updates to security data are made. WebLogic security providers cannot modify security data while the domain Administration Server is unavailable. The LDAP repositories on Managed Servers are replicas and cannot be modified.

The ldap\ldapfiles subdirectory contains the data files for the LDAP server. The files in this directory contain user, group, group membership, policies, and role information. Other subdirectories under the ldap directory contain LDAP server message logs and data about replicated LDAP servers.

Do not update the configuration of a security provider while a backup of LDAP data is in progress. If a change is made—for instance, if an administrator adds a user—while you are backing up the ldap directory tree, the backups in the ldapfiles subdirectory could become inconsistent. If this does occur, consistent, but potentially out-of-date, LDAP backups are available, because once a day, a server suspends write operations and creates its own backup of the LDAP data. It archives this backup in a ZIP file below the ldap\backup directory and then resumes write operations. This backup is guaranteed to be consistent, but it might not contain the latest security data.

See Configure backups for embedded LDAP servers in Oracle WebLogic Server Administration Console Online Help.

Back Up SerializedSystemIni.dat and Security Certificates

Each server instance creates a file named SerializedSystemIni.dat and locates it in the /security directory. This file contains encrypted security data that must be present to boot the server. You must back up this file.

If you configured a server to use SSL, you must also back up the security certificates and keys. The location of these files is user-configurable.

WebLogic Server Exit Codes and Restarting After Failure

When a server instance stops, it issues an exit code. The value of the exit code provides information about the conditions under which the server process ended. Each server exit code has significant meaning and specific restart recommendation to be followed.

When a server instance under Node Manager control exits, Node Manager uses the exit code to determine whether or not to restart the server instance. The server exit code can be used by other high-availability agents or scripts to determine what, if any action, to take after a server instance exits. Server exit codes are defined in the following table:

Table 4-1 WebLogic Server Exit Codes

Exit Code Value Meaning Restart Recommendation

Less than 0

A negative value indicates that the server instance failed during a state transition, and did not terminate in a stable condition.

Example: If a Start in Standby command is issued for a server instance whose configuration is invalid, the server instance fails in the transitional STARTING state, and does not achieve the STANDBY state.

Do not attempt to restart the server. Diagnose the problem that caused the server process to exit.

0

Indicates that the server process terminated normally, as a result of a shutdown command, either graceful or forced.

None.

Greater than 0

A positive value indicates that the server instance stopped itself after determining that one or more of its subsystems were unstable.

Example: A server instance detects an out of memory condition or stuck threads, and shuts itself down.

The server instance can be restarted.

Restarting a Failed Administration Server

You can restart a failed Administration Server either by using Node Manager or by considering the listen address scenarios depending on the same or different machines sharing the same IP addresses.

The following sections describe how to start an Administration Server after a failure.

Note:

You can use Node Manager to automatically restart a failed Administration Server. See Restart Administration and Managed Servers Automatically in Administering Node Manager for Oracle WebLogic Server.

Restarting an Administration Server

See Starting and Stopping Servers.

Restarting Administration Server Scenarios

Table 4-2 Administration Server Restart Scenarios

Listen Address Definition Same Machine or Different Machine with Same IP Address Different Machine with Different IP Address

Not defined

  1. If you are starting the Administration Server on a different machine with the same IP address:

    a. Install WebLogic Server.

    b. Move data.

  2. Start the Administration Server.

    Running Managed Servers will reconnect automatically at the next AdminReconnectIntervalSecs.

  3. To start a Managed Server that was not running when the Administration Server failed, no change in command is required.

  1. Install WebLogic Server.

  2. Move data.

  3. Start the Administration Server.

    Managed Servers that were running when the Administration Server went down, will not reconnect because they know only the previous Administration Server URL. See Managed Servers and the Re-started Administration Server.

    Restart Managed Servers, supplying the new Administration Server listen address on the command line.

  4. To start a Managed Server that was not running when the Administration Server failed, supply the new Administration Server listen address on the command line.

DNS name or IP address of the host

  1. If you are starting a Administration Server on a different machine with the same IP address:

    a. Install WebLogic Server.

    b. Move data.

  2. Start the Administration Server.

    Running Managed Servers will reconnect automatically at the next AdminReconnectIntervalSecs.

  3. To start a Managed Server that was not running when the Administration Server failed, no change in command is required.

  1. Install WebLogic Server.

  2. Move data.

  3. Update the config.xml with the IP address of the new host machine. See Restarting an Administration Server on Another Machine.

  4. Start the Administration Server.

    Managed Servers that were running when the Administration Server went down, will not reconnect because they know only the previous Administration Server URL. See Managed Servers and the Re-started Administration Server.

    Restart Managed Servers supplying the new Administration Server listen address or DNS name on the command line.

  5. To start a Managed Server that was not running when the Administration Server failed, supply the new Administration Server listen address or DNS name on the command line.

DNS name mapped to multiple hosts

  1. If you are starting a Administration Server on a different machine with the same IP address:

    a. Install WebLogic Server.

    b. Move data.

  2. Start the Administration Server.

    Running Managed Servers will reconnect automatically at the next AdminReconnectIntervalSecs.

  3. To start a Managed Server that was not running when the Administration Server failed, no change in command is required.

  1. Install WebLogic Server.

  2. Move data.

  3. Update the config.xml with the IP address of the new host machine. See Restarting an Administration Server on Another Machine.

  4. Start the Administration Server.

    Running Managed Servers that were started with a DNS name for the Administration Server URL that maps to multiple IPs, will attempt reconnection to the Administration Server on all the available URLs. Managed Servers can then locate the Administration Server that has been restarted at any of the URLs.

  5. To start a Managed Server that was not running when the Administration Server failed, supply the DNS name on the command line.

Restarting an Administration Server on Another Machine

If a machine crash prevents you from restarting the Administration Server on the same machine, you can recover management of the running Managed Servers as follows:

  1. Install the WebLogic Server software on the new administration machine (if this has not already been done).
  2. Make your application files available to the new Administration Server by restoring them from backups or by using a shared disk. Your application files should be available in the same relative location on the new file system as on the file system of the original Administration Server.
  3. Make your configuration and security data available to the new administration machine by restoring them from backups or by using a shared disk. See Directory and File Backups for Failure Recovery.
    • Update the config.xml with the IP address of the new host machine. If the listen address was set to blank, you do not need to change it. For example:

      <server>
         <name>AdminServer</name>
         ...
         <listen-address></listen-address>
      </server>
      
    • You can edit config.xml manually or use WLST offline to update the listen address.

  4. Restart the Administration Server on the new machine.
Managed Servers and the Re-started Administration Server

If an Administration Server stops running while the Managed Servers in the domain continue to run, each Managed Server periodically attempts to reconnect to the Administration Server, at the interval specified by the ServerMBean attribute AdminReconnectIntervalSeconds. By default, AdminReconnectIntervalSeconds is ten seconds.

In order for Managed Servers to reconnect after an Administration Server is restarted on a different IP address, you must have:

  • Configured a DNS name for the Administration Server URL that maps to multiple IP addresses. For example, a DNS server named wlsadminserver which maps to 10.10.10.1 and 10.10.10.2

  • Provided the DNS name for the Administration Server URL when starting the Managed Servers. For example:

    -Dweblogic.management.server=protocol://wlsadminserver:port

    or

    startManagedWebLogic.cmd managed_server_name protocol://wlsadminserver:port

    If the Administration Server goes down, Managed Servers will attempt to reconnect to the Administration Server on all the available URLs. When the Administration Server comes up on any of these URLs, Managed Servers connect to the Administration Server and stop attempting to reconnect on the other URLs. If the Administration Server goes down again, they attempt to reconnect again.

Restarting a Failed Managed Server

WebLogic server provides various methods to restart a failed Managed Server regardless of the Administration Server accessibility. If the Managed Server cannot connect to the Administration Server during server startup, it can retrieve its configuration by reading its locally cached configuration from the config directory. If the Administration Server is reachable by the failed Managed Server, you can restart it manually or automatically using Node Manager, or using a command script. Also, you can restart the Managed Server in MSI mode.

The following sections describe how to start Managed Servers after failure. For recovery considerations related to transactions and JMS, see Additional Failure Topics.

Starting a Managed Server When the Administration Server Is Accessible

If the Administration Server is reachable by a Managed Server that failed, you can:

Starting a Managed Server When the Administration Server Is Not Accessible

If a Managed Server cannot connect to the Administration Server during startup, it can retrieve its configuration by reading its locally cached configuration data from the config directory. A Managed Server that starts in this way is running in Managed Server Independence (MSI) mode.

Understanding Managed Server Independence Mode

When a Managed Server starts, it tries to contact the Administration Server to retrieve its configuration information. If a Managed Server cannot connect to the Administration Server during startup, it can retrieve its configuration by reading configuration and security files directly. A Managed Server that starts in this way is running in Managed Server Independence (MSI) mode. By default, MSI mode is enabled. For information about disabling MSI mode, see Disable Managed Server independence in Oracle WebLogic Server Administration Console Online Help.

In Managed Server Independence mode, a Managed Server:

  • Looks in its local config directory for config.xml—a replica of the domain config.xml.

  • Looks in its security directory for SerializedSystemIni.dat and for boot.properties, which contains an encrypted version of your user name and password. See Provide User Credentials to Start and Stop Servers.

If config.xml and SerializedSystemIni.dat are not in these locations in the server domain directory, you can copy them from the Administration Server domain directory.

MSI Mode and the Security Realm

A Managed Server must have access to a security realm to complete its startup process.

If you use the security realm that WebLogic Server installs, then the Administration Server maintains an LDAP server to store the domain security data. All Managed Servers replicate this LDAP server. If the Administration Server fails, Managed Servers running in MSI mode use the replicated LDAP server for security services.

If you use a third party security provider, then the Managed Server must be able to access the security data before it can complete its startup process.

MSI Mode and SSL

If you set up SSL for your servers, each server requires its own set of certificate files, key files, and other SSL-related files. Managed Servers do not retrieve SSL-related files from the Administration Server though the domain configuration file does store the pathnames to those files for each server. Starting in MSI Mode does not require you to copy or move the SSL-related files unless they are located on a machine that is inaccessible.

MSI Mode and Deployment

A Managed Server that starts in MSI mode deploys its applications from its staging directory: server_root\stage\appName.

MSI Mode and the Domain Log File

Each WebLogic Server instance writes log messages to its local log file and a domain-wide log file. The domain log file provides a central location from which to view messages from all servers in a domain.

Usually, a Managed Server forwards messages to the Administration Server, and the Administration Server writes the messages to the domain log file. However, when a Managed Server runs in MSI mode, it continues to write messages to its local server log file but does not forward messages to the domain log file.

See How a Server Instance Forwards Messages to the Domain Log in Configuring Log Files and Filtering Log Messages for Oracle WebLogic Server.

MSI Mode and Managed Server Configuration Changes

If you start a Managed Server in MSI mode, you cannot change its configuration until it restores communication with the Administration Server.

Starting a Managed Server in MSI Mode

Note:

If the Managed Server instance that failed was a clustered Managed Server that was the active server for a migratable service at the time of failure, perform the steps described in Migrating When the Currently Active Host is Unavailable in Administering Clusters for Oracle WebLogic Server. Do not start the Managed Server instance in MSI mode.

To start up a Managed Server in MSI mode:

  1. Ensure that the Managed Server's root directory contains the config subdirectory.

    If the config directory does not exist, copy it from the Administration Server's root directory or from a backup to the Managed Server's root directory.

    Note:

    Alternatively, you can use the -Dweblogic.RootDirectory=path startup option to specify a root directory that already contains these files.

  2. If it does not already exist, copy the SerializedSystemIni.dat file from the Administration Server domain directory or an existing server in the domain. The file location should be ${DOMAIN_HOME}/security/SerializedSystemIni.dat.
  3. Start the Managed Server at the command line or using a script. See Starting and Stopping Servers.

    The Managed Server will run in MSI mode until it is contacted by its Administration Server. See Restarting a Failed Administration Server.

Additional Failure Topics

You can refer to other related failure topics to understand the server failure and its recovery process.

For information related to recovering JMS data from a failed server instance, see Configuring WebLogic JMS Clustering in Administering JMS Resources for Oracle WebLogic Server.

For information about transaction recovery after failure, see Transaction Recovery After a Server Fails in Developing JTA Applications for Oracle WebLogic Server.

For information about recovering from a corrupt or unusable embedded LDAP server file, which prevents the Administration Server from starting, see Backup and Recovery in Administering Security for Oracle WebLogic Server.