5 Restarting Software Processes

This chapter describes how the LSMS automatically attempts to restart certain types of failures. It also describes how to manually verify and restart LSMS software components.

Introduction

This chapter describes how the LSMS automatically attempts to restart certain types of failures. It also describes how to manually verify and restart LSMS software components.

Automatically Restarting Software Processes

The LSMS Automatic Software Recovery feature, available as a standard feature for LSMS Release 2.0 and later, detects failures in certain LSMS processes and attempts to restart the processes without the need for manual intervention by the customer. This feature is implemented by the sentryd utility.

Detecting Failure Conditions

Table 5-1 shows which processes are checked by sentryd and the error conditions for which they are checked.

Table 5-1 Processes Monitored by the Automatic Software Recovery Feature

Process Unintentional Exit Inability to Perform Defined Tasks Failed to Initialize During Startup See section:

EAGLE agents

X

X

X

Automatically Monitoring and Restarting EAGLE Agent Processes

Regional NPAC agents

X

X

X

Automatically Monitoring and Restarting NPAC Agent Processes

OSI

X

    Automatically Monitoring and Restarting OSI Process

Service Assurance

X

    Automatically Monitoring and Restarting the Service Assurance Process

Local Services Manager

X

X

X

Automatically Monitoring and Restarting Other Processes

Local Data Manager

X

X

X

Automatically Monitoring and Restarting Other Processes

Logger Server

X

 

X

Automatically Monitoring and Restarting Other Processes
LSMS SNMP Agent X   X Automatically Monitoring and Restarting Other Processes

Apache web server

X

 

X

Automatically Monitoring and Restarting Other Processes

RMTP Manager

X

 

X

Automatically Monitoring and Restarting the rmtpmgr Process

RMTP Agent

X

 

X

Automatically Monitoring and Restarting the rmtpagent Process

Report Manager

X

 

X

Automatically Monitoring and Restarting Other Processes

The sentryd process uses either of the following methods to detect failures:

  • Verifying that the process has updated its timestamp in the supplemental database periodically

  • Using standard Linux commands to determine whether a process is running

For more information about specific methods used to detect failures, see the section shown in the last column of Table 5-1.

Reporting Failures Through the Surveillance Feature

If the Surveillance feature is not enabled, sentryd still detects failures and attempts to restart processes, but important information concerning the state of the LSMS is neither displayed nor logged.

To obtain the full benefit of this feature, the Surveillance feature must be enabled. The Surveillance feature displays and logs (in /var/TKLC/lsms/logs/survlog.log) the following notifications regarding the following conditions:

  • Software failures

  • Successful recovery of the software

  • Unsuccessful recovery of the software

Also, whether or not the Surveillance feature is enabled, surveillance agents will restart the sentryd process if it exits abnormally.

Automatically Restarting Processes Hierarchically

Figure 5-1 shows how sentryd restarts processes in a hierarchical order.

Figure 5-1 Order of Automatically Restarting Processes


img/c_automatically_restarting_software_processes_mm-fig1.jpg

This figure illustrates:

  • Which processes sentryd monitors.
  • When a failure is detected in a process, sentryd attempts to restart the failed process and all processes shown below it.
  • The optional Service Assurance process is monitored for failure, but is not restarted by sentryd. Also, if sentryd restarts the OSI process, it stops the Service Assurance process. (The Surveillance feature restarts the Service Assurance process whenever it detects that the Service Assurance process has stopped.)

All recovery procedures start within 60 seconds of failure detection.

Automatically Monitoring and Restarting EAGLE Agent Processes

The following sections describe the failure conditions for which sentryd monitors the EAGLE agent processes (eagleagent) and the steps performed in attempts to restart the process after failure has been detected.

Monitoring EAGLE Agent Processes

The sentryd process monitors each EAGLE agent process for the following conditions:

  • Failure to initialize during automatic system startup

  • Failure to initialize during manual startup using the eagle command

  • An abnormal exit during normal operation

  • Inability to perform its defined tasks, for example, because it is in an infinite loop

Restarting an EAGLE Agent Process

When one of conditions described in “Monitoring Eagle Agent Processes” has been detected, sentryd performs the following tasks:

  1. Generates the following surveillance notification, which represents the Common Language Location Identified (CLLI) of the EAGLE:

    
    LSMS6004|08:40 Sep 11, 1998|xxxxxxx|Notify:Sys Admin - FAILD: eagleagent <CLLI>
    
  2. Attempts to stop and restart the eagleagent. If the eagleagent restarts, sentryd generates the following Surveillance notification:

    
    LSMS6005|08:40 Sep 11, 1998|xxxxxxx|Notify:Sys Admin - RECOV: eagleagent <CLLI>
    

Continuing Attempts to Restart an EAGLE Agent Process

If the attempt to restart the eagleagent fails, sentryd attempts again.

If this attempt is also unsuccessful, the sentryd process generates the following Surveillance notification and continues to attempt to restart the eagleagent process.


LSMS6006|08:40 Sep 11, 1998|xxxxxxx|Notify:Sys Admin - RFAILD: eagleagent <CLLI>

If this notification appears several times in a row, contact the unresolvable-reference.html#GUID-646F2C79-C167-4B5A-A8DF-7ED0EAA9AD66.

Automatically Monitoring and Restarting NPAC Agent Processes

The following sections describe the failure conditions for which sentryd monitors the regional NPAC agent processes (npacagents) and the steps performed in attempts to restart an npacagent process after failure has been detected.

Monitoring NPAC Agent Processes

For each region, sentryd monitors its npacagent process for the following conditions:

  • Failure to initialize during automatic system startup

  • Failure to initialize during manual startup using the lsms command

  • An unintentional exit or crash during normal operation

  • Inability to perform its defined tasks, for example, because it is in an infinite loop

Restarting NPAC Agent Processes

When one of conditions described in “Monitoring NPAC Agent Processes” has been detected, sentryd performs the following tasks:

  1. Generates the following surveillance notification:

    
    LSMS6008|08:40 Sep 11, 1998|xxxxxxx| Notify:Sys Admin - FAILED:
    <NPAC_region> agent
    

    where <NPAC_region> indicates the name of the region whose npacagent process has failed.

  2. Attempts to stop and restart the failed npacagent. If the npacagent restarts, sentryd generates the following Surveillance notification:

    
    LSMS6009|08:40 Sep 11, 1998|xxxxxxx| Notify:Sys Admin - RECOV:
    <NPAC_region> agent
    

Continuing Attempts to Restart NPAC Agent Processes

If the attempt to restart the npacagent fails, sentryd attempts again. If this attempt is also unsuccessful, the sentryd process generates the following Surveillance notification and continues to attempt to restart the npacagent process.


LSMS6010|08:40 Sep 11, 1998|xxxxxxx|Notify:Sys Admin - RFAILED:
<region> agent

If this notification appears several times in a row, contact the unresolvable-reference.html#GUID-646F2C79-C167-4B5A-A8DF-7ED0EAA9AD66.

Automatically Monitoring and Restarting OSI Process

The following sections describe the failure conditions for which sentryd monitors the OSI process and the steps performed in attempts to restart the processes after failure has been detected.

Monitoring the OSI Process

The sentryd process monitors the OSI process for the following conditions:

  • An unintentional exit or crash during normal operation

Restarting the OSI Process

When one of conditions described in “Monitoring the OSI Process” has been detected, sentryd performs the following tasks:

  1. Generates the following surveillance notification:

    
    LSMS8037|08:40 Sep 11, 1998|xxxxxxx|Notify:Sys Admin - FAILD: OSI
    
  2. Stops all running npacagent processes and the Service Assurance process, if it is running.

  3. Attempts to restart the OSI process and all lsmsagent processes that were previously running. If all processes restart, sentryd generates the following Surveillance notifications, where <NPAC_region> is the name of the region served by the npacagent process and <CLLI> is the name of the EAGLE agent:

    
    LSMS8038|08:40 Sep 11, 1998|xxxxxxx|Notify:Sys Admin - RECOV: OSI
    LSMS6005|08:40 Sep 11, 1998|xxxxxxx|Notify:Sys Admin - RECOV:
    eagleagent <CLLI>
    LSMS6009|08:40 Sep 11, 1998|xxxxxxx|Notify:Sys Admin - RECOV:
    <NPAC_region> agent
    

Continuing Attempts to Restart the OSI Process

If the attempt to restart the OSI process fails, sentryd attempts again. After two failed attempts, sentryd generates the following Surveillance notification.


LSMS8039|08:40 Sep 11, 1998|xxxxxxx|Notify:Sys Admin - RFAILD: OSI

If this notification appears, contact the unresolvable-reference.html#GUID-646F2C79-C167-4B5A-A8DF-7ED0EAA9AD66.

Automatically Monitoring and Restarting the Service Assurance Process

The following sections describe the failure conditions for which sentryd monitors the optional Service Assurance process (sacw) and states that the Surveillance feature restarts sacw when it fails.

Monitoring the Service Assurance Process

The sentryd process monitors the optional Service Assurance process (sacw) so that it can be stopped if the OSI process need to be restarted. It is monitored for the following conditions:

  • An unintentional exit or crash during normal operation

  • Inability to perform its defined tasks, for example, because it is in an infinite loop

Restarting the Service Assurance Process

The sentryd does not attempt to restart the Service Assurance process when it fails. The Surveillance feature performs that function. For more information about the Service Assurance process, see “Understanding the Service Assurance Feature”.

Automatically Monitoring and Restarting the rmtpmgr Process

The following sections describe the failure conditions for which sentryd monitors the RMTP Manager process (rmtpmgr) and the steps performed in attempts to restart rmtpmgr after failure has been detected.

Monitoring the rmtpmgr Process

The sentryd process monitors rmtpmgr for the following conditions:

  • Failure to initialize during automatic system startup

  • An unintentional exit or crash during normal operation

  • Inability to perform its defined tasks, for example, because it is in an infinite loop

Restarting the rmtpmgr Process

When one of conditions described in “Monitoring the rmtpmgr Process” has been detected, sentryd performs the following tasks:

  1. Generates the following surveillance notification:

    
    LSMS4021|08:40 Sep 11, 1998|xxxxxxx|Notify:Sys Admin - rmtpmgr
    failed
    
  2. Attempts to stop and restart the process. If the process restarts, no notification is posted. After the sentryd process has restarted the rmtpmgr process, sentryd then attempts to restart the following processes that exited previously due to the rmtpmgr failure:

Continuing Attempts to Restart the rmtpmgr Process

If the attempt to restart the rmtpmgr process fails, sentryd attempts again. If the attempt fails again, sentryd generates the LSMS4021 notification again. If this notification appears several times in a row, contact the unresolvable-reference.html#GUID-646F2C79-C167-4B5A-A8DF-7ED0EAA9AD66.

Automatically Monitoring and Restarting the rmtpagent Process

The following sections describe the failure conditions for which sentryd monitors the RMTP Agent process (rmtpagent) and the steps performed in attempts to restart rmtpagent after failure has been detected.

Monitoring the rmtpagent Process

The sentryd process monitors rmtpagent for the following conditions:

  • Failure to initialize during automatic system startup

  • An unintentional exit or crash during normal operation

  • Inability to perform its defined tasks, for example, because it is in an infinite loop

Restarting the rmtpagent Process

When one of conditions described in Monitoring the rmtpagent Process has been detected, sentryd performs the following tasks:

  1. Generates the following surveillance notification:

    
    LSMS4021|08:40 Sep 11, 1998|xxxxxxx|Notify:Sys Admin - rmtpagent failed
    
  2. Attempts to stop and restart the process. If the process restarts, no notification is posted. After the sentryd process has restarted the rmtpagent process, sentryd then attempts to restart the following processes that exited previously due to the rmtpagent failure:

Continuing Attempts to Restart the rmtpagent Process

If the attempt to restart the rmtpagent process fails, sentryd attempts again. If the attempt fails again, sentryd generates the LSMS4021 notification again. If this notification appears several times in a row, contact the unresolvable-reference.html#GUID-646F2C79-C167-4B5A-A8DF-7ED0EAA9AD66.

Automatically Monitoring and Restarting Other Processes

The following sections describe the failure conditions for which sentryd monitors the following processes and the steps performed in attempts to restart a process after failure has been detected:

  • Local Services Manager (lsman)

  • LSMS SNMP Agent (lsmsSNMPagent)
  • Local Data Manager (supman)

  • Report Manager (reportman)

  • Logger Server

  • Apache Web Server

Monitoring Other Processes

The sentryd process monitors each process for the following conditions:

  • Failure to initialize during automatic system startup

  • An unintentional exit or crash during normal operation

  • Inability to perform its defined tasks, for example, because it is in an infinite loop

Restarting Other Processes

When one of conditions described in Monitoring EAGLE Agent Processes has been detected, sentryd performs the following tasks:

  1. Generates the following surveillance notification, where <process_name> is the name of the process:

    
    LSMS4021|08:40 Sep 11, 1998|xxxxxxx|Notify:Sys Admin - <process_name>
    failed
    
  2. Attempts to stop and restart the process. If the process restarts, no notification is posted.

Continuing Attempts to Restart Other Processes

If the attempt to restart the process fails, sentryd attempts again. If the attempt fails again, sentryd generates the LSMS4021 notification again. If this notification appears several times in a row, contact the unresolvable-reference.html#GUID-646F2C79-C167-4B5A-A8DF-7ED0EAA9AD66.