5 Restarting Software Processes
This chapter describes how the LSMS automatically attempts to restart certain types of failures. It also describes how to manually verify and restart LSMS software components.
Introduction
This chapter describes how the LSMS automatically attempts to restart certain types of failures. It also describes how to manually verify and restart LSMS software components.
Automatically Restarting Software Processes
The LSMS Automatic Software Recovery feature, available as a standard feature for LSMS Release 2.0 and later, detects failures in certain LSMS processes and attempts to restart the processes without the need for manual intervention by the customer. This feature is implemented by the sentryd
utility.
Detecting Failure Conditions
Table 5-1 shows which processes are checked by sentryd
and the error conditions for which they are checked.
Table 5-1 Processes Monitored by the Automatic Software Recovery Feature
The sentryd
process uses either of the following methods to detect failures:
-
Verifying that the process has updated its timestamp in the supplemental database periodically
-
Using standard Linux commands to determine whether a process is running
For more information about specific methods used to detect failures, see the section shown in the last column of Table 5-1.
Reporting Failures Through the Surveillance Feature
If the Surveillance feature is not enabled, sentryd
still detects failures and attempts to restart processes, but important information concerning the state of the LSMS is neither displayed nor logged.
To obtain the full benefit of this feature, the Surveillance feature must be enabled. The Surveillance feature displays and logs (in /var/TKLC/lsms/logs/survlog.log
) the following notifications regarding the following conditions:
-
Software failures
-
Successful recovery of the software
-
Unsuccessful recovery of the software
Also, whether or not the Surveillance feature is enabled, surveillance agents will restart the sentryd
process if it exits abnormally.
Automatically Restarting Processes Hierarchically
Figure 5-1 shows how sentryd
restarts processes in a hierarchical order.
Figure 5-1 Order of Automatically Restarting Processes

This figure illustrates:
- Which processes
sentryd
monitors. - When a failure is detected in a process,
sentryd
attempts to restart the failed process and all processes shown below it. - The optional Service Assurance process is monitored for failure, but is not restarted by
sentryd
. Also, ifsentryd
restarts the OSI process, it stops the Service Assurance process. (The Surveillance feature restarts the Service Assurance process whenever it detects that the Service Assurance process has stopped.)
All recovery procedures start within 60 seconds of failure detection.
Automatically Monitoring and Restarting EAGLE Agent Processes
The following sections describe the failure conditions for which sentryd
monitors the EAGLE agent processes (eagleagent
) and the steps performed in attempts to restart the process after failure has been detected.
Monitoring EAGLE Agent Processes
The sentryd
process monitors each EAGLE agent process for the following conditions:
-
Failure to initialize during automatic system startup
-
Failure to initialize during manual startup using the
eagle
command -
An abnormal exit during normal operation
-
Inability to perform its defined tasks, for example, because it is in an infinite loop
Restarting an EAGLE Agent Process
When one of conditions described in “Monitoring Eagle Agent Processes” has been detected, sentryd
performs the following tasks:
-
Generates the following surveillance notification, which represents the Common Language Location Identified (CLLI) of the EAGLE:
LSMS6004|08:40 Sep 11, 1998|xxxxxxx|Notify:Sys Admin - FAILD: eagleagent <CLLI>
-
Attempts to stop and restart the
eagleagent
. If theeagleagent
restarts,sentryd
generates the following Surveillance notification:LSMS6005|08:40 Sep 11, 1998|xxxxxxx|Notify:Sys Admin - RECOV: eagleagent <CLLI>
Continuing Attempts to Restart an EAGLE Agent Process
If the attempt to restart the eagleagent
fails, sentryd
attempts again.
If this attempt is also unsuccessful, the sentryd
process generates the following Surveillance notification and continues to attempt to restart the eagleagent
process.
LSMS6006|08:40 Sep 11, 1998|xxxxxxx|Notify:Sys Admin - RFAILD: eagleagent <CLLI>
If this notification appears several times in a row, contact the unresolvable-reference.html#GUID-646F2C79-C167-4B5A-A8DF-7ED0EAA9AD66.
Automatically Monitoring and Restarting NPAC Agent Processes
The following sections describe the failure conditions for which sentryd
monitors the regional NPAC agent processes (npacagent
s) and the steps performed in attempts to restart an npacagent
process after failure has been detected.
Monitoring NPAC Agent Processes
For each region, sentryd
monitors its npacagent
process for the following conditions:
-
Failure to initialize during automatic system startup
-
Failure to initialize during manual startup using the
lsms
command -
An unintentional exit or crash during normal operation
-
Inability to perform its defined tasks, for example, because it is in an infinite loop
Restarting NPAC Agent Processes
When one of conditions described in “Monitoring NPAC Agent Processes” has been detected, sentryd
performs the following tasks:
-
Generates the following surveillance notification:
LSMS6008|08:40 Sep 11, 1998|xxxxxxx| Notify:Sys Admin - FAILED: <NPAC_region> agent
where
<NPAC_region>
indicates the name of the region whosenpacagent
process has failed. -
Attempts to stop and restart the failed
npacagent
. If thenpacagent
restarts,sentryd
generates the following Surveillance notification:LSMS6009|08:40 Sep 11, 1998|xxxxxxx| Notify:Sys Admin - RECOV: <NPAC_region> agent
Continuing Attempts to Restart NPAC Agent Processes
If the attempt to restart the npacagent
fails, sentryd
attempts again. If this attempt is also unsuccessful, the sentryd
process generates the following Surveillance notification and continues to attempt to restart the npacagent
process.
LSMS6010|08:40 Sep 11, 1998|xxxxxxx|Notify:Sys Admin - RFAILED:
<region> agent
If this notification appears several times in a row, contact the unresolvable-reference.html#GUID-646F2C79-C167-4B5A-A8DF-7ED0EAA9AD66.
Automatically Monitoring and Restarting OSI Process
The following sections describe the failure conditions for which sentryd
monitors the OSI process and the steps performed in attempts to restart the processes after failure has been detected.
Monitoring the OSI Process
The sentryd
process monitors the OSI process for the following conditions:
-
An unintentional exit or crash during normal operation
Restarting the OSI Process
When one of conditions described in “Monitoring the OSI Process” has been detected, sentryd
performs the following tasks:
-
Generates the following surveillance notification:
LSMS8037|08:40 Sep 11, 1998|xxxxxxx|Notify:Sys Admin - FAILD: OSI
-
Stops all running
npacagent
processes and the Service Assurance process, if it is running. -
Attempts to restart the OSI process and all
lsmsagent
processes that were previously running. If all processes restart,sentryd
generates the following Surveillance notifications, where <NPAC_region> is the name of the region served by thenpacagent
process and <CLLI> is the name of the EAGLE agent:LSMS8038|08:40 Sep 11, 1998|xxxxxxx|Notify:Sys Admin - RECOV: OSI LSMS6005|08:40 Sep 11, 1998|xxxxxxx|Notify:Sys Admin - RECOV: eagleagent <CLLI> LSMS6009|08:40 Sep 11, 1998|xxxxxxx|Notify:Sys Admin - RECOV: <NPAC_region> agent
Continuing Attempts to Restart the OSI Process
If the attempt to restart the OSI process fails, sentryd
attempts again. After two failed attempts, sentryd
generates the following Surveillance notification.
LSMS8039|08:40 Sep 11, 1998|xxxxxxx|Notify:Sys Admin - RFAILD: OSI
If this notification appears, contact the unresolvable-reference.html#GUID-646F2C79-C167-4B5A-A8DF-7ED0EAA9AD66.
Automatically Monitoring and Restarting the Service Assurance Process
The following sections describe the failure conditions for which sentryd
monitors the optional Service Assurance process (sacw
) and states that the Surveillance feature restarts sacw
when it fails.
Monitoring the Service Assurance Process
The sentryd
process monitors the optional Service Assurance process (sacw
) so that it can be stopped if the OSI process need to be restarted. It is monitored for the following conditions:
-
An unintentional exit or crash during normal operation
-
Inability to perform its defined tasks, for example, because it is in an infinite loop
Restarting the Service Assurance Process
The sentryd
does not attempt to restart the Service Assurance process when it fails. The Surveillance feature performs that function. For more information about the Service Assurance process, see “Understanding the Service Assurance Feature”.
Automatically Monitoring and Restarting the rmtpmgr
Process
The following sections describe the failure conditions for which sentryd
monitors the RMTP Manager process (rmtpmgr
) and the steps performed in attempts to restart rmtpmgr
after failure has been detected.
Monitoring the rmtpmgr
Process
The sentryd
process monitors rmtpmgr
for the following conditions:
-
Failure to initialize during automatic system startup
-
An unintentional exit or crash during normal operation
-
Inability to perform its defined tasks, for example, because it is in an infinite loop
Restarting the rmtpmgr
Process
When one of conditions described in “Monitoring the rmtpmgr Process” has been detected, sentryd
performs the following tasks:
-
Generates the following surveillance notification:
LSMS4021|08:40 Sep 11, 1998|xxxxxxx|Notify:Sys Admin - rmtpmgr failed
-
Attempts to stop and restart the process. If the process restarts, no notification is posted. After the
sentryd
process has restarted thermtpmgr
process,sentryd
then attempts to restart the following processes that exited previously due to thermtpmgr
failure:-
NPAC agents (see Restarting NPAC Agent Processes)
-
EAGLE agents (see Restarting an EAGLE Agent Process)
-
Local Data Manager (see Restarting Other Processes)
-
Continuing Attempts to Restart the rmtpmgr
Process
If the attempt to restart the rmtpmgr
process fails, sentryd
attempts again. If the attempt fails again, sentryd
generates the LSMS4021 notification again. If this notification appears several times in a row, contact the unresolvable-reference.html#GUID-646F2C79-C167-4B5A-A8DF-7ED0EAA9AD66.
Automatically Monitoring and Restarting the rmtpagent
Process
The following sections describe the failure conditions for which sentryd
monitors the RMTP Agent process (rmtpagent
) and the steps performed in attempts to restart rmtpagent
after failure has been detected.
Monitoring the rmtpagent
Process
The sentryd
process monitors rmtpagent
for the following conditions:
-
Failure to initialize during automatic system startup
-
An unintentional exit or crash during normal operation
-
Inability to perform its defined tasks, for example, because it is in an infinite loop
Restarting the rmtpagent
Process
When one of conditions described in Monitoring the rmtpagent Process has been detected, sentryd
performs the following tasks:
-
Generates the following surveillance notification:
LSMS4021|08:40 Sep 11, 1998|xxxxxxx|Notify:Sys Admin - rmtpagent failed
-
Attempts to stop and restart the process. If the process restarts, no notification is posted. After the
sentryd
process has restarted thermtpagent
process,sentryd
then attempts to restart the following processes that exited previously due to thermtpagent
failure:-
NPAC agents (see Restarting NPAC Agent Processes)
-
EAGLE agents (see Restarting an EAGLE Agent Process)
-
Local Data Manager (see Restarting Other Processes)
-
Continuing Attempts to Restart the rmtpagent
Process
If the attempt to restart the rmtpagent
process fails, sentryd
attempts again. If the attempt fails again, sentryd
generates the LSMS4021
notification again. If this notification appears several times in a row, contact the unresolvable-reference.html#GUID-646F2C79-C167-4B5A-A8DF-7ED0EAA9AD66.
Automatically Monitoring and Restarting Other Processes
The following sections describe the failure conditions for which sentryd
monitors the following processes and the steps performed in attempts to restart a process after failure has been detected:
-
Local Services Manager (
lsman
) - LSMS SNMP Agent (lsmsSNMPagent)
-
Local Data Manager (
supman
) -
Report Manager (
reportman
) -
Logger Server
-
Apache Web Server
Monitoring Other Processes
The sentryd
process monitors each process for the following conditions:
-
Failure to initialize during automatic system startup
-
An unintentional exit or crash during normal operation
-
Inability to perform its defined tasks, for example, because it is in an infinite loop
Restarting Other Processes
When one of conditions described in Monitoring EAGLE Agent Processes has been detected, sentryd
performs the following tasks:
-
Generates the following surveillance notification, where
<process_name>
is the name of the process:LSMS4021|08:40 Sep 11, 1998|xxxxxxx|Notify:Sys Admin - <process_name> failed
-
Attempts to stop and restart the process. If the process restarts, no notification is posted.
Continuing Attempts to Restart Other Processes
If the attempt to restart the process fails, sentryd
attempts again. If the attempt fails again, sentryd
generates the LSMS4021
notification again. If this notification appears several times in a row, contact the unresolvable-reference.html#GUID-646F2C79-C167-4B5A-A8DF-7ED0EAA9AD66.