Sun Java System Messaging Server 6.3 Administration Guide

27.8.9 Monitoring Using msprobe and watcher Functions

Messaging Server provides two processes, watcher and msprobe to monitor various system services. watcher watches for server crashes and restarts them as necessary. msprobe monitors server hangs (unresponsiveness). Specifically msprobe monitors the following:

watcher and msprobe are controlled by the configutil options shown in Table 27–5. Further information can be found in 4.5 Automatic Restart of Failed or Unresponsive Services

Table 27–5 msprobe and watcher configutil Options




Enable automatic server restart. Automatically restarts failed or hung services. Default: no 


Failure retry time-out. If a server fails more than twice in this designated amount of time, then the system stops trying to restart the server. The value (set in seconds) should be set to a period value longer than the msprobe interval (local.schedule.msprobe). Default: 600 seconds


Timeout for a specific server before restart. service can be imap, pop, http, cert, job_controller, smtp, lmtp, mmp or ens.

Default: use service.readtimeout


Number of seconds of a specific server’s non-response before a warning message is logged to default log file. service can be imap, pop, http, cert, job_controller, smtp, lmtp, mmp or ens.

Default: Use local.probe.warningthreshold 


Number of seconds of server non-response before a warning message is logged to default log file.

Default: 5 secs 


MTA queue directory to check if queue size exceeded threshold defined by alarm.diskavail.msgalarmthreshold. 

Default: none 


Period of server non-response before restarting that server. See local.schedule.msprobe. 

Default: 10 seconds 


msprobe run schedule. A crontab style schedule string (see Table 20–10Note that by default, this is automatically set. See 4.6.2 Pre-defined Automatic Tasks.

To disable: set local.schedule.msprobe.enable to NO.


Enable watcher which monitors service failures. (IMAP, POP, HTTP, job controller, dispatcher, message store (stored), imsched, and MMP. (LMTP/SMTP servers are monitored by the dispatcher and LMTP/SMTP clients are monitored by the job_controller.) Logs error messages to the default log file for specific failures. Default: on Alarm Messages

msprobe can issue alarms in the form of email messages to the postmaster (see To Monitor imapd, popd and httpd) warning of a specified condition. A sample email alarm sent when a certain threshold is exceeded is shown below:

Subject:    ALARM: server response time in seconds of “ldap_siroe.com_389” is 10
Date:    Tue, 17 Jul 2001 16:37:08 -0700 (PDT) 

Server instance: /opt/SUNWmsgsr
Alarmid: serverresponse 
Instance: ldap_siroe_europa.com_389 
Description: server response time in seconds 
Current measured value (17/Jul/2001:16:37:08 -0700): 10 
Lowest recorded value: 0 
Highest recorded value: 10 
Monitoring interval: 600 seconds 
Alarm condition is when over threshold of 10 
Number of times over threshold: 1


You can specify how often msprobe monitors disk and server performance, and under what circumstances it sends alarms. This is done by using the configutil command to set the alarm parameters. Table 27–6 shows useful alarm parameters along with their default setting. See configutil Parameters in Sun Java System Messaging Server 6.3 Administration Reference.

Table 27–6 Useful Alarm Message configutil Parameters


Description (Default in parenthesis)  


(localhost) Machine to which you send warning messages. 


(25) The SMTP port to which to connect when sending alarm message. 


(Postmaster@localhost) Whom to send alarm notice. 


(Postmaster@localhost) Address of sender the alarm. 


(percentage mail partition diskspace available.) Text for description field for disk availability alarm. 


(3600) Interval in seconds between disk availability checks. Set to 0 to disable checking of disk usage. 


(10) Percentage of disk space availability below which an alarm is sent. 


(-1) Specifies whether the alarm is issued when disk space availability goes below threshold (-1) or above it (1). 


(24). Interval in hours between subsequent repetition of disk availability alarms. 


(server response time in seconds.) Text for description field for servers response alarm. 


(600) Interval in seconds between server response checks. Set to 0 to disable checking of server response. 


(10) If server response time in seconds exceeds this value, alarm issued. 


(1) Specifies whether alarm is issued when server response time is greater that (1) or less than (-1) the threshold. 


(24) Interval in hours between subsequent repetition of server response alarm.