Messaging Server provides two processes, watcher and msprobe to monitor various system services. watcher watches for server crashes and restarts them as necessary. msprobe monitors server hangs (unresponsiveness). Specifically msprobe monitors the following:
Server Response Time. msprobe connects to the enabled servers using their protocol commands and measures their response times. If the response time exceeds the alarm warning threshold, an alarm message is sent (see Alarm Messages to a server, or the server response time exceeds a specified timeout period, the server is restarted. Server response times are recorded in a counter database and is logged to the default log file. counterutil can be used to display the server response time statistics (counterutil).
The following servers are monitored by msprobe: imap, pop, http, cert, job_controller, smtp, lmtp, mmp and ens. When smtp or lmtp are not responding, the dispatcher is restarted. ens cannot be automatically restarted.
Disk usage. msprobe checks the disk availability and usage for every message store partition. Specifically it checks the message store mboxlist database directory and the MTA queue directory. If disk usage exceeds a configured threshold, an alarm message is sent. The disk sizes and usages are recorded in a counter database and is logged to the default log file. Administrators can use the counterutil utility (see counterutil) to display the disk usage statistics.
Message Store mboxlist Database Log File Accumulation. Log file accumulation is an indication of an mboxlist database error. msprobe counts the number of active log files and if the number of active log files is larger than the threshold, msprobe logs a critical error message to the default log file to inform the admin to restart the server. If the autorestart is enabled (local.autorestart to yes), the store daemon is restarted.
watcher and msprobe are controlled by the configutil options shown in Table 23–5. Further information can be found in Automatic Restart of Failed or Unresponsive Services
Table 23–5 msprobe and watcher configutil Options
Options |
Description |
---|---|
Enable automatic server restart. Automatically restarts failed or hung services. Default: no |
|
Failure retry time-out. If a server fails more than twice in this designated amount of time, then the system stops trying to restart the server. The value (set in seconds) should be set to a period value longer than the msprobe interval (local.schedule.msprobe). Default: 600 seconds |
|
Timeout for a specific server before restart. service can be imap, pop, http, cert, job_controller, smtp, lmtp, mmp or ens. Default: use service.readtimeout |
|
Number of seconds of a specific server’s non-response before a warning message is logged to default log file. service can be imap, pop, http, cert, job_controller, smtp, lmtp, mmp or ens. Default: Use local.probe.warningthreshold |
|
Number of seconds of server non-response before a warning message is logged to default log file. Default: 5 secs |
|
MTA queue directory to check if queue size exceeded threshold defined by alarm.diskavail.msgalarmthreshold. Default: none |
|
Period of server non-response before restarting that server. See local.schedule.msprobe. Default: 10 seconds |
|
msprobe run schedule. A crontab style schedule string (see Table 18–10 |
|
Enable watcher which monitors service failures. (IMAP, POP, HTTP, job controller, dispatcher, message store (stored), imsched, and MMP. (LMTP/SMTP servers are monitored by the dispatcher and LMTP/SMTP clients are monitored by the job_controller.) Logs error messages to the default log file for specific failures. Default: on |
msprobe can issue alarms in the form of email messages to the postmaster (see To Monitor imapd, popd and httpd) warning of a specified condition. A sample email alarm sent when a certain threshold is exceeded is shown below:
Subject: ALARM: server response time in seconds of “ldap_siroe.com_389” is 10 Date: Tue, 17 Jul 2001 16:37:08 -0700 (PDT) From: postmaster@siroe.com To: postmaster@siroe.com Server instance: /opt/SUNWmsgsr Alarmid: serverresponse Instance: ldap_siroe_europa.com_389 Description: server response time in seconds Current measured value (17/Jul/2001:16:37:08 -0700): 10 Lowest recorded value: 0 Highest recorded value: 10 Monitoring interval: 600 seconds Alarm condition is when over threshold of 10 Number of times over threshold: 1 |
You can specify how often msprobe monitors disk and server performance, and under what circumstances it sends alarms. This is done by using the configutil command to set the alarm parameters. Table 23–6 shows useful alarm parameters along with their default setting. See configutil Parameters in Sun Java System Messaging Server 6 2005Q4 Administration Reference.
Table 23–6 Useful Alarm Message configutil Parameters
Parameter |
Description (Default in parenthesis) |
---|---|
(localhost) Machine to which you send warning messages. |
|
(25) The SMTP port to which to connect when sending alarm message. |
|
(Postmaster@localhost) Whom to send alarm notice. |
|
(Postmaster@localhost) Address of sender the alarm. |
|
(percentage mail partition diskspace available.) Text for description field for disk availability alarm. |
|
(3600) Interval in seconds between disk availability checks. Set to 0 to disable checking of disk usage. |
|
(10) Percentage of disk space availability below which an alarm is sent. |
|
(-1) Specifies whether the alarm is issued when disk space availability goes below threshold (-1) or above it (1). |
|
(24). Interval in hours between subsequent repetition of disk availability alarms. |
|
(server response time in seconds.) Text for description field for servers response alarm. |
|
(600) Interval in seconds between server response checks. Set to 0 to disable checking of server response. |
|
(10) If server response time in seconds exceeds this value, alarm issued. |
|
(1) Specifies whether alarm is issued when server response time is greater that (1) or less than (-1) the threshold. |
|
(24) Interval in hours between subsequent repetition of server response alarm. |