Sun Java System Messaging Server 6.3 Administration Guide

4.5 Automatic Restart of Failed or Unresponsive Services

Messaging Server provides two processes called watcher and msprobe that transparently monitor services and automatically restart them if they crash or become unresponsive (the services hangs). watcher monitors server crashes. msprobe monitors server hangs by checking the response time. When a server fails or stops responding to requests, it is automatically restarted. Table 4–4 shows the services monitored by each utility.

Table 4–4 Services Monitored by watcher and msprobe


watcher (crash)	msprobe (unresponsive hang)
IMAP, POP, HTTP, job controller, dispatcher, message store (`stored`), `imsched`, MMP. (LMTP/SMTP servers are monitored by the dispatcher and LMTP/SMTP clients are monitored by the job_controller.)	IMAP, POP, HTTP, cert, job controller, message store (`stored`), `imsched`, ENS, LMTP, SMTP

Setting local.watcher.enable=on (default) will monitor process failures and unresponsive services and will log error messages to the default log file indicating specific failures. To enable automatic server restart, set the configutil parameter local.autorestart to yes. By default, this parameter is set to no.

If any of the message store services fail or freeze, all message store services that were enabled at start-up are restarted. For example, if imapd fails, at the least, stored and imapd are restarted. If other message store services were running, such as the POP or HTTP servers, then those will be restarted as well, whether or not they failed.

Automatic restart also works if a message store utility fails or freezes. For example, if mboxutil fails or freezes, the system will automatically restart all the message store servers. Note, however, that it will not restart the utility. msprobe runs every 10 minutes. Service and process restarts will be performed up to two times within a 10 minute period (configurable using local.autorestart.timeout).

Whether or not local.autorestart is set to yes, the system still monitors the services and sends failure or non-response error messages to the console and msg-svr-base/data/log/ watcher listens to port 49994 by default, but this is configurable with local.watcher.port.

A watcher log file is generated in msg-svr-base/data/log/watcher. This log file is not managed by the logging system (no rollover or purging) and records all server starts and stops. An example log is shown below:

watcher process 13425 started at Tue Oct 21 15:29:44 2003

Watched ’imapd’ process 13428 exited abnormally
Received request to restart:  store imap pop http
Connecting to watcher ...
Stopping http server 13440 .... done
Stopping pop server 13431 ... done
Stopping pop server 13434 ... done
Stopping pop server 13435 ... done
Stopping pop server 13433 ... done
imap server is not running
Stopping store server 13426 .... done
Starting store server .... 13457
checking store server status ...... ready
Starting imap server ..... 13459
Starting pop server ....... 13462
Starting http server ...... 13471

See 27.8.9 Monitoring Using msprobe and watcher Functions for more details on how to configure this feature.

msprobe is controlled by imsched. If imsched crashes, this event will be detected by watcher and trigger a restart (if autorestart is enabled). However, in the rare occurrence of imsched hanging, you will need to kill imsched with a kill imsched_pid, which will cause the watcher to restart it.

4.5.1 Automatic Restart in High Availability Deployments

Automatic restart in high availability deployments require the following configutil parameters to be set:

Table 4–5 HA Automatic Restart Parameters


Parameter	Description/HA Value
`local.watcher.enable`	Enable watcher on `start-msg` startup. Default is yes.
`local.autorestart`	Enable automatic restart of failed or frozen (unresponsive) servers including IMAP, POP, HTTP, job controller, dispatcher, and MMP servers. Default is No.
`local.autorestart.timeout`	Failure retry time-out. If a server fails more than once during this designated period of time, then the system will stop trying to restart this server. If this happens in an HA system, Messaging Server is shutdown and a failover to the other system occurs. The value (set in seconds) should be set to a period value longer than the msprobe interval. (See `local.schedule.msprobe` below). Default is 600.
`local.schedule.msprobe`	`msprobe` run schedule. A crontab style schedule string (see Table 20–10). Default is `5,15,25,35,45,55 * * * * lib/msprobe` To disable: set `local.schedule.msprobe.enable` to `NO`.