Sun Java System Messaging Server 6.3 Administration Guide

27.8 Utilities and Tools for Monitoring

The following tools are available in for monitoring:

27.8.1 immonitor-access

immonitor-access monitors the status of the following Messaging Server components/processes: Mail Delivery (SMTP server), Message Access and Store (POP and IMAP servers), Directory Service (LDAP server) and HTTP server. This utility measures the response times of the various services and the total round trip time taken to send and retrieve a message. The Directory Service is monitored by looking up a specified user in the directory and measuring the response time. Mail Delivery is monitored by sending a message (SMTP) and the Message Access and Store is monitored by retrieving it. Monitoring the HTTP server is limited to finding out whether or not it is up and running.

For complete instructions, refer to immonitor-access in Sun Java System Messaging Server 6.3 Administration Reference.

27.8.2 imcheck

Use imcheck —s to monitor database statistics including logs and transactions.

27.8.3 counterutil

This utility provides statistics acquired from different system counters. Here is a current list of available counter objects:

# /opt/SUNWmsgsr/sbin/counterutil -l
Listing registry (/opt/SUNWmsgsr/data/counter/counter)
numobjects = 11
refcount = 1
created = 25/Sep/2003:02:04:55 -0700
modified = 02/Oct/2003:22:48:55 -0700
     entry = alarm 
     entry = diskusage
     entry = serverresponse     entry = imapstat
     entry = httpstat
     entry = popstat
     entry = cgimsg

Each entry represents a counter object and supplies a variety of useful counts for this object. In this section we will only be discussing the alarm, diskusage, serverresponse, popstat, imapstat, and httpstat counter objects. For details on counterutil command usage, refer to counterutil in Sun Java System Messaging Server 6.3 Administration Reference.

27.8.3.1 counterutil Output

counterutil has a variety of flags. A command format for this utility may be as follows:

counterutil -o CounterObject -i 5 -n 10

where,

-o CounterObject represents the counter object alarm, diskusage, serverresponse, popstat, imapstat, and httpstat.

-i 5 specifies a 5 second interval.

-n 10 represents the number of iterations (default: infinity).

An example of counterutil usage is as follows:

# counterutil -o imapstat -i 5 -n 10 
Monitor counteroobject (imapstat) 
registry /gotmail/iplanet/server5/msg-gotmail/counter/counter opened 
counterobject imapstat opened 

count = 1 at 972082466 rh = 0xc0990 oh = 0xc0968 

global.currentStartTime [4 bytes]: 17/Oct/2000:12:44:23 -0700 
global.lastConnectionTime [4 bytes]: 20/Oct/2000:15:53:37 -0700 
global.maxConnections [4 bytes]: 69 
global.numConnections [4 bytes]: 12480 
global.numCurrentConnections [4 bytes]: 48 
global.numFailedConnections [4 bytes]: 0 
global.numFailedLogins [4 bytes]: 15 
global.numGoodLogins [4 bytes]: 10446 
...

27.8.3.2 Alarm Statistics Using counterutil

These alarm statistics refer to the alarms sent by stored.The alarm counter provides the following statistics:

Table 27–1 counterutil alarm Statistics


Suffix	Description
`alarm.countoverthreshold`	Number of times crossing threshold.
`alarm.countwarningsent`	Number of warnings sent.
`alarm.current`	Current monitored valued.
`alarm.high`	Highest ever recorded value.
`alarm.low`	Lowest ever recorded value.
`alarm.timelastset`	The last time current value was set.
`alarm.timelastwarning`	The last time warning was sent.
`alarm.timereset`	The last time reset was performed.
`alarm.timestatechanged`	The last time alarm state changed.
`alarm.warningstate`	Warning state (yes(1) or no(0)).

27.8.3.3 IMAP, POP, and HTTP Connection Statistics Using counterutil

To get information on the number of current IMAP, POP, and HTTP connections, number of failed logins, total connections from the start time, and so forth, you can use the command counterutil -o CounterObject -i 5 -n 10.where CounterObject represents the counter object popstat, imapstat, or httpstat. The meaning of the imapstat suffixes is shown in Table 27–2. The popstat and httpstat objects provide the same information in the same format and structure.

Table 27–2 counterutil imapstat Statistics


Suffix	Description
`currentStartTime`	Start time of the current IMAP server process.
`lastConnectionTime`	The last time a new client was accepted.
`maxConnections`	Maximum number of concurrent connections handled by IMAP server.
`numConnections`	Total number of connections served by the current IMAP server.
`numCurrentConnections`	Current number of active connections.
`numFailedConnections`	Number of failed connections served by the current IMAP server.
`numFailedLogins`	Number of failed logins served by the current IMAP server.
`numGoodLogins`	Number of successful logins served by the current IMAP server.

27.8.3.4 Disk Usage Statistics Using counterutil

The command: counterutil -o diskusage generates following information:

Table 27–3 counterutil diskstat Statistics


Suffix	Description
`diskusage.availSpace`	Total space available in the disk partition.
`diskusage.lastStatTime`	The last time statistic was taken.
`diskusage.mailPartitionPath`	Mail partition path.
`diskusage.percentAvail`	Disk partition space available percentage.
`diskusage.totalSpace`	Total space in the disk partition.

27.8.3.5 Server Response Statistics

The command: counterutil -o serverresponse generates following information. This information is useful for checking if the servers are running, and how quickly they’re responding.

Table 27–4 counterutil serverresponse Statistics


Suffix	Description
`http.laststattime`	Last time http server response was checked.
`http.responsetime`	Response time for the http.
`imap.laststattime`	Last time imap server response was checked.
`imap.responsetime`	Response time for the imap.
`pop.laststattime`	Last time pop server response was checked.
`pop.responsetime`	Response time for the pop.

27.8.4 Log Files

Messaging server logs event records for SMTP, IMAP, POP, and HTTP. The policies for creating and managing the Messaging Server log files are customizable.

Since logging can affect the server performance, logging should be considered very carefully before the burden is put on the server. Refer to Chapter 25, Managing Logging for more information.

27.8.5 imsimta counters

The MTA accumulates message traffic counters based upon the Mail Monitoring MIB, RFC 1566 for each of its active channels. The channel counters are intended to help indicate the trend and health of your e-mail system. Channel counters are not designed to provide an accurate accounting of message traffic. For precise accounting, instead see MTA logging as discussed in Chapter 25, Managing Logging.

The MTA channel counters are implemented using the lightest weight mechanisms available so that they cause as little impact as possible on actual operation. Channel counters do not try harder: if an attempt to map the section fails, no information is recorded; if one of the locks in the section cannot be obtained almost immediately, no information is recorded; when a system is shut down, the information contained in the in-memory section is lost forever.

The imsimta counters -show command provides MTA channel message statistics (see below). These counters need to be examined over time noting the minimum values seen. The minimums may actually be negative for some channels. A negative value means that there were messages queued for a channel at the time that its counters were zeroed (for example, the cluster-wide database of counters created). When those messages were dequeued, the associated counters for the channel were decremented and therefore leading to a negative minimum. For such a counter, the correct “absolute” value is the current value less the minimum value that counter has ever held since being initialized.

Channel          Messages    Recipients    Blocks 
-------          --------    ----------    ------- 
tcp_local
   Received       29379       79714      982252                (1)
   Stored            61         113       -2004                (2)
   Delivered      29369       79723      983903 (29369 first time)  (3)
   Submitted      13698       13699       18261                (4)
   Attempted          0           0           0                (5)
   Rejected           1          10           0                (6)
   Failed           104         104        4681                (7)

   Queue time/count        16425/29440 = 0.56                  (8)
   Queue first time/count  16425/29440 = 0.56                  (9)

   Total In Assocs           297637
   Total Out Assocs           28306

1) Received is the number of messages enqueued to the channel named tcp_local. That is, the messages enqueued (E records in the mail.log* file) to the tcp_local channel by any other channel.

2) Stored is the number of messages stored in the channel queue to be delivered.

3) Delivered is the number of messages which have been processed (dequeued) by the channel tcp_local. (That is, D records in the mail.log* file.) A dequeue operation may either correspond to a successful delivery (that is, an enqueue to another channel), or to a dequeue due to the message being returned to the sender. This will generally correspond to the number Received minus the number Stored.

The MTA also keeps track of how many of the messages were dequeued upon first attempt; this number is shown in parentheses.

4) Submitted is the number of messages enqueued (E records in the mail.log file) by the channel tcp_local to any other channel.

5) Attempted is the number of messages which have experienced temporary problems in dequeuing, that is, Q or Z records in the mail.log* file.

6) Rejected is the number of attempted enqueues which have been rejected, that is, J records in the mail.log* file.

7) Failed is the number of attempted dequeues which have failed, that is, R records in the mail.log* file.

8) Queue time/count is the average time-spent-in-queue for the delivered messages. This includes both the messages delivered upon the first attempt, see (9), and the messages that required additional delivery attempts (hence typically spent noticeable time waiting fallow in the queue).

9) Queue first time/count is the average time-spent-in-queue for the messages delivered upon the first attempt.

Note that the number of messages submitted can be greater than the number delivered. This is often the case, since each message the channel dequeues (delivers) will result in at least one new message enqueued (submitted) but possibly more than one. For example, if a message has two recipients reached via different channels, then two enqueues will be required. Or if a message bounces, a copy will go back to the sender and another copy may be sent to the postmaster. Usually that will be two submissions (unless both are reached through the same channel).

More generally, the connection between Submitted and Delivered varies according to type of channel. For example, in the conversion channel, a message would be enqueued by some other arbitrary channel, and then the conversion channel would process that message and enqueue it to a third channel and mark the message as dequeued from its own queue. Each individual message takes a path:

elsewhere -> conversion E record Received
conversion -> elsewhere E record Submitted
conversion              D record Delivered

However, for a channel such as tcp_local which is not a “pass through,” but rather has two separate pieces (slave and master), there is no connection between Submitted and Delivered. The Submitted counter has to do with the SMTP server portion of the tcp_local channel, whereas the Delivered counter has to do with the SMTP client portion of the tcp_local channel. Those are two completely separate programs, and the messages travelling through them may be completely separate.

Messages submitted to the SMTP server:

tcp_local -> elsewhere E record Submitted

Messages sent out to other SMTP hosts via the SMTP client:

elsewhere -> tcp_local E record Received
tcp_local              D record Delivered

Channel dequeues (delivers) will result in at least one new message enqueued (submitted) but possibly more than one. For example, if a message has two recipients reached via different channels, then two enqueues will be required. Or if a message bounces, a copy will go back to the sender and another copy may be sent to the postmaster. Usually that will be reached through the same channel.

27.8.5.1 Implementation on UNIX and NT

For performance reasons, a node running the MTA keeps a cache of channel counters in memory using a shared memory section (UNIX) or shared file-mapping object (NT). As processes on the node enqueue and dequeue messages, they update the counters in this in-memory cache. If the in-memory section does not exist when a channel runs, the section will be created automatically. (The imta start command also creates the in-memory section, if it does not exist.)

The command imta counters -clear or the imta qm command counters clear may be used to reset the counters to zero.

27.8.6 imsimta qm counters

The imsimta qm counters utility displays MTA channel queue message counters. You must be root or mailsrv to run this utility. The output fields are the same as those described in 27.8.5 imsimta counters. See also imsimta counters in Sun Java System Messaging Server 6.3 Administration Reference.

Example:

# imsimta counters -create
# imsimta qm counters show
Channel                Messages   Recipients Blocks
---------------------- ---------- ---------- ----------
tcp_intranet
   Received              13077      13859     264616 
   Stored                   92         91       -362 
   Delivered             12985      13768     264978 
   Submitted              2594       2594       3641
...

Every time you restart the MTA, you must run: # imsimta counters -create

27.8.7 MTA Monitoring Using SNMP

Messaging Server supports system monitoring through the Simple Network Management Protocol (SNMP). Using an SNMP client (sometimes called a network manager) such as Sun Net Manager or HP OpenView (not provided with this product), you can monitor certain parts of the Messaging Server. Refer to Appendix A, SNMP Support for details.

27.8.8 imquotacheck for Mailbox Quota Checking

You can monitor mailbox quota usage and limits by using the imquotacheck utility. The imquotacheck utility generates a report that lists defined quotas and limits, and provides information on quota usage.

For example, the following command lists all user quota information:

% imquotacheck 
-------------------------------------------------------------------------
Domain red.siroe.com (diskquota = not set msgquota = not set) quota usage
-------------------------------------------------------------------------
diskquota         size(K)    %use    msgquota      msgs    %use    user
# of domains = 1
# of users = 705

no quota          50418             no quota      4392             ajonk
no quota              5             no quota      2                andrt
no quota         355518             no quota      2500             ansri
 ...

The following example shows the quota usage for user sorook:

% imquotacheck -u sorook
-------------------------------------------------------------------------
quota usage for user sorook
-------------------------------------------------------------------------
diskquota      size(K)    %use    msgquota      msgs     %use    user

no quota       1487               no quota      305              sorook

27.8.9 Monitoring Using msprobe and watcher Functions

Messaging Server provides two processes, watcher and msprobe to monitor various system services. watcher watches for server crashes and restarts them as necessary. msprobe monitors server hangs (unresponsiveness). Specifically msprobe monitors the following:

Server Response Time. msprobe connects to the enabled servers using their protocol commands and measures their response times. If the response time exceeds the alarm warning threshold, an alarm message is sent (see 27.8.9.1 Alarm Messages to a server, or the server response time exceeds a specified timeout period, the server is restarted. Server response times are recorded in a counter database and is logged to the default log file. counterutil can be used to display the server response time statistics (27.8.3 counterutil).

The following servers are monitored by msprobe: imap, pop, http, cert, job_controller, smtp, lmtp, mmp and ens. When smtp or lmtp are not responding, the dispatcher is restarted. ens cannot be automatically restarted.
Disk usage. msprobe checks the disk availability and usage for every message store partition. Specifically it checks the message store mboxlist database directory and the MTA queue directory. If disk usage exceeds a configured threshold, an alarm message is sent. The disk sizes and usages are recorded in a counter database and is logged to the default log file. Administrators can use the counterutil utility (see 27.8.3 counterutil) to display the disk usage statistics.
Message Store mboxlist Database Log File Accumulation. Log file accumulation is an indication of an mboxlist database error. msprobe counts the number of active log files and if the number of active log files is larger than the threshold, msprobe logs a critical error message to the default log file to inform the admin to restart the server. If the autorestart is enabled (local.autorestart to yes), the store daemon is restarted.

watcher and msprobe are controlled by the configutil options shown in Table 27–5. Further information can be found in 4.5 Automatic Restart of Failed or Unresponsive Services

Table 27–5 msprobe and watcher configutil Options


Options	Description
local.autorestart	Enable automatic server restart. Automatically restarts failed or hung services. Default: no
local.autorestart.timeout	Failure retry time-out. If a server fails more than twice in this designated amount of time, then the system stops trying to restart the server. The value (set in seconds) should be set to a period value longer than the `msprobe` interval (`local.schedule.msprobe`). Default: 600 seconds
local.probe.service.timeout	Timeout for a specific server before restart. `service` can be imap, pop, http, cert, job_controller, smtp, lmtp, mmp or ens. Default: use `service.readtimeout`
local.probe.service.warningthreshold	Number of seconds of a specific server’s non-response before a warning message is logged to `default` log file. `service` can be imap, pop, http, cert, job_controller, smtp, lmtp, mmp or ens. Default: Use local.probe.warningthreshold
local.probe.warningthreshold	Number of seconds of server non-response before a warning message is logged to `default` log file. Default: 5 secs
local.queuedir	MTA queue directory to check if queue size exceeded threshold defined by alarm.diskavail.msgalarmthreshold. Default: none
service.readtimeout	Period of server non-response before restarting that server. See local.schedule.msprobe. Default: 10 seconds
local.schedule.msprobe	`msprobe` run schedule. A crontab style schedule string (see Table 20–10Note that by default, this is automatically set. See 4.6.2 Pre-defined Automatic Tasks. To disable: set `local.schedule.msprobe.enable` to `NO`.
local.watcher.enable	Enable watcher which monitors service failures. (IMAP, POP, HTTP, job controller, dispatcher, message store (`stored`), `imsched`, and MMP. (LMTP/SMTP servers are monitored by the dispatcher and LMTP/SMTP clients are monitored by the job_controller.) Logs error messages to the default log file for specific failures. Default: on

27.8.9.1 Alarm Messages

msprobe can issue alarms in the form of email messages to the postmaster (see 27.6.1.2 To Monitor imapd, popd and httpd) warning of a specified condition. A sample email alarm sent when a certain threshold is exceeded is shown below:

Subject:    ALARM: server response time in seconds of “ldap_siroe.com_389” is 10
Date:    Tue, 17 Jul 2001 16:37:08 -0700 (PDT) 
From:    postmaster@siroe.com 
To:     postmaster@siroe.com 

Server instance: /opt/SUNWmsgsr
Alarmid: serverresponse 
Instance: ldap_siroe_europa.com_389 
Description: server response time in seconds 
Current measured value (17/Jul/2001:16:37:08 -0700): 10 
Lowest recorded value: 0 
Highest recorded value: 10 
Monitoring interval: 600 seconds 
Alarm condition is when over threshold of 10 
Number of times over threshold: 1

You can specify how often msprobe monitors disk and server performance, and under what circumstances it sends alarms. This is done by using the configutil command to set the alarm parameters. Table 27–6 shows useful alarm parameters along with their default setting. See configutil Parameters in Sun Java System Messaging Server 6.3 Administration Reference.

Table 27–6 Useful Alarm Message configutil Parameters


Parameter	Description (Default in parenthesis)
a larm.msgalarmnoticehost	(localhost) Machine to which you send warning messages.
alarm.msgalarmnoticeport	(25) The SMTP port to which to connect when sending alarm message.
alarm.msgalarmnoticercpt	(Postmaster@localhost) Whom to send alarm notice.
alarm.msgalarmnoticesender	(Postmaster@localhost) Address of sender the alarm.
alarm.diskavail.msgalarmdescription	(percentage mail partition diskspace available.) Text for description field for disk availability alarm.
alarm.diskavail.msgalarmstatinterval	(3600) Interval in seconds between disk availability checks. Set to 0 to disable checking of disk usage.
alarm.diskavail.msgalarmthreshold	(10) Percentage of disk space availability below which an alarm is sent.
alarm.diskavail.msgalarmthresholddirection	(-1) Specifies whether the alarm is issued when disk space availability goes below threshold (-1) or above it (1).
alarm.diskavail.msgalarmwarninginterval	(24). Interval in hours between subsequent repetition of disk availability alarms.
alarm.serverresponse.msgalarmdescription	(server response time in seconds.) Text for description field for servers response alarm.
alarm.serverresponse.msgalarmstatinterval	(600) Interval in seconds between server response checks. Set to 0 to disable checking of server response.
alarm.serverresponse.msgalarmthreshold	(10) If server response time in seconds exceeds this value, alarm issued.
alarm.serverresponse.msgalarmthresholddirection	(1) Specifies whether alarm is issued when server response time is greater that (1) or less than (-1) the threshold.
alarm.serverresponse.msgalarmwarninginterval	(24) Interval in hours between subsequent repetition of server response alarm.