The following tools are available in for monitoring:
immonitor-access monitors the status of the following Messaging Server components/processes: Mail Delivery (SMTP server), Message Access and Store (POP and IMAP servers), Directory Service (LDAP server) and HTTP server. This utility measures the response times of the various services and the total round trip time taken to send and retrieve a message. The Directory Service is monitored by looking up a specified user in the directory and measuring the response time. Mail Delivery is monitored by sending a message (SMTP) and the Message Access and Store is monitored by retrieving it. Monitoring the HTTP server is limited to finding out weather or not it is up and running.
For complete instructions, refer to immonitor-access in Sun Java System Messaging Server 6 2005Q4 Administration Reference.
The stored utility performs maintenance tasks on the server and also can do monitoring. However, the monitoring tasks are now better handled by msprobe. See Monitoring Using msprobe and watcher Functions.
This utility provides statistics acquired from different system counters. Here is a current list of available counter objects:
Each entry represents a counter object and supplies a variety of useful counts for this object. In this section we will only be discussing the alarm, diskusage, serverresponse, db_lock, popstat, imapstat, and httpstat counter objects. For details on counterutil command usage, refer to counterutil in Sun Java System Messaging Server 6 2005Q4 Administration Reference.
counterutil has a variety of flags. A command format for this utility may be as follows:
counterutil -o CounterObject -i 5 -n 10
where,
-o CounterObject represents the counter object alarm, diskusage, serverresponse, db_lock, popstat, imapstat, and httpstat.
-i 5 specifies a 5 second interval.
-n 10 represents the number of iterations (default: infinity).
An example of counterutil usage is as follows:
# counterutil -o imapstat -i 5 -n 10 Monitor counteroobject (imapstat) registry /gotmail/iplanet/server5/msg-gotmail/counter/counter opened counterobject imapstat opened count = 1 at 972082466 rh = 0xc0990 oh = 0xc0968 global.currentStartTime [4 bytes]: 17/Oct/2000:12:44:23 -0700 global.lastConnectionTime [4 bytes]: 20/Oct/2000:15:53:37 -0700 global.maxConnections [4 bytes]: 69 global.numConnections [4 bytes]: 12480 global.numCurrentConnections [4 bytes]: 48 global.numFailedConnections [4 bytes]: 0 global.numFailedLogins [4 bytes]: 15 global.numGoodLogins [4 bytes]: 10446 ... |
These alarm statistics refer to the alarms sent by stored.The alarm counter provides the following statistics:
Table 23–1 counterutil alarm Statistics
Suffix |
Description |
---|---|
alarm.countoverthreshold |
Number of times crossing threshold. |
alarm.countwarningsent |
Number of warnings sent. |
alarm.current |
Current monitored valued. |
alarm.high |
Highest ever recorded value. |
alarm.low |
Lowest ever recorded value. |
alarm.timelastset |
The last time current value was set. |
alarm.timelastwarning |
The last time warning was sent. |
alarm.timereset |
The last time reset was performed. |
alarm.timestatechanged |
The last time alarm state changed. |
alarm.warningstate |
Warning state (yes(1) or no(0)). |
To get information on the number of current IMAP, POP, and HTTP connections, number of failed logins, total connections from the start time, and so forth, you can use the command counterutil -o CounterObject -i 5 -n 10.where CounterObject represents the counter object popstat, imapstat, or httpstat. The meaning of the imapstat suffixes is shown in Table 23–2. The popstat and httpstat objects provide the same information in the same format and structure.
Table 23–2 counterutil imapstat Statistics
Suffix |
Description |
---|---|
currentStartTime |
Start time of the current IMAP server process. |
lastConnectionTime |
The last time a new client was accepted. |
maxConnections |
Maximum number of concurrent connections handled by IMAP server. |
numConnections |
Total number of connections served by the current IMAP server. |
numCurrentConnections |
Current number of active connections. |
numFailedConnections |
Number of failed connections served by the current IMAP server. |
numFailedLogins |
Number of failed logins served by the current IMAP server. |
numGoodLogins |
Number of successful logins served by the current IMAP server. |
The command: counterutil -o diskusage generates following information:
Table 23–3 counterutil diskstat Statistics
Suffix |
Description |
---|---|
diskusage.availSpace |
Total space available in the disk partition. |
diskusage.lastStatTime |
The last time statistic was taken. |
diskusage.mailPartitionPath |
Mail partition path. |
diskusage.percentAvail |
Disk partition space available percentage. |
diskusage.totalSpace |
Total space in the disk partition. |
The command: counterutil -o serverresponse generates following information. This information is useful for checking if the servers are running, and how quickly they’re responding.
Table 23–4 counterutil serverresponse Statistics
Suffix |
Description |
---|---|
http.laststattime |
Last time http server response was checked. |
http.responsetime |
Response time for the http. |
imap.laststattime |
Last time imap server response was checked. |
imap.responsetime |
Response time for the imap. |
pop.laststattime |
Last time pop server response was checked. |
pop.responsetime |
Response time for the pop. |
ldap_host1_389.laststattime |
Last time ldap_host1_389 server response was checked. |
ldap_host1_389.responsetime |
Response time for the ldap_host1_389. |
ugldap_host2_389.laststattime |
Last time ugldap_host2_389 server response was checked. |
ugldap_host2_389.responsetime |
Response time for the ugldap_host2_389. |
Messaging server logs event records for SMTP, IMAP, POP, and HTTP. The policies for creating and managing the Messaging Server log files are customizable.
Since logging can affect the server performance, logging should be considered very carefully before the burden is put on the server. Refer to Chapter 21, Managing Logging for more information.
The MTA accumulates message traffic counters based upon the Mail Monitoring MIB, RFC 1566 for each of its active channels. The channel counters are intended to help indicate the trend and health of your e-mail system. Channel counters are not designed to provide an accurate accounting of message traffic. For precise accounting, instead see MTA logging as discussed in Chapter 21, Managing Logging.
The MTA channel counters are implemented using the lightest weight mechanisms available so that they cause as little impact as possible on actual operation. Channel counters do not try harder: if an attempt to map the section fails, no information is recorded; if one of the locks in the section cannot be obtained almost immediately, no information is recorded; when a system is shut down, the information contained in the in-memory section is lost forever.
The imsimta counters -show command provides MTA channel message statistics (see below). These counters need to be examined over time noting the minimum values seen. The minimums may actually be negative for some channels. A negative value means that there were messages queued for a channel at the time that its counters were zeroed (for example, the cluster-wide database of counters created). When those messages were dequeued, the associated counters for the channel were decremented and therefore leading to a negative minimum. For such a counter, the correct “absolute” value is the current value less the minimum value that counter has ever held since being initialized.
Channel Messages Recipients Blocks ------- -------- ---------- ------- tcp_local Received 29379 79714 982252 (1) Stored 61 113 -2004 (2) Delivered 29369 79723 983903 (29369 first time) (3) Submitted 13698 13699 18261 (4) Attempted 0 0 0 (5) Rejected 1 10 0 (6) Failed 104 104 4681 (7) Queue time/count 16425/29440 = 0.56 (8) Queue first time/count 16425/29440 = 0.56 (9) Total In Assocs 297637 Total Out Assocs 28306 |
1) Received is the number of messages enqueued to the channel named tcp_local. That is, the messages enqueued (E records in the mail.log* file) to the tcp_local channel by any other channel.
2) Stored is the number of messages stored in the channel queue to be delivered.
3) Delivered is the number of messages which have been processed (dequeued) by the channel tcp_local. (That is, D records in the mail.log* file.) A dequeue operation may either correspond to a successful delivery (that is, an enqueue to another channel), or to a dequeue due to the message being returned to the sender. This will generally correspond to the number Received minus the number Stored.
The MTA also keeps track of how many of the messages were dequeued upon first attempt; this number is shown in parentheses.
4) Submitted is the number of messages enqueued (E records in the mail.log file) by the channel tcp_local to any other channel.
5) Attempted is the number of messages which have experienced temporary problems in dequeuing, that is, Q or Z records in the mail.log* file.
6) Rejected is the number of attempted enqueues which have been rejected, that is, J records in the mail.log* file.
7) Failed is the number of attempted dequeues which have failed, that is, R records in the mail.log* file.
8) Queue time/count is the average time-spent-in-queue for the delivered messages. This includes both the messages delivered upon the first attempt, see (9), and the messages that required additional delivery attempts (hence typically spent noticeable time waiting fallow in the queue).
9) Queue first time/count is the average time-spent-in-queue for the messages delivered upon the first attempt.
Note that the number of messages submitted can be greater than the number delivered. This is often the case, since each message the channel dequeues (delivers) will result in at least one new message enqueued (submitted) but possibly more than one. For example, if a message has two recipients reached via different channels, then two enqueues will be required. Or if a message bounces, a copy will go back to the sender and another copy may be sent to the postmaster. Usually that will be two submissions (unless both are reached through the same channel).
More generally, the connection between Submitted and Delivered varies according to type of channel. For example, in the conversion channel, a message would be enqueued by some other arbitrary channel, and then the conversion channel would process that message and enqueue it to a third channel and mark the message as dequeued from its own queue. Each individual message takes a path:
elsewhere -> conversion E record Received conversion -> elsewhere E record Submitted conversion D record Delivered
However, for a channel such as tcp_local which is not a “pass through,” but rather has two separate pieces (slave and master), there is no connection between Submitted and Delivered. The Submitted counter has to do with the SMTP server portion of the tcp_local channel, whereas the Delivered counter has to do with the SMTP client portion of the tcp_local channel. Those are two completely separate programs, and the messages travelling through them may be completely separate.
Messages submitted to the SMTP server:
tcp_local -> elsewhere E record Submitted
Messages sent out to other SMTP hosts via the SMTP client:
elsewhere -> tcp_local E record Received tcp_local D record Delivered
Channel dequeues (delivers) will result in at least one new message enqueued (submitted) but possibly more than one. For example, if a message has two recipients reached via different channels, then two enqueues will be required. Or if a message bounces, a copy will go back to the sender and another copy may be sent to the postmaster. Usually that will be reached through the same channel.
For performance reasons, a node running the MTA keeps a cache of channel counters in memory using a shared memory section (UNIX) or shared file-mapping object (NT). As processes on the node enqueue and dequeue messages, they update the counters in this in-memory cache. If the in-memory section does not exist when a channel runs, the section will be created automatically. (The imta start command also creates the in-memory section, if it does not exist.)
The command imta counters -clear or the imta qm command counters clear may be used to reset the counters to zero.
The imsimta qm counters utility displays MTA channel queue message counters. You must be root or inetuser to run this utility. The output fields are the same as those described in imsimta counters. See also imsimta counters in Sun Java System Messaging Server 6 2005Q4 Administration Reference.
Example:
# imsimta counters -create # imsimta qm counters show Channel Messages Recipients Blocks ---------------------- ---------- ---------- ---------- tcp_intranet Received 13077 13859 264616 Stored 92 91 -362 Delivered 12985 13768 264978 Submitted 2594 2594 3641 ... |
Every time you restart the MTA, you must run: # imsimta counters -create
Messaging Server supports system monitoring through the Simple Network Management Protocol (SNMP). Using an SNMP client (sometimes called a network manager) such as Sun Net Manager or HP OpenView (not provided with this product), you can monitor certain parts of the Messaging Server. Refer to Appendix A, SNMP Support for details.
You can monitor mailbox quota usage and limits by using the imquotacheck utility. The imquotacheck utility generates a report that lists defined quotas and limits, and provides information on quota usage.
For example, the following command lists all user quota information:
% imquotacheck ------------------------------------------------------------------------- Domain red.siroe.com (diskquota = not set msgquota = not set) quota usage ------------------------------------------------------------------------- diskquota size(K) %use msgquota msgs %use user # of domains = 1 # of users = 705 no quota 50418 no quota 4392 ajonk no quota 5 no quota 2 andrt no quota 355518 no quota 2500 ansri ... |
The following example shows the quota usage for user sorook:
% imquotacheck -u sorook ------------------------------------------------------------------------- quota usage for user sorook ------------------------------------------------------------------------- diskquota size(K) %use msgquota msgs %use user no quota 1487 no quota 305 sorook |
Messaging Server provides two processes, watcher and msprobe to monitor various system services. watcher watches for server crashes and restarts them as necessary. msprobe monitors server hangs (unresponsiveness). Specifically msprobe monitors the following:
Server Response Time. msprobe connects to the enabled servers using their protocol commands and measures their response times. If the response time exceeds the alarm warning threshold, an alarm message is sent (see Alarm Messages to a server, or the server response time exceeds a specified timeout period, the server is restarted. Server response times are recorded in a counter database and is logged to the default log file. counterutil can be used to display the server response time statistics (counterutil).
The following servers are monitored by msprobe: imap, pop, http, cert, job_controller, smtp, lmtp, mmp and ens. When smtp or lmtp are not responding, the dispatcher is restarted. ens cannot be automatically restarted.
Disk usage. msprobe checks the disk availability and usage for every message store partition. Specifically it checks the message store mboxlist database directory and the MTA queue directory. If disk usage exceeds a configured threshold, an alarm message is sent. The disk sizes and usages are recorded in a counter database and is logged to the default log file. Administrators can use the counterutil utility (see counterutil) to display the disk usage statistics.
Message Store mboxlist Database Log File Accumulation. Log file accumulation is an indication of an mboxlist database error. msprobe counts the number of active log files and if the number of active log files is larger than the threshold, msprobe logs a critical error message to the default log file to inform the admin to restart the server. If the autorestart is enabled (local.autorestart to yes), the store daemon is restarted.
watcher and msprobe are controlled by the configutil options shown in Table 23–5. Further information can be found in Automatic Restart of Failed or Unresponsive Services
Table 23–5 msprobe and watcher configutil Options
Options |
Description |
---|---|
Enable automatic server restart. Automatically restarts failed or hung services. Default: no |
|
Failure retry time-out. If a server fails more than twice in this designated amount of time, then the system stops trying to restart the server. The value (set in seconds) should be set to a period value longer than the msprobe interval (local.schedule.msprobe). Default: 600 seconds |
|
Timeout for a specific server before restart. service can be imap, pop, http, cert, job_controller, smtp, lmtp, mmp or ens. Default: use service.readtimeout |
|
Number of seconds of a specific server’s non-response before a warning message is logged to default log file. service can be imap, pop, http, cert, job_controller, smtp, lmtp, mmp or ens. Default: Use local.probe.warningthreshold |
|
Number of seconds of server non-response before a warning message is logged to default log file. Default: 5 secs |
|
MTA queue directory to check if queue size exceeded threshold defined by alarm.diskavail.msgalarmthreshold. Default: none |
|
Period of server non-response before restarting that server. See local.schedule.msprobe. Default: 10 seconds |
|
msprobe run schedule. A crontab style schedule string (see Table 18–10 |
|
Enable watcher which monitors service failures. (IMAP, POP, HTTP, job controller, dispatcher, message store (stored), imsched, and MMP. (LMTP/SMTP servers are monitored by the dispatcher and LMTP/SMTP clients are monitored by the job_controller.) Logs error messages to the default log file for specific failures. Default: on |
msprobe can issue alarms in the form of email messages to the postmaster (see To Monitor imapd, popd and httpd) warning of a specified condition. A sample email alarm sent when a certain threshold is exceeded is shown below:
Subject: ALARM: server response time in seconds of “ldap_siroe.com_389” is 10 Date: Tue, 17 Jul 2001 16:37:08 -0700 (PDT) From: postmaster@siroe.com To: postmaster@siroe.com Server instance: /opt/SUNWmsgsr Alarmid: serverresponse Instance: ldap_siroe_europa.com_389 Description: server response time in seconds Current measured value (17/Jul/2001:16:37:08 -0700): 10 Lowest recorded value: 0 Highest recorded value: 10 Monitoring interval: 600 seconds Alarm condition is when over threshold of 10 Number of times over threshold: 1 |
You can specify how often msprobe monitors disk and server performance, and under what circumstances it sends alarms. This is done by using the configutil command to set the alarm parameters. Table 23–6 shows useful alarm parameters along with their default setting. See configutil Parameters in Sun Java System Messaging Server 6 2005Q4 Administration Reference.
Table 23–6 Useful Alarm Message configutil Parameters
Parameter |
Description (Default in parenthesis) |
---|---|
(localhost) Machine to which you send warning messages. |
|
(25) The SMTP port to which to connect when sending alarm message. |
|
(Postmaster@localhost) Whom to send alarm notice. |
|
(Postmaster@localhost) Address of sender the alarm. |
|
(percentage mail partition diskspace available.) Text for description field for disk availability alarm. |
|
(3600) Interval in seconds between disk availability checks. Set to 0 to disable checking of disk usage. |
|
(10) Percentage of disk space availability below which an alarm is sent. |
|
(-1) Specifies whether the alarm is issued when disk space availability goes below threshold (-1) or above it (1). |
|
(24). Interval in hours between subsequent repetition of disk availability alarms. |
|
(server response time in seconds.) Text for description field for servers response alarm. |
|
(600) Interval in seconds between server response checks. Set to 0 to disable checking of server response. |
|
(10) If server response time in seconds exceeds this value, alarm issued. |
|
(1) Specifies whether alarm is issued when server response time is greater that (1) or less than (-1) the threshold. |
|
(24) Interval in hours between subsequent repetition of server response alarm. |