Sun Java System Messaging Server 6.3 Administration Guide

Chapter 27 Monitoring Messaging Server

In most cases, a well-planned, well-configured server will perform without extensive intervention from an administrator. As an administrator, however, it is your job to monitor the server for signs of problems. This chapter describes the monitoring of the Messaging Server. It consists of the following sections:

Troubleshooting procedures can be found in Chapter 26, Troubleshooting the MTA

27.1 Automatic Monitoring and Restart

Messaging Server provides a way to transparently monitor services and automatically restart them if they crash or become unresponsive (the services hangs or freeze up). It can monitor all message store, MTA, and MMP services including the IMAP, POP, HTTP, job controller, dispatcher, and MMP servers. It does not monitor other services such as SMS or TCP/SNMP servers. (TCP/SNMP is monitored by the job controller.) Refer to 4.5 Automatic Restart of Failed or Unresponsive Services and 27.8.9 Monitoring Using msprobe and watcher Functions.

27.2 Daily Monitoring Tasks

The most important tasks you should perform on a daily basis are checking postmaster mail, monitoring the log files, and setting up the stored utility. These tasks are described below.

27.2.1 Checking postmaster Mail

Messaging Server has a predefined administrative mailing list set up for postmaster email. Any users who are part of this mailing list will automatically receive mail addressed to postmaster.

The rules for postmaster mail are defined in RFC822, which requires every email site to accept mail addressed to a user or mailing list named postmaster and that mail sent to this address be delivered to a real person. All messages sent to postmaster@host.domain are sent to a postmaster account or mailing list.

Typically, the postmaster address is where users should send email about their mail service. As postmaster, you might receive mail from local users about server response time, from other server administrators who are encountering problems sending mail to your server, and so on. You should check postmaster mail daily.

You can also configure the server to send certain error messages to the postmaster address. For example, when the MTA cannot route or deliver a message, you can be notified via email sent to the postmaster address. You can also send exception condition warnings (low disk space, poor server response) to postmaster.

27.2.2 Monitoring and Maintaining the Log Files

Messaging Server creates a separate set of log files for each of the major protocols or services it supports including SMTP, IMAP, POP, and HTTP. These are located in msg-svr-base/data/log. You should monitor the log files on a routine basis--especially if you are having problems with the server.

Be aware that logging can impact server performance. The more verbose the logging you specify, the more disk space your log files will occupy for a given amount of time. You should define effective but realistic log rotation, expiration, and backup policies for your server. For information about defining logging policies for your server, see Chapter 25, Managing Logging.

27.2.3 Setting Up the msprobe Utility

The msprobe utility automatically performs monitoring and restart functions. For further information see 27.8.9 Monitoring Using msprobe and watcher Functions

27.3 Monitoring System Performance

This chapter focuses on Messaging Server monitoring, however, you will also need to monitor the system on which the server resides. A well-configured server cannot perform well on a poorly-tuned system, and symptoms of server failure may be an indication that the hardware is not powerful enough to serve the email load. This chapter does not provide all the details for monitoring system performance as many of these procedures are platform specific and may require that you refer to the platform specific system documentation. The following procedures are described here for performance monitoring:

27.3.1 Monitoring End-to-end Message Delivery Times

Email needs to be delivered on time. This may be a service agreement requirement, but also it is good policy to have mail delivered as quickly as possible. Slow end-to-end times could indicate a number of things. It may be that the server is not working properly, or that certain times of the day experience overwhelming message loads, or that the existing hardware resources are being pushed beyond their capacity.

27.3.1.1 Symptoms of Poor End-to-end Message Delivery Times

Mail takes a longer period of time to be delivered than normal.

27.3.1.2 To Monitor End-to-end Message Delivery Times

Use any facility that sends a message and receives it. Compare the headers times between server hops, and times between point of origin and retrieval. See 27.8.1 immonitor-access.

27.3.2 Monitoring Disk Space

Inadequate disk space is one of the most common causes of the mail server problems and failure. Without space to write to the MTA queues or to the message store, the mail server will fail. In addition, unless log files are monitored and cleaned up, they can grow uncontrollably filling up all disk space.

Message store partitions grow as new messages are delivered to the mailboxes; for example, if message store quotas are not enforced, the message store can outgrow the disk space available for a partition. Another cause of running out of disk space are the MTA message queues growing too large. A third area of concern is if a problem occurs with the log file monitoring facilities and the log files growing uncontrollably. (Note that there are a number of log files such as LDAP, MTA, and Message Access, and that each of these log files can be stored on different disks.)

27.3.2.1 Symptoms of Disk Space Problems

Different symptoms can occur depending on which disk or partition is running out of space. MTA queues can overflow and reject SMTP connections, messages might remain in the ims_master queue and not be not delivered to the message store, and log files can overflow.

If a message store partition fills up, message access daemons can fail, and message store data can be corrupted. Message store maintenance utilities such as imexpire and reconstruct can repair the damage and reduce disk usage. However, these utilities require additional disk space, and repairing a partition that has filled an entire disk can potentially cause down time.

27.3.2.2 To Monitor Disk Space

Depending upon the system configuration you may need to monitor various disks and partitions. For example, MTA queues may reside on one disk/partition, message stores may reside on another, and log files may reside on yet another. Each of these spaces will require monitoring and the methods to monitor these spaces may differ.

Messaging Server provides specific methods for monitoring message store disk usage and preventing partitions from filling up all available disk space.

You can take the following steps to monitor the message store’s use of disk space:

Set parameters to monitor message store disk usage
Lock message store partitions when a disk-usage threshold is reached

For details, see the sections that follow: Monitoring the Message Store and Monitoring Message Store Partitions.

Monitoring the Message Store

It is recommended that message store disk usage not exceed 75% capacity. You can monitor message store disk usage by configuring the following alarm attributes using the configutil utility:

alarm.diskavail.msgalarmstatinterval
alarm.diskavail.msgalarmthreshold
alarm.diskavail.msgalarmwarninginterval
alarm.diskavail.msgalarmdescription

By setting these parameters, you can specify how often the system should monitor disk space and under what circumstances the system should send a warning. For example, if you want the system to monitor disk space every 600 seconds, specify the following command:

configutil -o alarm.diskavail.msgalarmstatinterval -v 600

If you want to receive a warning whenever available disk space falls below 20%, specify the following command:

configutil -o alarm.diskavail.msgalarmthreshold -v 20

Refer to Table 27–6 for more information on these parameters.

Monitoring Message Store Partitions

You can halt messages from being delivered to a message store partition when the partition fills more than a specified percentage of available disk space. This is done by setting two configutil parameters to enable the feature and specify the disk-usage threshold.

With this feature, the message store daemon monitors the partition’s disk usage. As disk usage increases, the store daemon dynamically checks the partition more frequently (ranging from once every 100 minutes to once a minute).

If disk usage goes higher than the specified threshold, the store daemon:

Locks the partition. Incoming messages are held in the MTA message queue, but not delivered to the mailboxes in the message store partition.
Logs a message to the default log file.
Sends an email notification to the postmaster. (You can change the recipient of the email by setting the configutil parameter alarm.msgalarmnoticercpt.)

When disk usage falls below the threshold, the partition is unlocked, and messages are again delivered to the store.

The configutil parameters are as follows:

local.store.checkdiskusage enables the partition-monitoring feature.

Allowable values: yes, no

Default value: yes
local.store.diskusagethreshold specifies the disk-usage threshold. The value of local.store.diskusagethreshold is a percentage from 1 to 99.

Default value: 99

You should set the disk-usage threshold to a percentage low enough to give you time to repartition or assign more disk space to the local message store.

For example, suppose a partition fills up disk space at a rate of 2 percent per hour, and it takes an hour to allocate additional disk space for the local message store. In this case, you should set the disk-usage threshold to a value lower than 98 percent.

Monitoring the MTA Queues and Logging Space

You will need to monitor MTA queue disk and logging space disk usage.

For information on managing logging space, see Chapter 25, Managing Logging For example, to learn how to monitor the mail.log file, see 25.3 Managing MTA Message and Connection Logs

27.3.3 Monitoring CPU Usage

High CPU usage is either a sign that there is not enough CPU capacity for the level of usage or some process is using up more CPU cycles than is appropriate.

27.3.3.1 Symptoms of CPU Usage Problems

Poor system response time. Slow logging in of users. Slow rate of delivery.

27.3.3.2 To Monitor CPU Usage

Monitoring CPU usage is a platform specific task. Refer to the relevant platform documentation.

27.4 Monitoring the MTA

This section consists of the following subsections:

27.4.1 Monitoring the Size of the Message Queues

Excessive message queue growth may indicate that messages are not being delivered, are being delayed in their delivery, or are coming in faster than the system can deliver them. This may be caused by a number of reasons such as a denial of service attack caused by huge numbers of messages flooding your system, or the Job Controller not running.

See 8.5.2 Channel Message Queues, 26.3.6 Messages are Not Dequeued and 26.3.7 MTA Messages are Not Delivered for more information on message queues.

27.4.1.1 Symptoms of Message Queue Problems

Disk space usage grows.
User not receiving messages in a reasonable time.
Message queue sizes are abnormally high.

27.4.1.2 To Monitor the Size of the Message Queues

Probably the best way to monitor the message queues is to use imsimta qm and imsimta summarize. Refer to 27.8.6 imsimta qm counters.

You can also monitor the number of files in the queue directories (msg-svr-base/data/queue/). The number of files will be site-specific, and you’ll need to build a baseline history to find out what is “too many.” This can be done by recording the size of the queue files over a two week period to get an approximate average.

27.4.2 Monitoring Rate of Delivery Failure

A delivery failure is a failed attempt to deliver a message to an external site. A large increase in rate of delivery failure can be a sign of a network problem such as a dead DNS server or a remote server timing out on responding to connections.

27.4.2.1 Symptoms of Rate of Delivery Failure

There are no outward symptoms. Lots of Q records will appear in to mail.log_current.

27.4.2.2 To Monitor the Rate of Delivery Failure

Delivery failures are recorded in the MTA logs with the logging entry code Q. Look at the record in the file msg-svr-base/data/log/mail.log_current. Example:

mail.log:06-Oct-2003 00:24:03.66 501d.0b.9 ims-ms Q 5 durai.balusamy@Sun.COM rfc822;durai.balusamy@Sun.COM durai@ims-ms-daemon <00ce01c38bda$c7e2b240$6501a8c0@guindy> Mailbox is busy

27.4.3 Monitoring Inbound SMTP Connections

An unusual increase in the number of inbound SMTP connections from a given IP address may indicate:

An external user is trying to relay mail.
An external user is trying to do a service denial attack.

27.4.3.1 Symptoms of Unauthorized SMTP Connections

External user relaying mail: No outward symptoms.
Service denial attack: External attempt to overload the SMTP servers with message requests.

27.4.3.2 To Monitor Inbound SMTP Connections

External user relaying mail: Look in msg-svr-base/log/mail.log_current for records with the logging entry code J (rejected relays). To turn on logging of remote IP addresses add the following line to the option.dat file:

log_connection=1

Note that there is a slight performance trade-off when this feature is enabled.
Service denial attack: To find out who and how many users are connecting to the SMTP servers, you can run the command netstat and check for connections at the SMTP port (default: 25). Example:

Local address       Remote address                                 State
192.18.79.44.25     192.18.78.44.56035    32768   0  32768   0   CLOSE_WAIT
192.18.79.44.25     192.18.136.54.57390    8760   0  24820   0   ESTABLISHED
192.18.79.44.25     192.18.26.165.48508   33580   0  24820   0   TIME_WAIT

Note that you will first need to determine the appropriate number of SMTP connections and their states (ESTABLISHED, CLOSE_WAIT, etc.) for your system to determine if a particular reading is out of the ordinary.

If you find many connections staying in the SYN_RECEIVED state this might be caused by a broken network or a denial of service attack. In addition, the lifetime of an SMTP server process is limited. This is controlled by the MTA configuration variable MAX_LIFE_TIME in the dispatcher.cnf file. The default is 86,400 seconds (one day). Similarly, MAX_LIFE_CONNS specifies the maximum number of connections a server process can handle in its lifetime. If you find a particular SMTP server that has around for a long time you may wish to investigate.

27.4.4 Monitoring the Dispatcher and Job Controller Processes

The Dispatcher and Job Controller Processes must be operating for MTA to work. You should have one process of each kind.

27.4.4.1 Symptoms of Dispatcher and Job Controller Processes Down

If the Dispatcher is down or does not have enough resources, SMTP connections are refused.

If the Job Controller is down, queue size will grow.

27.4.4.2 To Monitor Dispatcher and Job Controller Processes

Check to see that the processes called dispatcher and job_controller exist. See 26.2.4 Check that the Job Controller and Dispatcher are Running.

27.5 Monitoring LDAP Directory Server

This section consists of the following subsection:

27.5.1 Monitoring slapd

27.5.1 Monitoring slapd

The LDAP directory server (slapd) provides directory information for the messaging system. If slapd is down, the system will not work properly. If slapd response time is too slow, this will affect login speed and any other transaction that requires LDAP lookups.

27.5.1.1 Symptoms of slapd Problems

Client POP, IMAP, or Webmail Authentication fails or slower than expected.
MTA not working properly

27.5.1.2 To Monitor slapd

Check that ns-slapd process is running.
Check slapd log files access and errors in slapd-instance/logs/
Check the ns-slapd response time while searching for a user.
See also 27.8.1 immonitor-access

27.6 Monitoring Message Access

This section consists of the following subsections:

27.6.1 Monitoring imapd, popd and httpd

These processes provide access to IMAP, POP and Webmail services. If any of these is not running or not responding, the service will not function appropriately. If the service is running, but is over loaded, monitoring will allow you to detect this and configure it more appropriately.

27.6.1.1 Symptoms of imapd, popd and httpd Problems

Connections are refused or system is too slow to connect. For example, if IMAP is not running and you try to connect to IMAP directly you will see something like this:

telnet 0 143 Trying 0.0.0.0... telnet: Unable to connect to remote host: Connection refused

If you try to connect with a client, you will get a message such as:

“Client is unable to connect to the server at the location you have specified. The server may be down or busy.”

27.6.1.2 To Monitor imapd, popd and httpd

Can be monitored with watcher and msprobe. See 4.5 Automatic Restart of Failed or Unresponsive Services and 27.8.9 Monitoring Using msprobe and watcher Functions
Can be monitored with SNMP.

If you have the SNMP set up, this is a very good way to monitor these processes. See Appendix A, SNMP Support. The server information is in the Network Services Monitoring MIB.
Check log files.

Look in the directory msg-svr-base/log/service where service can be http or IMAP or POP. In that directory you will find a number of log files. One filename is the name of the service (imap, pop, http) and the others are the name of the service plus a sequence number and a date concatenated to the service name. For example:

imap imap.29.1010221593 imap.31.1010394412 imap.33.1010567224

The file with just the service name is the latest log. The other ones are ordered by the sequence number (here 29, 31, 33) and the one with the highest sequence number is the next newest one. (See Chapter 25, Managing Logging.”)

If a server was shut down you might see something like this:

imap.12.1065431243:[07/Oct/2003:01:15:43 -0700] gotmail-2 imapd[20525]: General Warning: Sun Java System Messaging Server IMAP4 6.1 (built Sep 24 2003) shutting down
Can be checked with counterutil. See 27.8.3 counterutil and counterutil in Sun Java System Messaging Server 6.3 Administration Reference.
Run the platform-specific command to verify that the imapd, popd and httpd processes are running. For example, in Solaris you can use the ps command and look for imapd, popd and mshttpd.
You can set alarms for specified server performance thresholds by setting the server response configuration parameters described in 27.8.9.1 Alarm Messages
See 27.8.1 immonitor-access.

27.7 Monitoring the Message Store

Messages are stored in a database. The distribution of users on disks, the size of their mailbox, and disk requirements affect the store performance. These are described in the following subsections:

27.7.1 Monitoring stored

stored performs a variety of important tasks such as deadlock and transaction operations of the message database, enforcing aging policies, and expunging and erasing messages stored on disk. If stored stops running, the messaging server will eventually run into problems. If stored doesn’t start when start-msg is run, no other processes will start. For more information about stored see stored in Sun Java System Messaging Server 6.3 Administration Reference.

27.7.1.1 Symptoms of stored Problems

There are no outward symptoms.

27.7.1.2 To Monitor stored

Check that the stored process is running. stored creates and updates a pid file in msg-svr-base/data/proc called store. The pid file shows an init state when recovering and a ready state when ready. For example:
231: cat store 28250 ready
The number on the first line is the process ID of stored.
232: ps -eaf | grep stored inetuser 28250 1 0 Jan 05 ? 8:44 /opt/SUNWmsgsr/lib/stored -d
Check for log file build up in msg-svr-base/store/mboxlist. Note that not every log file build up is caused by direct stored problems. Log files may also build up if imapd dies or there is a database problem.
Check the timestamp on the following files in msg-svr-base/config:

stored.ckp - Touched when attempt at checkpointing is made. Should get time stamped every 1 minute stored.lcu - Touched at every db log cleanup. Should get time stamped every 5 minutes stored.per - Touched at every spawn of peruser db writeout. Should get time stamped every 60 minutes
Check for stored messages in the default log file msg-svr-base/log/default/default
Can be monitored with watcher and msprobe. See 4.5 Automatic Restart of Failed or Unresponsive Services and 27.8.9 Monitoring Using msprobe and watcher Functions.

27.7.2 Monitoring the State of Message Store Database Locks

The state of database-locks is held by different server processes. These database locks can affect the performance of the message store. In case of deadlocks, messages will not be getting inserted into the store at reasonable speeds and the ims-ms channel queue will grow larger as a result. There are legitimate reasons for a queue to back up, so it is useful to have a history of the queue length in order to diagnose problems.

27.7.2.1 Symptoms of Message Store Database Lock Problems

Number of transactions are accumulating and not resolving.

27.7.2.2 To Monitor Message Store Database Locks

Use the command imcheck -s (used to be counterutil -o db_lock)

27.8 Utilities and Tools for Monitoring

The following tools are available in for monitoring:

27.8.1 immonitor-access

immonitor-access monitors the status of the following Messaging Server components/processes: Mail Delivery (SMTP server), Message Access and Store (POP and IMAP servers), Directory Service (LDAP server) and HTTP server. This utility measures the response times of the various services and the total round trip time taken to send and retrieve a message. The Directory Service is monitored by looking up a specified user in the directory and measuring the response time. Mail Delivery is monitored by sending a message (SMTP) and the Message Access and Store is monitored by retrieving it. Monitoring the HTTP server is limited to finding out whether or not it is up and running.

For complete instructions, refer to immonitor-access in Sun Java System Messaging Server 6.3 Administration Reference.

27.8.2 imcheck

Use imcheck —s to monitor database statistics including logs and transactions.

27.8.3 counterutil

This utility provides statistics acquired from different system counters. Here is a current list of available counter objects:

# /opt/SUNWmsgsr/sbin/counterutil -l
Listing registry (/opt/SUNWmsgsr/data/counter/counter)
numobjects = 11
refcount = 1
created = 25/Sep/2003:02:04:55 -0700
modified = 02/Oct/2003:22:48:55 -0700
     entry = alarm 
     entry = diskusage
     entry = serverresponse     entry = imapstat
     entry = httpstat
     entry = popstat
     entry = cgimsg

Each entry represents a counter object and supplies a variety of useful counts for this object. In this section we will only be discussing the alarm, diskusage, serverresponse, popstat, imapstat, and httpstat counter objects. For details on counterutil command usage, refer to counterutil in Sun Java System Messaging Server 6.3 Administration Reference.

27.8.3.1 counterutil Output

counterutil has a variety of flags. A command format for this utility may be as follows:

counterutil -o CounterObject -i 5 -n 10

where,

-o CounterObject represents the counter object alarm, diskusage, serverresponse, popstat, imapstat, and httpstat.

-i 5 specifies a 5 second interval.

-n 10 represents the number of iterations (default: infinity).

An example of counterutil usage is as follows:

# counterutil -o imapstat -i 5 -n 10 
Monitor counteroobject (imapstat) 
registry /gotmail/iplanet/server5/msg-gotmail/counter/counter opened 
counterobject imapstat opened 

count = 1 at 972082466 rh = 0xc0990 oh = 0xc0968 

global.currentStartTime [4 bytes]: 17/Oct/2000:12:44:23 -0700 
global.lastConnectionTime [4 bytes]: 20/Oct/2000:15:53:37 -0700 
global.maxConnections [4 bytes]: 69 
global.numConnections [4 bytes]: 12480 
global.numCurrentConnections [4 bytes]: 48 
global.numFailedConnections [4 bytes]: 0 
global.numFailedLogins [4 bytes]: 15 
global.numGoodLogins [4 bytes]: 10446 
...

27.8.3.2 Alarm Statistics Using counterutil

These alarm statistics refer to the alarms sent by stored.The alarm counter provides the following statistics:

Table 27–1 counterutil alarm Statistics


Suffix	Description
`alarm.countoverthreshold`	Number of times crossing threshold.
`alarm.countwarningsent`	Number of warnings sent.
`alarm.current`	Current monitored valued.
`alarm.high`	Highest ever recorded value.
`alarm.low`	Lowest ever recorded value.
`alarm.timelastset`	The last time current value was set.
`alarm.timelastwarning`	The last time warning was sent.
`alarm.timereset`	The last time reset was performed.
`alarm.timestatechanged`	The last time alarm state changed.
`alarm.warningstate`	Warning state (yes(1) or no(0)).

27.8.3.3 IMAP, POP, and HTTP Connection Statistics Using counterutil

To get information on the number of current IMAP, POP, and HTTP connections, number of failed logins, total connections from the start time, and so forth, you can use the command counterutil -o CounterObject -i 5 -n 10.where CounterObject represents the counter object popstat, imapstat, or httpstat. The meaning of the imapstat suffixes is shown in Table 27–2. The popstat and httpstat objects provide the same information in the same format and structure.

Table 27–2 counterutil imapstat Statistics


Suffix	Description
`currentStartTime`	Start time of the current IMAP server process.
`lastConnectionTime`	The last time a new client was accepted.
`maxConnections`	Maximum number of concurrent connections handled by IMAP server.
`numConnections`	Total number of connections served by the current IMAP server.
`numCurrentConnections`	Current number of active connections.
`numFailedConnections`	Number of failed connections served by the current IMAP server.
`numFailedLogins`	Number of failed logins served by the current IMAP server.
`numGoodLogins`	Number of successful logins served by the current IMAP server.

27.8.3.4 Disk Usage Statistics Using counterutil

The command: counterutil -o diskusage generates following information:

Table 27–3 counterutil diskstat Statistics


Suffix	Description
`diskusage.availSpace`	Total space available in the disk partition.
`diskusage.lastStatTime`	The last time statistic was taken.
`diskusage.mailPartitionPath`	Mail partition path.
`diskusage.percentAvail`	Disk partition space available percentage.
`diskusage.totalSpace`	Total space in the disk partition.

27.8.3.5 Server Response Statistics

The command: counterutil -o serverresponse generates following information. This information is useful for checking if the servers are running, and how quickly they’re responding.

Table 27–4 counterutil serverresponse Statistics


Suffix	Description
`http.laststattime`	Last time http server response was checked.
`http.responsetime`	Response time for the http.
`imap.laststattime`	Last time imap server response was checked.
`imap.responsetime`	Response time for the imap.
`pop.laststattime`	Last time pop server response was checked.
`pop.responsetime`	Response time for the pop.

27.8.4 Log Files

Messaging server logs event records for SMTP, IMAP, POP, and HTTP. The policies for creating and managing the Messaging Server log files are customizable.

Since logging can affect the server performance, logging should be considered very carefully before the burden is put on the server. Refer to Chapter 25, Managing Logging for more information.

27.8.5 imsimta counters

The MTA accumulates message traffic counters based upon the Mail Monitoring MIB, RFC 1566 for each of its active channels. The channel counters are intended to help indicate the trend and health of your e-mail system. Channel counters are not designed to provide an accurate accounting of message traffic. For precise accounting, instead see MTA logging as discussed in Chapter 25, Managing Logging.

The MTA channel counters are implemented using the lightest weight mechanisms available so that they cause as little impact as possible on actual operation. Channel counters do not try harder: if an attempt to map the section fails, no information is recorded; if one of the locks in the section cannot be obtained almost immediately, no information is recorded; when a system is shut down, the information contained in the in-memory section is lost forever.

The imsimta counters -show command provides MTA channel message statistics (see below). These counters need to be examined over time noting the minimum values seen. The minimums may actually be negative for some channels. A negative value means that there were messages queued for a channel at the time that its counters were zeroed (for example, the cluster-wide database of counters created). When those messages were dequeued, the associated counters for the channel were decremented and therefore leading to a negative minimum. For such a counter, the correct “absolute” value is the current value less the minimum value that counter has ever held since being initialized.

Channel          Messages    Recipients    Blocks 
-------          --------    ----------    ------- 
tcp_local
   Received       29379       79714      982252                (1)
   Stored            61         113       -2004                (2)
   Delivered      29369       79723      983903 (29369 first time)  (3)
   Submitted      13698       13699       18261                (4)
   Attempted          0           0           0                (5)
   Rejected           1          10           0                (6)
   Failed           104         104        4681                (7)

   Queue time/count        16425/29440 = 0.56                  (8)
   Queue first time/count  16425/29440 = 0.56                  (9)

   Total In Assocs           297637
   Total Out Assocs           28306

1) Received is the number of messages enqueued to the channel named tcp_local. That is, the messages enqueued (E records in the mail.log* file) to the tcp_local channel by any other channel.

2) Stored is the number of messages stored in the channel queue to be delivered.

3) Delivered is the number of messages which have been processed (dequeued) by the channel tcp_local. (That is, D records in the mail.log* file.) A dequeue operation may either correspond to a successful delivery (that is, an enqueue to another channel), or to a dequeue due to the message being returned to the sender. This will generally correspond to the number Received minus the number Stored.

The MTA also keeps track of how many of the messages were dequeued upon first attempt; this number is shown in parentheses.

4) Submitted is the number of messages enqueued (E records in the mail.log file) by the channel tcp_local to any other channel.

5) Attempted is the number of messages which have experienced temporary problems in dequeuing, that is, Q or Z records in the mail.log* file.

6) Rejected is the number of attempted enqueues which have been rejected, that is, J records in the mail.log* file.

7) Failed is the number of attempted dequeues which have failed, that is, R records in the mail.log* file.

8) Queue time/count is the average time-spent-in-queue for the delivered messages. This includes both the messages delivered upon the first attempt, see (9), and the messages that required additional delivery attempts (hence typically spent noticeable time waiting fallow in the queue).

9) Queue first time/count is the average time-spent-in-queue for the messages delivered upon the first attempt.

Note that the number of messages submitted can be greater than the number delivered. This is often the case, since each message the channel dequeues (delivers) will result in at least one new message enqueued (submitted) but possibly more than one. For example, if a message has two recipients reached via different channels, then two enqueues will be required. Or if a message bounces, a copy will go back to the sender and another copy may be sent to the postmaster. Usually that will be two submissions (unless both are reached through the same channel).

More generally, the connection between Submitted and Delivered varies according to type of channel. For example, in the conversion channel, a message would be enqueued by some other arbitrary channel, and then the conversion channel would process that message and enqueue it to a third channel and mark the message as dequeued from its own queue. Each individual message takes a path:

elsewhere -> conversion E record Received
conversion -> elsewhere E record Submitted
conversion              D record Delivered

However, for a channel such as tcp_local which is not a “pass through,” but rather has two separate pieces (slave and master), there is no connection between Submitted and Delivered. The Submitted counter has to do with the SMTP server portion of the tcp_local channel, whereas the Delivered counter has to do with the SMTP client portion of the tcp_local channel. Those are two completely separate programs, and the messages travelling through them may be completely separate.

Messages submitted to the SMTP server:

tcp_local -> elsewhere E record Submitted

Messages sent out to other SMTP hosts via the SMTP client:

elsewhere -> tcp_local E record Received
tcp_local              D record Delivered

Channel dequeues (delivers) will result in at least one new message enqueued (submitted) but possibly more than one. For example, if a message has two recipients reached via different channels, then two enqueues will be required. Or if a message bounces, a copy will go back to the sender and another copy may be sent to the postmaster. Usually that will be reached through the same channel.

27.8.5.1 Implementation on UNIX and NT

For performance reasons, a node running the MTA keeps a cache of channel counters in memory using a shared memory section (UNIX) or shared file-mapping object (NT). As processes on the node enqueue and dequeue messages, they update the counters in this in-memory cache. If the in-memory section does not exist when a channel runs, the section will be created automatically. (The imta start command also creates the in-memory section, if it does not exist.)

The command imta counters -clear or the imta qm command counters clear may be used to reset the counters to zero.

27.8.6 imsimta qm counters

The imsimta qm counters utility displays MTA channel queue message counters. You must be root or mailsrv to run this utility. The output fields are the same as those described in 27.8.5 imsimta counters. See also imsimta counters in Sun Java System Messaging Server 6.3 Administration Reference.

Example:

# imsimta counters -create
# imsimta qm counters show
Channel                Messages   Recipients Blocks
---------------------- ---------- ---------- ----------
tcp_intranet
   Received              13077      13859     264616 
   Stored                   92         91       -362 
   Delivered             12985      13768     264978 
   Submitted              2594       2594       3641
...

Every time you restart the MTA, you must run: # imsimta counters -create

27.8.7 MTA Monitoring Using SNMP

Messaging Server supports system monitoring through the Simple Network Management Protocol (SNMP). Using an SNMP client (sometimes called a network manager) such as Sun Net Manager or HP OpenView (not provided with this product), you can monitor certain parts of the Messaging Server. Refer to Appendix A, SNMP Support for details.

27.8.8 imquotacheck for Mailbox Quota Checking

You can monitor mailbox quota usage and limits by using the imquotacheck utility. The imquotacheck utility generates a report that lists defined quotas and limits, and provides information on quota usage.

For example, the following command lists all user quota information:

% imquotacheck 
-------------------------------------------------------------------------
Domain red.siroe.com (diskquota = not set msgquota = not set) quota usage
-------------------------------------------------------------------------
diskquota         size(K)    %use    msgquota      msgs    %use    user
# of domains = 1
# of users = 705

no quota          50418             no quota      4392             ajonk
no quota              5             no quota      2                andrt
no quota         355518             no quota      2500             ansri
 ...

The following example shows the quota usage for user sorook:

% imquotacheck -u sorook
-------------------------------------------------------------------------
quota usage for user sorook
-------------------------------------------------------------------------
diskquota      size(K)    %use    msgquota      msgs     %use    user

no quota       1487               no quota      305              sorook

27.8.9 Monitoring Using msprobe and watcher Functions

Messaging Server provides two processes, watcher and msprobe to monitor various system services. watcher watches for server crashes and restarts them as necessary. msprobe monitors server hangs (unresponsiveness). Specifically msprobe monitors the following:

Server Response Time. msprobe connects to the enabled servers using their protocol commands and measures their response times. If the response time exceeds the alarm warning threshold, an alarm message is sent (see 27.8.9.1 Alarm Messages to a server, or the server response time exceeds a specified timeout period, the server is restarted. Server response times are recorded in a counter database and is logged to the default log file. counterutil can be used to display the server response time statistics (27.8.3 counterutil).

The following servers are monitored by msprobe: imap, pop, http, cert, job_controller, smtp, lmtp, mmp and ens. When smtp or lmtp are not responding, the dispatcher is restarted. ens cannot be automatically restarted.
Disk usage. msprobe checks the disk availability and usage for every message store partition. Specifically it checks the message store mboxlist database directory and the MTA queue directory. If disk usage exceeds a configured threshold, an alarm message is sent. The disk sizes and usages are recorded in a counter database and is logged to the default log file. Administrators can use the counterutil utility (see 27.8.3 counterutil) to display the disk usage statistics.
Message Store mboxlist Database Log File Accumulation. Log file accumulation is an indication of an mboxlist database error. msprobe counts the number of active log files and if the number of active log files is larger than the threshold, msprobe logs a critical error message to the default log file to inform the admin to restart the server. If the autorestart is enabled (local.autorestart to yes), the store daemon is restarted.

watcher and msprobe are controlled by the configutil options shown in Table 27–5. Further information can be found in 4.5 Automatic Restart of Failed or Unresponsive Services

Table 27–5 msprobe and watcher configutil Options


Options	Description
local.autorestart	Enable automatic server restart. Automatically restarts failed or hung services. Default: no
local.autorestart.timeout	Failure retry time-out. If a server fails more than twice in this designated amount of time, then the system stops trying to restart the server. The value (set in seconds) should be set to a period value longer than the `msprobe` interval (`local.schedule.msprobe`). Default: 600 seconds
local.probe.service.timeout	Timeout for a specific server before restart. `service` can be imap, pop, http, cert, job_controller, smtp, lmtp, mmp or ens. Default: use `service.readtimeout`
local.probe.service.warningthreshold	Number of seconds of a specific server’s non-response before a warning message is logged to `default` log file. `service` can be imap, pop, http, cert, job_controller, smtp, lmtp, mmp or ens. Default: Use local.probe.warningthreshold
local.probe.warningthreshold	Number of seconds of server non-response before a warning message is logged to `default` log file. Default: 5 secs
local.queuedir	MTA queue directory to check if queue size exceeded threshold defined by alarm.diskavail.msgalarmthreshold. Default: none
service.readtimeout	Period of server non-response before restarting that server. See local.schedule.msprobe. Default: 10 seconds
local.schedule.msprobe	`msprobe` run schedule. A crontab style schedule string (see Table 20–10Note that by default, this is automatically set. See 4.6.2 Pre-defined Automatic Tasks. To disable: set `local.schedule.msprobe.enable` to `NO`.
local.watcher.enable	Enable watcher which monitors service failures. (IMAP, POP, HTTP, job controller, dispatcher, message store (`stored`), `imsched`, and MMP. (LMTP/SMTP servers are monitored by the dispatcher and LMTP/SMTP clients are monitored by the job_controller.) Logs error messages to the default log file for specific failures. Default: on

27.8.9.1 Alarm Messages

msprobe can issue alarms in the form of email messages to the postmaster (see 27.6.1.2 To Monitor imapd, popd and httpd) warning of a specified condition. A sample email alarm sent when a certain threshold is exceeded is shown below:

Subject:    ALARM: server response time in seconds of “ldap_siroe.com_389” is 10
Date:    Tue, 17 Jul 2001 16:37:08 -0700 (PDT) 
From:    postmaster@siroe.com 
To:     postmaster@siroe.com 

Server instance: /opt/SUNWmsgsr
Alarmid: serverresponse 
Instance: ldap_siroe_europa.com_389 
Description: server response time in seconds 
Current measured value (17/Jul/2001:16:37:08 -0700): 10 
Lowest recorded value: 0 
Highest recorded value: 10 
Monitoring interval: 600 seconds 
Alarm condition is when over threshold of 10 
Number of times over threshold: 1

You can specify how often msprobe monitors disk and server performance, and under what circumstances it sends alarms. This is done by using the configutil command to set the alarm parameters. Table 27–6 shows useful alarm parameters along with their default setting. See configutil Parameters in Sun Java System Messaging Server 6.3 Administration Reference.

Table 27–6 Useful Alarm Message configutil Parameters


Parameter	Description (Default in parenthesis)
a larm.msgalarmnoticehost	(localhost) Machine to which you send warning messages.
alarm.msgalarmnoticeport	(25) The SMTP port to which to connect when sending alarm message.
alarm.msgalarmnoticercpt	(Postmaster@localhost) Whom to send alarm notice.
alarm.msgalarmnoticesender	(Postmaster@localhost) Address of sender the alarm.
alarm.diskavail.msgalarmdescription	(percentage mail partition diskspace available.) Text for description field for disk availability alarm.
alarm.diskavail.msgalarmstatinterval	(3600) Interval in seconds between disk availability checks. Set to 0 to disable checking of disk usage.
alarm.diskavail.msgalarmthreshold	(10) Percentage of disk space availability below which an alarm is sent.
alarm.diskavail.msgalarmthresholddirection	(-1) Specifies whether the alarm is issued when disk space availability goes below threshold (-1) or above it (1).
alarm.diskavail.msgalarmwarninginterval	(24). Interval in hours between subsequent repetition of disk availability alarms.
alarm.serverresponse.msgalarmdescription	(server response time in seconds.) Text for description field for servers response alarm.
alarm.serverresponse.msgalarmstatinterval	(600) Interval in seconds between server response checks. Set to 0 to disable checking of server response.
alarm.serverresponse.msgalarmthreshold	(10) If server response time in seconds exceeds this value, alarm issued.
alarm.serverresponse.msgalarmthresholddirection	(1) Specifies whether alarm is issued when server response time is greater that (1) or less than (-1) the threshold.
alarm.serverresponse.msgalarmwarninginterval	(24) Interval in hours between subsequent repetition of server response alarm.