18 Monitoring Messaging Server

This chapter describes how to monitor Oracle Communications Messaging Server. In most cases, a well-planned, well-configured server performs without extensive intervention from an administrator. As an administrator, however, it is your job to monitor the server for signs of problems.

In addition to this chapter, see the following chapters for more information about monitoring Messaging Server:

Automatic Monitoring and Restart

Messaging Server provides a way to transparently monitor services and automatically restart them if they crash or become unresponsive (the services hangs or freeze up). It can monitor all message store, MTA, and MMP services including the IMAP, POP, HTTP, job controller, dispatcher, and MMP servers. It does not monitor other services such as SMS or TCP/SNMP servers. (TCP/SNMP is monitored by the job controller.) See "Automatic Restart of Failed or Unresponsive Services" and "Monitoring Using msprobe and watcher Functions" for more information.

Daily Monitoring Tasks

The most important tasks you should perform on a daily basis are checking postmaster mail, monitoring the log files, and setting up the stored utility. These tasks are described below.

Checking Postmaster Mail

Messaging Server has a predefined administrative mailing list set up for postmaster email. Any users who are part of this mailing list will automatically receive mail addressed to postmaster.

The rules for postmaster mail are defined in RFC822, which requires every email site to accept mail addressed to a user or mailing list named postmaster and that mail sent to this address be delivered to a real person. All messages sent to postmaster@host.domain are sent to a postmaster account or mailing list.

Typically, the postmaster address is where users should send email about their mail service. As postmaster, you might receive mail from local users about server response time, from other server administrators who are encountering problems sending mail to your server, and so on. You should check postmaster mail daily.

You can also configure the server to send certain error messages to the postmaster address. For example, when the MTA cannot route or deliver a message, you can be notified via email sent to the postmaster address. You can also send exception condition warnings (low disk space, poor server response) to postmaster.

Monitoring and Maintaining the Log Files

Messaging Server creates a separate set of log files for each of the major protocols or services it supports including SMTP, IMAP, POP, and HTTP. These are located in DataRoot/log. You should monitor the log files on a routine basis--especially if you are having problems with the server.

Be aware that logging can impact server performance. The more verbose the logging you specify, the more disk space your log files will occupy for a given amount of time. You should define effective but realistic log rotation, expiration, and backup policies for your server. See "Managing Logging" for information about defining logging policies for your server.

Setting Up the msprobe Utility

The msprobe utility automatically performs monitoring and restart functions. See "Monitoring Using msprobe and watcher Functions" for more information.

Utilities and Tools for Monitoring

The following tools are available for monitoring:

immonitor-access
imcheck
Log Files
imsimta counters
imsimta qm counters
MTA Monitoring Using SNMP
Monitoring Using msprobe and watcher Functions
Monitoring Using msstatbot Tool

immonitor-access

"immonitor-access" monitors the status of the following Messaging Server components/processes: Mail Delivery (SMTP server), Message Access and Store (POP and IMAP servers), Directory Service (LDAP server) and HTTP server. This utility measures the response times of the various services and the total round trip time taken to send and retrieve a message. The Directory Service is monitored by looking up a specified user in the directory and measuring the response time. Mail Delivery is monitored by sending a message (SMTP) and the Message Access and Store is monitored by retrieving it. Monitoring the HTTP server is limited to finding out whether or not it is up and running.

See "immonitor-access" for complete instructions.

imcheck

Use "imcheck" to monitor database statistics including logs and transactions.

Note:

The imcheck -s command, which prints database statistics, is only valid for classic message store.

counterutil

This utility provides statistics acquired from different system counters. See "Gathering Message Store Counter Statistics by Using counterutil" for more information.

Log Files

Messaging server logs event records for SMTP, IMAP, POP, and HTTP. The policies for creating and managing the Messaging Server log files are customizable.

Since logging can affect the server performance, logging should be considered very carefully before the burden is put on the server. Refer to "Managing Logging" for more information.

imsimta counters

The MTA accumulates message traffic counters based upon the Mail Monitoring MIB, RFC 1566 for each of its active channels. The channel counters are intended to help indicate the trend and health of your email system. Channel counters are not designed to provide an accurate accounting of message traffic. See the discussion about MTA logging in "Managing Logging" for precise accounting.

The MTA channel counters are implemented using the lightest weight mechanisms available so that they cause as little impact as possible on actual operation. Channel counters do not try harder: if an attempt to map the section fails, no information is recorded. If one of the locks in the section cannot be obtained almost immediately, no information is recorded. When a system is shut down, the information contained in the in-memory section is lost forever.

The imsimta counters -show command provides MTA channel message statistics (see below). These counters need to be examined over time noting the minimum values seen. The minimums may actually be negative for some channels. A negative value means that there were messages queued for a channel at the time that its counters were zeroed (for example, the cluster-wide database of counters created). When those messages were dequeued, the associated counters for the channel were decremented and therefore leading to a negative minimum. For such a counter, the correct "absolute" value is the current value less the minimum value that counter has ever held since being initialized.

Channel Messages Recipients Blocks 
------- -------- ---------- ------- 
tcp_local 
Received 29379 79714 982252 (1) 
Stored 61 113 -2004 (2) 
Delivered 29369 79723 983903 (29369 first time) (3) 
Submitted 13698 13699 18261 (4) 
Attempted 0 0 0 (5) 
Rejected 1 10 0 (6) 
Failed 104 104 4681 (7) 
Queue time/count 16425/29440 = 0.56 (8) 
Queue first time/count 16425/29440 = 0.56 (9) 
Total In Assocs 297637 
Total Out Assocs 28306

1)Received is the number of messages enqueued to the channel named tcp_local. That is, the messages enqueued (E records in the mail.log* file) to the tcp_local channel by any other channel.

2)Stored is the number of messages stored in the channel queue to be delivered.

3)Delivered is the number of messages which have been processed (dequeued) by the channel tcp_local. (That is, D records in the mail.log* file.) A dequeue operation may either correspond to a successful delivery (that is, an enqueue to another channel), or to a dequeue due to the message being returned to the sender. This will generally correspond to the number Received minus the number Stored.

The MTA also keeps track of how many of the messages were dequeued upon first attempt; this number is shown in parentheses.

4)Submitted is the number of messages enqueued (E records in the mail.log file) by the channel tcp_local to any other channel.

5)Attempted is the number of messages which have experienced temporary problems in dequeuing, that is, Q or Z records in the mail.log* file.

6)Rejected is the number of attempted enqueues which have been rejected, that is, J records in the mail.log* file.

7)Failed is the number of attempted dequeues which have failed, that is, R records in the mail.log* file.

8)Queue time/count is the average time-spent-in-queue for the delivered messages. This includes both the messages delivered upon the first attempt, see (9), and the messages that required additional delivery attempts (hence typically spent noticeable time waiting fallow in the queue).

9)Queue first time/count is the average time-spent-in-queue for the messages delivered upon the first attempt.

Note that the number of messages submitted can be greater than the number delivered. This is often the case, since each message the channel dequeues (delivers) will result in at least one new message enqueued (submitted) but possibly more than one. For example, if a message has two recipients reached via different channels, then two enqueues will be required. Or if a message bounces, a copy will go back to the sender and another copy may be sent to the postmaster. Usually that will be two submissions (unless both are reached through the same channel).

More generally, the connection between Submitted and Delivered varies according to type of channel. For example, in the conversion channel, a message would be enqueued by some other arbitrary channel, and then the conversion channel would process that message and enqueue it to a third channel and mark the message as dequeued from its own queue. Each individual message takes a path:

elsewhere -> conversion E record Received 
conversion -> elsewhere E record Submitted 
conversion D record Delivered

However, for a channel such as tcp_local which is not a "pass through," but rather has two separate pieces (slave and master), there is no connection between Submitted and Delivered. The Submitted counter has to do with the SMTP server portion of the tcp_local channel, whereas the Delivered counter has to do with the SMTP client portion of the tcp_local channel. Those are two completely separate programs, and the messages travelling through them may be completely separate.

Messages submitted to the SMTP server:

tcp_local -> elsewhere E record Submitted

Messages sent out to other SMTP hosts via the SMTP client:

elsewhere -> tcp_local E record Received 
tcp_local D record Delivered

Channel dequeues (delivers) will result in at least one new message enqueued (submitted) but possibly more than one. For example, if a message has two recipients reached via different channels, then two enqueues will be required. Or if a message bounces, a copy will go back to the sender and another copy may be sent to the postmaster. Usually that will be reached through the same channel.

imsimta counters Implementation

For performance reasons, a node running the MTA keeps a cache of channel counters in memory using a shared memory section. As processes on the node enqueue and dequeue messages, they update the counters in this in-memory cache. If the in-memory section does not exist when a channel runs, the section will be created automatically. (The imta start command also creates the in-memory section, if it does not exist.)

The command imta counters -clear or the imta qm command counters clear may be used to reset the counters to zero.

imsimta qm counters

The imsimta qm counters utility displays MTA channel queue message counters. You must be root or mailsrv to run this utility. The output fields are the same as those described in insmita counters. See Messaging Server Reference for more information.

Example:

imsimta counters -create 
imsimta qm counters show 
Channel Messages Recipients Blocks 
---------------------- ---------- ---------- ---------- 
tcp_intranet 
Received 13077 13859 264616 
Stored 92 91 -362 
Delivered 12985 13768 264978 
Submitted 2594 2594 3641 
...

Every time you restart the MTA, you must run: imsimta counters -create

imsimta qm summarize

The imsimta qm summarize utility displays a summary of the number of messages and their status in the MTA channel queues.

For more details of the various switches available, see the summarize sub-command in imsimta qm and the imsimta qm help summarize command.

qm summarize modes

Like many of the qm sub-commands, summarize has two modes of operation: The -directory_tree mode examines the message files in the MTA queue directories on disk. The -database mode queries the job_controller process's in-memory database structures. The directory mode creates a heavier load on the IO system and may not reflect what job_controller is actually working on, but it can be useful to know if there is a difference between the two. The job controller makes the decisions about which messages are tried next, so the database mode will be more useful.

imsimta qm 
qm.maint> sum -directory_tree 
Channel Messages Size (Mb) 
-------------------------------- -------- --------- 
conversion 0 0.0 
hold 0 0.0 
ims-ms (stopped) 2 0.0 
process 0 0.0 
reprocess 0 0.0 
tcp_intranet (stopped) 0 0.0 
tcp_local (stopped) 2 0.0 
-------------------------------- -------- --------- 
Totals 4 0.0

Notice that the -database mode breaks down the number messages into three catagories. Active messages are currently being tried by a worker process. Pending messages are ready to be tried by a worker as soon as thread/slot is available. Delayed messages have been tried before and are waiting for a specified time to be tried again as per the backoff option for that channel.

qm.maint> sum -database 
Total Total 
Channel Messages = Active + Pending + Delayed Size (Mb) 
-------------------------------- -------- -------- -------- -------- --------- 
conversion 0 0 0 0 0.0 
hold 0 0 0 0 0.0 
ims-ms (stopped) 2 0 2 0 0.0 
l 0 0 0 0 0.0 
process 0 0 0 0 0.0 
reprocess 0 0 0 0 0.0 
tcp_intranet (stopped) 0 0 0 0 0.0 
tcp_local (stopped) 2 0 2 0 0.0 
-------------------------------- -------- -------- -------- -------- --------- 
Totals 4 0 4 0 0.0

Note: In these examples, some channels had been stopped using the imsimta qm stop channel command to provide some data to look at.

Held messages

A .HELD message file is a message which has encountered a loop or otherwise been sidelined and requires administrative intervention for some reason. You can see such messages using the -held switch. Note that job_controller will have no knowledge of held messages, therefore the -database and -held switches are mutually exclusive. See "Diagnosing and Cleaning up .HELD Messages" for more information about .HELD messages .

qm.maint> sum -held -database 
%QM-E-CMDERR, Conflicting parameters and/or qualifiers: (DATABASE AND HELD) 

qm.maint> sum -held 
Held Held 
Channel Messages Size (Mb) Oldest Queued Messages Size (Mb) Oldest Held 
-------------------------------- -------- --------- ----------------- -------- --------- ----------------- 
conversion 0 0.0 0 0.0 
hold 0 0.0 1 0.0 23 Apr, 21:35:16 
ims-ms (stopped) 2 0.0 6 Apr, 13:24:00 0 0.0 
process 0 0.0 0 0.0 
reprocess 0 0.0 0 0.0 
tcp_intranet (stopped) 0 0.0 0 0.0 
tcp_local (stopped) 2 0.0 5 May, 10:16:08 0 0.0 
-------------------------------- -------- --------- ----------------- -------- --------- ----------------- 
Totals 4 0.0 6 Apr, 13:24:00 1 0.0 23 Apr, 21:35:16

Displaying Summary by Destination Host

The -hosts switch displays a breakdown of the messages in the queue by destination host for channels where that is meaningful. This information is stored in the job_controller process in-memory queue cache database. Therefore -hosts implies -database.

qm.maint> sum -hosts 
Total Total 
Channel Host Messages = Active + Pending + Delayed Size (Mb) 
-------------------------------- -------- -------- -------- -------- --------- 
conversion 0 0 0 0 0.0 
hold 0 0 0 0 0.0 
ims-ms (stopped) 2 0 2 0 0.0 
l 0 0 0 0 0.0 
process 0 0 0 0 0.0 
reprocess 0 0 0 0 0.0 
tcp_intranet (stopped) 0 0 0 0 0.0 
tcp_local (stopped) 2 0 2 0 0.0 
aol.com 1 0 1 0 0.0 
sun.com 1 0 1 0 0.0 
-------------------------------- -------- -------- -------- -------- --------- 
Totals 4 0 4 0 0.0

imsimta qm jobs

After starting the tcp_local channel:

tcp_local 1 1 0 0 0.0 
aol.com 1 1 0 0 0.0

And to see what processes are working on what jobs:

qm.maint> jobs tcp_local 
tcp_local channel: 

Pending: 0 jobs 
Active: 1 jobs, 1 messages (0.00 Mb), 1 recipients 
Current jobs have delivered 1 messages, requeued 0 messages 

Active jobs and messages: 

22157: 1 messages (0.00 Mb), 1 recipients 
1 messages processed and 0 requeued 

Active hosts: 

aol.com 

Active messages: 

ZZg0u410_P_~1.01 (1.0 Kb)

MTA Monitoring Using SNMP

Messaging Server supports system monitoring through the Simple Network Management Protocol (SNMP). Using an SNMP client (sometimes called a network manager) such as Sun Net Manager or HP OpenView (not provided with this product), you can monitor certain parts of the Messaging Server. Refer to "SNMP Support" for details.

Monitoring Using msprobe and watcher Functions

Messaging Server provides two processes, watcher and msprobe to monitor various system services. watcher watches for server crashes and restarts them as necessary. msprobe monitors server hangs (unresponsiveness). Specifically msprobe monitors the following:

Server Response Time. msprobe connects to the enabled servers using their protocol commands and measures their response times. If the response time exceeds the alarm warning threshold, an alarm message is sent (see "Alarm Messages") to a server, or the server response time exceeds a specified timeout period, the server is restarted. Server response times are recorded in a counter database and is logged to the default log file. counterutil can be used to display the server response time statistics ("counterutil").

The following servers are monitored by msprobe: imap, pop, http, cert, job_controller, smtp, lmtp, mmp and ens. When smtp or lmtp are not responding, the dispatcher is restarted. ens cannot be automatically restarted.

Disk usage. msprobe checks the disk availability and usage for every message store partition. Specifically it checks the message store mboxlist database directory and the MTA queue directory. If disk usage exceeds a configured threshold, an alarm message is sent. The disk sizes and usages are recorded in a counter database and is logged to the default log file. Administrators can use the counterutil utility (see "counterutil") to display the disk usage statistics.

Message Store mboxlist Database Log File Accumulation. Log file accumulation is an indication of an mboxlist database error. msprobe counts the number of active log files and if the number of active log files is larger than the threshold, msprobe logs a critical error message to the default log file to inform the admin to restart the server. If the autorestart is enabled ( local.autorestart to yes), the store daemon is restarted.

watcher and msprobe are controlled by the msconfig options shown in Table 18-1. See "Automatic Restart of Failed or Unresponsive Services" for more information.

Table 18-1 msprobe and watcher msconfig Options

Options	Description
base.autorestart.enable	Enable automatic server restart. Automatically restarts failed or hung services. Default: 1
base.autorestart.timeout	Failure retry time-out. If a server fails more than twice in this designated amount of time, then the system stops trying to restart the server. The value (set in seconds) should be set to a period value longer than the msprobe interval ( schedule.task:msprobe). Default: 600 seconds
msprobe.probe:service.timeout	Timeout for a specific server before restart. service can be imap, pop, http, cert, job_controller, smtp, lmtp, mmp or ens. Default: use msprobe.timeout
msprobe.probe:service.warningthreshold	Number of seconds of a specific server's non-response before a warning message is logged to default log file. service can be imap, pop, http, cert, job_controller, smtp, lmtp, mmp or ens. Default: Use msprobe.warningthreshold
msprobe.warningthreshold	Number of seconds of server non-response before a warning message is logged to default log file. Default: 25 secs
msprobe.queuedir	MTA queue directory to check if queue size exceeded threshold defined by alarm.system:diskavail.thresholddirection. Default: none
msprobe.timeout	Period of server non-response before restarting that server. See "Expire and Purge Log and Scheduling Options" schedule.task:msprobe.crontab. Default: 30 seconds
schedule.task:msprobe.crontab	msprobe run schedule. A crontab style schedule string (see schedule.task:expire.enable in ). Note that by default, this is automatically set. See "Pre-defined Automatic Tasks". To disable: set schedule.task:msprobe.enable to 0.
watcher.enable	Enable watcher which monitors service failures. (IMAP, POP, HTTP, job controller, dispatcher, message store (stored), imsched, and MMP. (LMTP/SMTP servers are monitored by the dispatcher and LMTP/SMTP clients are monitored by the job_controller.) Logs error messages to the default log file for specific failures. Default: 1

Alarm Messages

msprobe can issue alarms in the form of email messages to the postmaster (see "To Monitor imapd, popd and httpd") warning of a specified condition. A sample email alarm sent when a certain threshold is exceeded is shown below:

Subject: ALARM: server response time in seconds of "ldap_example.com_389" is 10 
Date: Tue, 17 Jul 2001 16:37:08 -0700 (PDT) 
From: postmaster@example.com 
To: postmaster@example.com 
Server instance: /opt/sun/comms/messaging64 
Alarmid: serverresponse 
Instance: ldap_example_europa.com_389 
Description: server response time in seconds 
Current measured value (17/Jul/2001:16:37:08 -0700): 10 
Lowest recorded value: 0 
Highest recorded value: 10 
Monitoring interval: 600 seconds 
Alarm condition is when over threshold of 10 
Number of times over threshold: 1

You can specify how often msprobe monitors disk and server performance, and under what circumstances it sends alarms. This is done by using the msconfig command to set the alarm options. Table 18-2 shows some useful alarm options along with their default setting. See Messaging Server Reference for all options.

Table 18-2 Useful Alarm Message msconfig Options

Option	Description (Default in Parenthesis)
alarm.noticehost	(localhost) Machine to which you send warning messages.
alarm.noticeport	(587) The SMTP port to which to connect when sending alarm message.
alarm.noticercpt	(Postmaster@localhost) Whom to send alarm notice.
alarm.noticesender	(Postmaster@localhost) Address of sender the alarm.
alarm.system:diskavail.description	(Percentage mail partition diskspace available.) Text for description field for disk availability alarm.
alarm.system:diskavail.statinterval	(3600) Interval in seconds between disk availability checks. Set to 0 to disable checking of disk usage.
alarm.system:diskavail.threshold	(10) Percentage of disk space availability below which an alarm is sent.
alarm.system:diskavail.thresholddirection	(-1) Specifies whether the alarm is issued when disk space availability goes below threshold (-1) or above it (1).
alarm.system:diskavail.warninginterval	(24). Interval in hours between subsequent repetition of disk availability alarms.
alarm.system:serverresponse.description	(Server response time in seconds.) Text for description field for servers response alarm.
alarm.system:serverresponse.statinterval	(600) Interval in seconds between server response checks. Set to 0 to disable checking of server response.
alarm.system:serverresponse.threshold	(10) If server response time in seconds exceeds this value, alarm issued.
alarm.system:serverresponse.thresholddirection	(1) Specifies whether alarm is issued when server response time is greater that (1) or less than (-1) the threshold.
alarm.system:serverresponse.warninginterval	(24) Interval in hours between subsequent repetition of server response alarm.

Monitoring Using msstatbot Tool

Message stores uses msstatbot tool to perform basic administrative tasks, and monitor cluster health.

Beyond message stores, Elasticsearch engine and MTA also uses msstatbot monitoring tool to visualize and track their running states and health.

For message store, the tool supports administrative and monitoring functions as follows:

Administrative Functions: Administrative function includes backing up and restoring the data. A backup is a snapshot of all on-disk data files (SSTable files) stored in the data directory. You can set a retention policy that defines how to handle the snapshot files for older backup data. The default policy is to retain On Server backup files for 30 days.

Monitoring Functions: Monitoring functions includes monitoring clusters and diagnosing problems in the cluster and nodes. Table 18-3 lists the nodetool commands to collect statistics and status:

Table 18-3 nodetool Commands

Nodetool command	Description
Status	cluster information (state, load, IDs, ...)
tablestats	statistics on tablesSupport json format (-F json, --format json)
tpstats	usage statistics of thread poolsSupport json format (-F json, --format json)
gcstats	JVM GC Statistics
netstats	Network information on provided host (connecting node by default)

MTA stats can also be collected using :mtastats.

For the nodetool commands details, see the Cassandra documentation at: http://cassandra.apache.org/doc/latest/tools/nodetool.

Stats Available from the msstatbot Tool

Following is the stats available from the msstatbot tool:

netstats - cassandra node
tpstats - cassandra node
tablestats -cassandra node
gcstats - cassandra node
status - cassandra node
mtastats - mta node

Installing the msstatbot Tool

msstatbot tool is distributed as python 2.7 package for Messaging Server statistics monitoring daemon (msstatd).You can install the msstatbot tool using the following rpm command:

rpm -i msstatbot-1.0-1.noarch.rpm

The msstatbot gets installed in the location /opt/sun/comms/messaging64/lib/python2.7/site-packages/. This location is non- relocatable.

In Cassandra node, the msstatbot also gets installed in the location /opt/sun/comm/messaging64/lib/python 2.7/site-packages/.

msstatd server provides APIs for the clients to configure, start, stop, and check running health of Oracle messaging server services.

Configuration

The msstatbot tool supports three types of configuration:

when it is installed on message server node, the configurations has to set. Example:
```
<serverroot>/bin/msconfig set role.msstatbot.port 8889
```
It loads Messaging Server's unified configuration.

Make sure that msconfig should have following parameters configured:
- elasticsearch.hostlist
- elasticsearch.port
  - ./msconfig show elasticsearch
  - role.elasticsearch.hostlist = <ip list>
  - role.elasticsearch.port = <port #>
  - role.store.searchengine elastic
- role.dispatcher.service:SMTP.tcp_ports: SMTP server tcp port
  - ./msconfig show service:SMTP.tcp_ports
  - role.dispatcher.service:SMTP.tcp_ports = 25
- msstatbot.port: default to 8889
- msstatbot.enabledstats: enabled stats to monitor, which should be set to mtastat on MTA node
  - ./msconfig show msstatbot
  - role.msstatbot.port = 8190
  - role.msstatbot.enabledstats = mtastats:20

Example:

role.store.dbtype = cassandra
role.store.searchengine = elastic
role.store.casconnectpoints = 10.196.12.157
role.store.casmetarf = 1
role.store.casmsgrf = 1
role.msstatbot.port = 8889
role.msstatbot.enabledstats = mtastats:20

when it is installed on cassandra node, it loads json formatted configuration:

{
"es_hosts": "replace_with_elasticsearch_host_name", 
 "es_port": "replace_with_elasticsearch_port", 
"storetype": "cassandra",
"port": 8889,
"enablestats": "gcstats:15",
"nodetoolpath": "path/to/nodetool",
"pidfile": "cassbot",
"logfile": "path/to/logfile",
"loglevel": "INFO",
"maxbytes": 20971520,
"backupcount":10

when it is installed on Cassandra node and Messaging Server, make sure that msconfig should have following parameters configured:
- elasticsearch.hostlist:
- elasticsearch.port
  - ./msconfig show elasticsearch
  - role.elasticsearch.hostlist = <ip list>
  - role.elasticsearch.port = <port #>
  - role.store.searchengine elastic
- role.dispatcher.service:SMTP.tcp_ports: SMTP server tcp port
  - ./msconfig show service:SMTP.tcp_ports
  - role.dispatcher.service:SMTP.tcp_ports = 25
- msstatbot.port: default to 8889
- msstatbot.enabledstats: enabled stats to monitor, which should be stat to mtastats on MTA node
  - ./msconfig show msstatbot
  - role.msstatbot.port = 8190
  - role.msstatbot.enabledstats = mtastats:20
  - role.msstatbot.nodetoolpath = <node tool path>

The signficance of ':<number>'with the stats, is the frequency (in secs) at which the stats are collected from the cassandra/MTA.

Notes

nodetool has to be in the path, or specified with nodetoolpath in the configuration
replace the hosts and ports for Elasticsearch
if pidfile is set, daemon will use it as pidfile, otherwise uses ./msstatpid. As for Messaging Server config, it will to set to path/to/ms/data/proc/msstatpid
logfile, loglevel, maxbytes and backupcount are for logging configuration.
- logfile: set the path to log file;
- loglevel: set the logging level of [CRITICAL, ERROR, WARNING, INFO, DEBUG], otherwise default to INFO;
- maxbyptes: the max log file size, rotating the log file if exceeding the size;
- backupcount: total number log files kept in the log folder.

Assumptions

nodetool path setting should meet one of following three conditions:
- nodetool should be on the path.
- when cassandra is installed, by default, it has nodetool path in that particular location. nodetoolpath is given in the configuration.
- nodetool is installed under /var/opt/cassandra/dse-*/bin <default Cassandra installation location.
Elasticsearch dependency
- Elasticsearch is installed and accessible to all nodes running this program.
- Elasticsearch python client lib is installed with python version used to run the program.
- Elasticsearch hosts and ports should be configured.

Starting and Stopping Statistics Monitoring

You can start the msstatd server, run msstatd with start command, and configuration file as follows:

python msstatd.py -c start -f msstatd.conf

You can stop the msstatd server, run msstatd with stop command, and configuration file as follows:

python msstatd.py -c stop -f msstatd.conf

When msstatd tool is installed with messaging server, it will load the configuration from msconfig xml. In this case, we assume that msstatd tool is installed under /opt/sun/comms/messaging64/lib/python2.7/site-packages/src/, and it is run with root or mailsrv privilege.

python msstatd.py -c start
python msstatd.py -c stop

msstatd Syntax

msstatd.py [-h] [-c {start,stop}] [-p PORT] [-i IP] [-f CONFIG]

Table 18-4 describes the msstatd options.

Table 18-4 msstatd Options

Option	Description
-h, --help	Displays the help.
-c {start,stop}, --command {start,stop}	start \| stop the msstatd server.
-p PORT , --port PORT	Listening port for msstatd server. PORT: the port the daemon listens on, if missing, the default is 8889.
-i IP, --ip IP	msstatd server IP. IP: the node IP address, if missing, the default is hostname of the node.
-f CONFIG, --config CONFIG	msstatd server configuration file. CONFIG: the json formatted configuration, if missing, the daemon will check If it is a messaging server node, it will load unified configuration; If it is a cassandra node (nodetool is installed and in the path), it will load default configuration.

To start/stop/restart stat collection after msstatd server starts:

curl -X POST <host>:<port>/stat/ -H "Content-Type: application/json" -d '{"start":"netstats:30"}'
curl -X POST <host>:<port>/stat/ -H "Content-Type: application/json" -d '{"stop":"gcstats"}'
curl -X POST <host>:<port>/stat/ -H "Content-Type: application/json" -d '{"restart":"netstats:25"}'

Querying the Node Statistics

The query of statistics is with Elasticsearch. But, msstatd provides RESTful APIs to query data too.

With browser:

http://<msstatd host>:<msstatd port>/stat/netstats?count=1&node=<cassandra node>

With CURL:
```
curl -X GET <host>:<port>/stat/<tpstats|tablestats>?node=?&ks=?&table=?&count=?
```
where,

node is Cassandra node, ks is keyspace, table is table name, and count is count of results to return.

Following is the responses to query tablestat:

Tabletstats json format :  {"hits": {"hits": [{"sort": [1551180643392], "_type":
"cassstats", "_source": {"bloom_filter_space_used_f": 0.0,
"number_of_partitions_estimate_f": 6,
"bloom_filter_off_heap_memory_used_f": 0.0, "space_used_live_f": 0,
"table_s": "ms_msg.message",
"compression_metadata_off_heap_memory_used_f": 0,
"average_live_cells_per_slice_last_five_minutes_f": 1.0,
"memtable_off_heap_memory_used_f": 0, "percent_repaired_f": 100.0,
"sstable_compression_ratio_f": -1.0, "local_write_latency_ms_f": 0.028,
"ts": 1551180643392, "maximum_tombstones_per_slice_last_five_minutes_f":
1.0, "proc": "cassandra", "memtable_switch_count_f": 0, "node":
"kkm00cxy", "local_read_count_f": 15, "pending_flushes_f": 0,
"local_write_count_f": 7, "off_heap_memory_used_total_f": 0,
"average_tombstones_per_slice_last_five_minutes_f": 1.0,
"space_used_total_f": 0.0, "memtable_data_size_f": 27096.0,
"compacted_partition_minimum_bytes_f": 0,
"compacted_partition_maximum_bytes_f": 0.0,
"maximum_live_cells_per_slice_last_five_minutes_f": 1.0,
"bloom_filter_false_ratio_f": 0.0, "compacted_partition_mean_bytes_f":
0, "dropped_mutations_f": 0.0, "index_summary_off_heap_memory_used_f":
0.0, "bloom_filter_false_positives_f": 0.0, "local_read_latency_ms_f":
0.18, "memtable_cell_count_f": 7, "space_used_by_snapshots_total_f": 0},
"_score": null, "_index": "ms_tablestats_2019_02_25", "_id":
"tablestats-16929923c40"}], "total": 21, "max_score": null}, "_shards":
{"successful": 5, "failed": 0, "skipped": 0, "total": 5}, "took": 5,
"timed_out": false}

Log Files

Two log files, system.log and msstatd.log, are generated in the location /python2.7/site-packages/src/.

Uninstalling the msstatbot Tool

You can install the msstatbot tool using the following rpm command:

rpm -e msstatbot

This command uninstalls the packages from /python 2.7/site-packages/.