Sun Java System Messaging Server 6.3 Administration Guide

20.14 Troubleshooting the Message Store

This section provides guidelines for actively maintaining your message store. In addition, this section describes other message store recovery procedures you can use if the message store becomes corrupted or unexpectedly shuts down. Note that the section on these additional message store recovery procedures is an extension of 20.14.3 Repairing Mailboxes and the Mailboxes Database.

Prior to reading this section, it is strongly recommended that you review this chapter as well as the command-line utility and configutil chapters in the Sun Java System Messaging Server Administration Reference. Topics covered in this section include:

20.14.1 Standard Message Store Monitoring Procedures

This section outlines standard monitoring procedures for the message store. These procedures are helpful for general message store checks, testing, and standard maintenance.

For additional information, see 27.7 Monitoring the Message Store.

20.14.1.1 Check Hardware Space

A message store should have enough additional disk space and hardware resources. When the message store is near the maximum limit of disk space and hardware space, problems might occur within the message store.

Inadequate disk space is one of the most common causes of the mail server problems and failure. Without space to write to the message store, the mail server will fail. In addition, when the available disk space goes below a certain threshold, there will be problems related to message delivery, logging, and so forth. Disk space can be rapidly depleted when the clean up function of the stored process fails and deleted messages are not expunged from the message store.

For information on monitoring disk space, see 20.11.5 To Monitor Disk Space and 27.7 Monitoring the Message Store.

20.14.1.2 Check Log Files

Check the log files to make sure the message store processes are running as configured. Messaging Server creates a separate set of log files for each of the major protocols, or services, it supports: SMTP, IMAP, POP, and HTTP. You can look at the log files in the directory msg-svr-base/log/. You should monitor the log files on a routine basis.

Be aware that logging can impact server performance. The more verbose the logging you specify, the more disk space your log files will occupy for a given amount of time. You should define effective but realistic log rotation, expiration, and backup policies for your server. For information about defining logging policies for your server, see Chapter 25, Managing Logging.

20.14.1.3 Check User IMAP/POP/Webmail Session by Using Telemetry

Messaging Server provides a feature called telemetry that can capture a user’s entire IMAP, POP or HTTP session into a file. This feature is useful for debugging client problems. For example, if a user complains that their message access client is not working as expected, this feature can be used to trace the interaction between the access client and Messaging Server.

To capture a POP session, create the following directory:

msg-svr-base/data/telemetry/pop_or_imap_or_http/userid

To capture a POP session, create the following directory:

msg-svr-base/data/telemetry/pop/userid

To capture an IMAP session, create the following directory:

msg-svr-base/data/telemetry/imap/userid

To capture a Webmail session, create the following directory:

msg-svr-base/data/telemetry/http/userid

Note that the directory must be owned or writable by the messaging server userid.

Messaging Server will create one file per session in that directory. Example output is shown below.


LOGIN redb 2003/11/26 13:03:21
>0.017>1 OK User logged in
<0.047<2 XSERVERINFO MANAGEACCOUNTURL MANAGELISTSURL MANAGEFILTERSURL
>0.003>* XSERVERINFO MANAGEACCOUNTURL {67}
http://redb@cuisine.blue.planet.com:800/bin/user/admin/bin/enduser 
MANAGELISTSURL NIL MANAGEFILTERSURL NIL
2 OK Completed
<0.046<3 select "INBOX"
>0.236>* FLAGS (\Answered flagged draft deleted \Seen $MDNSent Junk)
* OK [PERMANENTFLAGS (\Answered flag draft deleted \Seen $MDNSent Junk \*)]
* 1538 EXISTS
* 0 RECENT
* OK [UNSEEN 23]
* OK [UIDVALIDITY 1046219200]
* OK [UIDNEXT 1968]
3 OK [READ-WRITE] Completed
<0.045<4 UID fetch 1:* (FLAGS)
>0.117>* 1 FETCH (FLAGS (\Seen) UID 330)
* 2 FETCH (FLAGS (\Seen) UID 331)
* 3 FETCH (FLAGS (\Seen) UID 332)
* 4 FETCH (FLAGS (\Seen) UID 333)
* 5 FETCH (FLAGS (\Seen) UID 334)
<etc>

To disable the telemetry logging, move or remove the directory that you created.

20.14.1.4 Check stored Processes

The stored function performs a variety of important tasks such as deadlock and transaction operations of the message database, enforcing aging policies, and expunging and erasing messages stored on disk. If stored stops running, Messaging Server will eventually run into problems. If stored doesn’t start when start-msg is run, no other processes will start.

Table 20–12 stored Operations

stored Operation 

Function  

stored.ckp

Touched when a database checkpoint was initiated. Stamped approximately every 1 minute. 

stored.lcu

Touched at every database log cleanup. Time stamped approximately every 5 minutes. 

stored.per

Touched at every spawn of peruser db write out. Time stamped once an hour. 

For more information on the stored process, see 20.11.6 The stored Daemon chapter of the Sun Java System Messaging Server 6.3 Administration Reference.

For additional information on monitoring the stored function, see 27.7 Monitoring the Message Store

20.14.1.5 Check Database Log Files

Database log files refer to sleepycat transaction checkpointing log files (in directory store_root/mboxlist). If log files accumulate, then database checkpointing is not occurring. In general, there are two or three database log files during a single period of time. If there are more files, it could be a sign of a problem.

20.14.1.6 Check User Folders

If you want to check the user folders, you might run the command reconstruct -r -n (recursive no fix) which will review any user folder and report errors. For more information on the reconstruct command, see 20.14.3 Repairing Mailboxes and the Mailboxes Database

20.14.1.7 Check for Core Files

Core files only exist when processes have unexpectedly terminated. It is important to review these files, particularly when you see a problem in the message store. On Solaris, use coreadm to configure core file location.

20.14.2 Message Store Startup and Recovery

Message store data consists of the messages, index data, and the message store database. While this data is fairly robust, on rare occasions there may be message store data problems in the system. These problems will be indicated in the default log file, and almost always will be fixed transparently. In rare cases an error message in the log file may indicate that you need to run the reconstruct utility. In addition, as a last resort, messages are protected by the backup and restore processes described in 20.12 Backing Up and Restoring the Message Store. This section will focus on the automatic startup and recovery process of stored.

The message store automates many recovery operations which were previously the responsibility of the administrator. These operations are performed by message store daemon stored during startup and include database snapshots and automatic fast recovery as necessary. stored thoroughly checks the message store’s database and automatically initiates repairs if it detects a problem.

stored also provides a comprehensive analysis of the state of the database via status messages to the default log, reporting on repairs done to the message store and automatic attempts to bring it into operation.

20.14.2.1 Automatic Startup and Recovery—Theory of Operations

The stored daemon starts before the other message store processes. It initializes and, if necessary, recovers the message store database. The message store database keeps folder, quota, subscription, and message flag information. The database is logging and transactional, so recovery is already built in. In addition, some database information is copied redundantly in the message index area for each folder.

Although the database is fairly robust, on the rare occasions that it breaks, in most cases stored recovers and repairs it transparently. However, whenever stored is restarted, you should check the default log files to make sure that additional administrative intervention is not required. Status messages in the log file will tell you to run reconstruct if the database requires further rebuilding.

Before opening the message store database, stored analyzes its integrity and sends status messages to the default log under the category of warning. Some messages will be useful to administrators and some messages will consists of coded data to be used for internal analysis. If stored detects any problems, it will attempt to fix the database and try starting it again.

When the database is opened, stored will signal that the rest of the services may start. If the automatic fixes failed, messages in the default log will specify what actions to take. See Error Messages Signifying that reconstruct is Needed

In previous releases, stored could start a recovery process which would take a very long time leaving the administrator wondering if stored was “stuck.” This type of long recovery has been removed and stored should determine a final state in less than a minute. However, if stored needs to employ recovery techniques such as recovering from a snapshot, the process may take a few minutes.

After most recoveries, the database will usually be up-to-date and nothing else will be required. However, some recoveries will require a reconstruct -m in order to synchronize redundant data in the message store. Again, this will be stated in the default log, so it is important to monitor the default log after a startup. Even though the message store will seem to be up and running normally, it is important to run any requested operations such as reconstruct.

Another reason for reading the log file is to determine what caused damage to the database in the first place. Although stored is designed to bring up the message store regardless of any problem on the system, you will still want to try to ascertain cause of the database damage as this may be a sign of a larger hidden problem.

Error Messages Signifying that reconstruct is Needed

This section describes the type of error messages that require reconstruct to be run.

When the error message indicates mailbox error, run reconstruct <mailbox>. Example:

"Invalid cache data for msg 102 in mailbox user/joe/INBOX. Needs reconstruct"

"Mailbox corrupted, missing fixed headers: user/joe/INBOX"

"Mailbox corrupted, start_offset beyond EOF: user/joe/INBOX"

When the error message indicates a database error, run reconstruct -m. Example:

"Removing extra database logs. Run reconstruct -m soon after startup to resync redundant data"

"Recovering database from snapshot. Run reconstruct -m soon after startup to resync redundant data"

Database Snapshots

A snapshot is a hot backup of the database and is used by stored to restore a broken database transparently in a few minutes. This is much quicker than using reconstruct, which relies on the redundant information stored in other areas.

Message Store Database Snapshot—Theory of Operations

Snapshots of the database, located in the mboxlist directory, are taken automatically, by default, once every 24 hours. Snapshots are copied by default into a subdirectory of the store directory. By default, there are five snapshots kept at any given time: one live database, three snapshots, and one database/removed copy. The database/removed copy is newer and is an emergency copy of the database which is put into a subdirectory removed of the mboxlist database directory.

If the recovery process decides to remove the current database because it is determined to be bad, stored will move it into the removed directory if it can. This allows the database to be analyzed if desired.

The data move will only happen once a week. If there is already a copy of the database there, stored will not replace it every time the store comes up. It will only replace it if the data in the removed directory is older than a week. This will prevent the original database which had the problem from being replaced too soon by successive startups.

To Specify Message Store Database Snapshot Interval and Location

There should be five times as much space for the database and snapshots combined. It is highly recommended that the administrator reconfigure snapshots to run on a separate disk, and that it is tuned to the system’s needs.

If stored detects a problem with the database on startup, the best snapshot will automatically be recovered. Three snapshot variables can set the following parameters: the location of the snapshot file, the interval for taking snapshots, number of snapshots saved. These configutil parameters are shown in Table 20–13.

Having a snapshot interval which is too small will result in a frequent burden to the system and a greater chance that a problem in the database will be copied as a snapshot. Having a snapshot interval too large can create a situation where the database will hold the state it had back when the snapshot was taken.

A snapshot interval of a day is recommended and a week or more of snapshots can be useful if a problem remains on the system for a number of days and you wish to go back to a period prior to the point at which the problem existed.

stored monitors the database and is intelligent enough to refuse the latest snapshot if it suspects the database is not perfect. It will instead retrieve the latest most reliable snapshot. Despite the fact that a snapshot may be retrieved from a day ago, the system will use more up to date redundant data and override the older snapshot data, if available.

Thus, the ultimate role the snapshot plays is to get the system as close to up-to-date and ease the burden of the rest of the system trying to rebuild the data on the fly.

Table 20–13 Message Store Database Snapshot Parameters

Parameter  

Description  

local.store.snapshotpath

Location of message store database snapshot files. Either existing absolute path or path relative to the store directory.

Default: dbdata/snapshots

local.store.snapshotinterval

Minutes between snapshots. Valid values: 1 - 46080 

Default: 1440 (1440 minutes = 1 day) 

local.store.snapshotdirs

Number of different snapshots kept. Valid values: 2 -367 

Default: 3 

20.14.3 Repairing Mailboxes and the Mailboxes Database

If one or more mailboxes become corrupt, you can use the reconstruct utility to rebuild the mailboxes or the mailbox database and repair any inconsistencies.

The reconstruct utility rebuilds one or more mailboxes, or the master mailbox file, and repairs any inconsistencies. You can use this utility to recover from almost any form of data corruption in the mail store. See Error Messages Signifying that reconstruct is Needed

Table 20–14 lists the reconstruct options. For detailed syntax and usage requirements, see the reconstruct in Sun Java System Messaging Server 6.3 Administration Reference.

Table 20–14 reconstruct Options

Option  

Description  

-e

Removes the store.exp file before reconstructing. This eliminates any internal store record of removed messages which have not been cleaned out by the store process. It would also be useful to use the -f option when using -i or -e, because these options only work if the folder is actually reconstructed. Similarly, if you use the -n option (which performs a check, not a reconstruction), the -i and -e options do not work.

Running a reconstruct -e will not recover removed messages if reconstruct does not detect damage. An -f will force the reconstruct.

-i

Sets the store.idx file length to zero before reconstructing. It would also be useful to use the -f option when using -i or -e, because these options only work if the folder is actually reconstructed. Similarly, if you use the -n option (which performs a check, not a reconstruction), the -i and -e options do not work.

-f

Forces reconstruct to perform a fix on the mailbox or mailboxes.

-l

Reconstruct lright.db.

-m

Performs a consistency check and, if needed, repairs the mailboxes database. This option examines every mailbox it finds in the spool area, adding or removing entries from the mailbox’s database as appropriate. The utility prints a message to the standard output file whenever it adds or removes an entry from the database. Specifically it fixes folder.db, quota.db, and lright.db

-n

Checks the message store only, without performing a fix on the mailbox or mailboxes. The -n option cannot be used by itself unless a mailbox name is provided. When a mailbox name is not provided, the -n option must be used with the -r option. The -r option may be combined with the -p option. For example, any of the following commands are valid:

reconstruct -n user/dulcinea/INBOX

reconstruct -n -r

reconstruct -n -r -p primary

reconstruct -n -r user/dulcinea/

-o

Obsolete, see mboxutil -o

-o -d filename

Obsolete, see mboxutil -o

-p partition

The -p option is used with the -m option and limits the scope of the reconstruction to the specified partition. If the -p option is not specified, reconstruct defaults to all partitions. Specifically it fixes folder.db and, quota.db, but not lright.db. This is because fixing the lright.db requires scanning the acls for every user in the message store. Performing this for every partition is not very efficient. To fix lright.db run reconstruct -l.

Specify a partition name; do not use a full path name. 

-q

Fixes any inconsistencies in the quota subsystem, such as mailboxes with the wrong quota root or quota roots with the wrong quota usage reported. The -q option can be run while other server processes are running.

-r [mailbox]

Repairs and performs a consistency check of the partition area of the specified mailbox or mailboxes. The -r option also repairs all sub-mailboxes within the specified mailbox. If you specify -r with no mailbox argument, the utility repairs the spool areas of all mailboxes within the user partition directory.

-u user

The -u option is used with the -m option and limits the scope of the reconstruction to the specified user. The -u option must be used with the -p option. If the -u option is not specified, reconstruct defaults to all partitions or to the partition specified with the -p option.

Specify a user name; do not use a full path name. 

20.14.3.1 To Rebuild Mailboxes

To rebuild mailboxes, use the -r option. You should use this option when:

reconstruct -r first runs a consistency check. It reports any inconsistencies and rebuilds only if it detects any problems. Consequently, performance of the reconstruct utility is improved with this release.

You can use reconstruct as described in the following examples:

To rebuild the spool area for the mailboxes belonging to the user daphne, use the following command:

reconstruct -r user/daphne

To rebuild the spool area for all mailboxes listed in the mailbox database:

reconstruct -r

You must use this option with caution, however, because rebuilding the spool area for all mailboxes listed in the mailbox database can take a very long time for large message stores. (See 20.14.3.3 reconstruct Performance.) A better method for failure recovery might be to use multiple disks for the store. If one disk goes down, the entire store does not. If a disk becomes corrupt, you need only rebuild a portion of the store by using the -p option as follows:

reconstruct -r -p subpartition

To rebuild mailboxes listed in the command-line argument only if they are in the primary partition:

reconstruct -p primary mbox1 mbox2 mbox3

If you do need to rebuild all mailboxes in the primary partition:

reconstruct -r -p primary

If you want to force reconstruct to rebuild a folder without performing a consistency check, use the -f option. For example, the following command forces a reconstruct of the user folder daphne:

reconstruct -f -r user/daphne

To check all mailboxes without fixing them, use the -n option as follows:

reconstruct -r -n

20.14.3.2 Checking and Repairing Mailboxes

To perform a high-level consistency check and repair of the mailboxes database:

reconstruct -m

To perform a consistency check and repair of the primary partition:

reconstruct -p primary -m

Note –

Running reconstruct with the -P. and -m flags together will not fix lright.db. This is because fixing the lright.db requires scanning the ACLs for every user in the message store. Performing this for every partition is not very efficient. To fix the lright.db run reconstruct -l


To perform a consistency check and repair of an individual user’s mailbox named john:

reconstruct -p primary -u john -m

You should use the -m option when:

20.14.3.3 reconstruct Performance

The time it takes reconstruct to perform an operation depends on the following factors:

The reconstruct -r option performs an initial consistency check; this check improves reconstruct performance depending on how many folders must be rebuilt.

The following performance was found with a system with approximately 2400 users, a message store of 85GB, and concurrent POP, IMAP, or SMTP activity on the server:


Note –

A reconstruct operation may take significantly less time if the server is not performing ongoing POP, IMAP, HTTP, or SMTP activity.


20.14.4 Common Problems and Solutions

This section lists common message store problems and solutions:

20.14.4.1 Reduced Message Store Performance

Message store problems can occur if the mboxlist database cache is too small. Specifically, Message store performance can slow to unacceptable levels and can even dump core. Refer to Tuning the mboxlist Database Cache.

20.14.4.2 Linux - Messaging Server Patch 120230-08 IMAP, POP and HTTP Servers Not Starting Due to Over Sessions Per Process

After installing this patch, when you try to start Messaging Server, the IMAP, POP and HTTP servers do not start and may send the following example error logs:


http server - log:
[29/May/2006:17:44:37 +051800] usg197 httpd[6751]: General Critical: Not enough file 
descriptors to support 6000 sessions per process; Recommend ulimit -n 12851 or 87 
sessions per process.

pop server - log:
[29/May/2006:17:44:37 +051800] usg197 popd[6749]: General Critical: Not enough file 
descriptors to support 600 sessions per process; Recommend ulimit -n 2651 or 58 
sessions per process.

Once these values setting in /opt/sun/messaging/sbin/configutil then imap server 
failed to start

imap server - log: 
[29/May/2006:17:44:37 +051800] usg197 imapd[6747]: General Critical: Not enough 
file descriptors to support 4000 sessions per process; Recommend ulimit -n 12851 
or 58 sessions per process.

Set the appropriate number of file descriptors for all three server sessions. Additional file descriptors are available by adding a line similar to the following to /etc/sysctl.conf and using sysctl -p to reread that file:


fs.file-max = 65536 

You must also add a line like the following to /etc/security/limits.conf:


*   soft  nofile  65536  
*   hard  nofile  65536

20.14.4.3 Messenger Express or Communications Express Not Loading Mail Page

If the user cannot load any Messenger Express pages or the Communications Express mail page, the problem may be that the data is getting corrupted after compression. This can sometimes happen if the system has deployed a outdated proxy server. To solve this problem, try setting local.service.http.gzip.static and local.service.http.gzip.dynamic to 0 to disable data compression. If this solves the problem, you may want to update the proxy server.

20.14.4.4 Command Using Wildcard Pattern Doesn’t Work

Some UNIX shells may require quotes around wildcard parameters and some will not. For example, the C shell tries to expand arguments containing wild cards (*, ?) as files and will fail if no match is found. These pattern matching arguments may need to be enclosed in quotes to be passed to commands like mboxutil.

For example:

mboxutil -l -p user/usr44*

will work in the Bourne shell, but will fail with tsch and the C shell. These shells would require the following:

mboxutil -l -p "user/usr44*"

If a command using a wildcard pattern doesn’t work, verify whether or not you need to use quotes around wildcards for that shell.

20.14.4.5 Unknown/invalid Partition

A user can get the message “Unknown/invalid partition” in Messenger Express if their mailbox was moved to a new partition which was just created and Messaging Server was not refreshed or restarted. This problem only occurs on new partitions. If you now add additional user mailboxes to this new partition, you will not have to do a refresh/restart of Messaging Server.

20.14.4.6 User Mailbox Directory Problems

A user mailbox problem exists when the damage to the message store is limited to a small number of users and there is no global damage to the system. The following guidelines suggest a process for identifying, analyzing, and resolving a user mailbox directory problem:

  1. Review the log files, the error messages, or any unusual behavior that the user observes.

  2. To keep debugging information and history, copy the entire store_root/mboxlist/ user directory to another location outside the message store.

  3. To find the user folder that might be causing the problem, run the command reconstruct -r -n. If you are unable to find the folder using reconstruct, the folder might not exist in the folder.db.

    If you are unable to find the folder using the reconstruct -r -n command, use the hashdir command to determine the location. For more information on hashdir, see 20.11.2.3 The hashdir Utility and the hashdir utility in the Messaging Server Command-line Utilities chapter of the Sun Java System Messaging Server 6.3 Administration Reference.

  4. Once you find the folder, examine the files, check permissions, and verify the proper file sizes.

  5. Use reconstruct -r (without the -n option) to rebuild the mailbox.

  6. If reconstruct does not detect a problem that you observe, you can force the reconstruction of your mail folders by using the reconstruct -r -f command.

  7. If the folder does not exist in the mboxlist directory (store_root/mboxlist), but exists in the partition directory store_root/partition), there might be a global inconsistency. In this case, you should run the reconstruct -m command.

  8. If the previous steps do not work, you can remove the store.idx file and run the reconstruct command again.


    Caution – Caution –

    You should only remove the store.idx file if you are sure there is a problem in the file that the reconstruct command is unable to find.


  9. If the issue is limited to a problematic message, you should copy the message file to another location outside of the message store and run the command reconstruct -r on the mailbox/ directory.

  10. If you determine the folder exists on the disk (store_root/partition/ directory), but is apparently not in the database (store_root/mboxlist/ directory), run the command reconstruct -m to ensure message store consistency.

For more information on the reconstruct command, see 20.14.3 Repairing Mailboxes and the Mailboxes Database

20.14.4.7 Store Daemon Not Starting

If stored won’t start and returns the following error message:


# msg-svr-base/sbin/start-msg

msg-svr-base: Starting STORE daemon ...Fatal error: Cannot
find group in name service

This indicates that the UNIX group configured in local.servergid cannot be found. Stored and others need to set their gid to that group. Sometimes the group defined by local.servergid gets inadvertently deleted. In this case, create the deleted group, add mailsrv to the group, change ownership of the instance_root and its files to mailsrv and the group.

20.14.4.8 User Mail Not Delivered Due to Mailbox Overflow

The message store has a hard limit of two gigabytes for a store.idx file, which is equivalent to about one million messages in a single mailbox (folder). If a mailbox grows to the point that the store.idx file will attempt to exceed two gigabytes, the user will stop receiving any new email. In addition, other processes that handle that mailbox, such as imapd, popd, mshttpd, could also experience degraded performance.

If this problem arises, you will see errors in mail.log_current such as this:

05-Oct-2005 16:09:09.63 ims-ms Q 7 ... System I/O error. Administrator, check server log for details. System I/O error.

In addition, the MTA log file will have an errors such as this:

[05/Oct/2005:16:09:09 +0900] jmail ims_master[20745]: Store Error: Unable to append cache for user/admin: File too large

You can determine this problem conclusively by looking at the file in the user's message store directory, or by looking in the imta log file to see a more detailed message.

The immediate action is to reduce the size of the file. Either delete some mail, or move some of it to another mailbox. You could also use mboxutil -r to rename the folder out of the way, or mboxutil -d to delete the folder (see 20.11.2.1 The mboxutil Utility.

Long-term, you will need to inform the user of mailbox size limitations, implement an aging policy (see 20.9 To Set the Automatic Message Removal (Expire and Purge) Feature), a quota policy (see 20.8 About Message Store Quotas), set a mail box limit by setting local.store.maxmessages(see configutil Parameters in Sun Java System Messaging Server 6.3 Administration Reference), set up an archiving system, or do something to keep the mailbox size under control.

20.14.4.9 IMAP Events Become Slow

Symptom: After working fine for a short period of time, many IMAP events become unreasonably slow, with some events taking over a second.

Diagnosis: You have the Event Notification Service (ENS) plugin, libibiff, configured, but ENS is not running or not reachable. See Appendix B, Administering Event Notification Service in Messaging Server for ENS details.

Solution: If you want ENS notifications, make sure the ENS is enabled and configured correctly. If you do not want ENS notifications, make sure that libibiff is not being loaded. Typical bad configuration:

local.store.notifyplugin = /opt/sun/comms/messaging/lib/libibiff
local.ens.enable = 0

Use either of the following for solution configurations:

local.store.notifyplugin = 
local.ens.enable = 0

or

local.store.notifyplugin = /opt/sun/comms/messaging/lib/libibiff
local.ens.enable = 1