This section provides guidelines for actively maintaining your message store. In addition, this section describes other message store recovery procedures you can use if the message store becomes corrupted or unexpectedly shuts down. Note that the section on these additional message store recovery procedures is an extension of 20.14.3 Repairing Mailboxes and the Mailboxes Database.
Prior to reading this section, it is strongly recommended that you review this chapter as well as the command-line utility and configutil chapters in the Sun Java System Messaging Server Administration Reference. Topics covered in this section include:
This section outlines standard monitoring procedures for the message store. These procedures are helpful for general message store checks, testing, and standard maintenance.
For additional information, see 27.7 Monitoring the Message Store.
A message store should have enough additional disk space and hardware resources. When the message store is near the maximum limit of disk space and hardware space, problems might occur within the message store.
Inadequate disk space is one of the most common causes of the mail server problems and failure. Without space to write to the message store, the mail server will fail. In addition, when the available disk space goes below a certain threshold, there will be problems related to message delivery, logging, and so forth. Disk space can be rapidly depleted when the clean up function of the stored process fails and deleted messages are not expunged from the message store.
For information on monitoring disk space, see 20.11.5 To Monitor Disk Space and 27.7 Monitoring the Message Store.
Check the log files to make sure the message store processes are running as configured. Messaging Server creates a separate set of log files for each of the major protocols, or services, it supports: SMTP, IMAP, POP, and HTTP. You can look at the log files in the directory msg-svr-base/log/. You should monitor the log files on a routine basis.
Be aware that logging can impact server performance. The more verbose the logging you specify, the more disk space your log files will occupy for a given amount of time. You should define effective but realistic log rotation, expiration, and backup policies for your server. For information about defining logging policies for your server, see Chapter 25, Managing Logging.
Messaging Server provides a feature called telemetry that can capture a user’s entire IMAP, POP or HTTP session into a file. This feature is useful for debugging client problems. For example, if a user complains that their message access client is not working as expected, this feature can be used to trace the interaction between the access client and Messaging Server.
To capture a POP session, create the following directory:
msg-svr-base/data/telemetry/pop_or_imap_or_http/userid
To capture a POP session, create the following directory:
msg-svr-base/data/telemetry/pop/userid
To capture an IMAP session, create the following directory:
msg-svr-base/data/telemetry/imap/userid
To capture a Webmail session, create the following directory:
msg-svr-base/data/telemetry/http/userid
Note that the directory must be owned or writable by the messaging server userid.
Messaging Server will create one file per session in that directory. Example output is shown below.
LOGIN redb 2003/11/26 13:03:21 >0.017>1 OK User logged in <0.047<2 XSERVERINFO MANAGEACCOUNTURL MANAGELISTSURL MANAGEFILTERSURL >0.003>* XSERVERINFO MANAGEACCOUNTURL {67} http://redb@cuisine.blue.planet.com:800/bin/user/admin/bin/enduser MANAGELISTSURL NIL MANAGEFILTERSURL NIL 2 OK Completed <0.046<3 select "INBOX" >0.236>* FLAGS (\Answered flagged draft deleted \Seen $MDNSent Junk) * OK [PERMANENTFLAGS (\Answered flag draft deleted \Seen $MDNSent Junk \*)] * 1538 EXISTS * 0 RECENT * OK [UNSEEN 23] * OK [UIDVALIDITY 1046219200] * OK [UIDNEXT 1968] 3 OK [READ-WRITE] Completed <0.045<4 UID fetch 1:* (FLAGS) >0.117>* 1 FETCH (FLAGS (\Seen) UID 330) * 2 FETCH (FLAGS (\Seen) UID 331) * 3 FETCH (FLAGS (\Seen) UID 332) * 4 FETCH (FLAGS (\Seen) UID 333) * 5 FETCH (FLAGS (\Seen) UID 334) <etc> |
To disable the telemetry logging, move or remove the directory that you created.
The stored function performs a variety of important tasks such as deadlock and transaction operations of the message database, enforcing aging policies, and expunging and erasing messages stored on disk. If stored stops running, Messaging Server will eventually run into problems. If stored doesn’t start when start-msg is run, no other processes will start.
Check that the stored process is running. Run imcheck
Check for the log file build up in store_root/mboxlist.
Check for stored messages in the default log file msg-svr-base/log/default/default
Check that the time stamps of the following files (in directory msg-svr-base/config/) are updated whenever one of the following functions are attempted by the stored process:
stored Operation |
Function |
---|---|
stored.ckp |
Touched when a database checkpoint was initiated. Stamped approximately every 1 minute. |
stored.lcu |
Touched at every database log cleanup. Time stamped approximately every 5 minutes. |
stored.per |
Touched at every spawn of peruser db write out. Time stamped once an hour. |
For more information on the stored process, see 20.11.6 The stored Daemon chapter of the Sun Java System Messaging Server 6.3 Administration Reference.
For additional information on monitoring the stored function, see 27.7 Monitoring the Message Store
Database log files refer to sleepycat transaction checkpointing log files (in directory store_root/mboxlist). If log files accumulate, then database checkpointing is not occurring. In general, there are two or three database log files during a single period of time. If there are more files, it could be a sign of a problem.
If you want to check the user folders, you might run the command reconstruct -r -n (recursive no fix) which will review any user folder and report errors. For more information on the reconstruct command, see 20.14.3 Repairing Mailboxes and the Mailboxes Database
Core files only exist when processes have unexpectedly terminated. It is important to review these files, particularly when you see a problem in the message store. On Solaris, use coreadm to configure core file location.
Message store data consists of the messages, index data, and the message store database. While this data is fairly robust, on rare occasions there may be message store data problems in the system. These problems will be indicated in the default log file, and almost always will be fixed transparently. In rare cases an error message in the log file may indicate that you need to run the reconstruct utility. In addition, as a last resort, messages are protected by the backup and restore processes described in 20.12 Backing Up and Restoring the Message Store. This section will focus on the automatic startup and recovery process of stored.
The message store automates many recovery operations which were previously the responsibility of the administrator. These operations are performed by message store daemon stored during startup and include database snapshots and automatic fast recovery as necessary. stored thoroughly checks the message store’s database and automatically initiates repairs if it detects a problem.
stored also provides a comprehensive analysis of the state of the database via status messages to the default log, reporting on repairs done to the message store and automatic attempts to bring it into operation.
The stored daemon starts before the other message store processes. It initializes and, if necessary, recovers the message store database. The message store database keeps folder, quota, subscription, and message flag information. The database is logging and transactional, so recovery is already built in. In addition, some database information is copied redundantly in the message index area for each folder.
Although the database is fairly robust, on the rare occasions that it breaks, in most cases stored recovers and repairs it transparently. However, whenever stored is restarted, you should check the default log files to make sure that additional administrative intervention is not required. Status messages in the log file will tell you to run reconstruct if the database requires further rebuilding.
Before opening the message store database, stored analyzes its integrity and sends status messages to the default log under the category of warning. Some messages will be useful to administrators and some messages will consists of coded data to be used for internal analysis. If stored detects any problems, it will attempt to fix the database and try starting it again.
When the database is opened, stored will signal that the rest of the services may start. If the automatic fixes failed, messages in the default log will specify what actions to take. See Error Messages Signifying that reconstruct is Needed
In previous releases, stored could start a recovery process which would take a very long time leaving the administrator wondering if stored was “stuck.” This type of long recovery has been removed and stored should determine a final state in less than a minute. However, if stored needs to employ recovery techniques such as recovering from a snapshot, the process may take a few minutes.
After most recoveries, the database will usually be up-to-date and nothing else will be required. However, some recoveries will require a reconstruct -m in order to synchronize redundant data in the message store. Again, this will be stated in the default log, so it is important to monitor the default log after a startup. Even though the message store will seem to be up and running normally, it is important to run any requested operations such as reconstruct.
Another reason for reading the log file is to determine what caused damage to the database in the first place. Although stored is designed to bring up the message store regardless of any problem on the system, you will still want to try to ascertain cause of the database damage as this may be a sign of a larger hidden problem.
This section describes the type of error messages that require reconstruct to be run.
When the error message indicates mailbox error, run reconstruct <mailbox>. Example:
"Invalid cache data for msg 102 in mailbox user/joe/INBOX. Needs reconstruct"
"Mailbox corrupted, missing fixed headers: user/joe/INBOX"
"Mailbox corrupted, start_offset beyond EOF: user/joe/INBOX"
When the error message indicates a database error, run reconstruct -m. Example:
"Removing extra database logs. Run reconstruct -m soon after startup to resync redundant data"
"Recovering database from snapshot. Run reconstruct -m soon after startup to resync redundant data"
A snapshot is a hot backup of the database and is used by stored to restore a broken database transparently in a few minutes. This is much quicker than using reconstruct, which relies on the redundant information stored in other areas.
Snapshots of the database, located in the mboxlist directory, are taken automatically, by default, once every 24 hours. Snapshots are copied by default into a subdirectory of the store directory. By default, there are five snapshots kept at any given time: one live database, three snapshots, and one database/removed copy. The database/removed copy is newer and is an emergency copy of the database which is put into a subdirectory removed of the mboxlist database directory.
If the recovery process decides to remove the current database because it is determined to be bad, stored will move it into the removed directory if it can. This allows the database to be analyzed if desired.
The data move will only happen once a week. If there is already a copy of the database there, stored will not replace it every time the store comes up. It will only replace it if the data in the removed directory is older than a week. This will prevent the original database which had the problem from being replaced too soon by successive startups.
There should be five times as much space for the database and snapshots combined. It is highly recommended that the administrator reconfigure snapshots to run on a separate disk, and that it is tuned to the system’s needs.
If stored detects a problem with the database on startup, the best snapshot will automatically be recovered. Three snapshot variables can set the following parameters: the location of the snapshot file, the interval for taking snapshots, number of snapshots saved. These configutil parameters are shown in Table 20–13.
Having a snapshot interval which is too small will result in a frequent burden to the system and a greater chance that a problem in the database will be copied as a snapshot. Having a snapshot interval too large can create a situation where the database will hold the state it had back when the snapshot was taken.
A snapshot interval of a day is recommended and a week or more of snapshots can be useful if a problem remains on the system for a number of days and you wish to go back to a period prior to the point at which the problem existed.
stored monitors the database and is intelligent enough to refuse the latest snapshot if it suspects the database is not perfect. It will instead retrieve the latest most reliable snapshot. Despite the fact that a snapshot may be retrieved from a day ago, the system will use more up to date redundant data and override the older snapshot data, if available.
Thus, the ultimate role the snapshot plays is to get the system as close to up-to-date and ease the burden of the rest of the system trying to rebuild the data on the fly.
Table 20–13 Message Store Database Snapshot Parameters
Parameter |
Description |
---|---|
Location of message store database snapshot files. Either existing absolute path or path relative to the store directory. Default: dbdata/snapshots |
|
Minutes between snapshots. Valid values: 1 - 46080 Default: 1440 (1440 minutes = 1 day) |
|
local.store.snapshotdirs |
Number of different snapshots kept. Valid values: 2 -367 Default: 3 |
If one or more mailboxes become corrupt, you can use the reconstruct utility to rebuild the mailboxes or the mailbox database and repair any inconsistencies.
The reconstruct utility rebuilds one or more mailboxes, or the master mailbox file, and repairs any inconsistencies. You can use this utility to recover from almost any form of data corruption in the mail store. See Error Messages Signifying that reconstruct is Needed
Table 20–14 lists the reconstruct options. For detailed syntax and usage requirements, see the reconstruct in Sun Java System Messaging Server 6.3 Administration Reference.
Table 20–14 reconstruct Options
Option |
Description |
---|---|
-e |
Removes the store.exp file before reconstructing. This eliminates any internal store record of removed messages which have not been cleaned out by the store process. It would also be useful to use the -f option when using -i or -e, because these options only work if the folder is actually reconstructed. Similarly, if you use the -n option (which performs a check, not a reconstruction), the -i and -e options do not work. Running a reconstruct -e will not recover removed messages if reconstruct does not detect damage. An -f will force the reconstruct. |
-i |
Sets the store.idx file length to zero before reconstructing. It would also be useful to use the -f option when using -i or -e, because these options only work if the folder is actually reconstructed. Similarly, if you use the -n option (which performs a check, not a reconstruction), the -i and -e options do not work. |
-f |
Forces reconstruct to perform a fix on the mailbox or mailboxes. |
-l |
Reconstruct lright.db. |
-m |
Performs a consistency check and, if needed, repairs the mailboxes database. This option examines every mailbox it finds in the spool area, adding or removing entries from the mailbox’s database as appropriate. The utility prints a message to the standard output file whenever it adds or removes an entry from the database. Specifically it fixes folder.db, quota.db, and lright.db |
-n |
Checks the message store only, without performing a fix on the mailbox or mailboxes. The -n option cannot be used by itself unless a mailbox name is provided. When a mailbox name is not provided, the -n option must be used with the -r option. The -r option may be combined with the -p option. For example, any of the following commands are valid: reconstruct -n user/dulcinea/INBOX reconstruct -n -r reconstruct -n -r -p primary reconstruct -n -r user/dulcinea/ |
-o |
Obsolete, see mboxutil -o |
-o -d filename |
Obsolete, see mboxutil -o |
-p partition |
The -p option is used with the -m option and limits the scope of the reconstruction to the specified partition. If the -p option is not specified, reconstruct defaults to all partitions. Specifically it fixes folder.db and, quota.db, but not lright.db. This is because fixing the lright.db requires scanning the acls for every user in the message store. Performing this for every partition is not very efficient. To fix lright.db run reconstruct -l. Specify a partition name; do not use a full path name. |
-q |
Fixes any inconsistencies in the quota subsystem, such as mailboxes with the wrong quota root or quota roots with the wrong quota usage reported. The -q option can be run while other server processes are running. |
-r [mailbox] |
Repairs and performs a consistency check of the partition area of the specified mailbox or mailboxes. The -r option also repairs all sub-mailboxes within the specified mailbox. If you specify -r with no mailbox argument, the utility repairs the spool areas of all mailboxes within the user partition directory. |
-u user |
The -u option is used with the -m option and limits the scope of the reconstruction to the specified user. The -u option must be used with the -p option. If the -u option is not specified, reconstruct defaults to all partitions or to the partition specified with the -p option. Specify a user name; do not use a full path name. |
To rebuild mailboxes, use the -r option. You should use this option when:
Accessing a mailbox returns one of the following errors: “System I/O error” or “Mailbox has an invalid format”.
Accessing a mailbox causes the server to crash.
Files have been added to or removed from the spool directory.
reconstruct -r first runs a consistency check. It reports any inconsistencies and rebuilds only if it detects any problems. Consequently, performance of the reconstruct utility is improved with this release.
You can use reconstruct as described in the following examples:
To rebuild the spool area for the mailboxes belonging to the user daphne, use the following command:
reconstruct -r user/daphne
To rebuild the spool area for all mailboxes listed in the mailbox database:
reconstruct -r
You must use this option with caution, however, because rebuilding the spool area for all mailboxes listed in the mailbox database can take a very long time for large message stores. (See 20.14.3.3 reconstruct Performance.) A better method for failure recovery might be to use multiple disks for the store. If one disk goes down, the entire store does not. If a disk becomes corrupt, you need only rebuild a portion of the store by using the -p option as follows:
reconstruct -r -p subpartition
To rebuild mailboxes listed in the command-line argument only if they are in the primary partition:
reconstruct -p primary mbox1 mbox2 mbox3
If you do need to rebuild all mailboxes in the primary partition:
reconstruct -r -p primary
If you want to force reconstruct to rebuild a folder without performing a consistency check, use the -f option. For example, the following command forces a reconstruct of the user folder daphne:
reconstruct -f -r user/daphne
To check all mailboxes without fixing them, use the -n option as follows:
reconstruct -r -n
To perform a high-level consistency check and repair of the mailboxes database:
reconstruct -m
To perform a consistency check and repair of the primary partition:
reconstruct -p primary -m
Running reconstruct with the -P. and -m flags together will not fix lright.db. This is because fixing the lright.db requires scanning the ACLs for every user in the message store. Performing this for every partition is not very efficient. To fix the lright.db run reconstruct -l
To perform a consistency check and repair of an individual user’s mailbox named john:
reconstruct -p primary -u john -m
You should use the -m option when:
One or more directories were removed from the store spool area, so the mailbox database entries also need to be removed.
One or more directories were restored to the store spool area, so the mailbox database entries also need to be added.
The stored -d option is unable to make the database consistent.
If the stored -d option is unable to make the database consistent, you should perform the following steps in the order indicated:
Shut down all servers.
Remove all files in store_root/mboxlist.
Restart the server processes.
Run reconstruct -m to build a new mailboxes database from the contents of the spool area.
The time it takes reconstruct to perform an operation depends on the following factors:
The kind of operation being performed and the options chosen
Disk performance
The number of folders when running reconstruct -m
The number of messages when running reconstruct -r
The overall size of the message store
What other processes the system is running and how busy the system is
Whether or not there is ongoing POP, IMAP, HTTP, or SMTP activity. Note that reconstruct is designed to run with the store services up. It is not necessary to keep the store offline to run reconstruct.
The reconstruct -r option performs an initial consistency check; this check improves reconstruct performance depending on how many folders must be rebuilt.
The following performance was found with a system with approximately 2400 users, a message store of 85GB, and concurrent POP, IMAP, or SMTP activity on the server:
reconstruct -m took about 1 hour
reconstruct -r -f took about 18 hours
A reconstruct operation may take significantly less time if the server is not performing ongoing POP, IMAP, HTTP, or SMTP activity.
This section lists common message store problems and solutions:
Message store problems can occur if the mboxlist database cache is too small. Specifically, Message store performance can slow to unacceptable levels and can even dump core. Refer to Tuning the mboxlist Database Cache.
After installing this patch, when you try to start Messaging Server, the IMAP, POP and HTTP servers do not start and may send the following example error logs:
http server - log: [29/May/2006:17:44:37 +051800] usg197 httpd[6751]: General Critical: Not enough file descriptors to support 6000 sessions per process; Recommend ulimit -n 12851 or 87 sessions per process. pop server - log: [29/May/2006:17:44:37 +051800] usg197 popd[6749]: General Critical: Not enough file descriptors to support 600 sessions per process; Recommend ulimit -n 2651 or 58 sessions per process. Once these values setting in /opt/sun/messaging/sbin/configutil then imap server failed to start imap server - log: [29/May/2006:17:44:37 +051800] usg197 imapd[6747]: General Critical: Not enough file descriptors to support 4000 sessions per process; Recommend ulimit -n 12851 or 58 sessions per process. |
Set the appropriate number of file descriptors for all three server sessions. Additional file descriptors are available by adding a line similar to the following to /etc/sysctl.conf and using sysctl -p to reread that file:
fs.file-max = 65536 |
You must also add a line like the following to /etc/security/limits.conf:
* soft nofile 65536 * hard nofile 65536 |
If the user cannot load any Messenger Express pages or the Communications Express mail page, the problem may be that the data is getting corrupted after compression. This can sometimes happen if the system has deployed a outdated proxy server. To solve this problem, try setting local.service.http.gzip.static and local.service.http.gzip.dynamic to 0 to disable data compression. If this solves the problem, you may want to update the proxy server.
Some UNIX shells may require quotes around wildcard parameters and some will not. For example, the C shell tries to expand arguments containing wild cards (*, ?) as files and will fail if no match is found. These pattern matching arguments may need to be enclosed in quotes to be passed to commands like mboxutil.
For example:
mboxutil -l -p user/usr44*
will work in the Bourne shell, but will fail with tsch and the C shell. These shells would require the following:
mboxutil -l -p "user/usr44*"
If a command using a wildcard pattern doesn’t work, verify whether or not you need to use quotes around wildcards for that shell.
A user can get the message “Unknown/invalid partition” in Messenger Express if their mailbox was moved to a new partition which was just created and Messaging Server was not refreshed or restarted. This problem only occurs on new partitions. If you now add additional user mailboxes to this new partition, you will not have to do a refresh/restart of Messaging Server.
A user mailbox problem exists when the damage to the message store is limited to a small number of users and there is no global damage to the system. The following guidelines suggest a process for identifying, analyzing, and resolving a user mailbox directory problem:
Review the log files, the error messages, or any unusual behavior that the user observes.
To keep debugging information and history, copy the entire store_root/mboxlist/ user directory to another location outside the message store.
To find the user folder that might be causing the problem, run the command reconstruct -r -n. If you are unable to find the folder using reconstruct, the folder might not exist in the folder.db.
If you are unable to find the folder using the reconstruct -r -n command, use the hashdir command to determine the location. For more information on hashdir, see 20.11.2.3 The hashdir Utility and the hashdir utility in the Messaging Server Command-line Utilities chapter of the Sun Java System Messaging Server 6.3 Administration Reference.
Once you find the folder, examine the files, check permissions, and verify the proper file sizes.
Use reconstruct -r (without the -n option) to rebuild the mailbox.
If reconstruct does not detect a problem that you observe, you can force the reconstruction of your mail folders by using the reconstruct -r -f command.
If the folder does not exist in the mboxlist directory (store_root/mboxlist), but exists in the partition directory store_root/partition), there might be a global inconsistency. In this case, you should run the reconstruct -m command.
If the previous steps do not work, you can remove the store.idx file and run the reconstruct command again.
You should only remove the store.idx file if you are sure there is a problem in the file that the reconstruct command is unable to find.
If the issue is limited to a problematic message, you should copy the message file to another location outside of the message store and run the command reconstruct -r on the mailbox/ directory.
If you determine the folder exists on the disk (store_root/partition/ directory), but is apparently not in the database (store_root/mboxlist/ directory), run the command reconstruct -m to ensure message store consistency.
For more information on the reconstruct command, see 20.14.3 Repairing Mailboxes and the Mailboxes Database
If stored won’t start and returns the following error message:
# msg-svr-base/sbin/start-msg msg-svr-base: Starting STORE daemon ...Fatal error: Cannot find group in name service |
This indicates that the UNIX group configured in local.servergid cannot be found. Stored and others need to set their gid to that group. Sometimes the group defined by local.servergid gets inadvertently deleted. In this case, create the deleted group, add mailsrv to the group, change ownership of the instance_root and its files to mailsrv and the group.
The message store has a hard limit of two gigabytes for a store.idx file, which is equivalent to about one million messages in a single mailbox (folder). If a mailbox grows to the point that the store.idx file will attempt to exceed two gigabytes, the user will stop receiving any new email. In addition, other processes that handle that mailbox, such as imapd, popd, mshttpd, could also experience degraded performance.
If this problem arises, you will see errors in mail.log_current such as this:
05-Oct-2005 16:09:09.63 ims-ms Q 7 ... System I/O error. Administrator, check server log for details. System I/O error.
In addition, the MTA log file will have an errors such as this:
[05/Oct/2005:16:09:09 +0900] jmail ims_master[20745]: Store Error: Unable to append cache for user/admin: File too large
You can determine this problem conclusively by looking at the file in the user's message store directory, or by looking in the imta log file to see a more detailed message.
The immediate action is to reduce the size of the file. Either delete some mail, or move some of it to another mailbox. You could also use mboxutil -r to rename the folder out of the way, or mboxutil -d to delete the folder (see 20.11.2.1 The mboxutil Utility.
Long-term, you will need to inform the user of mailbox size limitations, implement an aging policy (see 20.9 To Set the Automatic Message Removal (Expire and Purge) Feature), a quota policy (see 20.8 About Message Store Quotas), set a mail box limit by setting local.store.maxmessages(see configutil Parameters in Sun Java System Messaging Server 6.3 Administration Reference), set up an archiving system, or do something to keep the mailbox size under control.
Symptom: After working fine for a short period of time, many IMAP events become unreasonably slow, with some events taking over a second.
Diagnosis: You have the Event Notification Service (ENS) plugin, libibiff, configured, but ENS is not running or not reachable. See Appendix B, Administering Event Notification Service in Messaging Server for ENS details.
Solution: If you want ENS notifications, make sure the ENS is enabled and configured correctly. If you do not want ENS notifications, make sure that libibiff is not being loaded. Typical bad configuration:
local.store.notifyplugin = /opt/sun/comms/messaging/lib/libibiff local.ens.enable = 0
Use either of the following for solution configurations:
local.store.notifyplugin = local.ens.enable = 0
or
local.store.notifyplugin = /opt/sun/comms/messaging/lib/libibiff local.ens.enable = 1