Handling Common Database Failures (Sun Java System Calendar Server 6 2005Q4 Administration Guide)

Sun Java System Calendar Server 6 2005Q4 Administration Guide

Handling Common Database Failures

This section covers a few of the common database failures and includes some suggested remedies. It contains the following topics:

csadmind Won’t Start or Crashes During Startup

Since csadmind is the service that handles both the group scheduling engine (GSE) and the alarm dispatch engine, this could have been caused by offending entries in the GSE queue or the alarm queue.

Remedies:

If csadmind is not running, issue stop-cal immediately.

Leaving calendar server running could cause transaction logs to accumulate, which could further corrupt the database, and could take much longer to reconcile the transaction log files to the database.

Try restarting csadmind again (issue start-cal again).

If it starts successfully, make sure the two queues are functioning by:
1. Checking the GSE queue using csschedule.
2. Checking the alarm queue using dbrig.
  
  For instructions on running csschedule and dbrig, see Appendix D, Calendar Server Command-Line Utilities Reference.

If csadmind crashes with a dump, analyze the pstack.

If you notice any GSE related functions in the trace (they will have the letters GSE in them), look at the first entry in the GSE queue and the referenced entry in the events database. Most of the time, the event referred to in the GSE entry is the offending entry. To fix this problem:
1. Remove the GSE entry using csschedule.
2. Remove the offending event from the database using cscomponents.
  
  For instructions on running csschedule and cscomponents, see Appendix D, Calendar Server Command-Line Utilities Reference.

If the entries are not corrupted, then it could be a special case that the calendar server could not handle.

Take the following steps:
1. Take a calendar environment snapshot of the corrupted database, and contact customer support.
  
  To create an environmental backup:
  1. Use the db_checkpoint utility found at:
    
    cal_svr_base/SUNWics5/cal/tools/unsupported/bin/db_checkpoint
  2. Run db_archive -s.
    
    Use the -s option to identify all the database files and copy them to a removable medium, such as CD, or DVD, or tape.
  3. Run db_archive -l.
    
    Use the -loption to identify all the log files and copy unapplied log files to a removable-medium device.
2. To avoid service interruptions, place your calendar database into a read-only state temporarily, and revert to a hot backup copy.
  - Placing your calendar database into a read-only state temporarily prevents any add, modify or delete transactions from taking place. End users will get an error message when they try to add, modify or delete any calendar data. Administrator tools that add, modify or delete calendar events and todos also will not work while the database is in read-only mode.
    
    To put your calendar database in read-only mode, edit the ics.conf file and set the following parameter to “yes”, as shown:
    
    caldb.berkeleydb.readonly=”yes”
  - Revert to a hot backup copy, using the instructions found in Restoring an Automatic Backup Copy.
    
    With csstored configured and enabled, a hot backup is available that should be within minutes of being up-to-date. You should always verify your hot backup copy to make sure it is not corrupt also. (Run db_verify.)

If all else fails, perform the dump and reload procedure to see if it can salvage the database.

This procedure is described in Using the Dump and Load Procedure to Recover a Calendar Database.

Services Hung, and End Users Can’t Connect–Orphaned Locks

This condition may be caused by a control thread, which holds a Berkeley DB database page lock, quitting without releasing the lock. To confirm the problem, run pstack on cshttpd processes and csadmind. (pstack is a standard UNIX utility found at: /usr/bin/pstack) It should show threads that are waiting to acquire a lock.

To fix the problem, restart Calendar Server, as follows:

Change to the directory where start-cal resides.

cd cal_svr_base/SUNWics5/cal/sbin

Issue the start-cal command.

./start-cal

csdb rebuild Never Finishes–Database Looping

Database looping is usually caused by corruption in the database files. Since it is a database corruption, it can be unrecoverable. There are several options:

Revert to the hot backup.

If the corruption occurred recently, you can use one of your hot backups.

Use your catastrophe archival recovery process.

For a suggested process, see Restoring an Automatic Backup Copy.

Use the dump and reload procedure, Using the Dump and Load Procedure to Recover a Calendar Database.