HADB achieves fault tolerance by replicating data on mirror nodes. In a production environment, a mirror node is on a separate DRU from the node it mirrors, as described in Sun Java System Application Server Enterprise Edition 8.2 Deployment Planning Guide.
A failure is an unexpected event such as a hardware failure, power failure, or operating system reboot. The HADB tolerates single failures: of one node, one machine (that has no mirror node pairs), one or more machines belonging to the same DRU, or even one entire DRU. However, HADB does not automatically recover from a double failure, which is the simultaneous failure of one or more mirror node pairs. If a double failure occurs, you must clear HADB and recreate its session store, which erases all its data.
There are different maintenance procedures, depending on whether you need to work on a single machine or multiple machines.
This procedure is applicable to both planned and unplanned maintenance, and does not interrupt HADB availability.
Perform the maintenance procedure and get the machine up and running.
Ensure that ma is running.
If ma runs as a Windows service or under init.d scripts (recommended for deployment), it should have been started by the operating system. If not start it manually. See Starting the Management Agent.
Start all nodes on the machine.
For more information, see Starting a Node.
Check whether the nodes are active and running.
For more information, see Getting the Status of HADB
Planned maintenance includes operations such as hardware and software upgrades. This procedure does not interrupt HADB availability.
For each spare machine in the first DRU, repeat the single machine procedure as described in To perform maintenance on a single machine, one by one, for each machine.
For each active machine in the first DRU, repeat the single machine procedure as described in To perform maintenance on a single machine, one by one, for each machine.
Repeat step 1 and step 2 for the second DRU.
This procedure is applicable when HADB is on single or multiple machines. It interrupts HADB service during the maintenance procedure.
Stop HADB. See Stopping a Database .
Perform the maintenance procedure and get all the machines up and running.
Ensure ma is running.
Start HADB.
For more information, see Starting a Database.
After you complete the last step, HADB data becomes available again.
Check the database state.
See Getting the Status of HADB
If the database state is Operational or better:
The machines needing unplanned maintenance do not include mirror nodes. Follow the single machine procedure for each failed machine, one DRU at a time. HADB service is not interrupted.
If the database state is Non-Operational:
The machines needing unplanned maintenance include mirror nodes. One such case is when the entire HADB is on a single failed machine. Get all the machines up and running first. Then clear HADB and recreate the session store. See Clearing a database. This interrupts HADB service.
HADB history files record all database operations and error messages. HADB appends to the end of existing history files, so the files grow over time. To save disk space and prevent files from getting too large, periodically clear and archive history files.
To clear a database’s history files, use the hadbm clearhistory command.
The command syntax is:
hadbm clearhistory [--saveto=path] [dbname] [--adminpassword=password | --adminpasswordfile=file] [--agent=maurl]
The dbname operand specifies the database name. The default is hadb.
Use the --saveto option (short form -o) to specify the directory in which to store the old history files. This directory must have appropriate write permissions. See General Options for a description of other command options.
For more information, see hadbm-clearhistory(1).
The --historypath option of the hadbm create command determines the location of the history files. The names of the history files are of the format dbname.out. nodeno. For information on hadbm create, see Creating a Database
Each message in the history file contains the following information:
The abbreviated name of the HADB process that produced the message.
The type of message:
INF - general information
WRN - warnings
ERR - errors
DBG - debug information
A timestamp. The time is obtained from the host machine system clock.
The service set changes occurring in the system when a node stops or starts.
Messages about resource shortages contain the string “HIGH LOAD.”
You do not need a detailed knowledge of all entries in the history file. If for any reason you need to study a history file in greater detail, contact Sun customer support.