Solaris Volume Manager Administration Guide

Chapter 5 State Database (Overview)

This chapter provides conceptual information about state database replicas. For information about performing related tasks, see Chapter 6, State Database (Tasks).

This chapter contains the following information:

About the Solaris Volume Manager State Database and Replicas

The Solaris Volume Manager state database contains configuration and status information for all volumes, hot spares, and disk sets. Solaris Volume Manager maintains multiple copies (replicas) of the state database to provide redundancy and to prevent the database from being corrupted during a system crash (at most, only one database copy will be corrupted).

The state database replicas ensure that the data in the state database is always valid. When the state database is updated, each state database replica is also updated. The updates take place one at a time (to protect against corrupting all updates if the system crashes).

If your system loses a state database replica, Solaris Volume Manager must figure out which state database replicas still contain valid data. Solaris Volume Manager determines this information by using a majority consensus algorithm. This algorithm requires that a majority (half + 1) of the state database replicas be available and in agreement before any of them are considered valid. It is because of this majority consensus algorithm that you must create at least three state database replicas when you set up your disk configuration. A consensus can be reached as long as at least two of the three state database replicas are available.

During booting, Solaris Volume Manager ignores corrupted state database replicas. In some cases, Solaris Volume Manager tries to rewrite state database replicas that are corrupted. Otherwise, they are ignored until you repair them. If a state database replica becomes corrupted because its underlying slice encountered an error, you will need to repair or replace the slice and then enable the replica.

If all state database replicas are lost, you could, in theory, lose all data that is stored on your Solaris Volume Manager volumes. For this reason, it is good practice to create enough state database replicas on separate drives and across controllers to prevent catastrophic failure. It is also wise to save your initial Solaris Volume Manager configuration information, as well as your disk partition information.

See Chapter 6, State Database (Tasks) for information on adding additional state database replicas to the system, and on recovering when state database replicas are lost.

State database replicas are also used for RAID 1 volume resynchronization regions. Too few state database replicas relative to the number of mirrors might cause replica I/O to impact RAID 1 volume performance. That is, if you have a large number of mirrors, make sure that you have a total of at least two state database replicas per RAID 1 volume, up to the maximum of 50 replicas per disk set.

Each state database replica occupies 4 Mbytes (8192 disk sectors) of disk storage by default. Replicas can be stored on the following devices:

a dedicated disk partition
a partition that will be part of a volume
a partition that will be part of a UFS logging device

Note –

Replicas cannot be stored on the root (/), swap, or /usr slices, or on slices that contain existing file systems or data. After the replicas have been stored, volumes or file systems can be placed on the same slice.

Understanding the Majority Consensus Algorithm

Replicated databases have an inherent problem in determining which database has valid and correct data. To solve this problem, Solaris Volume Manager uses a majority consensus algorithm. This algorithm requires that a majority of the database replicas agree with each other before any of them are declared valid. This algorithm requires the presence of at least three initial replicas which you create. A consensus can then be reached as long as at least two of the three replicas are available. If there is only one replica and the system crashes, it is possible that all volume configuration data will be lost.

To protect data, Solaris Volume Manager will not function unless half of all state database replicas are available. The algorithm, therefore, ensures against corrupt data.

The majority consensus algorithm provides the following:

The system will stay running if at least half of the state database replicas are available.
The system will panic if fewer than half of the state database replicas are available.
The system will not reboot into multiuser mode unless a majority (half + 1) of the total number of state database replicas is available.

If insufficient state database replicas are available, you will have to boot into single-user mode and delete enough of the bad or missing replicas to achieve a quorum. See How to Recover From Insufficient State Database Replicas.

Note –

When the number of state database replicas is odd, Solaris Volume Manager computes the majority by dividing the number in half, rounding down to the nearest integer, then adding 1 (one). For example, on a system with seven replicas, the majority would be four (seven divided by two is three and one-half, rounded down is three, plus one is four).

Background Information for Defining State Database Replicas

In general, it is best to distribute state database replicas across slices, drives, and controllers, to avoid single points-of-failure. You want a majority of replicas to survive a single component failure. If you lose a replica (for example, due to a device failure), it might cause problems with running Solaris Volume Manager or when rebooting the system. Solaris Volume Manager requires at least half of the replicas to be available to run, but a majority (half plus one) to reboot into multiuser mode.

When you work with state database replicas, consider the following Recommendations for State Database Replicas and Guidelines for State Database Replicas.

Recommendations for State Database Replicas

You should create state database replicas on a dedicated slice of at least 4 Mbytes per replica. If necessary, you could create state database replicas on a slice that will be used as part of a RAID 0, RAID 1, or RAID 5 volume, soft partitions, or transactional (master or log) volumes. You must create the replicas before you add the slice to the volume. Solaris Volume Manager reserves the starting part of the slice for the state database replica.
You can create state database replicas on slices that are not in use.
You cannot create state database replicas on existing file systems, or the root (/), /usr, and swap file systems. If necessary, you can create a new slice (provided a slice name is available) by allocating space from swap and then put state database replicas on that new slice.
A minimum of 3 state database replicas are recommended, up to a maximum of 50 replicas per Solaris Volume Manager disk set. The following guidelines are recommended:
- For a system with only a single drive: put all three replicas in one slice.
- For a system with two to four drives: put two replicas on each drive.
- For a system with five or more drives: put one replica on each drive.
If you have a RAID 1 volume that will be used for small-sized random I/O (as in for a database), be sure that you have at least two extra replicas per RAID 1 volume on slices (and preferably disks and controllers)that are unconnected to the RAID 1 volume for best performance.

Guidelines for State Database Replicas

You can add additional state database replicas to the system at any time. The additional state database replicas help ensure Solaris Volume Manager availability.

Caution –
If you upgraded from Solstice DiskSuite^TMto Solaris Volume Manager and you have state database replicas sharing slices with file systems or logical volumes (as opposed to on separate slices), do not delete the existing replicas and replace them with new replicas in the same location.

The default state database replica size in Solaris Volume Manager is 8192 blocks, while the default size in Solstice DiskSuite was 1034 blocks. If you delete a default-sized state database replica from Solstice DiskSuite, then add a new default-sized replica with Solaris Volume Manager, you will overwrite the first 7158 blocks of any file system that occupies the rest of the shared slice, thus destroying the data.
When a state database replica is placed on a slice that becomes part of a volume, the capacity of the volume is reduced by the space that is occupied by the replica(s). The space used by a replica is rounded up to the next cylinder boundary and this space is skipped by the volume.
By default, the size of a state database replica is 4 Mbytes or 8192 disk blocks. Because your disk slices might not be that small, you might want to resize a slice to hold the state database replica. For information on resizing a slice, see “Administering Disks (Tasks)” in System Administration Guide: Basic Administration.
If multiple controllers exist, replicas should be distributed as evenly as possible across all controllers. This strategy provides redundancy in case a controller fails and also helps balance the load. If multiple disks exist on a controller, at least two of the disks on each controller should store a replica.

Handling State Database Replica Errors

How does Solaris Volume Manager handle failed replicas?

The system will continue to run with at least half of the available replicas. The system will panic when fewer than half of the replicas are available.

The system can reboot multiuser when at least one more than half of the replicas are available. If fewer than a majority of replicas are available, you must reboot into single-user mode and delete the unavailable replicas (by using the metadb command).

For example, assume you have four replicas. The system will stay running as long as two replicas (half the total number) are available. However, to reboot the system, three replicas (half the total plus one) must be available.

In a two-disk configuration, you should always create at least two replicas on each disk. For example, assume you have a configuration with two disks, and you only create three replicas (two replicas on the first disk and one replica on the second disk). If the disk with two replicas fails, the system will panic because the remaining disk only has one replica and this is less than half the total number of replicas.

Note –

If you create two replicas on each disk in a two-disk configuration, Solaris Volume Manager will still function if one disk fails. But because you must have one more than half of the total replicas available for the system to reboot, you will be unable to reboot.

What happens if a slice that contains a state database replica fails?

The rest of your configuration should remain in operation. Solaris Volume Manager finds a valid state database during boot (as long as there are at least half plus one valid state database replicas).

What happens when state database replicas are repaired?

When you manually repair or enable state database replicas, Solaris Volume Manager updates them with valid data.

Scenario—State Database Replicas

State database replicas provide redundant data about the overall Solaris Volume Manager configuration. The following example, drawing on the sample system provided in Chapter 4, Configuring and Using Solaris Volume Manager (Scenario), describes how state database replicas can be distributed to provide adequate redundancy.

The sample system has one internal IDE controller and drive, plus two SCSI controllers, which each have six disks attached. With three controllers, the system can be configured to avoid any single point-of-failure. Any system with only two controllers cannot avoid a single point-of-failure relative to Solaris Volume Manager. By distributing replicas evenly across all three controllers and across at least one disk on each controller (across two disks if possible), the system can withstand any single hardware failure.

A minimal configuration could put a single state database replica on slice 7 of the root disk, then an additional replica on slice 7 of one disk on each of the other two controllers. To help protect against the admittedly remote possibility of media failure, using two replicas on the root disk and then two replicas on two different disks on each controller, for a total of six replicas, provides more than adequate security.

To round out the total, add 2 additional replicas for each of the 6 mirrors, on different disks than the mirrors. This configuration results in a total of 18 replicas with 2 on the root disk and 8 on each of the SCSI controllers, distributed across the disks on each controller.