This chapter provides information and procedures for planning your Sun Cluster configuration, and includes the following sections.
Configuration planning includes making decisions about:
The administrative workstation
Cluster-specific names and naming conventions
Network connections
Volume management
The Solaris operating environment
Multihost disk requirements
File system layout on the multihost disks
Logical host configuration (HA configurations only)
Cluster Configuration Database (CCD)
Quorum device (VxVM only)
Data migration strategy
Multihost backup strategy
Before you develop your configuration plan, consider the reliability issues described in "Configuration Rules for Improved Reliability". Also, the Sun Cluster environment imposes some configuration restrictions that you should consider before completing your configuration plan. These are described in "Configuration Restrictions".
"Configuration Worksheets", provides worksheets to help you plan your configuration.
The following sections describe the tasks and issues associated with planning your configuration. You are not required to perform the tasks in the order shown here, but you should address each task as part of your configuration plan.
You must decide whether to use a dedicated SPARCTM workstation, known as the administrative workstation, for administering the active cluster. The administrative workstation is not a cluster node. The administrative workstation can be any SPARC machine capable of running a telnet session to the Terminal Concentrator to facilitate console logins. Alternatively, on E10000 platforms, you must have the ability from the administrative workstation to log into the System Service Processor (SSP) and connect using the netcon command.
Sun Cluster does not require a dedicated administrative workstation, but using one provides you these advantages:
Enables centralized cluster management by grouping console and management tools on the same machine
Provides potentially quicker problem resolution by Enterprise Services
It is possible to use a cluster node as both the administrative workstation and a cluster node. This entails installing a cluster node as both "client" and "server."
Before configuring the cluster, you must decide on names for the following:
The cluster itself
Physical hosts
Logical hosts
Disk groups
Network interfaces
The network interface names (and associated IP addresses) are necessary for each logical host on each public network. Although you are not required to use a particular naming convention, the following naming conventions are used throughout the documentation and are recommended. Use the configuration worksheets included in "Configuration Worksheets".
Cluster - As part of the configuration process, you will be prompted for the name of the cluster. You can choose any name; there are no restrictions imposed by Sun Cluster.
Physical Hosts - Physical host names are created by adding the prefix phys- to the logical host names (for physical hosts that master only one logical host each). For example, the physical host that masters a logical host named hahost1 would be named phys-hahost1 by default. There is no Sun Cluster naming convention or default for physical hosts that master more than one logical host.
If you are using DNS as your name service, do not use an underscore in your physical or logical host names. DNS will not recognize a host name containing an underscore.
Logical Hosts and Disk Groups - Logical host names can be different from disk group names in Sun Cluster. However, using the same names is the Sun Cluster convention and eases administration. Refer to "Planning Your Logical Host Configuration", for more information.
Public Network - The names by which physical hosts are known on the public network are their primary physical host names. The names by which physical hosts are known on a secondary public network are their secondary physical host names. Assign these names using the following conventions, as illustrated in Figure 2-1:
For the primary physical host names, simply use the physical host names as described previously; for example, phys-hahost1 would be used for a physical host associated with logical host hahost1.
For the secondary physical host names, start with the physical host name and add a suffix indicating the secondary network address. For example, the connection to a secondary network with a network address 192.9.201 from physical host phys-hahost1 would be named phys-hahost1-201.
The primary physical host name should be the node name returned by uname -n.
Private Interconnect - There is no default naming convention for the private interconnect.
Naming convention examples are illustrated in Figure 2-1.
You must have at least one public network connection to a local area network and exactly two private interconnects between the cluster nodes. Refer to Chapter 1, Understanding the Sun Cluster Environment, for overviews of Sun Cluster network configurations, and to "Configuration Worksheets", for network planning worksheets.
Consider these points when planning your public network configuration:
You must have at least one public network that is attached to all cluster nodes. You can have as many additional public network connections as your hardware configuration allows.
In configurations providing HA data services, you must provide IP addresses and network interface names for each logical host on each public network. This can lead to many host names. Refer to "Establishing Names and Naming Conventions", for more information. You must add the logical host names to the /etc/hosts files on all cluster nodes.
Sun Cluster includes a Public Network Management (PNM) component, which enables a public network interface to fail over to another interface within a designated backup group. Refer to the chapter on administering network interfaces in the Sun Cluster 2.2 System Administration Guide for more information about PNM.
Sun Cluster requires two private networks for normal operation. You must decide whether to use 100 Mbit/sec Ethernet or 1 Gbit/sec Scalable Coherent Interface (SCI) connections for the private networks.
In two-node configurations, these networks may be implemented with point-to-point cables between the cluster nodes. In three- or four-node configurations, they are implemented using hubs or switches. Only private traffic between Sun Cluster nodes is transported on these networks.
If you connect nodes by using SCI switches, each node must be connected to the same port number on both switches. During the installation, note that the port numbers on the switches correspond to node numbers. For example, node 0 is the host physically connect to port 0 on the switch, and so on.
A class C network number (204.152.64) is reserved for private network use by the Sun Cluster nodes. The same network number is used by all Sun Cluster systems.
The Terminal Concentrator and administrative workstation connect to a public network with access to the Sun Cluster nodes. You must assign IP addresses and host names for them to enable access to the cluster nodes over the public network.
E10000 systems use a System Service Processor (SSP) instead of a Terminal Concentrator. You will need to assign the SSP a host name, IP address, and root password. You will also need to create a user named "ssp" on the SSP and provide a password for user "ssp" during Sun Cluster installation.
All nodes in a cluster must be installed with the same version of the Solaris operating environment (Solaris 2.6, Solaris 7, or Solaris 8) before you can install the Sun Cluster software. When you install Solaris on cluster nodes, follow the general rules in this section.
Install the Entire Distribution Solaris software packages on all Sun Cluster nodes.
All platforms except the E10000 require at least the Entire Distribution Solaris installation. E10000 systems require the Entire Distribution + OEM.
After installing the Solaris operating environment, you must install the latest patches. For the current list of required patches for the Solaris operating environment, consult your Enterprise Services representative or service provider, or see the Sun website http//sunsolve.sun.com.
If you are upgrading from an earlier version of the Solaris operating environment:
You must use the upgrade option in the Solaris installation program (rather than reinstalling the operating environment) and be prepared to increase the size of your root (/) and /usr slices to accommodate the Solaris environment.
The upgrade option in the Solaris installation program provides the ability to reallocate disk space if the current file systems don't have enough space for the upgrade. By default, an auto-layout feature tries to determine how to reallocate the disk space so the upgrade can succeed. If auto-layout cannot determine how to reallocate disk space, you must specify which file systems can be moved or changed and run auto-layout again.
If you are installing Sun Cluster for the first time:
Set up each Sun Cluster node as stand-alone machine. Do this in response to a question in the Solaris installation program.
Do not define an exported file system. HA-NFS file systems are not mounted on /export and only HA-NFS file systems should be NFS-shared on Sun Cluster nodes.
Disable the Solaris power management "autoshutdown" mechanism if it applies to any nodes in your Sun Cluster configuration. See the pwconfig(1M) and power.conf(4) man pages for details.
A new feature called interface groups was added to the Solaris 2.6 operating environment. This feature is implemented as default behavior in Solaris 2.6 software, but as optional behavior in subsequent releases.
As described in the ifconfig(1M) man page, logical or physical interfaces that share an IP prefix are collected into an interface group. IP uses the interface group to rotate source address selections when the source address is unspecified. An interface group made up of multiple physical interfaces is used to distribute traffic across different IP addresses on a per-IP-destination basis (see netstat(1M) for information about per-IP-destination).
When enabled, the interface groups feature causes a problem with switchover of logical hosts. The system will experience RPC timeouts and the switchover will fail, causing the logical host to remain mastered on its current host. Therefore, interface groups must be disabled on all cluster nodes. The status of interface groups is determined by the value of ip_enable_group_ifs in /etc/system.
The value for this parameter can be checked with the following ndd command:
# ndd /dev/ip ip_enable_group_ifs |
If the value returned is 1 (enabled), disable interface groups by editing the /etc/system file to include the following line
set ip:ip_enable_group_ifs=0 |
Whenever you modify the /etc/system file, you must reboot the system for the changes to take effect.
When Solaris is installed, the system disk is partitioned into slices for root (/), /usr, and other standard file systems. You must change the partition configuration to meet the requirements your volume manager. Use the guidelines in the following sections to allocate disk space accordingly.
See your Solaris documentation for file system sizing guidelines. Sun Cluster imposes no additional requirements for file system slices.
If you will be using Solstice DiskSuite, set aside a 10 Mbyte slice on the system disk for metadevice state database replicas. See your Solstice DiskSuite documentation for more information about replicas.
If you will be using VxVM, designate a disk for the root disk group (rootdg). See your VERITAS documentation for guidelines and details about creating the rootdg. Refer also to "VERITAS Volume Manager Considerations", for more information.
The root (/) slice on your local disk must have enough space for the various files and directories as well as space for the device inodes in /devices and symbolic links in /dev.
The root slice also must be large enough to hold the following:
Solaris system software
Sun Cluster, some components from your volume management software, and any third-party software packages
Data space for symbolic links in /dev for the disk units and for volume manager use
Sun Cluster uses various shell scripts that run as root processes. For this reason, the /.cshrc* and /.profile files for user root should be empty or non-existent on the cluster nodes.
Your cluster might require a larger root file system if it contains large numbers of disk drives.
If you run out of free space, you must reinstall the operating environment on all cluster nodes to obtain additional free space in the root slice. Make sure at least 20 percent of the total space on the root slice is left free.
The /usr slice holds the user file system. The /var slice holds the system log files. The /opt slice holds the Sun Cluster and data service software packages. See your Solaris advanced system administration documentation for details about changing the allocation values when installing Solaris software.
Sun Cluster uses volume management software to group disks into disk groups that can then be administered as one unit. Sun Cluster supports Solstice DiskSuite and VERITAS Volume Manager (VxVM). Sun Cluster supports the cluster feature of VERITAS Volume Manager for use with the Oracle Parallel Server data service.
You can use Solstice DiskSuite and VERITAS Volume Manager together only if you use Solstice DiskSuite to manage the local disks and VERITAS Volume Manager to control the multihost disks. In such a configuration, plan your physical disk needs accordingly. You might need additional disks to use for the VERITAS Volume Manager root disk group, for example. See your volume manager documentation for more information.
You must install the volume management software after you install the Solaris operating environment. You can install the volume management software either before or after you install Sun Cluster software. Refer to your volume manager software documentation and to Chapter 3, Installing and Configuring Sun Cluster Software, for instructions on installing the volume management software.
Use these guidelines when configuring your disks:
Mirroring of root disks is recommended, but not required.
All multihosted disks must be mirrored across arrays. An exception is the Sun StorEdge A3x00, which is configured to mirror or provide hardware RAID5.
Use of hot spares is highly recommended, but not required.
See "Volume Manager Slices" for disk layout recommendations related to volume management, and consult your volume manager documentation for any additional restrictions.
Consider these points when planning Solstice DiskSuite configurations:
Always use trans metadevices for file systems within disksets.
If you run with only two disk expansion units, you will need to use Solstice DiskSuite mediators.
When using Solstice DiskSuite mediators with any disksets in your configuration, only two cluster nodes can act as mediator hosts. Those two nodes must be used for all disksets requiring mediators, regardless of which nodes master those disksets. Therefore, in Sun Cluster configurations with more than two nodes, it is possible for a diskset's mediator host to be one that is not actually a potential master of that diskset.
Consider these points when planning VxVM configurations:
You must create a default disk group (rootdg) on each cluster node. See your VERITAS documentation for guidelines and details about creating a rootdg.
If the rootdg is encapsulated, then two disk slice table entries must be left free. In addition, only the root (/), /usr, /var, and swap file systems should be present on the encapsulated root disk.
Insufficient disk space and slices prevent encapsulation of the boot disk later and increase installation time because the operating environment might have to be reinstalled.
You will need licenses for VERITAS Volume Manager if you use it with any storage devices other than SPARCstorage Arrays or Sun StorEdge A5000s. SPARCstorage Arrays and Sun StorEdge A5000s include bundled licenses for use with VxVM. Contact the Sun License Center for any necessary VxVM licenses; see http://www.sun.com/licensing/ for more information. You do not need licenses to run Solstice DiskSuite with Sun Cluster.
One important aspect of high availability is the ability to bring file systems back online quickly in the event of a node failure. This aspect is best served by using a logging file system. Sun Cluster supports three logging file systems: VxFS logging from VERITAS, DiskSuite UFS logging, and Solaris UFS logging. VERITAS Volume Manager cluster feature, when used with Oracle Parallel Server, uses raw partitions so does not use a logging file system. However, you can also run VxVM with cluster feature in a cluster with both OPS and HA data services. In this configuration, the OPS shared disk groups would use raw partitions, but the HA disk groups could use either VxFS or Solaris UFS logging file systems. Excluding the co-existent VxVM with cluster feature configuration described above, Sun Cluster supports the following combinations of volume managers and logging file systems.
Table 2-1 Supported Logging File Systems
Solaris Operating Environment |
Volume Manager |
Supported File Systems |
---|---|---|
Solaris 2.6 |
VERITAS Volume Manager |
VxFS, UFS (no logging) |
Solstice DiskSuite |
DiskSuite UFS logging |
|
Solaris 7 |
Solstice DiskSuite |
DiskSuite UFS logging, Solaris UFS logging |
Solaris 8 |
Solstice DiskSuite |
DiskSuite UFS logging, Solaris UFS logging |
VxVM with cluster feature uses Dirty Region Logging to aid in fast recovery after a reboot, similar to what the logging file systems provide. For information about the VxVM cluster feature, refer to your VERITAS Volume Manager documentation. For information on DiskSuite UFS logging, refer to the Solstice DiskSuite documentation. For information on VxFS logging, see the VERITAS documentation. Solaris UFS logging is described briefly below. See the mount_ufs(1M) for more details.
Solaris UFS logging is a new feature in the Solaris 7 operating environment.
Solaris UFS logging uses a circular log to journal the changes made to a UFS file system. As the log fills up, changes are "rolled" into the actual file system. The advantage of logging is that the UFS file system is never left in an inconsistent state, that is, with a half-completed operation. After a system crash, fsck has nothing to fix, so you boot up much faster.
Solaris UFS logging is enabled using the "logging" mount option. To enable logging on a UFS file system, you either add -o logging to the mount command or add the word "logging" to the /etc/opt/SUNWcluster/conf/hanfs/vfstab.logicalhost entry (the rightmost column).
Solaris UFS logging always allocates the log using free space on the UFS file system. The log takes up 1 MByte on file systems less than 1 GByte in size, and 1 MByte per GByte on larger file systems, up to a maximum of 64 MBytes.
Solaris UFS logging always puts the log files on the same disk as the file system. If you use this logging option, you are limited to the size of the disk. DiskSuite UFS logging allows the log to be separated on a different disk. This has the effect of reducing a bit of the I/O that is associated with the log.
With DiskSuite UFS logging, the trans device used for logging creates a metadevice. The log is yet another metadevice which can be mirrored and striped. Furthermore, you can create up to a 1TByte logging file system with Solstice DiskSuite.
The "logging" mount option will not work if you already have logging provided by Solstice DiskSuite--you will receive a warning message explaining you already have logging on that file system. If you require more control over the size or location of the log, you should use DiskSuite UFS logging.
Depending on the file system usage, Solaris UFS logging gives you performance that is the same or better than running without logging.
There is currently no support for converting from DiskSuite UFS logging to Solaris UFS logging.
Unless you are using a RAID5 configuration, all multihost disks must be mirrored in Sun Cluster configurations. This enables the configuration to tolerate single-disk failures. Refer to "Mirroring Guidelines", and to your volume management documentation, for more information.
Determine the amount of data that you want to move to the Sun Cluster configuration. If you are not using RAID5, double that amount to allow disk space for mirroring. With RAID5, you need extra space equal to 1/(# of devices -1). Use the worksheets in "Configuration Worksheets", to help plan your disk requirements.
Consider these points when planning your disk requirements:
Sun Cluster supports several multihost disk expansion units. Consider the size of disks available with each disk expansion unit when you calculate the amount of data to migrate to Sun Cluster.
With Sun StorEdge A3x00 units, you need only one disk expansion unit because each unit has two controllers. With Sun StorEdge MultiPacks, you must have at least two disk expansion units.
Under some circumstances, there might be an advantage to merging several smaller file systems into a single larger file system. This reduces the number of file systems to administer and might help speed up cluster takeovers.
The size of the dump media (backup system) might influence the size of the file systems in your configuration.
With Solstice DiskSuite, if you have only two disk expansion units, then you must configure dual-string mediators. If you have more than two disk expansion units, you need not configure mediators. VERITAS Volume Manager does not support mediators. See the appendix on dual-string mediators in the Sun Cluster 2.2 System Administration Guide for details on the dual-string mediator feature.
New disk types are continually being qualified with Sun Cluster. See your service provider for the most current list of supported disk types.
Consider these points when planning for disk space growth:
Less administration time is required to configure disks during initial configuration than to add them while the system is in service.
Leaving empty slots in the multihost disk expansion units during initial configuration allows you to add disks easily later.
When your site needs additional disk expansion units, you might have to reconfigure your data to prevent mirroring within a single disk expansion unit. Therefore, if all the existing disk expansion units are full, the easiest way to add disk expansion units without reorganizing data is to add them in pairs.
Several sizes of disks are supported in multihost disk expansion units. Consider these points when deciding which size drives to use:
If you use lower capacity drives, you can have more spindles; this increases potential I/O bandwidth, assuming the disks have the same I/O rates.
If you use higher capacity disks, then fewer devices are required in the configuration. This can help speed up takeovers because takeover time can be partially dependent on the number of drives being taken over.
You can determine the number of disks needed by dividing the total disk capacity that you have selected (including mirrors) by the disk size in your disk expansion units.
Sun Cluster does not require any specific disk layout or file system size. The requirements for the file system hierarchy are dependent on the volume management software you are using.
Regardless of your volume management software, Sun Cluster requires at least one file system per disk group to serve as the HA administrative file system. This administrative file system should always be mounted on /logicalhost, and must be a minimum of 10 Mbytes. It is used to store private directories containing data service configuration information.
For clusters using Solstice DiskSuite, you need to create a metadevice to contain the HA administrative file system. The HA administrative file system should be configured the same as your other multihost file systems, that is, it should be mirrored and set up as a trans device.
For clusters using VxVM, Sun Cluster creates the HA administrative file system on a volume named dg-stat where dg is the name of the disk group in which the volume is created. dg is usually the first disk group in the list of disk groups specified when defining a logical host.
Consider these points when planning file system size and disk layout:
When mirroring, lay out disks so that they are mirrored across disk controllers.
Partitioning or subdividing all similar disks identically simplifies administration.
Solstice DiskSuite software requires some additional space on the multihost disks and imposes some restrictions on its use. For example, if you are using UNIX file system (UFS) logging under Solstice DiskSuite, one to two percent of each multihost disk must be reserved for metadevice state database replicas and UFS logging. Refer to Appendix B, Configuring Solstice DiskSuite, and to the Solstice DiskSuite documentation for specific guidelines and restrictions.
All metadevices used by each shared diskset are created in advance, at reconfiguration boot time, based on settings found in the md.conf file. The fields in md.conf file are described in the Solstice DiskSuite documentation. The two fields that are used in the Sun Cluster configuration are md_nsets and nmd. The md_nsets field defines the number of disksets and the mnd field defines the number of metadevices to create for each diskset. You should set these fields at install time to allow for all predicted future expansion of the cluster.
Extending the Solstice DiskSuite configuration after the cluster is in production is time consuming because it requires a reconfiguration reboot for each node and always carries the risk that there will not be enough space allocated in the root (/) file system to create all of the requested devices.
The value of md_nsets must be set to the expected number of logical hosts in the cluster, plus one to allow Solstice DiskSuite to manage the private disks on the local host (that is, those metadevices that are not in the local diskset).
The value of nmd must be set to the predicted largest number of metadevices used by any one of the disksets in the cluster. For example, if a cluster uses 10 metadevices in its first 15 disksets, but 1000 metadevices in the 16th diskset, nmd must be set to at least 1000.
All cluster nodes (or cluster pairs in the cluster pair topology) must have identical md.conf files, regardless of the number of logical hosts served by each node. Failure to follow this guideline can result in serious Solstice DiskSuite errors and possible loss of data.
Consider these points when planning your Solstice DiskSuite file system layout:
The HA administrative file system cannot be grown using growfs(1M).
You must create mount points for other file systems at the /logicalhost level.
Your application might dictate a file system hierarchy and naming convention. Sun Cluster imposes no restrictions on file system naming, as long as names do not conflict with data service required directories.
Use the partitioning scheme described in Table 2-2 for the majority of drives.
In general, if UFS logs are created, the default size for Slice 6 should be 1 percent of the size of the largest multihost disk found on the system.
The overlap of Slices 6 and 0 by Slice 2 is used for raw devices where there are no UFS logs.
In addition, the first drive on each of the first two controllers in each of the disksets should be partitioned as described in Table 2-3.
Each disk group has an HA administrative file system associated with it. This file system is not NFS-shared. It is used for data service specific state or configuration information.
Partition 7 is always reserved for use by Solstice DiskSuite as the first or last 2 Mbytes on each multihost disk.
Slice |
Description |
---|---|
7 |
2 Mbytes, reserved for Solstice DiskSuite |
6 |
UFS logs |
0 |
Remainder of the disk |
2 |
Overlaps Slice 6 and 0 |
Table 2-3 Multihost Disk Partitioning for the First Drive on the First Two Controllers
Slice |
Description |
---|---|
7 |
2 Mbytes, reserved for Solstice DiskSuite |
5 |
2 Mbytes, UFS log for HA administrative file systems |
4 |
9 Mbytes, UFS master for HA administrative file systems |
6 |
UFS logs |
0 |
Remainder of the disk |
2 |
Overlaps Slice 6 and 0 |
You can create UNIX File System (UFS) or VERITAS File System (VxFS) file systems in the disk groups of logical hosts. When a logical host is mastered on a cluster node, the file systems associated with the disk groups of the logical host are mounted on the specified mount points of the mastering node.
When you reconfigure logical hosts, Sun Cluster must check the file systems before mounting them, by running the fsck command. Even though the fsck command checks the UFS file systems in non-interactive parallel mode on UFS file systems, this still consumes some time, and this affects the reconfiguration process. VxFS drastically cuts down on the file system check time, especially if the configuration contains large file systems (greater than 500 Mbytes) used for data services.
When setting up mirrored volumes, always add a Dirty Region Log (DRL) to decrease volume recovery time in the event of a system crash. When mirrored volumes are used in clusters, DRL must be assigned for volumes greater than 500 Mbytes.
With VxVM, it is important to estimate the maximum number of volumes that will be used by any given disk group at the time the disk group is created. If the number is less than 1000, default minor numbering can be used. Otherwise, you must carefully plan the way in which minor numbers are assigned to disk group volumes. It is important that no two disk groups shared by the same nodes have overlapping minor number assignments.
As long as default numbering can be used and all disk groups are currently imported, it is not necessary to use the minor option to the vxdg init command at disk group creation time. Otherwise, the minor option must be used to prevent overlapping the volume minor number assignments. It is possible to modify the minor numbering later, but doing so might require you to reboot and import the disk group again. Refer to the vxdg(1M) man page for details.
The /etc/vfstab file contains the mount points of file systems residing on local devices. For a multihost file system used for a logical host, all the nodes that can potentially master the logical host should possess the mount information.
The mount information for a logical host's file system is kept in a separate file on each node, named /etc/opt/SUNWcluster/conf/hanfs/vfstab.logicalhost. The format of this file is identical to the /etc/vfstab file for ease of maintenance, though not all fields are used.
You must keep the vfstab.logicalhost file consistent among all nodes of the cluster. Use the rcp command or file transfer protocol (FTP) to copy the file to the other nodes of the cluster. Alternatively, edit all of the files simultaneously by using crlogin or ctelnet.
The same file system cannot be mounted by more than one node at the same time, because a file system can be mounted only if the corresponding disk group has been imported by the node. The consistency and uniqueness of the disk group imports and logical host mastery is enforced by the cluster framework logical host reconfiguration sequence.
Sun Cluster supports booting from a private or shared disk inside a SPARCstorage Array.
Consider these points when using a boot disk in an SSA:
Make sure that each cluster node's boot disk is on a different SSA. If nodes share a single SSA for their boot disks, the loss of a single controller would bring down all nodes.
For VxVM configurations, do not configure a boot disk and a quorum device on the same tray. This is especially true for a two-node cluster. If you place both on the same tray, the cluster loses one of its nodes as well as its quorum device when you remove the tray. If for any reason a reconfiguration happens on the surviving node during this time, the cluster is lost. If a controller containing the boot disk and the quorum disk becomes faulty, the node that has its boot disk on the bad controller inevitably hangs or crashes, causing the other node to reconfigure and abort, since the quorum device is inaccessible. (This is a likely scenario in a minimal configuration consisting of two SSAs with boot disks and no root disk mirroring.)
Mirroring the boot disks in a bootable SSA configuration is recommended. However, there is an impact on software upgrades. Neither Solaris nor the volume manager software can be upgraded while the root disk is mirrored. In such configurations, perform upgrades carefully to avoid corruption of the root file system. Refer to "Mirroring Root (/)", for additional information on mirroring the root file system.
A disk group stores the data for one or more data services. Generally, several data services share a logical host, and therefore fail over together. If you want to enable a particular data service to fail over independently of all other data services, then assign a logical host to that data service alone, and do not allow any other data services to share it.
As part of the installation and configuration, you need to establish the following for each logical host:
Default master - Each logical host can potentially be mastered by any physical host to which it is connected.
HA administrative file system - This is a mount point on the logical host for the administrative file system. Refer to "Planning Your File System Layout on the Multihost Disks", for more information.
vfstab file name - Each logical host needs a separate vfstab file to store file system information. This name is generally vfstab.logicalhost.
Disk group or Diskset - Each disk group or diskset has its own name. By convention, the disk group or diskset name and the logical host name are the same, but you can give the disk group or diskset another name.
Use the logical host worksheet in "Configuration Worksheets", to record this information.
Consider these points when planning your logical host configuration:
If a data service does not put a heavy load on the CPU or memory, then you will not gain a load-balancing advantage by switching it over separately.
Use care when load balancing an N+1 configuration. If the data service puts a heavy load on the CPU or memory, then you should limit the workload on the hot-standby node. A large workload on the hot-standby node compromises its ability to take over should any active node fail.
If the data service software is relatively reliable and starts up quickly, then you will not gain much availability by failing it over separately.
If the data service uses only a small amount of disk space, you might waste a lot of disk space by putting it on a separate logical host, because you must have at least a mirrored pair of drives per disk group.
The administrative burden increases as the number of logical hosts grows.
As part of Sun Cluster installation and configuration, you configure a Cluster Configuration Database (CCD) volume to store internal configuration data. In a two-node cluster using VxVM, this volume can be shared between the nodes thereby increasing the availability of the CCD. In clusters with more than two nodes, a copy of the CCD is local to each node. See "Configuring the Shared CCD Volume" for details on configuring a shared CCD.
You cannot used a shared CCD in a two-node cluster using Solstice DiskSuite.
If each node keeps its own copy of the CCD, then updates to the CCD are disabled by default when one node is not part of the cluster. This prevents the database from getting out of synchronization when only a single node is up.
The CCD requires two disks as part of a disk group for a shared volume. These disks are dedicated for CCD use and cannot be used by any other application, file system, or database.
The CCD should be mirrored for maximum availability. The two disks comprising the CCD should be on separate controllers.
In clusters using VxVM, the scinstall(1M) script will ask you how you want to set up the CCD on a shared volume in your configuration.
Refer to Chapter 1, Understanding the Sun Cluster Environment, for a general overview of the CCD. Refer to the chapter on general cluster administration in the Sun Cluster 2.2 System Administration Guide for procedures used to administer the CCD.
Although the installation procedure does not prevent you from choosing disks on the same controller, this would introduce a possible single point of failure and is not recommended.
If you are using VERITAS Volume Manager (with or without the cluster feature), you must configure a quorum device regardless of the number of cluster nodes. During the Sun Cluster installation process, scinstall(1M) will prompt you to configure a quorum device.
The quorum device is either an array controller or a disk.
If it is an array controller, all disks in the array must be part of the cluster applications. No private data (a private file system or disk groups private to a node) can be stored in the array controller.
If the quorum device is a disk, that disk must be part of the cluster application. The disk cannot be private to either of the nodes.
During the cluster software installation, you will need to make decisions concerning:
Type of quorum configuration (simple mode or complex mode) - In simple mode, the quorum device is configured automatically. In complex mode, you must configure the quorum device manually.
Quorum device behavior - If the cluster is partitioned into subsets, you can configure the Cluster Membership Monitor either to automatically select which subset stays up, or to have the system prompt you for action.
Quorum device policy - If you choose to have the system automatically select which subset stays up, you must configure the policy. You choose either lowest or highest node ID, to specify which subset of nodes automatically becomes the new cluster in the event the quorum device is activated. Refer to "Quorum, Quorum Devices, and Failure Fencing", for more information on the quorum device policy.
Type of device - The quorum device can be a controller or disk in a multihost disk expansion unit.
If all the disks in an expansion unit are going to be used for shared disk groups (for OPS) or for HA disk groups, then the array controller can be used for the quorum device.
If one or more disks in an array are used for the private storage of a node (either as a file system or as a raw device), then one of the disks belonging either to the shared disk group (for OPS) or to one of the HA disk groups must be used as the quorum device.
You can also choose a dedicated disk as the quorum device (one on which no data is stored).
Before you select the quorum device for your cluster, be aware of the implications of your selection. Any node pair of the cluster must have a quorum device. That is, one quorum device must be specified for every node set that share multihost disks. Each node in the cluster must be informed of all quorum devices in the cluster, not just the quorum device connected to it. During cluster installation, the scinstall(1M) command displays all possible node pairs in sequence and displays any common devices that are quorum device candidates.
In two-node clusters with dual-ported disks, a single quorum device needs to specified.
In greater than two-node clusters with dual-ported disks, not all of the cluster nodes have access to the entire disk subsystem. In such configurations, you must specify one quorum device for each set of nodes that shares disks.
Sun Cluster configurations can consist of disk storage units (such as the Sun StorEdge A5000) that can be connected to all nodes in the cluster. This allows for applications such as OPS to run on clusters of greater than two nodes. A disk storage unit that is physically connected to all nodes in the cluster is referred to as a direct attached device. In this type of cluster a single quorum device needs to be selected from a direct attached device.
In clusters with direct attached devices, if the cluster interconnect fails, one of the following will happen:
If manual intervention was specified when the quorum device was configured, all nodes will prompt for operator assistance.
If automatic selection was specified when the quorum device was configured, the highest or lowest node ID will reserve the quorum device and all other nodes will prompt for operator assistance.
In clusters without direct attached devices to all nodes of the cluster, you will, by definition, have multiple quorum devices (one for each node pair that share disks). In this configuration, the quorum device only comes into play where only two nodes are remaining and they share a common quorum device.
In the event of a node failure, the node that is able to reserve the quorum device remains as the sole survivor of the cluster. This is necessary to ensure the integrity of data on the shared disks.
Consider these points when deciding how to migrate existing data to the Sun Cluster environment.
You cannot move data into the Sun Cluster configuration by connecting existing disks that contain data. The volume management software must be used to prepare the disks before moving data to them.
You can use ufsdump(1M) and ufsrestore(1M) or other suitable file system backup products to migrate UNIX file system data to Sun Cluster nodes.
When migrating databases to a Sun Cluster configuration, use the method recommended by the database vendor.
Before you load data onto the multihost disks in a Sun Cluster configuration, you should have a plan for backing up the data. Sun recommends using Solstice Backup(TM), ufsdump, or VERITAS NetBackup to back up your Sun Cluster configuration.
If you are converting your backup method from Online: Backup(TM) to Solstice Backup, special considerations exist because the two products are not compatible. The primary decision for the system administrator is whether or not the files backed up with Online:Backup will be available online after Solstice Backup is in use. Refer to the Solstice Backup documentation for details on conversion.
VERITAS NetBackup provides a powerful, scalable backup solution. The NetBackup master server can be made highly available by placing it under the control of Sun Cluster HA for NetBackup. Refer to Chapter 14, Installing and Configuring Sun Cluster HA for NetBackup for more information about VERITAS NetBackup.
The following files should be saved after the system is configured and running. In the unlikely event that the cluster should experience problems, these files can help service providers debug and solve cluster problems.
did.conf /etc - This file contains the disk ID (DID) mappings for Solstice DiskSuite configurations. this information can be used to track and verify the correct DID configurations after a catastrophic failure on one node.
ccd.database /etc/opt/SUNWcluster/conf - This file contains important cluster configuration information. Find instructions to troubleshoot the CCD in Chapter 4 of the Sun Cluster 2.2 System Administration Guide.
cluster_name.cdb /etc/opt/SUNWcluster/conf - This file contains current information about the cluster configuration. It can be used as a reference to determine what changes have occurred since the original setup.
metastat -s diskset_name -p > sav.diskset_name - This command saves the current Solstice DiskSuite diskset configuration.
You can install Solaris software from a local CD-ROM or from a network install server using JumpStart. If you are installing several Solaris machines, consider a network install. Otherwise, use the local CD-ROM.
Configurations using FDDI as the primary public network cannot be network-installed directly using JumpStart because the FDDI drivers are unbundled and are not available in "mini-unix." If you use FDDI as the primary public network, you must install Solaris software from CD-ROM.
You will receive a paper license for the Sun Cluster 2.2 framework, one for each hardware platform on which Sun Cluster 2.2 will run. You will also receive a paper license for each Sun Cluster data service, one per node. The Sun Cluster 2.2 framework does not enforce these licenses, but you should retain the paper licenses as proof of ownership when you need technical support or other support services.
You do not need licenses to run Solstice DiskSuite with a licensed Sun Cluster 2.2 configuration. However, you need a license for VERITAS Volume Manager (VxVM) and optionally for VERITAS Volume Manager cluster functionality. The base VxVM license certificates are included with Sun Cluster Server license kits, and VxVM cluster functionality license certificates are bundled with Oracle Parallel Server RTU license kits. The Sun Cluster and Oracle Parallel Server license kits are available from Sun. Follow the instructions printed on the license certificates to obtain active license keys.
You may need to obtain licenses for DBMS products and other third-party products. Contact your third-party service provider for third-party product licenses. See http://www.sun.com/licensing/ for more information.
The rules discussed in this section help ensure that your Sun Cluster configuration is highly available. These rules also help determine the appropriate hardware for your configuration.
Although it is not required in some configurations, in general the Sun Cluster nodes should have identical local hardware. This means that if one cluster node is configured with two FC/S cards, then all Sun Cluster nodes in the cluster also should have two FC/S cards.
Identify "redundant" hardware components on each node and plan their placement to prevent the loss of both components in the event of a single hardware failure. For example, consider the private networks on the E10,000 system. The minimum configuration consists of two I/O boards, each supporting one of the private network connections and one of the multihost disk connections. A localized failure on an I/O board is unlikely to affect both private network connections, or both multihost disk connections.
Configuring redundant hardware is not always possible--some configurations might contain only one system board--but some of the concerns can still be addressed easily with hardware options. For example, in an Ultra Enterprise(TM) 2 Cluster with two SPARCstorage Arrays, one private network can be connected to a Sun Quad FastEthernet(TM) Controller card (SQEC), while the other private network can be connected to the on-board interface.
Unless you are using a RAID5 configuration, all multihost disks must be mirrored in Sun Cluster configurations. This enables the configuration to tolerate single-disk failures.
Consider these points when mirroring multihost disks:
Each submirror of a given mirror or plex should reside in a different multihost disk expansion unit.
Mirroring doubles the amount of necessary disk space.
Three-way mirroring is supported by Solstice DiskSuite and VERITAS Volume Manager. However, only two-way mirroring is required by Sun Cluster.
Under Solstice DiskSuite, mirrors are made up of other metadevices such as concatenations or stripes. Large configurations might contain a large number of metadevices. For example, seven metadevices are created for each logging UFS file system.
If you mirror to a disk of a different size, your mirror capacity is limited to the size of the smallest submirror or plex.
For maximum availability, you should mirror root (/), /usr, /var, /opt, and swap on the local disks. Under VERITAS Volume Manager, this means encapsulating the root disk and mirroring the generated subdisks. However, mirroring the root disk is not a requirement of Sun Cluster.
You should consider the risks, complexity, cost, and service time for the various alternatives concerning the root disk. There is not one answer for all configurations. You might want to consider your local Enterprise Services representative's preferred solution when deciding whether to mirror root.
Refer to your volume manager documentation for instructions on mirroring root.
Consider the following issues when deciding whether to mirror the root file system.
Mirroring root adds complexity to system administration and complicates booting in single user mode.
Regardless of whether or not you mirror root, you also should perform regular backups of root. Mirroring alone does not protect against administrative errors; only a backup plan can allow you to restore files which have been accidentally altered or deleted.
Under Solstice DiskSuite, in failure scenarios in which metadevice state database quorum is lost, you cannot reboot the system until maintenance is performed.
Refer to the discussion on metadevice state database and state database replicas in the Solstice DiskSuite documentation.
Highest availability includes mirroring root on a separate controller.
You might regard a sibling node as the "mirror" and allow a takeover to occur in the event of a local disk drive failure. Later, when the disk is repaired, you can copy over data from the root disk on the sibling node.
Note, however, that there is nothing in the Sun Cluster software that guarantees an immediate takeover. In fact, the takeover might not occur at all. For example, presume some sectors of a disk are bad. Presume they are all in the user data portions of a file that is crucial to some data service. The data service will start getting I/O errors, but the Sun Cluster node will stay up.
You can set up the mirror to be a bootable root disk so that if the primary boot disk fails, you can boot from the mirror.
With a mirrored root, it is possible for the primary root disk to fail and work to continue on the secondary (mirror) root disk.
At a later point the primary root disk might return to service (perhaps after a power cycle or transient I/O errors) and subsequent boots are performed using the primary root disk specified in the OpenBoot(TM) PROM boot-device field. Note that a Solstice DiskSuite resync has not occurred--that requires a manual step when the drive is returned to service.
In this situation there was no manual repair task--the drive simply started working "well enough" to boot.
If there were changes to any files on the secondary (mirror) root device, they would not be reflected on the primary root device during boot time (causing a stale submirror). For example, changes to /etc/system would be lost. It is possible that some Solstice DiskSuite administrative commands changed /etc/system while the primary root device was out of service.
The boot program does not know whether it is booting from a mirror or an underlying physical device, and the mirroring becomes active part way through the boot process (after the metadevices are loaded). Before this point the system is vulnerable to stale submirror problems.
Upgrading to later versions of the Solaris environment while using volume management software to mirror root requires steps not currently outlined in the Solaris documentation. The current Solaris upgrade is incompatible with the volume manager software used by Sun Cluster. Consequently, a root mirror must be converted to a one-way mirror before running the Solaris upgrade. Additionally, all three supported volume managers require that other tasks be performed to successfully upgrade Solaris. Refer to the appropriate volume management documentation for more information.
Consider the following alternatives when deciding whether to mirror root (/) file systems under Solstice DiskSuite. The issues mentioned in this section are not applicable to VERITAS Volume Manager configurations.
For highest availability, mirror root on a separate controller with metadevice state database replicas on three different controllers. This tolerates both disk and controller failures.
Under Solstice DiskSuite, use one of the following methods to tolerate disk media failures:
Mirror the root disk on a second controller and keep a copy of the metadevice state database on a third disk on one of the controllers.
Mirror the root disk on the same controller and keep a copy of the metadevice state database on a third disk on the same controller.
It is possible to reboot the system before performing maintenance in these configurations, because a quorum is maintained after a disk media failure. These configurations do not tolerate controller failures, with the exception that option `a' tolerates controller failure of the controller that contains metadevice state database replicas on a single disk.
If the controller that contains replicas on two disks fails, quorum is lost.
Mirroring the root disk on the same controller and storing metadevice state database replicas on both disks tolerates a disk media failure and prevents an immediate takeover. However, you cannot reboot the machine until after maintenance is performed because more than half of the metadevice state database replicas are not available after the failure.
Do not mirror the root disk, but perform a daily manual backup of the root disk (with dd(1) or some other utility) to a second disk which can be used for booting if the root disk fails. Configure the second disk as an alternate boot device in the OpenBoot PROM. The /etc/vfstab file might need to be updated after the dd(1) operation to reflect the different root partition. Configure additional metadevice state database replicas on Slice 4 of the second disk. In the event of failure of the first disk, these will continue to point to the multihost disk replicas. Do not copy and restore the metadevice state database. Rather, let Solstice DiskSuite do the replication.
This section describes Sun Cluster configuration restrictions.
Note the following restrictions related to services and applications.
Sun Cluster can be used to provide service for only those data services that either are supplied with Sun Cluster or set up using the Sun Cluster data services API.
Do not configure the Sun Cluster nodes as mail servers, because sendmail(1M) is not supported in a Sun Cluster environment. No mail directories should reside on Sun Cluster nodes.
Do not configure Sun Cluster systems as routers (gateways). If the system goes down, the clients cannot find an alternate router and recover.
Do not configure Sun Cluster systems as NIS or NIS+ servers. Sun Cluster nodes can be NIS or NIS+ clients, however.
A Sun Cluster configuration cannot be used to provide a highly available boot or install service to client systems.
A Sun Cluster configuration cannot be used to provide highly available rarpd service.
The Solaris interface groups feature is not supported on Sun Cluster, as it disrupts cluster switchover and failover behavior. You must disable Solaris interface groups on all cluster nodes. See "Disabling Solaris Interface Groups" for more information.
Sun Cluster reserves certain port numbers for internal use. These port numbers are stored in the clustername.cdb file. Note the following reserved port numbers when planning your configuration, data services, and applications.
In addition, note that Solaris reserves ports 6000 to 6031 for UNIX Distributed Lock Manager (UDLM). UDLM is used with Oracle Parallel Server configurations.
Table 2-4 Default Ports Reserved by Sun Cluster
Port Number |
Reserved For... |
---|---|
5556 |
Cluster Membership Monitor |
5568-5599 |
VERITAS Volume Manager cluster feature, vxclust |
5559 |
VERITAS Volume Manager cluster feature, vxkmsgd |
5560 |
VERITAS Volume Manager cluster feature, vxconfigd |
603 |
sm_configd and smad (for TCP and UDP) |
Changing these port numbers is not allowed, except in the case of sm_configd and smad. To change the port numbers used by sm_configd or smad, edit the /etc/services files on all nodes. The files must be identical on all nodes. For all cases other than sm_configd and smad, you must change the application's port use rather than the cluster's.
Note the following restrictions related to Sun Cluster HA for NFS.
Do not run, on any Sun Cluster node, any applications that access the Sun Cluster HA for NFS file system locally. For example, on Sun Cluster systems, users should not locally access any Sun Cluster file systems that are NFS exported. This is because local locking interferes with the ability to kill and restart lockd(1M). Between the kill and the restart, a blocked local process is granted the lock, which prevents reclamation of the lock by the client machine.
Sun Cluster does not support cross-mounting of Sun Cluster HA for NFS resources.
Sun Cluster HA for NFS requires that all NFS client mounts be "hard" mounts.
For Sun Cluster HA for NFS, do not use host name aliases for the logical hosts. NFS clients mounting HA file systems using host name aliases for the logical hosts might experience statd lock recovery problems.
Sun Cluster does not support Secure NFS or the use of Kerberos with NFS. In particular, the secure and kerberos options to share_nfs(1M) are not supported.
Note the following hardware-related restrictions.
A pair of Sun Cluster nodes must have at least two multihost disk enclosures, with one exception: if you use Sun StorEdge A3x00 disks, you can use only one.
The SS1000 and SC2000 hardware platforms are supported with Sun Cluster 2.2 on only Solaris 2.6. This restriction is due to the removal of support for the SFE 1.0 be(7D) driver on versions of Solaris later than 2.6. The SFE 1.0 be(7D) driver is used for the cluster interconnect on the SS1000 and SC2000 machines.
You cannot mix UDWIS/DWIS and SCI controllers on the same I/O board.
The following restrictions apply only to Ultra 2 Series configurations:
The Sun Cluster node must be reinstalled to migrate from one basic hardware configuration to another. For example, a configuration with three FC/S cards and one SQEC card must be reinstalled to migrate to a configuration with two FC/S cards, one SQEC card, and one SFE or SunFDDI(TM) card.
Dual FC/OMs per FC/S card is supported only when used with the SFE or SunFDDI card.
In the SFE or SunFDDI 0 card configuration, the recovery mode of a dual FC/OM FC/S card failure is by a failover, not by mirroring or hot sparing.
For more information about Sun Cluster 2.2 hardware considerations and procedures, see the Sun Cluster 2.2 Hardware Site Preparation, Planning, and Installation Guide and Sun Cluster 2.2 Hardware Service Manual.
Note the following restrictions related to Solstice DiskSuite.
In Solstice DiskSuite configurations using mediators, the number of mediator hosts configured for a diskset must be an even number.
Although the shared CCD is an optional feature for two-node clusters running VxVM, a shared CCD cannot be used in Solstice DiskSuite configurations.
The +D option to scconf(1M) cannot be used with Solstice DiskSuite.
The RAID5 feature in the Solstice DiskSuite product is not supported. RAID5 is supported under VERITAS Volume Manager. A hardware implementation of RAID5 is also supported by the Sun StorEdge A3x00 disk expansion unit.
In the event of a power failure that brings down the entire cluster, user intervention is required to restart the cluster. The administrator must determine the last node that went down (by examining /var/adm/messages) and run scadmin startcluster on that node. Then the administrator must run scadmin startnode on the other cluster nodes to bring the cluster back online.
Sun Cluster software must run in C locale. This applies to Sun Cluster daemons, Sun Cluster rc scripts, and the superuser environment. As a consequence, you should not configure the superuser environment or the default environment (through the /etc/default/init file) to anything other than C locale on all hosts in the cluster.
Sun Cluster does not support the use of the loopback file system (lofs) on Sun Cluster nodes.
Do not run client applications on the Sun Cluster nodes. Because of local interface group semantics, a switchover or failover of a logical host may cause a TCP (telnet/rlogin) connection to be broken. This includes both connections that were initiated by the server hosts of the cluster, as well as connections that were initiated by client hosts outside the cluster.
Do not run, on any Sun Cluster node, any processes that run in the real-time scheduling class.
Do not access the /logicalhost directories from shells on any nodes. If you have shell connections to any /logicalhost directories when a switchover or failover is attempted, the switchover or failover will be blocked.
The Sun Cluster HA administrative file system cannot be grown using the Solstice DiskSuite growfs(1M) command.
Logical network interfaces are reserved for use by Sun Cluster.
Sun Prestoserve is not supported. Prestoserve works within the host system, which means that any data contained in the Prestoserve memory would not be available to the Sun Cluster sibling in the event of a switchover.
The be FastEthernet device driver has reached end of life and is not supported in the Solaris 7 or Solaris 8 operating environments. Due to this situation, the SPARCserver(TM) 1000 and SPARCcenter(TM) 2000, which use the be driver for the cluster interconnect, are not supported for Sun Cluster 2.2 with the Solaris 7 or Solaris 8 operating environments. These servers are supported for Sun Cluster 2.2 with only the Solaris 2.6 operating environment.
The user-defined script clustername.reconfig.user_script is not supported in Sun Cluster 2.2.