Sun Cluster 2.2 Release Notes

Known Problems

The following known problems affect the operation of Sun Cluster 2.2.

Framework Bugs

4185966 - A bad trap following loss of heartbeat might result in the SCI module causing node panic.

4202413 - The cluster aborts when a majority of nodes halt simultaneously. If the volume manager is CVM or SSVM, this can be avoided by selecting a single direct-attached disk as a quorum disk when configuring the cluster.

4202418 - An SCI heartbeat-alive check failure might cause node failure.

4213128 - In Solstice DiskSuite configurations in which a logical host has multiple disksets, takeover of the logical host fails because the hactl(1M) utility does not parse the diskset names correctly. This bug compromises fault monitoring in certain scenarios. The workaround involves replacing the file /opt/SUNWcluster/ha/nfs/have_maj_util with a modified file. The modified file is available through your service provider.

Administrative Command Bugs

4209264 - The scconf -F command does not always mirror the administrative file system across controllers. Use vxprint to display the volumes; if the administrative file system is not mirrored across controllers, manually create a mirror of that volume on a different controller.

4210684 - Installing and configuring a cluster by using scinstall(1M) command-line options in conjunction with its configuration menus does not work. In addition, when using scinstall(1M) command-line options to remove the server software, the cluster network packages are not removed. To perform these tasks, run the scinstall(1M) command interactively (without options).

4210191 - When all public network connections fail on a node with Solstice DiskSuite, the node aborts from the cluster and panics with the following panic string:

Failfast timeout - unit "abort_thread"

4213927 - The pnmset(1M) command fails in some cases due to ping(1M) timeout after an ifconfig(1M) operation on some gigabit Ethernet cards. Work around the problem by configuring the /etc/pnmconfig file manually. See the pnmconfig(4M) man page for more information.

Data Service Bugs

4210065 - In Solstice DiskSuite configurations in which a logical host has multiple disksets, the Sun Cluster HA for NFS shell script /opt/SUNWcluster/ha/nfs/fdl_enum_probe_disks reports an error. This causes fault monitoring of the disksets to fail. The workaround involves replacing the file /opt/SUNWcluster/ha/nfs/fdl_enum_probe_disks with a modified file. The modified file is available through your service provider.

4210646 - The Sun Cluster HA for Oracle fault monitor does not restart Oracle correctly if the character set is non-USASCII. This is commonly the case when Oracle is installed during SAP installation. To correct this, you must establish the following link so that NLS data files specified by the fault monitor's ORA_NLS33 environment variable will be found by Oracle during startup. Create this link on all cluster nodes:

# ln -s /opt/SUNWcluster /SUNWcluster

SCM Bugs

4207695 - In SCM, the Previous button on the syslog page is enabled even when the syslog is empty. Using the Previous button in this case will have no effect.

4207726 - SCM does not detect the loss of a public network until after network connection is reestablished.

4208089 - SCM does not display the correct current status for the Sun Cluster HA for Oracle data service. When an Oracle instance is stopped with the command haoracle stop, the instance is put into maintenance mode, and no message is posted to syslog. While an instance is in maintenance mode, it is not monitored by Sun Cluster. SCM interprets this state as unknown.

4211950 - If a logical host is put into maintenance mode, SCM displays the node as waiting to be given up. Manually refresh the screen to show the correct state.

4212030 - When the NFS service is off, the NFS service on some logical hosts may be displayed as OK.

4212623 - When a cluster node leaves a cluster, the private and public networks will no longer reflect the correct state, and should therefore be ignored.

4212691 - There are some cases when all nodes that own a logical host are not part of the cluster. In this case, the logical host is also down. SCM displays these logical hosts as up.

Other Known Issues

The following issues apply to Sun Cluster 2.2.

Running SCM With the HotJava Browser

If you choose to use the HotJava browser shipped with your Solaris 2.6 or Solaris 7 operating environment to run SCM, there may be problems such as:

Timeout Values

After configuring each logical host with the scinstall(1M) or scconf(1M) commands, you might need to use the scconf clustername -l command to set the timeout values for the logical host. The timeout value is site-dependent; it is tied to the number of logical hosts, spindles, and file systems.

Refer to the scconf(1M) man page for details. For procedures for setting timeout values, refer to Section 3.14, "Configuring Timeouts for Cluster Transition Steps, in the Sun Cluster 2.2 System Administration Guide.

Encapsulated Root Disks

If you are running SSVM with an encapsulated root disk, you must unencapsulate the root disk before installing Sun Cluster 2.2. After you install Sun Cluster 2.2, encapsulate the disk again. You also must unencapsulate the root disk before changing the major numbers.

Refer to your SSVM documentation for the procedures to encapsulate and unencapsulate the root disk.

SNMP Default Port

As part of the client software installation, the SUNWcsnmp package is installed to provide simple network management protocol (SNMP) support for Sun Cluster. The default port used by Sun Cluster SNMP is the same as the default port number used by Solaris SNMP; both use Port 161. Once the SUNWcsnmp package is installed, you must change the Sun Cluster SNMP port number using the procedure described in Section D.6, "Configuring the Cluster SNMP Agent Port, in the Sun Cluster 2.2 System Administration Guide.

Installation Directory for Sun Cluster HA for Informix

The INFORMIX_ESQL Embedded Language Runtime Facility product must be installed in the /var/opt/informix directory on Sun Cluster servers. This is required even if Informix server binaries are installed on the physical host.

Lotus and Netscape Message Servers

You can set up Lotus Domino servers as HTTP, POP3, IMAP, NNTP, or LDAP servers. Lotus Domino will start server tasks for all of these types. However, do not set up instances of any Netscape message servers on a logical host that is potentially mastered by the node on which Lotus Domino is installed.

Lotus and Netscape Port Numbers

Within a cluster, do not configure Netscape services with the same port number as the one used by the Lotus Domino server. The following port numbers are used by default by the Lotus Domino server:

HTTP

Port 80

POP3

Port 110

IMAP

Port 143

LDAP

Port 389

NNTP

Port 119

Failover/Switchover When Logical Host File System Is Busy

If a failover or switchover occurs while a logical host's file system is busy, the logical host fails over only partially; some of the disk group remains on the original target physical host. Do not attempt a switchover if a logical host's file system is busy. Also, do not access any host's file system locally, because file locking does not work correctly when both NFS locks and local locks are present.

SSP Password Must Be Correct

If an incorrect password is used for the System Service Processor (SSP) on an Ultra Enterprise 10000, the system will behave unpredictably and might crash.

Harmless Error When Stopping a Node

When you stop a node, the following error message might be displayed:

in.rdiscd[517]: setsockopt (IP_DROP_MEMBERSHIP): Cannot assign requested address

The error is caused by a timing issue between the in.rdiscd daemon and the IP module. It is harmless and can be ignored safely.

Harmless Error by NFS lockd Daemon

For Sun Cluster HA for NFS running on Solaris 7, if the lockd daemon is killed before the statd daemon is fully running, the following error message is displayed:

WARNING: lockd: cannot contact statd (error 4), continuing.

This error message can be ignored safely.

Directory Permissions and Ownership of $ORACLE_HOME

If the Sun Cluster HA for Oracle fault monitor displays errors like those shown below, make sure that the $ORACLE_HOME directory permissions are set to 755 and that the directory is owned by the Oracle administrative user with group ID dba.

Feb 16 17:13:13 ID[SUNWcluster.ha.haoracle_fmon.2520]: hahost1:HA1: 
 DBMS Error: connecting to database: ORA-12546: TNS:permission denied
 Feb 16 17:12:13 ID[SUNWcluster.ha.haoracle_fmon.2050]: hahost1:HA1: 
 RDBMS error, but HA-RDBMS Oracle will take no action for this error code 

Displaying LOG_DB_WARNING Messages for the SAP Probe

The Sun Cluster HA for SAP parameter LOG_DB_WARNING determines whether warning messages should be displayed if the Sun Cluster HA for SAP probe cannot connect to the database. When LOG_DB_WARNING is set to -y and the probe cannot connect to the database, a message is logged at the warning level in the local0 facility. By default, the syslogd(1M) daemon does not display these messages to /dev/console or to /var/adm/messages. To see these warnings, you must modify the /etc/syslog.conf file to display messages of local0.warning priority. For example:

...
 *.err;kern.notice;auth.notice;local0.warning /dev/console
 *.err;kern.debug;daemon.notice;mail.crit;local0.warning /var/adm/messages
 ...

After modifying the file, you must restart syslogd(1M). See the syslog.conf(1M) and syslogd(1M) man pages for more information.

Nodelock Freeze After Cluster Panic

In a cluster with more than two nodes and with direct-attached storage, a problem occurs if the last node in the cluster panics or exits the cluster unusually (without performing the stopnode transition). In such a case, all nodes have been removed from the cluster and the cluster no longer exists, but because the last node left the cluster in an unusual manner, it still holds the nodelock. A subsequent invocation of the scadmin startcluster command will fail to acquire the nodelock.

To work around this problem, manually clear the nodelock before restarting the cluster.

Use the following procedure to manually clear the nodelock and restart the cluster, after the cluster has aborted completely.

  1. As root, display the cluster configuration.

    # scconf clustername -p
    

    Look for this line in the output:

    clustername Locking TC/SSP, port  : A.B.C.D, E
    
    • If E is a positive number, the nodelock is on Terminal Concentrator A.B.C.D and Port E. Proceed to Step 2.

    • If E is -1, the lock is on an SSP. Proceed to Step 3.

  2. For a nodelock on a Terminal Concentrator (TC), perform the following steps (otherwise, proceed to Step 3).

    1. Start a telnet connection to Terminal Concentrator tc-name.

      $ telnet tc-name
       Trying 192.9.75.51...
       Connected to tc-name.
       Escape character is `^]'.

      Enter Return to continue.

    2. Specify -cli (command line interface).

      Enter Annex port name or number: cli
      
    3. Log in as root.

    4. Run the admin command.

      annex# admin
      
    5. Reset Port E.

      admin : reset E
      
    6. Close the telnet connection

      annex# hangup
      
    7. Proceed to Step 4.

  3. For a nodelock on a System Service Processor (SSP), perform the following steps.

    1. Connect to the SSP.

      $ telnet ssp-name
      
    2. Log in as user ssp.

    3. Display information on the clustername.lock file by using the following command (this file is a symbolic link to /proc/csh.pid).

      $ ls -l /var/tmp/clustername.lock
      
    4. Search for the process csh.pid.

      $ ps -ef | grep csh.pid
      
    5. If the csh.pid process exists in the ps -ef output, kill the process by using the following command.

      $ kill -9 csh.pid 
      
    6. Delete the clustername.lock file.

      $ rm -f /var/tmp/clustername.lock
      
    7. Log out of the SSP.

  4. Restart the cluster.

    $ scadmin startcluster
    

Setting Up the /etc/nsswitch.conf Files With DBMS Data Services

The following applies to configurations using Sun Cluster HA for Oracle, Sun Cluster HA for Informix, or Sun Cluster HA for Sybase.

The Sun Cluster 2.2 Software Installation Guide contains erroneous information about how to set up the /etc/nsswitch.conf files for these DBMS data services. In order for the data services to start and stop correctly in case of switchovers or failovers, the /etc/nsswitch.conf files must be set up as follows.

On each node that can master the logical host running the DBMS data service, the /etc/nsswitch.conf file must have one of the following entries for group.

group:
 group:		 	files
 group:		 	files [NOTFOUND=return] nis
 group:		 	files [NOTFOUND=return] nisplus

The DMBS data services use the su user command when starting and stopping the database node. The above settings will ensure that the su user command does not refer to NIS/NIS+ when the network information name service is not available due to failure of the public network on the cluster node.