Sun Cluster 2.2 Release Notes

Nodelock Freeze After Cluster Panic

In a cluster with more than two nodes and with direct-attached storage, a problem occurs if the last node in the cluster panics or exits the cluster unusually (without performing the stopnode transition). In such a case, all nodes have been removed from the cluster and the cluster no longer exists, but because the last node left the cluster in an unusual manner, it still holds the nodelock. A subsequent invocation of the scadmin startcluster command will fail to acquire the nodelock.

To work around this problem, manually clear the nodelock before restarting the cluster.

Use the following procedure to manually clear the nodelock and restart the cluster, after the cluster has aborted completely.

  1. As root, display the cluster configuration.

    # scconf clustername -p
    

    Look for this line in the output:

    clustername Locking TC/SSP, port  : A.B.C.D, E
    
    • If E is a positive number, the nodelock is on Terminal Concentrator A.B.C.D and Port E. Proceed to Step 2.

    • If E is -1, the lock is on an SSP. Proceed to Step 3.

  2. For a nodelock on a Terminal Concentrator (TC), perform the following steps (otherwise, proceed to Step 3).

    1. Start a telnet connection to Terminal Concentrator tc-name.

      $ telnet tc-name
       Trying 192.9.75.51...
       Connected to tc-name.
       Escape character is `^]'.

      Enter Return to continue.

    2. Specify -cli (command line interface).

      Enter Annex port name or number: cli
      
    3. Log in as root.

    4. Run the admin command.

      annex# admin
      
    5. Reset Port E.

      admin : reset E
      
    6. Close the telnet connection

      annex# hangup
      
    7. Proceed to Step 4.

  3. For a nodelock on a System Service Processor (SSP), perform the following steps.

    1. Connect to the SSP.

      $ telnet ssp-name
      
    2. Log in as user ssp.

    3. Display information on the clustername.lock file by using the following command (this file is a symbolic link to /proc/csh.pid).

      $ ls -l /var/tmp/clustername.lock
      
    4. Search for the process csh.pid.

      $ ps -ef | grep csh.pid
      
    5. If the csh.pid process exists in the ps -ef output, kill the process by using the following command.

      $ kill -9 csh.pid 
      
    6. Delete the clustername.lock file.

      $ rm -f /var/tmp/clustername.lock
      
    7. Log out of the SSP.

  4. Restart the cluster.

    $ scadmin startcluster