Sun Cluster 2.2 System Administration Guide

Replacing a Failed Node

Complete the following steps when one node has a hardware failure and needs to be replaced with a new node.


Note -

This procedure assumes the root disk of the failed node is still operational and can be used. If your failed root disk is not mirrored, contact your local Sun Enterprise Service representative or your local authorized service provider for assistance.


How to Replace a Failed Node

If the failed node is not operational, start at Step 5.

  1. If you have a parallel database configuration, stop the database.


    Note -

    Refer to the appropriate documentation for your data services. All HA applications are automatically shut down with the scadmin stopnode command.


  2. Use the Cluster Console to open a terminal window.

  3. As root, enter the following command in the terminal window.

    This command removes the node from the cluster, stops the Sun Cluster software, and disables the volume manager on that node.


    # scadmin stopnode
    

  4. Halt the operating system on the node.

    Refer to your Solaris system administration documentation if necessary.

  5. Power off the node.

    Refer to your hardware service manual for more information.


    Caution - Caution -

    Do not disconnect any cables from the failed node at this time.


  6. Remove the boot disk from the failed node.

    Refer to your hardware service manual for more information.

  7. Place the boot disk in the identical slot in the new node.

    The root disk should be accessible at the same address as before. Refer to your hardware service manual for more information.


    Note -

    Be sure that the new node has the same IP address as the failed system. You may need to modify the boot servers or arp servers to remap the IP address to the new Ethernet address. For more information, refer to the NIS+ and DNS Setup and Configuration Guide.


  8. Power on the new node.

    Refer to your hardware service manual for more information.

  9. If the node automatically boots, shut down the operating system and take the system to the OpenBoot PROM monitor.

    For more information, refer to the shutdown(1M) man page.

  10. Make sure that every scsi-initiator-id is set correctly.

    See Chapter 4 in the Sun Cluster 2.2 Hardware Site Preparation, Planning, and Installation Guide for the detailed procedure to set the scsi-initiator-id.

  11. Power off the new node.

    Refer to your hardware service manual for more information.

  12. On the surviving node that shares the multihost disks with the failed node, detach all of the disks in one disk expansion unit attached to the failed node.

    Refer to your hardware service manual for more information.

  13. Power off the disk expansion unit.

    Refer to your hardware service manual for more information.


    Note -

    As you replace the failed node, messages similar to the following might appear on the system console. Disregard these messages, because they might not indicate a problem.


    Nov  3 17:44:00 updb10a unix: WARNING: /sbus@1f,0/SUNW,fas@0,8800000/sd@2,0 (sd17):
    Nov  3 17:44:00 updb10a unix: SCSI transport failed: reason 'incomplete': retrying \ command
    Nov  3 17:44:03 updb10a unix: WARNING: /sbus@1f,0/SUNW,fas@0,8800000/sd@2,0 (sd17):
    Nov  3 17:44:03 updb10a unix:   disk not responding to selection


  14. Detach the SCSI cable from the failed node and attach it to the corresponding slot on the new node.

    Refer to your hardware service manual for more information.

  15. Power on the disk expansion unit.

    Refer to your hardware service manual for more information.

  16. Reattach all of the disks you detached in Step 12.

    Refer to your hardware service manual for more information.

  17. Wait for volume recovery to complete on all the volumes in the disk expansion unit before detaching the corresponding mirror disk expansion unit.

    Use your volume manager software to determine when volume recovery has occurred.

  18. Repeat Step 12 through Step 17 for all of the remaining disk expansion units.

  19. Power on the replaced (new) node.

    Refer to your hardware service manual for more information.

  20. Reboot the node and wait for the system to come up.


    <#0> boot
    

  21. Determine the Ethernet address on the replaced (new) node.


    # /usr/sbin/arp nodename
    

  22. Determine the node ID of the replaced node.

    By the process of elimination, you can determine which node is not in the cluster. The node IDs should be numbered consecutively starting with node 0.


    # get_node_status
    sc: included in running cluster
    node id: 0        
    membership: 0
    interconnect0: unknown
    interconnect1: unknown
    vm_type: vxvm
    vm_on_node: master
    vm: up
    db: down

  23. Inform the cluster system of the new Ethernet address (of the replaced node) by entering the following command on all the cluster nodes.


    # scconf clustername -N node-id ethernet-address-of-host
    

    Continuing with the example in Step 22, the node ID is 1:


    # scconf clustername -N 1 ethernet-address-of-host
    

  24. Start up the replaced node.


    # scadmin startnode
    

  25. In parallel database configuration, restart the database.


    Note -

    Refer to the appropriate documentation for your data services. All HA applications are automatically started with the scadmin startcluster and scadmin startnode commands.