Repairing a Host in the Cluster (Sun Open Telecommunications Platform 1.1 Installation and Administration Guide)

Sun Open Telecommunications Platform 1.1 Installation and Administration Guide

Repairing a Host in the Cluster

This section provides the procedure for repairing a failed host in a clustered OTP system. If a host fails in a multi-host cluster setup, the host has to be repaired. The host repair process involves the following two steps:

Remove the failed host from the cluster.
Add a host to the cluster as described in Adding a Host to the Existing Cluster.

To Remove a Failed Host From the Cluster

In this procedure, the host pcl17-ipp2 is removed from a two-host cluster configuration. The hosts are pcl17-ipp1 and pcl17-ipp2. Substitute your own cluster and host information.

Note –

If the host that is being removed is the first host in the cluster, back up the system management database as described in Backing Up The OTP System Management Service Database and Configuration Files so that the database can be restored to one of the remaining cluster hosts as described in Restoring the OTP System Management Service Database and Configuration Files to Another OTP Host.

If the cluster has more than two hosts:

Type /usr/cluster/bin/scstat -g | grep Online to determine which host in the cluster is active.

Make note of the host on which the resource group otp-system-rg is online.

For example:

# /usr/cluster/bin/scstat -g | grep Online
    Group: otp-system-rg          pcl17-ipp2   Online
 Resource: otp-lhn                pcl17-ipp2   Online  Online - LogicalHostname online.
 Resource: otp-sps-hastorage-plus pcl17-ipp2   Online  Online
 Resource: otp-spsms-rs           pcl17-ipp2   Online  Online
 Resource: otp-spsra-rs           pcl17-ipp2   Online  Online

In the above example, the active host is pcl17-ipp2.

Add the cluster binaries path to your $PATH.

# PATH=$PATH:/usr/cluster/bin

Move all the resource groups and disk device groups to pcl17-ipp1.

# scswitch -z -g otp-system-rg -h pcl17-ipp1

Remove the host from all resource groups.

# scrgadm -c -g otp-system-rg -y RG_system=false

# scrgadm -c -g otp-system-rg -y Nodelist=pcl17-ipp1

Note –
Nodelist must contain all the node names except the node to be removed.

If the node was set up as a mediator host, remove it from the set.

# metaset -s sps-dg -d -m pcl17-ipp2

Remove the node from metaset.

# metaset -s sps-dg -d -h -f pcl17-ipp2

Remove all the disks connected to the node except the quorum disk.

Check the disks connected to the node by typing the following command:

scconf -pvv |grep pcl17-ipp2|grep Dev

# scconf -pvv |grep pcl17-ipp2|grep Dev
(dsk/d12) Device group node list:                pcl17-ipp2
(dsk/d11) Device group node list:                pcl17-ipp2
(dsk/d10) Device group node list:                pcl17-ipp2
(dsk/d9) Device group node list:                 pcl17-ipp2
(dsk/d8) Device group node list:                 pcl17-ipp2
(dsk/d7) Device group node list:                 pcl17-ipp1, pcl17-ipp2
(dsk/d6) Device group node list:                 pcl17-ipp1, pcl17-ipp2
(dsk/d5) Device group node list:                 pcl17-ipp1, pcl17-ipp2
(dsk/d1) Device group node list:                 pcl17-ipp1, pcl17-ipp2

Remove the local disks.

# scconf -c -D name=dsk/d8,localonly=false

# scconf -c -D name=dsk/d9,localonly=false

# scconf -c -D name=dsk/d10,localonly=false

# scconf -c -D name=dsk/d11,localonly=false

# scconf -c -D name=dsk/d12,localonly=false

# scconf -r -D name=dsk/d8

# scconf -r -D name=dsk/d9

# scconf -r -D name=dsk/d10

# scconf -r -D name=dsk/d11

# scconf -r -D name=dsk/d12

Determine which disk is the quorum disk.

To determine which disk is the quorum disk, type the command scstat -q | grep "Device votes". For example:
# scstat -q | grep "Device votes" Device votes: /dev/did/rdsk/d1s2 1 1 Online
In this example, the quorum disk is dsk/d1

Remove the shared disks except for the quorum disk.

# scconf -r -D name=dsk/d5,nodelist=pcl17-ipp2

# scconf -r -D name=dsk/d6,nodelist=pcl17-ipp2

# scconf -r -D name=dsk/d7,nodelist=pcl17-ipp2

Check that only the quorum disk is in the list.

# scconf -pvv |grep pcl17-ipp2|grep Dev
(dsk/d1) Device group node list: pcl17-ipp1, pcl17-ipp2

Shut down the failed node.

shutdown -y -g 0 -i 0

Place the failed node in maintenance state.

# scconf -c -q node=pcl17-ipp2,maintstate

Remove the private interconnect interfaces.
1. Check the private interconnect interfaces using the following command:
  
  # scconf -pvv | grep pcl17-ipp2 | grep Transport
  Transport cable: pcl17-ipp2:ce0@0 switch1@2 Enabled Transport cable: pcl17-ipp2:ce2@0 switch2@2 Enabled
2. Disable and remove the private interconnect interfaces.
  
  # scconf -c -m endpoint=pcl17-ipp2:ce0,state=disabled
  
  # scconf -c -m endpoint=pcl17-ipp2:ce2,state=disabled
  
  # scconf -r -m endpoint=pcl17-ipp2:ce0
  
  # scconf -r -m endpoint=pcl17-ipp2:ce2
3. Remove the private interfaces of the failed node.
  
  # scconf -r -A name=ce0,node=pcl17-ipp2
  
  # scconf -r -A name=ce2,node=pcl17-ipp2

Remove the quorum disk from the failed node.
- For a two-node cluster, type the following commands:
  
  # scconf -r -D name=dsk/d1,nodelist=pcl17-ipp2
  
  # scconf -c -q installmode
  
  # scconf -r -q globaldev=d1
  
  # scconf -c -q installmodeoff
- For a three-host or more cluster, type the following commands:
  
  # scconf -r -D name=dsk/d1,nodelist=pcl17-ipp2
  
  # scconf -r -q globaldev=d1

Add the quorum devices only to the nodes that will remain in the cluster.

# scconf -a -q globaldev=d[n],node=node1,node=node2

Where n is the disk DID number.

Remove the failed node from the node authentication list.

# scconf -r -T node=pcl17-ipp2

Remove the failed node from the cluster node list.

# scconf -r -h node=pcl17-ipp2

Perform this step from installmode (scconf -c -q installmode). Otherwise, you will get a warning about possible quorum compromise.

Use the following commands to verify whether the failed node is still in the cluster configuration.

# scconf -pvv |grep pcl17-ipp2

# scrgadm -pvv|grep pcl17-ipp2

If the failed node was successfully removed, both of the above commands return to the system prompt.

If the scconf command failed, command out will be similar to the following:

# scconf -pvv | grep pcl17-ipp2
Cluster nodes: pcl17-ipp1 pcl17-ipp2
Cluster node name: pcl17-ipp2
(ipp-node70) Node ID: 1
(ipp-node70) Node enabled: yes
(ipp-node70) Node private hostname: clusternode1-priv
(ipp-node70) Node quorum vote count: 0
(ipp-node70) Node reservation key: 0x462DC27400000001
(ipp-node70) Node transport adapters:

If the scrgadm command output is similar to the following, then Step 4 was not executed.

# scrgadm -pvv|grep pcl17-ipp2
(otp-system-rg) Res Group Nodelist: pc117-ipp1 pc117-ipp2

Change the RG_system property to true.

Type scrgadm -c -g otp-system-rg -y RG_system=true

Next Steps

Add the host to the cluster as described in Adding a Host to the Existing Cluster.