N1 Provisioning Server 3.1, Blades Edition, System Administration Guide

Handling Failed Farm Devices

N1 Provisioning Server software actively and automatically monitors the availability of all devices within a farm to enable automated failover of logical server farm devices. No special configuration is necessary to enable this monitoring. When the software detects that a device is no longer available, the software can take the necessary automated steps to replace the physical device with an identical device from the pool of unused devices. In replacing the failed physical device, the system duplicates configuration and logically reattaches storage to the new device so that the new device can take over the role of the failed device.

Whenever a failure is detected, the system starts the failover process by placing a failover “job” request into the N1 Provisioning Server queuing mechanism. This same queuing mechanism is used to process all farm activation and farm update requests. You can view the current set of tasks and their status by running the request -l command.

Two options for specifying failover behavior when the monitoring system detects a device failure:

Automatic failover consists of the N1 Provisioning Server placing and automatically processing a replacefaileddevice request in the queue. You are not required to intervene.
Manual failover consists of the N1 Provisioning Server placing a replacefaileddevice request as a blocked request in the request queue for the replacement of the failed device. You must unblock the request before it can be processed and the device replaced. Run the command request -u request-ID to do this.

Note –

A blocked failover request blocks all subsequent requests for that farm, including farm update requests made through the Control Center. The system does not process any other changes to the farm until the failover request is either unblocked or deleted.

The automatic option enables immediate processing of the failover when the request enters the system's request queue. The manual option causes the failover request to be blocked in the system's request queue. As a result, the blocked failover request and any subsequent request for the failed device's farm are not processed until you either unblock the request to allow the failover or you delete the request to abort the failover.

Automatic failover can be costly if the server has not really failed, but was unavailable. In this case, the device might get replaced unnecessarily.

To configure the behavior of the failover mechanism, alter the property in the /etc/opt/terraspring/tspr.properties file.

If the com.terraspring.cs.services.DeviceStatus.blockReqFailedDevice property is set to false, any detected device failures will not result in blocked failover requests. If this property is set to true, any detected device failures will result in blocked failover requests. The property is set to true by default, which is the recommended setting.

Troubleshooting Farm Device Failure

Upon the detection of a farm device failure, the monitoring software sends a device DOWN event to the segment manager of the N1 Provisioning Server that manages the farm.

Note –

The monitoring software sends a device DOWN event only for active farms.

When the segment manager receives a DOWN event from the monitoring software, the segment manager performs the following procedures:

A blocked replacePhysicalDevices request is sent to the farm manager of the farm that owns that device.
A critical error message is logged into the log file to alert operations. The critical message that the segment manager generates contains the failed device's name and the corresponding farm ID.

To Respond to Farm Device Failure

Perform the following procedure on an control plane server if you receive a critical farm device failure message.

Note –

If the automatic failover property is set to true, no action needs to be taken.

Steps

List the current requests for the farm:
request -lf farm ID

Review the list and obtain the requestID of the blocked replacePhysicalDevices request generated by the segment manager for the farm.

You can identify the requestID by the replacePhysicalDevice request where state is listed as QUEUED_BLOCKED. The second argument of the replacePhysicalDevices request specifies the IDs of the devices that failed.

Verify that the physical device has actually failed and that it is not a spurious error. See Handling a Failed Control Plane Server for details. Temporary network failures can cause spurious errors.
- If the device has not failed, or you do not want to replace the device, delete the request by typing request -d request ID.
- If only one device failed, start the device replacement by unblocking the replacePhysicalDevices request typing request -u request-ID.
- If multiple devices failed, you will see many replacePhysicalDevices requests.
  1. Identify the device ID of each of the failed devices.
  2. Run the replacedevice command with the device IDs of all the devices that you want to replace:
    
    replacedevice farm-ID failed-device-ID failed-device-ID failed-device-ID
    
    For example:
    
    replacedevice 5 19001 37001 2003

After replacing failed devices, delete the replacePhysicalDevices requests by typing:
request -d request-ID