N1 Provisioning Server 3.1, Blades Edition, Troubleshooting Guide

Farm Activation and Transition Issues and Solutions

The following section gives information about debugging farm activation failures. If a farm fails to complete a request successfully, an error state will be set on the farm. You can determine the error state by running the command /opt/terraspring/sbin/farm -l. The second-to-last column in the command output indicates the error state of the farm.

Two error states are set for informational purposes only:

Error state 0 indicates that the farm is okay and ready to accept new requests.
Error state 1000 indicates that the farm is in a transition state. A farm request is currently being processed for this farm and the farm is moving from one state to another.

You might encounter the problems below during farm activation or when a farm transitions from one state to another. To reissue a farm request to a farm in an error state, add the -f option to the farm command, for example farm -af farm-id. This option will cause the farm error to be cleared and the farm request will be processed.

Problem:

When you run the /opt/terraspring/sbin/farm -l command, the farm state displays as NEW/NEW_CONFIG/20. This value indicates that your farm could not be allocated.

Solution:

Farm allocation fails if the farm requires more resources of a particular type than are available. To verify the resources that are preventing successful allocation of your farm, run the following command: /opt/terraspring/sbin/rsck farm-id. Retry farm activation after rsck reports that enough resources are available to allocate your farm.

Problem:

During the dispatch phase of activation, all devices in the farm are powered on for the final boot. If the farm fails to dispatch, an error message appears in the Control Center window. Details about the error are in the debug log file /var/adm/tspr.debug.

Solution:

One of the following two solutions might apply:

Device did not report to the monitoring agent before end of timeout period.

The resource pool server might not be booting properly. Watch the resource pool server boot on the console connection. If a device does not boot properly, you need to focus on networking problems. If the device boots successfully but the monitoring agent does not recognize that the device booted, verify that the monitoring agent process on the resource pool server started successfully. Run the following command on the resource pool server: /usr/ucb/ps auxwww | grep java. You should see a monitoring agent Java process running in the background. To debug the monitoring agent process, examine the log files at /opt/terraspring/log/tspr.log.
A problem could exist when trying to power on the device. Refer to the debugging method discussed in the second problem in Farm Wiring Issues and Solutions.

Problem:

Farm update failure. If the farm fails to update, an error message appears in the Control Center window. Details about the error are in the debug log file /var/adm/tspr.debug.

Solution:

The farm update procedure is very similar to the activation procedure. The farm will first transition from the ACTIVE state into the UPDATE state from which it transitions through the WIRED and DISPATCHED states back to the ACTIVE state. During the transition to the UPDATE state, the farm attempts to allocate newly requested resources. Failures during this transition should be debugged in the same manner as allocation problems are debugged. Once a farm has reached UPDATE state, it will transition through the WIRED, DISPATCHED, and ACTIVE states. Refer to the two preceding problems to debug failures.

Problem:

In farm STANDBY state, scrubbing of removed disks fails.

Solution:

Setting up a device for scrubbing follows the same procedure as preparing a device for a disk copy. Refer to the third problem in Farm Wiring Issues and Solutions to debug this problem.

Problem:

In farm STANDBY state, you are unable to move a device into a VLAN.

Solution:

Refer to the first problem in Farm Wiring Issues and Solutions to debug this problem.

Problem:

In farm STANDBY state, you are unable to change the power state of a device.

Solution:

Refer to the debugging method discussed in the second problem in Farm Wiring Issues and Solutions.

Problem:

Snapshot of disk images fails. The completion of the snapshot request leaves the farm in an error state.

Solution:

A disk snapshot is prepared in a similar way as disk copies to a server disk are set up. Follow the instructions in the third problem in Farm Wiring Issues and Solutions to debug this problem.

In addition, the snapshot process attempts to reserve disk space for the snapshot image on the image server prior to taking the snapshot image. If the image server is nearly filled up to its capacity, the snapshot process might fail. In this case, remove old images from the server or back these images up to a separate storage device. Use the image command to delete old images: /opt/terraspring/sbin/image -d image-id.

Problem:

Farm deactivation failed. The completion of the deactivation request leaves the farm in an error state.

Solution:

Farm deactivation goes through the same motions as moving a farm into STANDBY with the exception that no snapshot images are created for any removed devices. Follow the farm STANDBY debugging advice when troubleshooting farms that failed to deactivate.

Problem:

A replacement device could not be allocated for device failover support. The completion of the “replace failed device” request leaves the farm in an error state.

Solution:

To replace a failed device with a backup device, a backup device must be available. If there are no more devices available in the free pool, replacing a failed device will fail due to allocation problems. Use the following command to verify that a replacement device is available:/opt/terraspring/sbin/device -LFr device-id.

Problem:

The replacement device could not be provisioned for device failover support. The completion of the “replace failed device” request leaves the farm in an error state.

Solution:

A replacement device will be provisioned in the same way newly allocated devices are provisioned during initial farm activation and farm updates. Refer to Farm Wiring Issues and Solutions to debug these problems.

Problem:

When executing the command /opt/terraspring/sbin/request -lf farm-id, requests show as queued but do not get processed.

Solution:

Verify the following items:

Check the state of the farm to confirm that the farm is not in an error state. Type the following command:
# /opt/terraspring/sbin/farm -l farm-id
If a farm is in an error state, no farm requests will be processed. If you know that the farm is in a reasonable state to process the queued farm requests, perform the following steps:
1. Clear the error state using the following command: /opt/terraspring/sbin/farm -pf farm-id. This command will ping the farm and clear its error state.
2. Verify that the segment manager (sm) is running and responding by typing the following command: /opt/terraspring/sbin/aps.
The next request to be processed for this farm is something other than a request marked QUEUED_BLOCKED.

Because the requests are handled in a first-in/first-out fashion, you might be filing a farm request that is not being processed because a QUEUED_BLOCKED request needs to be processed first. You have two options to eliminate this request. You can either unblock the blocked request, or delete the blocked request. Type the following command to unblock the request:
request -u request-id
Type the following command to delete the request:
request -d request-id
Your farm is in a state other than DEACTIVATED.

A farm in DEACTIVATED state cannot be transitioned into any other state. Once a farm is deactivated, the only other farm operation allowed is farm deletion.
The Segment Manager process is alive.

If none of the above scenarios apply, the Segment Manager process might be no longer responding. To verify that all processes are running properly, use the following command: /opt/terraspring/sbin/aps. If the process is not running properly, use the /etc/rc3.d/S99sm -start script to start the Segment Manager process.

Problem:

You are unable to deactivate a farm that contains a faulty blade. Both farm deactivation and replace device requests fail.

Solution:

Type the following command to mark the faulty blade as FAILED: device -sB blade-id. Then, re-issue the deactivation request.