N1 Provisioning Server 3.1, Blades Edition, Control Center Management Guide

Previous: Chapter 5 Image Management

Chapter 6 Troubleshooting

You issue farm management requests by using either the Control Center or the command-line interface. Examples of these requests include activating farms, updating farms, deactivating farms, and so forth. As these requests are processed, a farm transitions from state to state. However, if a farm request fails at some point, the farm is left in an error state. This chapter describes how to diagnose these errors and strategies for correcting farms that are in the error state.

Note –

See Chapter 7, Troubleshooting in N1 Provisioning Server 3.1, Blades Edition, System Administration Guide for more detailed information regarding troubleshooting an I-Fabric.

This chapter includes the following topics:

Overview

As with any complex system, when farms transition from state to state, errors can occur. You must be able to remedy these errors quickly. Use the following general strategy to resolve an error state:

Determine that the farm request failed.
Diagnose the problem by determining the error state.
Fix the problem, for example, replace a failed server, free farm resources, resolve networking issue, and so forth. Then run the farm -af command to activate the farm.
Alternatively, you can bypass the problem, for example, delete the request and return to the prior condition of the farm or delete the farm and start over.

Every device in a logical server farm is continuously monitored for availability. The monitoring facility alerts in case of a device failure. The N1 Provisioning Server software automatically brings up another identically configured physical device to replace the failed device. In these cases, failover is expected behavior and no error message is generated.

Note –

Most error states can be diagnosed and resolved by the administrator. However, in some rare cases, error states must be resolved by a Sun Service provider.

At a high level, types of failures include resource layer device failure, that is, device and networking failures, configuration errors, or not enough resources available, software configuration errors, and software error/control plane error. The following list describes potential failure points in farm activation:

The action cannot be completed because there are not enough free resources
Provisionable equipment servers (PES) configuration issues
Network problems
Wiring problems

Other points of failure exist. Given the variety of devices and systems involved, there are a number of failure points to investigate. However, you know you have a problem if the following situations occur:

The Control Center shows a failed status in the Message section of the Farm Request dialog of the Administration screen
The Control Center shows a failed request in the Farm Details section of the Main and Editor screens.
When you run the farm -l farm_ID command, the farm ERROR is a nonzero number, other than 1000, and the farm is not in the desired state.

External and Internal Farm States

Farm lifecycle management is one of the major functions provided by the Control Center software. As a farm goes through different stages during its lifecycle, this stage information is represented as farm state in the control plane. To determine the error state, you must be familiar with external and internal farm states.

Two kinds of state information exists

External state—displayed in the Control Center
Internal state—accessed by using the command-line interface

External States

External states are represented as strings. The following list shows the valid farm external state values:

NEW–Farm is just created
ACTIVE–Farm is active and ready for the customer
INACTIVE–Farm is inactive
STANDBY–Farm is in standby mode

Figure 6–1 illustrates the external farm states and state transitions:

Figure 6–1 External Farm States and Transitions

Note –

These external states do not map exactly to the farm lifecycle states displayed in the Control Center. For example, there is no equivalent Design state in external states, and there is no equivalent New state in the Control Center.

Internal States

The internal farm state as maintained by the SP is only visible to you through the SP command-line interface. You must understand these internal states as they help you monitor the progress of a farm through the various stages of automated activation, updates, and decommissioning, as well as troubleshooting problems. Internal states are represented as integers. The valid internal state values are described in the following table:

Table 6–1 Valid Internal State Values


Internal State	Internal State Value	External State	Meaning
CREATED	0	New	The farm has just been created but not submitted for activation.
NEW_CONFIG	10	New	Same as `CREATED` in terms of farm resource changes, but the SP has now taken over the farm.
ALLOCATED	20	New	Resources are allocated to the farm in the database.
WIRED	30	New	Physical devices are connected according to the farm topology.
DISPATCHED	40	New	An SP server owns the farm. Domain Name System (DNS), Dynamic Host Configuration Protocol (DHCP), and Network Interface Card (NIC) are set up for the farm. Farm monitoring is also registered or in the process of registering at this stage if applicable. This action is part of both the initial activation process and the farm update process.
ACTIVE	50	Active	The farm is active and running.
IDLE	60	Active	Reserved for Sun Microsystems.
STANDBY	70	Standby	The farm is on standby. IP addresses are still associated with the farm.
SHUTDOWN	90	Active (pending standby or inactive)	The farm devices are shut down.
UNWIRED	100	Active (pending standby or inactive)	Physical devices are detached from the farm.
DEACTIVATED	110	Inactive	The farm is deactivated and all resources are freed.
UPDATED	120	Active	The farm has been updated.

Use the command farm -l to list information about a farm. Used as is, farm –l lists information about all farms. Used with a farm ID (a unique string assigned when the farm is created), farm –l farm_ID lists information for a specific farm. The output looks like the following:

FARM_ID
 FARM_NAME  CUSTOMER  STATE   ISTATE  ERROR  OWNER
123      testx      Customerx       ACTIVE  ACTIVE  0      SM:cp1

As shown in this example, both the farm's external and internal states are listed. Also, the internal state has been translated from a numerical value to a text string.

Farm Requests

A request is the main communication mechanism used by the N1 Provisioning Server. Usually, a request starts from the Control Center and subsequent requests are generated within the control plane to assist with the completion of the Control Center request. Alternatively, you can use the command-line interface to directly send requests to the ID.

Typically, the Control Center initiates a farm operation by sending a request to the control plane. This farm request initially goes to the Segment Manager, which in turn sends the request to the Farm Manager to delegate the request.

There is not a one-to-one relationship from the Control Center request to the control plane. One farm request from the Control Center is actually completed by a series of requests destined for different request servers. The actual number of requests required to complete one Control Center request varies, depending on the implementation.

When a request is queued by the Control Center or CLI (client), the request is either processed by the control plane (server) or cancelled.

The request lifecycle starts with either QUEUED_BLOCKED or QUEUED state and ends in any of the following states: CANCELLED, TIMEOUT, DONE, INTERNAL_ERROR or DELETED.

Table 6–2 lists the status of the request lifecycle:

Table 6–2 Status of Request Lifecycle


Request State	Description of State
QUEUED or QUEUED_BLOCKED	Initial status of any request
INPROGRESS	The request is served by the RequestHandler at the server side
DONE	The request is done at the server side
INTERNAL_ERROR	The request is in error during the processing at the server side
CANCELLED	The request is cancelled, usually by the requester
DELETED	The request is deleted
TIMEDOUT	The request is not finished by the specified time
FAILED	The request had an error while being processed.

Farm Activation Problems

Note –

For a detailed description of farm operation failure scenarios, refer to Troubleshooting Problems with Farm Operations in N1 Provisioning Server 3.1, Blades Edition, System Administration Guide.

Determining Whether a Farm Operation Succeeded or Failed

When a farm operation succeeds:
- The Control Center shows a completed status in the Message section of the Farm Request dialog of the Administration screen.
- The farm –l farm_ID command shows an ERROR of 0 and the farm state will reflect the desired state for that operation.
When a farm operation fails:
- The Control Center shows a failed status in the Message section of the Farm Request dialog of the Administration screen.
- The farm ERROR is a nonzero number (other than 1000) and the farm is not in the desired state. An ERROR of 1000 is not an error; it means that a farm operation is in progress.

Diagnosing the Failure

Run the farm -Lt farm_ID command to extract messages related to the specified farm from the log files.
If the farm has been assigned to an SP (as shown by the farm –l farm_ID command), look at the /var/adm/messages file and the /var/adm/tspr.debug file on the owning SP for any error messages for the farm.
Check the /var/adm/messages file and the /var/adm/tspr.debug on the SP running the Master Segment Manager for any critical error messages for the farm.

The following example shows how a message appears in the log:

Oct 30 00:16:47 sp4 java[506]: [ID 289794 user.info] 
    TSPR [sev=okay] [apps=770034] TCPEventHandler:dispatch...

Note –

See Chapter 6, Error Messages in N1 Provisioning Server 3.1, Blades Edition, System Administration Guide.

Use the following tools to help pinpoint the problem:

Monitor the farm activation process through the Control Center Farm Requests dialog of the Administration screen. During the activation process, a message reports when a device is added successfully to the farm. See if you can identify a device that failed.
Use the terminal server, or the serial port of the device if the terminal server is not available, as a console to connect to a specific device and obtain diagnostic information. Until the farm device is activated, the only way to connect to the device is through the console connection.

Resolving the Failure

Re-run the Request

After you have determined the cause of the error and you have taken any necessary actions, that is, replaced a failed server, freed farm resources, resolved networking issues, and so forth, you can re-run the farm operation. Use the -f option to clear the error. For example, if a farm activation failed, you can run the farm –af farmid command.
- Inadequate Resources
  
  If you have determined that the cause of the error is inadequate resources, and you cannot free resources to fix this problem, you can do the following steps:
  1. Run the farm -pf farm_ID command to clear the error state. This command clears the internal state. However, this change is not reflected in the Control Center.
  2. Open the farm in the Control Center Editor, and select the last “good” farm configuration from Farm Details on the left-hand side of the screen.
  3. Make any changes necessary to this version of the farm in the Editor and click Commit.
Abandon Request and Start Over

You might decide to abandon the farm and deactivate it by using the farm –df farm_ID command. This command clears the farm resources and brings the farm to the deactivated state. You can then delete the farm using the farm –D farmid command. You may then save the farm under a different name, by using the Save As option in the File menu. The saved farm may then be activated.

Note –
The Control Center reflects the current farm status because it is automatically synchronized with the control plane.

Previous: Chapter 5 Image Management