N1 Provisioning Server 3.1, Blades Edition, Control Center Management Guide

Chapter 6 Troubleshooting

You issue farm management requests by using either the Control Center or the command-line interface. Examples of these requests include activating farms, updating farms, deactivating farms, and so forth. As these requests are processed, a farm transitions from state to state. However, if a farm request fails at some point, the farm is left in an error state. This chapter describes how to diagnose these errors and strategies for correcting farms that are in the error state.


Note –

See Chapter 7, Troubleshooting in N1 Provisioning Server 3.1, Blades Edition, System Administration Guide for more detailed information regarding troubleshooting an I-Fabric.


This chapter includes the following topics:

Overview

As with any complex system, when farms transition from state to state, errors can occur. You must be able to remedy these errors quickly. Use the following general strategy to resolve an error state:

  1. Determine that the farm request failed.

  2. Diagnose the problem by determining the error state.

  3. Fix the problem, for example, replace a failed server, free farm resources, resolve networking issue, and so forth. Then run the farm -af command to activate the farm.

  4. Alternatively, you can bypass the problem, for example, delete the request and return to the prior condition of the farm or delete the farm and start over.

Every device in a logical server farm is continuously monitored for availability. The monitoring facility alerts in case of a device failure. The N1 Provisioning Server software automatically brings up another identically configured physical device to replace the failed device. In these cases, failover is expected behavior and no error message is generated.


Note –

Most error states can be diagnosed and resolved by the administrator. However, in some rare cases, error states must be resolved by a Sun Service provider.


At a high level, types of failures include resource layer device failure, that is, device and networking failures, configuration errors, or not enough resources available, software configuration errors, and software error/control plane error. The following list describes potential failure points in farm activation:

Other points of failure exist. Given the variety of devices and systems involved, there are a number of failure points to investigate. However, you know you have a problem if the following situations occur:

External and Internal Farm States

Farm lifecycle management is one of the major functions provided by the Control Center software. As a farm goes through different stages during its lifecycle, this stage information is represented as farm state in the control plane. To determine the error state, you must be familiar with external and internal farm states.

Two kinds of state information exists

External States

External states are represented as strings. The following list shows the valid farm external state values:

Figure 6–1 illustrates the external farm states and state transitions:

Figure 6–1 External Farm States and Transitions

>


Note –

These external states do not map exactly to the farm lifecycle states displayed in the Control Center. For example, there is no equivalent Design state in external states, and there is no equivalent New state in the Control Center.


Internal States

The internal farm state as maintained by the SP is only visible to you through the SP command-line interface. You must understand these internal states as they help you monitor the progress of a farm through the various stages of automated activation, updates, and decommissioning, as well as troubleshooting problems. Internal states are represented as integers. The valid internal state values are described in the following table:

Table 6–1 Valid Internal State Values

Internal State 

Internal State Value 

External State  

Meaning 

CREATED 

New 

The farm has just been created but not submitted for activation. 

NEW_CONFIG 

10 

New 

Same as CREATED in terms of farm resource changes, but the SP has now taken over the farm.

ALLOCATED 

20 

New 

Resources are allocated to the farm in the database. 

WIRED 

30 

New 

Physical devices are connected according to the farm topology. 

DISPATCHED 

40 

New 

An SP server owns the farm. Domain Name System (DNS), Dynamic Host Configuration Protocol (DHCP), and Network Interface Card (NIC) are set up for the farm. Farm monitoring is also registered or in the process of registering at this stage if applicable. This action is part of both the initial activation process and the farm update process. 

ACTIVE 

50 

Active 

The farm is active and running. 

IDLE 

60 

Active 

Reserved for Sun Microsystems. 

STANDBY 

70 

Standby 

The farm is on standby. IP addresses are still associated with the farm. 

SHUTDOWN 

90 

Active (pending standby or inactive) 

The farm devices are shut down. 

UNWIRED 

100 

Active (pending standby or inactive) 

Physical devices are detached from the farm. 

DEACTIVATED 

110 

Inactive 

The farm is deactivated and all resources are freed. 

UPDATED 

120 

Active 

The farm has been updated. 

Use the command farm -l to list information about a farm. Used as is, farm –l lists information about all farms. Used with a farm ID (a unique string assigned when the farm is created), farm –l farm_ID lists information for a specific farm. The output looks like the following:

FARM_ID
 FARM_NAME  CUSTOMER  STATE   ISTATE  ERROR  OWNER
123      testx      Customerx       ACTIVE  ACTIVE  0      SM:cp1

As shown in this example, both the farm's external and internal states are listed. Also, the internal state has been translated from a numerical value to a text string.

Farm Requests

A request is the main communication mechanism used by the N1 Provisioning Server. Usually, a request starts from the Control Center and subsequent requests are generated within the control plane to assist with the completion of the Control Center request. Alternatively, you can use the command-line interface to directly send requests to the ID.

Typically, the Control Center initiates a farm operation by sending a request to the control plane. This farm request initially goes to the Segment Manager, which in turn sends the request to the Farm Manager to delegate the request.

There is not a one-to-one relationship from the Control Center request to the control plane. One farm request from the Control Center is actually completed by a series of requests destined for different request servers. The actual number of requests required to complete one Control Center request varies, depending on the implementation.

When a request is queued by the Control Center or CLI (client), the request is either processed by the control plane (server) or cancelled.

The request lifecycle starts with either QUEUED_BLOCKED or QUEUED state and ends in any of the following states: CANCELLED, TIMEOUT, DONE, INTERNAL_ERROR or DELETED.

Table 6–2 lists the status of the request lifecycle:

Table 6–2 Status of Request Lifecycle

Request State 

Description of State 

QUEUED or QUEUED_BLOCKED 

Initial status of any request 

INPROGRESS 

The request is served by the RequestHandler at the server side 

DONE 

The request is done at the server side 

INTERNAL_ERROR 

The request is in error during the processing at the server side 

CANCELLED 

The request is cancelled, usually by the requester 

DELETED 

The request is deleted 

TIMEDOUT 

The request is not finished by the specified time 

FAILED 

The request had an error while being processed. 

Farm Activation Problems


Note –

For a detailed description of farm operation failure scenarios, refer to Troubleshooting Problems with Farm Operations in N1 Provisioning Server 3.1, Blades Edition, System Administration Guide.


Determining Whether a Farm Operation Succeeded or Failed

Diagnosing the Failure

The following example shows how a message appears in the log:

Oct 30 00:16:47 sp4 java[506]: [ID 289794 user.info] 
    TSPR [sev=okay] [apps=770034] TCPEventHandler:dispatch...

Note –

See Chapter 6, Error Messages in N1 Provisioning Server 3.1, Blades Edition, System Administration Guide.


Use the following tools to help pinpoint the problem:

Resolving the Failure