As with any software, you might have occasional difficulties with N1 Provisioning Server software or an Infrastructure Fabric (I-Fabric). This chapter helps you to identify and resolve such problems. This chapter describes the potential problems, their likely causes, and practical solutions to these problems. It features useful troubleshooting tables and flowcharts. The following major topics are covered in this chapter:
Failover of resource pool servers within an I-Fabric is automated, and requires you to perform few or no manual tasks after a failover and startup of a new device. Any tasks to be performed after a failover are described within the sections dedicated to specific I-Fabric devices within this chapter.
You can set the debug level in the tspr.properties configuration file according to your preference of debugging detail. The higher the debug levef, the more logging information you receive. A debug level setting of 9 is recommended. You can view debug information in the tspr.debug.logfile. The following is an example tspr.propertiesfile the debug level set to 9. All entries in the tspr.propertiesfile must have the following format:
package name.class name.attribute=value |
com.terraspring.core.sys.GridOS.debuglevel=9 |
N1 Provisioning Server software actively and automatically monitors the availability of all devices within a farm to enable automated failover of logical server farm devices. No special configuration is necessary to enable this monitoring. When the software detects that a device is no longer available, the software can take the necessary automated steps to replace the physical device with an identical device from the pool of unused devices. In replacing the failed physical device, the system duplicates configuration and logically reattaches storage to the new device so that the new device can take over the role of the failed device.
Whenever a failure is detected, the system starts the failover process by placing a failover “job” request into the N1 Provisioning Server queuing mechanism. This same queuing mechanism is used to process all farm activation and farm update requests. You can view the current set of tasks and their status by running the request -l command.
Two options for specifying failover behavior when the monitoring system detects a device failure:
Automatic failover consists of the N1 Provisioning Server placing and automatically processing a replacefaileddevice request in the queue. You are not required to intervene.
Manual failover consists of the N1 Provisioning Server placing a replacefaileddevice request as a blocked request in the request queue for the replacement of the failed device. You must unblock the request before it can be processed and the device replaced. Run the command request -u request-ID to do this.
A blocked failover request blocks all subsequent requests for that farm, including farm update requests made through the Control Center. The system does not process any other changes to the farm until the failover request is either unblocked or deleted.
The automatic option enables immediate processing of the failover when the request enters the system's request queue. The manual option causes the failover request to be blocked in the system's request queue. As a result, the blocked failover request and any subsequent request for the failed device's farm are not processed until you either unblock the request to allow the failover or you delete the request to abort the failover.
Automatic failover can be costly if the server has not really failed, but was unavailable. In this case, the device might get replaced unnecessarily.
To configure the behavior of the failover mechanism, alter the property in the /etc/opt/terraspring/tspr.properties file.
If the com.terraspring.cs.services.DeviceStatus.blockReqFailedDevice property is set to false, any detected device failures will not result in blocked failover requests. If this property is set to true, any detected device failures will result in blocked failover requests. The property is set to true by default, which is the recommended setting.
Upon the detection of a farm device failure, the monitoring software sends a device DOWN event to the segment manager of the N1 Provisioning Server that manages the farm.
The monitoring software sends a device DOWN event only for active farms.
When the segment manager receives a DOWN event from the monitoring software, the segment manager performs the following procedures:
A blocked replacePhysicalDevices request is sent to the farm manager of the farm that owns that device.
A critical error message is logged into the log file to alert operations. The critical message that the segment manager generates contains the failed device's name and the corresponding farm ID.
Perform the following procedure on an control plane server if you receive a critical farm device failure message.
If the automatic failover property is set to true, no action needs to be taken.
List the current requests for the farm:
request -lf farm ID |
Review the list and obtain the requestID of the blocked replacePhysicalDevices request generated by the segment manager for the farm.
You can identify the requestID by the replacePhysicalDevice request where state is listed as QUEUED_BLOCKED. The second argument of the replacePhysicalDevices request specifies the IDs of the devices that failed.
Verify that the physical device has actually failed and that it is not a spurious error. See Handling a Failed Control Plane Server for details. Temporary network failures can cause spurious errors.
If the device has not failed, or you do not want to replace the device, delete the request by typing request -d request ID.
If only one device failed, start the device replacement by unblocking the replacePhysicalDevices request typing request -u request-ID.
If multiple devices failed, you will see many replacePhysicalDevices requests.
After replacing failed devices, delete the replacePhysicalDevices requests by typing:
request -d request-ID |
If the control plane server fails, see the server documentation for details on replacing the server. If the Oracle database on the control plane database (CPDB) has been damaged, contact http://sun.com/service/contacting/index.html for assistance.
See Chapter 5, Backing Up and Restoring Components for a description of the files that you need to restore. See the N1 Provisioning Server 3.1, Blades Edition, Installation Guide for details on how to reinstall the software that runs on the control plane, such as the Control Center software and the N1 Provisioning Server software.
This section describes how to troubleshoot the failure of software processes on the control plane.
If the monitoring manager fails, attempt to restart it by running the /opt/terraspring/sbin/mmd start command. If the monitoring manager does not restart, see the N1 Provisioning Server 3.1, Blades Edition, Installation Guide for details on how to reinstall the monitoring manager software and restore the most recent backup.
If the Control Center fails, attempt to restart it by running the /opt/terraspring/sunone/bin/appserv start command. If the Control Center does not restart, see the N1 Provisioning Server 3.1, Blades Edition, Installation Guide for details on how to reinstall the Control Center software and restore the most recent backup.
Almost all farm operations are carried out asynchronously, that is, a request (message) is queued for the N1 Provisioning Server, which processes them in the order received. Use the request command to view all pending and in-progress requests in the N1 Provisioning Server.
Any critical error occurring during the execution of a farm operation is reported by the Monitoring Manager through the monitoring system. Critical messages are also logged to the /var/adm/messages file on the N1 Provisioning Server. The critical error causes the current farm operation to exit and moves the farm into an error state. When the farm is in an error state, requests in the queue cannot be processed until the error is cleared manually. After the problem causing the error is resolved, you must reset the farm error state so that the stopped operation can be restarted.
In addition to the /var/adm/messages file, various debug messages are logged in the file /var/adm/tspr.debug. The amount of information in this file is determined by the debug level set in the /etc/opt/terraspring/tspr.properties file. The default debug level is 9. When a critical error is encountered, the additional information in the /var/adm/tspr.debug file is useful in determining the cause of the error.
To diagnose a problem with the N1 Provisioning Server, follow the steps illustrated in Figure 7–1.
All messages logged into the /var/adm/messages and the /var/adm/tspr.debug files are in a standard format. The following example is a typical message:
Nov 12 17:22:57 sp3 java[23033]: [ID 398540 user.info] TSPR [sev=crit] [fmid=1211] [MSG6718] FM Activate: Ready timeout expired.
Table 7–1 Message Log File Format
Message Element |
Description |
---|---|
Nov 12 17:22:57 |
The date and timestamp of the message. |
sp3 |
The name of the control plane server. |
java[23033] |
The unique ID of the Solaris process. |
[ID 398540 |
The unique ID of the system log. |
user.info] |
The type of system log entry. |
TSPR |
The type of message. In the example, the message is an N1 Provisioning Server message. |
[sev=crit] |
Indicates the severity of the message. In the example, crit means this error message is critical. Other severity types are warn, okay, and dbug. |
[fmid=1211] |
Indicates the application ID that generated the message. In the example, the application is the farm manager for farm 1211. The segment manager application ID is in the format [smid=27003]. The command-line tools application ID is in the format [apps=26765]. |
[MSG6718] |
The message ID assigned by the N1 Provisioning Server. |
FM Activate: Ready timeout expired |
The message itself. |
When an error or exception occurs during execution, the name of the exception is logged as part of the message along with a description of the exception.
The farm operations initiated using the Control Center or the farm command are queued for the segment manager. The segment manager processes the request, and dispatches the request to the farm manager, which actually does the work. When investigating problems, ensure that your request has not stalled along the way to the farm manager.
If there are no critical errors or messages relating to requests that are being processed in the tspr.debug file of the segment manager, check the request queue to verify that your request has been processed.
The request command is used to view the request queue:
To list all current requests in the N1 Provisioning Server, run the command request -l.
To list all current requests for a farm, run the command request -lf farm-ID.
To list all completed requests for a farm, run the command request -lcf farm-ID.
To list all requests for a farm, run the command request -laf farm-ID.
For more information on the request command and its usage, see the request man page.
If your request is still in the queue in the QUEUED state, the request has not been processed yet. If no other requests are ahead of this request for your farm, the request queue might have stalled. The next section describes how to resolve this problem.
A request can sometimes fail to be processed by its intended server and might stay in the queue unattended. This section explains how to diagnose and solve this problem.
If starting the segment manager does not process all requests on the queue, some requests might be blocked. Blocked requests require manual intervention prior to their processing. When you have reviewed the blocked requests, you can unblock them to process them or delete them. You can run the request -u request-ID command to unblock or the request -d request-ID command to delete blocked requests.
After the blocked requests are cleared, the requests in the queue are processed in the order in which they were received.
Sometimes the request is not blocked but just queued, and the farm manager is still not processing the request. In this case, ping the farm. Issuing the ping command to the farm activates the farm manager queue handler to process requests. This command especially applies to requests for farms that are in an error state. Run the following command to ping the farm to activate the request queue handler:
farm -p farm-ID |
If the farm was in an error state (that is, a previous operation ended in error), reset the error to move the queue:
farm -pf farm-ID |
You can look for failed power or switch operation requests by checking the tspr.debug log for ExpectTimedOut messages. Power operation requests include powerUp, ispowerUp, and powerDown operations. Switch operation requests include addPort, removeVlan, and removeAll operations.
A failure can occur if the N1 Provisioning Server is unable to access a device or the N1 Provisioning Server receives unexpected output from a device. If a failure occurs, check the following items:
Verify that the IP address for the device is set correctly and that communication from the N1 Provisioning Server to the device exists.
Run the device -lv device-ID command from the N1 Provisioning Server to verify that all login names and passwords for this device are correct.
Ensure that the firmware is the supported version and that any changes made to the default settings are acceptable.
If you need to debug further to resolve the issue, enable the expect log in the following properties:
com.terraspring.drivers.util.expect.Expect.print=true
com.terraspring.drivers.util.expect.Expect.output=log-name-path
Run the tail -f log-name-path command and reissue your request to view the operation.
The following table describes N1 Provisioning Server troubleshooting issues related to farm operations. This list is not inclusive. The message log file indicates whether your problem relates to farm operations.
Table 7–2 Troubleshooting Farm Operations
Every farm has an error status code that is associated with the farm to indicate whether the farm is currently in an abnormal state.
The error status code of 0 represents a health state.
The error status code of 1000 means that the farm manager is processing a request.
An error status code other than 0 or 1000 means that the farm has an error.
During the request process, the farm's internal state changes whenever the transition process is completed with a success. If the farm fails to transition from one internal state to another internal state, the farm's internal state is not changed and the farm is set with an error status. The value of the error status code is the failed internal state value.
For example, if the farm failed the transition from state ALLOCATED (20) to state WIRED (30), the farm is still left with the internal state ALLOCATED (20) and the farm error status code is set to 30 to represent the failed state WIRED. Just before the code is set to 30, it is 1000 to indicate that the request is in progress.
Whenever a farm error occurs, a critical error message is generated in the system log file /var/adm/messages.
The farm manager will not process any further farm requests until the error condition is changed and the error status code is cleared to 0.
The following table contains descriptions of troubleshooting scenarios involving images. This list is not inclusive. The message log file indicates whether your problem relates to images.
Table 7–3 Troubleshooting Image Problems
Problem |
Possible Cause |
Solution |
---|---|---|
While allocating the resources for a farm, the N1 Provisioning Server generates a message indicating that it cannot find the named image. |
The image ID in the FML file is incorrect. |
Contact http://sun.com/service/contacting/index.html for assistance. Synchronize the Control Center. Also check the images list on the CPDB by running the command image -ls, and ensure that this list matches what displays in the Control Center. |
A farm update request failed after making a snapshot. |
The server was not completely backed up and running yet when the update request was made. |
When you issue a farm update request after taking a snapshot, make sure that the server is completely up and running before issuing a farm update request. Otherwise, the farm update will fail. To ensure that the server is up and running, run the command ping server IP-address. |
Farm activation fails after image creation. |
Not enough external subnets. |
The farm that is automatically created by the image creation process includes an external subnet. Define a new external subnet using the command subnet -cx -m netmask-size network-IP. |
Farm standby request failed. |
Not enough space on the image server for storing the image. |
Go to the directory where all images are stored and issue the command df -k to determine how much space is available for storing image. If the capacity is above 85 percent, add more storage space for images. |
The image server manages images. The image server can be either a stand-alone network file server (NFS) or it can run on a control plane server. See Chapter 3, Managing Software Images for details.
If your image requires Gigabit Ethernet support for the Solaris operating environment, ensure that the appropriate drivers are loaded each time the system boots. You can initialize the list of managed interfaces by running the following commands:
devfsadm ifconfig -a plumb ifconfig -a | grep flags | cut -d: -f1 | grep -v lo0> /etc/opt/terraspring/managed_interfaces |
Now edit the contents of /etc/opt/terraspring/managed_interfaces by commenting out any interfaces that should not be managed by N1 Provisioning Server software.
The instance assigned to any gigabit Ethernet card must always be 0. For example, ce0, skge 0, alt0.
This section describes some common monitoring problems. It shows how to diagnose these problems and suggests possible corrective actions. See Chapter 4, Monitoring and Messaging for an introduction to monitoring concepts.
The most common problems are as follows:
An UP message is not received from one or more resource pool servers.
Monitors do not show colors on the Element Monitor window of the Control Center.
A farm does not activate even though an UP message was sent.
Frequent UP and DOWN messages are received.
Many of these problems have interconnected root causes. The most common causes are:
Network, DNS, or DHCP issues.
Monitoring processes are not running on the control plane server.
Agent processes are not running on the resource pool servers.
This section describes how to diagnose these symptoms. Corrective Actions for Monitoring describes the corrective action to take to resolve these problems.
You can confirm an UP message problem on the control plane server by checking whether the following conditions exist:
In the /var/adm/tspr.debug file, a message is listed similar to the following:
"Still waiting for 1 device(s) in 2879974 ms" |
The farm activation shows ERROR 50, as shown in the following example:
FARM_ID FARM_NAME CUSTOMER STATE ISTATE ERROR 123 Farm_Name Customer NEW DISPATCHED 50 |
The following figure shows the steps needed to diagnose and resolve this problem.
The preceding illustration shows the following troubleshooting sequence:
Check for a network, DNS, or DHCP problem. See Network, DNS, or DHCP Problems for details on how to do this.
Check that the monitoring processes are running on the control plane server. See Monitoring Processes Are Not Running on the Control Plane for details. Follow the instructions in this section to restart the processes.
Check that the agent processes are running on the resource pool server. See Agent Processes Are Not Running on a Resource Pool Server for details. If the agent processes are not running, follow the instructions in this section to restart them.
Farm-specific monitors might not appear on the Control Center. This condition could be caused by one of the following problems:
Agent processes are not running on the servers.
The mapping between the gw-mon-vip and the IP address of the Control Center server software is not set in the /etc/hosts file on the control plane server.
The listener on the Control Center is not running. See Control Plane Server-to-Control Center Messages Not Working for information on how to verify this condition.
Figure 7–2 shows the sequence of steps for you to follow to diagnose and resolve the above error condition. See Control Plane Server-to-Control Center Messages Not Working for details on how to resolve these problems.
Even though the UP message was sent by the monitoring system, the segment manager might not be running. In this case, restart the segment manager. See Check for Blocked Requests for details on this procedure.
A number of UP and DOWN messages received for a server might be received as a result of incorrect configuration of the interfaces on the N1 Provisioning Server.
Clear the duplicate Ethernet interfaces on the control plane server by running the clearNicInterface command. See the man pages for details on using this command.
A number of symptoms are common to a number of problems. This section describes how to diagnose the following symptoms.
Perform the checks in the following table for network, DNS, or DHCP problems:
Table 7–4 Checking for Errors
Error Check |
Error Confirmation |
---|---|
Verify that all the resource pool servers can receive pingsignals by running the following command on the control plane server: /opt/terraspring/sbin/mls -lf farm-ID. Note – This command lists all the servers in the farm that can receive ping signals. |
Any of the servers are listed as ADDED |
Verify that all the resource pool servers are reachable by performing a telnet to each of the servers. |
Any of the servers are not reachable with telnet |
Sometimes a server can receive ping signals but is not reachable with telnet when in a single-user mode. To resolve this problem, connect to the console port and boot into multiuser mode.
After you determine a diagnosis for a monitoring process run the command:
/usr/ucb/ps -auxww | grep MM |
If the monitoring process is running, you will see an output similar to this example:
USER PID %CPU %MEM SZ RSS TT S START TIME COMMAND root 14540 0.2 1.14 485 620 608? S Mar 05 18:32 /bin/../java/bin/.. /bin/sparc/native_threads/java -Dsun.net.inetaddr.ttl=0 com. terraspring.mon.MM root 9529 0.1 0.1 976 672 pts/2 S 11:49:40 0:00 grep MM |
If the monitoring process is not running, you will see an output similar to this example:
USER PID %CPU %MEM SZ RSS TT S START TIME COMMAND root 9565 0.1 0.1 976 672 pts/2 S 11:50:28 0:00 grep MM |
See Restart the Monitoring Processes on the Control Plane Server for details on how to restart the process.
Agent processes might not be running on a resource pool server. You can verify this condition by one of two methods:
On the control plane server run the following command:
/opt/terraspring/sbin/mls -a IP address of host |
To be able to use this command, you must know the IP address of the server.
On the server on which the agent you want to verify is running, run the following command:
/usr/ucb/ps -auxww | grep tspragt |
If the agent processes are running, you will see output similar to the following example:
root 7652 0.1 0.1 976 656 pts/1 S 11:37:30 0:00 grep tspragt |
root 321 0.1 0.73167213816 ? S 16:26:37 0:10 /usr/bin/../java/bin/.. /bin/sparc/native_threads/java -Dsun.net.inetaddr.ttl=0 com.terraspring.mon.client.tspragt start 10.42.14.2 |
If the agent processes are not running, you will see output similar to the following example:
root 7709 0.1 0.1 976 656 pts/1 S 11:39:54 0:00 grep tspragt |
See Restart the Agent Processes on a Resource Pool Server for details on how to restart the process.
For a number of reasons messages between the control plane server and Control Center might not work. The most common reasons include:
The mapping between the gw-mon-vip to the IP address of the Control Center server software is not set in the /etc/hosts file on the control plane server. Verify that a suitable entry is present to check this condition.
For example:
10.5.131.19 gw-mon-vip |
The listener on the Control Center server software is not running. You can verify this condition by running finger test@gw-mon-vip on the control plane server. The expected sample output is similar to the following examples:
[gw-mon-vip] |
or
[hostname] |
This section describes a number of corrective actions that you can take to resolve a monitoring problem.
To restart the monitoring process run the following commands on the control plane server:
/opt/terraspring/sbin/mmd stop |
This command ensures that all relevant processes are stopped. Restart the monitoring processes with the following command:
/opt/terraspring/sbin/mmd start |
If the control agent process terminated on the server, start the process on the server with the following command:
/etc/init.d/N1PSagt start |
To verify that the processes have restarted, run the following command from the control plane server:
/opt/terraspring/sbin/mls -a server IP address |
If the agent is running you will see output similar to the following:
FARM_ID IP_ADDRESS TYPE STATE DB_STATE SINCE 134 10.9.0.35 Server UP UP Feb 05 14:15:32 |
If the agent is down in real-time (STATE) it might still be marked as being UP in the database (DB_STATE because the database state is updated every five minutes. Therefore it will be up in real time, but still down in the database state. You will see output similar to the following:
FARM_ID IP_ADDRESS TYPE STATE DB_STATE SINCE 134 10.9.0.35 Server DOWN UP Feb 10 14:20:33 |
The main problem you might encounter when working with the Control Center relates to the CPDB connection: either the connection between the Control Center and the CPDB is down, or parameters are incorrectly configured in the Control Center database.
To determine which of these two problems you might be encountering:
Log in as an administrator in the Control Center using the browser.
In the configuration tools section, select I-Fabrics to bring up a list of I-Fabrics.
On the I-Fabrics List, select the I-Fabric whose connection needs to be checked.
Click OK when the following dialog displays: Improper Configuration of these Properties may disrupt System Operation. Proceed with Caution.
In the Property Name Property Value dialog, take note of the IP address of the database server (defined at the field primary) and its port.
If no IP address is listed, telnet to the hostname of the URL on which the Control Center runs.
Try the telnet command (assuming 10.0.0.18 and 1521 are the IP address and port obtained at step 4):
telnet 10.0.0.18 1521
Trying 10.0.0.18...
If you see the following response:
Connected to cpdb
Escape character is "^]"
the connection is okay. Type Ctrl-C to cancel the telnet session.
If you see the following response:
telnet: Unable to connect to remote host: Connection refused
the IP is okay, but the CPDB database server is not running (or it is running on a different port). Contact your DBA or consult the database manufacturer's documentation for information on how to verify the listener port.
ping 10.0.0.18
If you see the following result:
no answer from 10.0.0.18
no communication exists between the Control Center and the database server hosts exists. This error might be caused by a routing problem or by the database server being down.
The connection from the Control Center to the CPDB might be too slow at times causing a timeout. This condition might be caused by either the database or the network being slow due to overwhelming workload. To prevent a timeout when the connection is slow, increase the value in the timeout field in the I-Fabric Properties screen in the Control Center. The default is 30 seconds.