N1 Provisioning Server 3.1, Blades Edition, Troubleshooting Guide

Previous: Chapter 1 General Troubleshooting Guidelines and Processes

Chapter 2 Solving Installation and Configuration Problems

This chapter provides information about issues you might encounter when installing and configuring the N1 Provisioning Server 3.1, Blades Edition product. The issues are separated into the following categories:

For each issue in this document, a description of the symptom that you will see is provided along with a recommended solution. Although the solution to the problem should enable you to install the N1 Provisioning Server software, you should still evaluate whether you encountered the problem due to a bug in the software. If you encounter what you believe is a bug in the software, inform your customer care center representative.

Installation Issues and Solutions

This section describes known installation and initial configuration issues and how you can respond to them. For information about problems that are related to validating the installation, see Installation Validation Issues and Solutions.

Problem:

No VLAN-capable gigabit Ethernet card was found during the installation procedure. If the installer does not find any VLAN-capable card or appropriate drivers on the system, the following message is displayed:

Could not detect any known gigacards. Check if your driver packages are installed.

Solution:

This problem might occur due to several reasons:

A supported VLAN-capable gigabit Ethernet card has not been installed on the N1 Provisioning Server system. The N1 Provisioning Server software requires that one of the following two supported network cards is installed with the specified driver:
- SysKonnect SK98xx: 64-bit driver version 6.02
- GigaSwift: Solaris 8 drivers requires patch 112119-04.
Make sure that the supported version of the driver is installed in the N1 Provisioning Server.
The driver for the card might not be installed properly. Make sure that the correct version of the network card driver is installed. Also, verify that the driver packages for the GigaSwift card have been installed to enable VLAN capabilities.
The card is not installed as instance 0. The N1 Provisioning Server software expects the VLAN-capable gigabit Ethernet card to be installed as instance 0. To verify this setting, look in the /etc/path_to_inst file for ce or skge. If the network card has been assigned a different instance number, you will need to reconfigure the server. To reconfigure the server for a different instance number, follow these steps:
1. Identify all instances that match the same device:
  "/pci@8,700000/pci@3/network@0" 0 "ce" grep "/pci@8,700000/pci@3/network@0" /etc/path_to_inst
2. Remove all entries that were identified in the previous step from the/etc/path_to_inst file.
3. Use the following command to reboot the N1 Provisioning Server system:
  # reboot -- -r
4. After the reboot, verify that the ce or skge instance has been set to 0.

Problem:

During the installation process, you see a message that not all required patches have been installed.

Solution:

The required operating system patches need to be installed before installation can proceed. The N1 Provisioning Server 3.1, Blades Edition release allows you to install the Solaris 8 Recommended patch cluster, rather than to install individually the required patches mentioned in the installation guide. For more information about installing required patches, see Installing Required Patches in N1 Provisioning Server 3.1, Blades Edition, Installation Guide.

Problem:

After applying operating system patches to the N1 Provisioning Server system, named/dhcp does not start up properly. One indication of this problem is that farm requests start failing.

Solution:

The applied patches probably overwrote one of the N1 modified system tools like BIND or DHCP. Try the following solutions:

Identify the patches that overwrote the N1 modified version of the tools. Currently, N1 only replaces BIND and DHCP related files. Once you have identified the list of patches that patched these system tools, remove those patches using the patchrm utility.
Reinstall the following two packages:
- TSPRdns
- TSPRdhcp
The package files can be found on the product media in the Solaris/Terraspring/3.1/Product path.

Problem:

Installation of the Sun Java^TM System Application Server failed. The installation log file (/var/opt/terraspring/install/run/install.log) contains the specific error message.

Solution:

This problem is most likely due to a missing patch on the N1 Provisioning Server system. Make sure that you have all required patches installed on the N1 Provisioning Server system.

Problem:

When entering the Blade system chassis IP address and switch network configuration information, the chassis cannot be found. The following error message appears:

Cannot ping System Controller IP Value ip-address

Solution:

This problem is due to a networking issue. If you have not done so, you need to configure the Blade system chassis system controller (SC) to be IP-accessible from the N1 Provisioning Server. For more information, see Assigning IP Addresses in the Control Plane in N1 Provisioning Server 3.1, Blades Edition, Installation Guide.

Problem:

Configuration of the chassis failed. The installation log file (/var/opt/terraspring/install/run/install.log) contains the specific error message. The installer displays the following message, along with a series of options from which you can choose:

Installation may have failed due to incorrect user input or some other correctable error.

Solution:

When the installer fails to configure a chassis SC or switch, several items need to be verified:

Verify that the chassis SC is responding properly and does not hang. Execute showplatform -v from the SC prompt and verify that the output is printed to the screen without significant delay, such as greater than one minute.
Verify that the login parameters provided to the installer match the user name and password configured on the chassis and switch.
Verify that the chassis SC is network accessible using the IP address provided to the installer.

Problem:

Chassis discovery fails. The installation log file (/var/opt/terraspring/install/run/install.log) contains the specific error message. The installer displays the following message, along with a series of options from which you can choose:

Installation may have failed due to incorrect user input or some other correctable error.

Solution:

Verify that the shelf controller's CLI prompt generation property is set to none and that the prompt ends with a > character.

Problem:

During the installation process, creating the Control Center database fails. The installation log file (/var/opt/terraspring/install/run/install.log) contains the specific error message. The installer displays the following message, along with a series of options from which you can choose:

Installation may have failed due to incorrect user input or some other correctable error.

Solution:

If you are using Oracle for your database system, the installer needs to know the user name and password of an Oracle dba account so that it can create the Control Center database. If this information is not provided to the installer correctly, or if the default system/manager account does not exist, creation of the CC database will fail.

If the creation of the database fails, you will need to either update the Oracle database to contain the proper user name and password combination or update the installer parameters by entering a new set of user name and password.

Problem:

You see an error indicating a problem loading the configuration on the switch.

Solution:

If you ran the switchsync tool on a new shelf system controller (SSC) that replaces a low activity SSC, you can safely ignore this message. The switchsync tool is meant to be used when an SSC that malfunctioned while in use or had some other problem is being replaced by another SSC. However, if there is not enough activity on the Provisioning Server, such as farm activation, to alter the configuration of the switch, this tool need not be run on the new SSC that replaces the old one. If you attempt to run this tool, you might see an error that there was some problem loading the configuration on the switch. The error occurs because not enough information is in the database about the configuration of the switch because no switch-related activity occurred.

Problem:

Installation failed, and you would like to start over.

Solution:

The N1 Provisioning Server software includes an uninstall program. Run the uninstall_PS command to uninstall a partially installed server.

Installation Validation Issues and Solutions

The following section provides information about issues that you might encounter when you are validating the installation of your N1 Provisioning Server 3.1, Blades Edition software.

Problem:

Resource pool server did not pass the final validation test. The installation log file (/var/opt/terraspring/install/run/install.log) contains the specific error message. The installer displays the following message, along with a series of options from which you can choose:

Installation may have failed due to incorrect user input or some other correctable error.

Solution:

During the final validation test, the installer attempts to boot all available resource pool server devices. During this process, the resource pool server will boot on the image provisioning network. This process relies on correctly configured data and control layer switches as well as proper configuration of the Boot Loader Image, DHCP, and BIND.

If a resource pool server fails the validation test, you should verify the following items:

Verify that the SNMP public community string on the chassis switch matches the admin password.
Verify that VLAN 8 is defined on all chassis switches and the data layer switch. Also ensure that VLAN 8 is added to the list of allowed VLANs on all trunked ports.
Verify that the Boot Loader Image has been set up properly.
Use the following command to verify that the Provisioning Server has an interface configured on the image VLAN (skge800 when using Syskonnect; ce8000 when using GigaSwift).
/sbin/ifconfig -a
Use the following command to verify that traffic on the image subnet interface on the Provisioning Server is not blocked by IPF.
/usr/sbin/ipfstat -io

To further debug this problem, you might need to look at the traffic on the image subnet interface and monitor the resource pool server device during boot up on the console port. Use the snoop utility.

Problem:

During installation, the final validation test (pestest) is taking too long to run, and you do not want to wait for it to finish.

Solution:

You can stop the test by pressing Control-C or by sending a kill signal. Stopping the test will not harm anything. However, the blades not have been validated for farm use. If any of the blades had a problem, farm activation would fail later. You should let the test finish to detect any problems with the blades, such as defective hardware.

If you stop the test, the pestest tool restores the state of each blade to the state that the blade was in before the pestest was run. For example, a blade might have an initial state of FREE. During the validation test, the blade might acquire the USED state. However, if you were to kill or cancel the test before the test finishes, pestest would set the blade back to FREE before exiting.

Problem:

You think that the validation test (pestest) failed, but are not certain because you do not recognize the failure message.

Solution:

When the validation test fails, messages similar to the following are displayed on your screen:

50306: test FAILED: Reason was: - Cannot save state information for 50306:
Blade S6 seems to be faulty
50111: test FAILED: Reason was: - PES 50111 did not become active in 120 seconds

Warning: 1 Blade(s) timed out and did not complete the test.
  Some Blades (1) in your I-Fabric have failed the validation test
  and are not usable by the N1 Provisioning Server. This may probably
  be due to some configuration issues in the I-Fabric. Please diagnose
  and fix the problem before using these Blades.

Problem:

During the validation test (pestest), the following message is displayed for some blades:

device-id: test FAILED: Reason was: - Cannot save state information for device-id: 
Blade Sn seems to be faulty

Solution:

This blade is defective and needs to be replaced as soon as possible. Follow these steps:

Type the following command to see the properties of the blade:
# /opt/terraspring/sbin/device -l device-id
Where device-id is the device id shown in the error message.
Examine the FARM_ID column.

If the FARM_ID column does not contain a hyphen (-), the blade is part of a farm.

If the blade is part of a farm, type the following command to replace the failed blade in the farm with another blade that has similar attributes:
# /opt/terraspring/sbin/replacedevice farm-id failed-device-id

To find the shelf ID and IP address for this blade, use the console command.

# /opt/terraspring/sbin/console failed-device-id

In the following example, s2 is the shelf ID and 10.5.141.50 is the IP address.

# console 50102

Console Information
====================
IP address of Terminal-Server(Service Controller): 10.5.141.50
Port(Blade) ID: s2
#

Telnet to the shelf and type the following command to inform the shelf controller that the blade is to be prepared for removal:
# replace fru Sn
Where Sn is the shelf ID from the previous step.

In response to this command, a blue LED will light on the blade that is to be removed.
Approach the blade shelf front panel and remove the defective blade that has a blue LED.
Insert a good blade into the blade shelf to replace the defective blade.
To detect the new blade and update the information in the database, type the following command:
# /opt/terraspring/sbin/shelfsync
To retest the blades, type the following command:
# /opt/terraspring/sbin/pestest

Note –

If you do not replace a defective blade, you must mark that blade as FAILED. Otherwise, later farm activation could fail if that defective blade is used in the farm. Use the following command: /opt/terraspring/sbin/device -sB device-id

Problem:

During the validation test (pestest), the following message is displayed for some blades:

device-id: test FAILED: Reason was: - PES device-id did not become active in 120 seconds

Solution:

This generic message indicates that the blade is unable to boot in the time allowed. If only some of the blades have failed with this message, the cause could be defective hardware, network congestion, or network interference. Try to retest specific blades using the /opt/terraspring/sbin/pestest -d device-id command. If after several retests, those blades still fail with the same message, the cause is likely to be defective hardware or network interference from another machine on the network. Before proceeding to create farms, you must either take appropriate steps to troubleshoot and fix the problem, or mark these blades with the FAILED status so that they will not be used in a farm. To mark the blades as FAILED, use the following command: /opt/terraspring/sbin/device -sB device-id

Problem:

During the validation test (pestest), the following message is displayed for all blades:

###: test FAILED: Reason was:  - PES ### did not become active in 120 seconds

Solution:

If all blades fail the validation test, pestest will print a generic message. This message tells you to diagnose and fix the problem before the blades can be used. This issue has three likely causes and solutions:

The network switch on the data plane is not configured properly. Follow these steps to check the configuration:
1. Verify that all the switch ports that connect the data plane switch to the blade shelves are set to “trunk.”
2. Verify that the image VLAN is created on VLAN 8.
3. Verify that the switch port that connects the N1 Provisioning Server machine to the data plane switch is set to VLAN 8.
For more information about configuring the switches, see Chapter 3, N1 Provisioning Server System and Network Preparation in N1 Provisioning Server 3.1, Blades Edition, Installation Guide.
A network problem prevents the blades from communicating with the N1 Provisioning Server machine. To resolve this issue, verify that no other server on the same network is configured as a DHCP server. Another DHCP server could be sending NACK to the blades, preventing them from properly acquiring an IP address from the N1 Provisioning Server machine.
Hardware connections are loose or otherwise not correct. Verify that all cables are properly connected.

Farm Activation and Transition Issues and Solutions

The following section gives information about debugging farm activation failures. If a farm fails to complete a request successfully, an error state will be set on the farm. You can determine the error state by running the command /opt/terraspring/sbin/farm -l. The second-to-last column in the command output indicates the error state of the farm.

Two error states are set for informational purposes only:

Error state 0 indicates that the farm is okay and ready to accept new requests.
Error state 1000 indicates that the farm is in a transition state. A farm request is currently being processed for this farm and the farm is moving from one state to another.

You might encounter the problems below during farm activation or when a farm transitions from one state to another. To reissue a farm request to a farm in an error state, add the -f option to the farm command, for example farm -af farm-id. This option will cause the farm error to be cleared and the farm request will be processed.

Problem:

When you run the /opt/terraspring/sbin/farm -l command, the farm state displays as NEW/NEW_CONFIG/20. This value indicates that your farm could not be allocated.

Solution:

Farm allocation fails if the farm requires more resources of a particular type than are available. To verify the resources that are preventing successful allocation of your farm, run the following command: /opt/terraspring/sbin/rsck farm-id. Retry farm activation after rsck reports that enough resources are available to allocate your farm.

Problem:

During the dispatch phase of activation, all devices in the farm are powered on for the final boot. If the farm fails to dispatch, an error message appears in the Control Center window. Details about the error are in the debug log file /var/adm/tspr.debug.

Solution:

One of the following two solutions might apply:

Device did not report to the monitoring agent before end of timeout period.

The resource pool server might not be booting properly. Watch the resource pool server boot on the console connection. If a device does not boot properly, you need to focus on networking problems. If the device boots successfully but the monitoring agent does not recognize that the device booted, verify that the monitoring agent process on the resource pool server started successfully. Run the following command on the resource pool server: /usr/ucb/ps auxwww | grep java. You should see a monitoring agent Java process running in the background. To debug the monitoring agent process, examine the log files at /opt/terraspring/log/tspr.log.
A problem could exist when trying to power on the device. Refer to the debugging method discussed in the second problem in Farm Wiring Issues and Solutions.

Problem:

Farm update failure. If the farm fails to update, an error message appears in the Control Center window. Details about the error are in the debug log file /var/adm/tspr.debug.

Solution:

The farm update procedure is very similar to the activation procedure. The farm will first transition from the ACTIVE state into the UPDATE state from which it transitions through the WIRED and DISPATCHED states back to the ACTIVE state. During the transition to the UPDATE state, the farm attempts to allocate newly requested resources. Failures during this transition should be debugged in the same manner as allocation problems are debugged. Once a farm has reached UPDATE state, it will transition through the WIRED, DISPATCHED, and ACTIVE states. Refer to the two preceding problems to debug failures.

Problem:

In farm STANDBY state, scrubbing of removed disks fails.

Solution:

Setting up a device for scrubbing follows the same procedure as preparing a device for a disk copy. Refer to the third problem in Farm Wiring Issues and Solutions to debug this problem.

Problem:

In farm STANDBY state, you are unable to move a device into a VLAN.

Solution:

Refer to the first problem in Farm Wiring Issues and Solutions to debug this problem.

Problem:

In farm STANDBY state, you are unable to change the power state of a device.

Solution:

Refer to the debugging method discussed in the second problem in Farm Wiring Issues and Solutions.

Problem:

Snapshot of disk images fails. The completion of the snapshot request leaves the farm in an error state.

Solution:

A disk snapshot is prepared in a similar way as disk copies to a server disk are set up. Follow the instructions in the third problem in Farm Wiring Issues and Solutions to debug this problem.

In addition, the snapshot process attempts to reserve disk space for the snapshot image on the image server prior to taking the snapshot image. If the image server is nearly filled up to its capacity, the snapshot process might fail. In this case, remove old images from the server or back these images up to a separate storage device. Use the image command to delete old images: /opt/terraspring/sbin/image -d image-id.

Problem:

Farm deactivation failed. The completion of the deactivation request leaves the farm in an error state.

Solution:

Farm deactivation goes through the same motions as moving a farm into STANDBY with the exception that no snapshot images are created for any removed devices. Follow the farm STANDBY debugging advice when troubleshooting farms that failed to deactivate.

Problem:

A replacement device could not be allocated for device failover support. The completion of the “replace failed device” request leaves the farm in an error state.

Solution:

To replace a failed device with a backup device, a backup device must be available. If there are no more devices available in the free pool, replacing a failed device will fail due to allocation problems. Use the following command to verify that a replacement device is available:/opt/terraspring/sbin/device -LFr device-id.

Problem:

The replacement device could not be provisioned for device failover support. The completion of the “replace failed device” request leaves the farm in an error state.

Solution:

A replacement device will be provisioned in the same way newly allocated devices are provisioned during initial farm activation and farm updates. Refer to Farm Wiring Issues and Solutions to debug these problems.

Problem:

When executing the command /opt/terraspring/sbin/request -lf farm-id, requests show as queued but do not get processed.

Solution:

Verify the following items:

Check the state of the farm to confirm that the farm is not in an error state. Type the following command:
# /opt/terraspring/sbin/farm -l farm-id
If a farm is in an error state, no farm requests will be processed. If you know that the farm is in a reasonable state to process the queued farm requests, perform the following steps:
1. Clear the error state using the following command: /opt/terraspring/sbin/farm -pf farm-id. This command will ping the farm and clear its error state.
2. Verify that the segment manager (sm) is running and responding by typing the following command: /opt/terraspring/sbin/aps.
The next request to be processed for this farm is something other than a request marked QUEUED_BLOCKED.

Because the requests are handled in a first-in/first-out fashion, you might be filing a farm request that is not being processed because a QUEUED_BLOCKED request needs to be processed first. You have two options to eliminate this request. You can either unblock the blocked request, or delete the blocked request. Type the following command to unblock the request:
request -u request-id
Type the following command to delete the request:
request -d request-id
Your farm is in a state other than DEACTIVATED.

A farm in DEACTIVATED state cannot be transitioned into any other state. Once a farm is deactivated, the only other farm operation allowed is farm deletion.
The Segment Manager process is alive.

If none of the above scenarios apply, the Segment Manager process might be no longer responding. To verify that all processes are running properly, use the following command: /opt/terraspring/sbin/aps. If the process is not running properly, use the /etc/rc3.d/S99sm -start script to start the Segment Manager process.

Problem:

You are unable to deactivate a farm that contains a faulty blade. Both farm deactivation and replace device requests fail.

Solution:

Type the following command to mark the faulty blade as FAILED: device -sB blade-id. Then, re-issue the deactivation request.

Farm Wiring Issues and Solutions

During farm activation, various checks are made. When you run the /opt/terraspring/sbin/farm -l command, the farm state might display as NEW/ALLOCATED/30. This value indicates that the farm failed during the wiring phase. This section describes the possible causes and solutions to a farm wiring failure.

Problem:

Moving a device into a VLAN fails. Moving a device into the image VLAN or farm VLAN during the wiring phase could fail in several ways.

Solution:

Each failure has a unique solution:

If the SNMP community string on the switch is not set correctly, moving a farm device into a VLAN might fail. To resolve this issue, change the public SNMP community string on the switch to match the admin password.
The VLAN into which the device is being moved is not defined in the VLAN database of the chassis switch. To resolve this issue, update the VLAN database with the VLAN to which the farm device is being moved.
The switch is not IP accessible from the Provisioning Server. To resolve this issue, resolve any networking issues that could cause IP connectivity to fail. Once the IP connectivity problem has been resolved, reissue the activation request using the farm command: /opt/terraspring/sbin/farm -af farm-id.

Problem:

Unable to change the power state of a blade.

Solution:

This problem has several possible causes and solutions:

If a blade is marked faulty in the showplatform -v output on the chassis system controller (SC), the N1 Provisioning Server software will refuse to change the power state of the blade. A blade that has been marked faulty should be replaced.
If the chassis SC login does not match the login and password known to the N1 Provisioning Server software, you will be unable to change the power state of the blade. In this case, you should update the chassis SC login to match the user name and password known to the software.
During a power state change of a blade, the N1 Provisioning Server system needs to have IP connectivity to the chassis SC. If this connectivity is not granted, the power state change will fail.
The chassis SC occasionally hangs. It does not respond or responds slowly to showplatform -v commands. In this case, you should turn off the power to the chassis and turn the power back on. Then, reactivate the farm using the following command:/opt/terraspring/sbin/farm -af farm-id.
Finally, if you are trying to commit a power change to a device that is defined in the N1 Provisioning Server database but that has been physically removed from the chassis, the power command will fail. Issue a replace failed device request for the removed device in your farm and reactivate your farm.

Problem:

Disk copy failed. If a disk copy fails, an error message appears in the Control Center window. Details about the error are in the debug log file /var/adm/tspr.debug.

Solution:

The process of preparing a resource pool server for image copy is similar to the process executed during the final lab validation test during the installation. Refer to Installation Validation Issues and Solutions to troubleshoot a device that did not come alive within 300 seconds.

Image copies could also fail if the images requested to be copied to a resource pool server do not exist or are not in the READY state. To verify that the image you are trying to copy to a resource pool server is in the READY state, use the following command: /opt/terraspring/sbin/image -l image-id. To mark an image as ready to execute, use the following command: /opt/terraspring/sbin/imagesync --nosync image-id.

If the disk copy process starts successfully and later fails due to an interrupt, you might need to resolve networking problems within your environment.

Previous: Chapter 1 General Troubleshooting Guidelines and Processes