N1 Provisioning Server 3.1, Blades Edition, Troubleshooting Guide

Chapter 2 Solving Installation and Configuration Problems

This chapter provides information about issues you might encounter when installing and configuring the N1 Provisioning Server 3.1, Blades Edition product. The issues are separated into the following categories:

For each issue in this document, a description of the symptom that you will see is provided along with a recommended solution. Although the solution to the problem should enable you to install the N1 Provisioning Server software, you should still evaluate whether you encountered the problem due to a bug in the software. If you encounter what you believe is a bug in the software, inform your customer care center representative.

Installation Issues and Solutions

This section describes known installation and initial configuration issues and how you can respond to them. For information about problems that are related to validating the installation, see Installation Validation Issues and Solutions.

Problem:

No VLAN-capable gigabit Ethernet card was found during the installation procedure. If the installer does not find any VLAN-capable card or appropriate drivers on the system, the following message is displayed:


Could not detect any known gigacards. Check if your driver packages are installed.
Solution:

This problem might occur due to several reasons:

Problem:

During the installation process, you see a message that not all required patches have been installed.

Solution:

The required operating system patches need to be installed before installation can proceed. The N1 Provisioning Server 3.1, Blades Edition release allows you to install the Solaris 8 Recommended patch cluster, rather than to install individually the required patches mentioned in the installation guide. For more information about installing required patches, see Installing Required Patches in N1 Provisioning Server 3.1, Blades Edition, Installation Guide.

Problem:

After applying operating system patches to the N1 Provisioning Server system, named/dhcp does not start up properly. One indication of this problem is that farm requests start failing.

Solution:

The applied patches probably overwrote one of the N1 modified system tools like BIND or DHCP. Try the following solutions:

Problem:

Installation of the Sun JavaTM System Application Server failed. The installation log file (/var/opt/terraspring/install/run/install.log) contains the specific error message.

Solution:

This problem is most likely due to a missing patch on the N1 Provisioning Server system. Make sure that you have all required patches installed on the N1 Provisioning Server system.

Problem:

When entering the Blade system chassis IP address and switch network configuration information, the chassis cannot be found. The following error message appears:


Cannot ping System Controller IP Value ip-address
Solution:

This problem is due to a networking issue. If you have not done so, you need to configure the Blade system chassis system controller (SC) to be IP-accessible from the N1 Provisioning Server. For more information, see Assigning IP Addresses in the Control Plane in N1 Provisioning Server 3.1, Blades Edition, Installation Guide.

Problem:

Configuration of the chassis failed. The installation log file (/var/opt/terraspring/install/run/install.log) contains the specific error message. The installer displays the following message, along with a series of options from which you can choose:


Installation may have failed due to incorrect user input or some other correctable error.
Solution:

When the installer fails to configure a chassis SC or switch, several items need to be verified:

Problem:

Chassis discovery fails. The installation log file (/var/opt/terraspring/install/run/install.log) contains the specific error message. The installer displays the following message, along with a series of options from which you can choose:


Installation may have failed due to incorrect user input or some other correctable error.
Solution:

Verify that the shelf controller's CLI prompt generation property is set to none and that the prompt ends with a > character.

Problem:

During the installation process, creating the Control Center database fails. The installation log file (/var/opt/terraspring/install/run/install.log) contains the specific error message. The installer displays the following message, along with a series of options from which you can choose:


Installation may have failed due to incorrect user input or some other correctable error.
Solution:

If you are using Oracle for your database system, the installer needs to know the user name and password of an Oracle dba account so that it can create the Control Center database. If this information is not provided to the installer correctly, or if the default system/manager account does not exist, creation of the CC database will fail.

If the creation of the database fails, you will need to either update the Oracle database to contain the proper user name and password combination or update the installer parameters by entering a new set of user name and password.

Problem:

You see an error indicating a problem loading the configuration on the switch.

Solution:

If you ran the switchsync tool on a new shelf system controller (SSC) that replaces a low activity SSC, you can safely ignore this message. The switchsync tool is meant to be used when an SSC that malfunctioned while in use or had some other problem is being replaced by another SSC. However, if there is not enough activity on the Provisioning Server, such as farm activation, to alter the configuration of the switch, this tool need not be run on the new SSC that replaces the old one. If you attempt to run this tool, you might see an error that there was some problem loading the configuration on the switch. The error occurs because not enough information is in the database about the configuration of the switch because no switch-related activity occurred.

Problem:

Installation failed, and you would like to start over.

Solution:

The N1 Provisioning Server software includes an uninstall program. Run the uninstall_PS command to uninstall a partially installed server.

Installation Validation Issues and Solutions

The following section provides information about issues that you might encounter when you are validating the installation of your N1 Provisioning Server 3.1, Blades Edition software.

Problem:

Resource pool server did not pass the final validation test. The installation log file (/var/opt/terraspring/install/run/install.log) contains the specific error message. The installer displays the following message, along with a series of options from which you can choose:


Installation may have failed due to incorrect user input or some other correctable error.
Solution:

During the final validation test, the installer attempts to boot all available resource pool server devices. During this process, the resource pool server will boot on the image provisioning network. This process relies on correctly configured data and control layer switches as well as proper configuration of the Boot Loader Image, DHCP, and BIND.

If a resource pool server fails the validation test, you should verify the following items:

To further debug this problem, you might need to look at the traffic on the image subnet interface and monitor the resource pool server device during boot up on the console port. Use the snoop utility.

Problem:

During installation, the final validation test (pestest) is taking too long to run, and you do not want to wait for it to finish.

Solution:

You can stop the test by pressing Control-C or by sending a kill signal. Stopping the test will not harm anything. However, the blades not have been validated for farm use. If any of the blades had a problem, farm activation would fail later. You should let the test finish to detect any problems with the blades, such as defective hardware.

If you stop the test, the pestest tool restores the state of each blade to the state that the blade was in before the pestest was run. For example, a blade might have an initial state of FREE. During the validation test, the blade might acquire the USED state. However, if you were to kill or cancel the test before the test finishes, pestest would set the blade back to FREE before exiting.

Problem:

You think that the validation test (pestest) failed, but are not certain because you do not recognize the failure message.

Solution:

When the validation test fails, messages similar to the following are displayed on your screen:


50306: test FAILED: Reason was: - Cannot save state information for 50306:
Blade S6 seems to be faulty
50111: test FAILED: Reason was: - PES 50111 did not become active in 120 seconds

Warning: 1 Blade(s) timed out and did not complete the test.
  Some Blades (1) in your I-Fabric have failed the validation test
  and are not usable by the N1 Provisioning Server. This may probably
  be due to some configuration issues in the I-Fabric. Please diagnose
  and fix the problem before using these Blades.
Problem:

During the validation test (pestest), the following message is displayed for some blades:


device-id: test FAILED: Reason was: - Cannot save state information for device-id: 
Blade Sn seems to be faulty
Solution:

This blade is defective and needs to be replaced as soon as possible. Follow these steps:

  1. Type the following command to see the properties of the blade:


    # /opt/terraspring/sbin/device -l device-id
    

    Where device-id is the device id shown in the error message.

  2. Examine the FARM_ID column.

    If the FARM_ID column does not contain a hyphen (-), the blade is part of a farm.

    If the blade is part of a farm, type the following command to replace the failed blade in the farm with another blade that has similar attributes:


    # /opt/terraspring/sbin/replacedevice farm-id failed-device-id
    
  3. To find the shelf ID and IP address for this blade, use the console command.


    # /opt/terraspring/sbin/console failed-device-id
    

    In the following example, s2 is the shelf ID and 10.5.141.50 is the IP address.


    # console 50102
    
    Console Information
    ====================
    IP address of Terminal-Server(Service Controller): 10.5.141.50
    Port(Blade) ID: s2
    #
  4. Telnet to the shelf and type the following command to inform the shelf controller that the blade is to be prepared for removal:


    # replace fru Sn
    

    Where Sn is the shelf ID from the previous step.

    In response to this command, a blue LED will light on the blade that is to be removed.

  5. Approach the blade shelf front panel and remove the defective blade that has a blue LED.

  6. Insert a good blade into the blade shelf to replace the defective blade.

  7. To detect the new blade and update the information in the database, type the following command:


    # /opt/terraspring/sbin/shelfsync
    
  8. To retest the blades, type the following command:


    # /opt/terraspring/sbin/pestest
    

Note –

If you do not replace a defective blade, you must mark that blade as FAILED. Otherwise, later farm activation could fail if that defective blade is used in the farm. Use the following command: /opt/terraspring/sbin/device -sB device-id


Problem:

During the validation test (pestest), the following message is displayed for some blades:


device-id: test FAILED: Reason was: - PES device-id did not become active in 120 seconds
Solution:

This generic message indicates that the blade is unable to boot in the time allowed. If only some of the blades have failed with this message, the cause could be defective hardware, network congestion, or network interference. Try to retest specific blades using the /opt/terraspring/sbin/pestest -d device-id command. If after several retests, those blades still fail with the same message, the cause is likely to be defective hardware or network interference from another machine on the network. Before proceeding to create farms, you must either take appropriate steps to troubleshoot and fix the problem, or mark these blades with the FAILED status so that they will not be used in a farm. To mark the blades as FAILED, use the following command: /opt/terraspring/sbin/device -sB device-id

Problem:

During the validation test (pestest), the following message is displayed for all blades:


###: test FAILED: Reason was:  - PES ### did not become active in 120 seconds
Solution:

If all blades fail the validation test, pestest will print a generic message. This message tells you to diagnose and fix the problem before the blades can be used. This issue has three likely causes and solutions:

Farm Activation and Transition Issues and Solutions

The following section gives information about debugging farm activation failures. If a farm fails to complete a request successfully, an error state will be set on the farm. You can determine the error state by running the command /opt/terraspring/sbin/farm -l. The second-to-last column in the command output indicates the error state of the farm.

Two error states are set for informational purposes only:

You might encounter the problems below during farm activation or when a farm transitions from one state to another. To reissue a farm request to a farm in an error state, add the -f option to the farm command, for example farm -af farm-id. This option will cause the farm error to be cleared and the farm request will be processed.

Problem:

When you run the /opt/terraspring/sbin/farm -l command, the farm state displays as NEW/NEW_CONFIG/20. This value indicates that your farm could not be allocated.

Solution:

Farm allocation fails if the farm requires more resources of a particular type than are available. To verify the resources that are preventing successful allocation of your farm, run the following command: /opt/terraspring/sbin/rsck farm-id. Retry farm activation after rsck reports that enough resources are available to allocate your farm.

Problem:

During the dispatch phase of activation, all devices in the farm are powered on for the final boot. If the farm fails to dispatch, an error message appears in the Control Center window. Details about the error are in the debug log file /var/adm/tspr.debug.

Solution:

One of the following two solutions might apply:

Problem:

Farm update failure. If the farm fails to update, an error message appears in the Control Center window. Details about the error are in the debug log file /var/adm/tspr.debug.

Solution:

The farm update procedure is very similar to the activation procedure. The farm will first transition from the ACTIVE state into the UPDATE state from which it transitions through the WIRED and DISPATCHED states back to the ACTIVE state. During the transition to the UPDATE state, the farm attempts to allocate newly requested resources. Failures during this transition should be debugged in the same manner as allocation problems are debugged. Once a farm has reached UPDATE state, it will transition through the WIRED, DISPATCHED, and ACTIVE states. Refer to the two preceding problems to debug failures.

Problem:

In farm STANDBY state, scrubbing of removed disks fails.

Solution:

Setting up a device for scrubbing follows the same procedure as preparing a device for a disk copy. Refer to the third problem in Farm Wiring Issues and Solutions to debug this problem.

Problem:

In farm STANDBY state, you are unable to move a device into a VLAN.

Solution:

Refer to the first problem in Farm Wiring Issues and Solutions to debug this problem.

Problem:

In farm STANDBY state, you are unable to change the power state of a device.

Solution:

Refer to the debugging method discussed in the second problem in Farm Wiring Issues and Solutions.

Problem:

Snapshot of disk images fails. The completion of the snapshot request leaves the farm in an error state.

Solution:

A disk snapshot is prepared in a similar way as disk copies to a server disk are set up. Follow the instructions in the third problem in Farm Wiring Issues and Solutions to debug this problem.

In addition, the snapshot process attempts to reserve disk space for the snapshot image on the image server prior to taking the snapshot image. If the image server is nearly filled up to its capacity, the snapshot process might fail. In this case, remove old images from the server or back these images up to a separate storage device. Use the image command to delete old images: /opt/terraspring/sbin/image -d image-id.

Problem:

Farm deactivation failed. The completion of the deactivation request leaves the farm in an error state.

Solution:

Farm deactivation goes through the same motions as moving a farm into STANDBY with the exception that no snapshot images are created for any removed devices. Follow the farm STANDBY debugging advice when troubleshooting farms that failed to deactivate.

Problem:

A replacement device could not be allocated for device failover support. The completion of the “replace failed device” request leaves the farm in an error state.

Solution:

To replace a failed device with a backup device, a backup device must be available. If there are no more devices available in the free pool, replacing a failed device will fail due to allocation problems. Use the following command to verify that a replacement device is available: /opt/terraspring/sbin/device -LFr device-id.

Problem:

The replacement device could not be provisioned for device failover support. The completion of the “replace failed device” request leaves the farm in an error state.

Solution:

A replacement device will be provisioned in the same way newly allocated devices are provisioned during initial farm activation and farm updates. Refer to Farm Wiring Issues and Solutions to debug these problems.

Problem:

When executing the command /opt/terraspring/sbin/request -lf farm-id, requests show as queued but do not get processed.

Solution:

Verify the following items:

Problem:

You are unable to deactivate a farm that contains a faulty blade. Both farm deactivation and replace device requests fail.

Solution:

Type the following command to mark the faulty blade as FAILED: device -sB blade-id. Then, re-issue the deactivation request.

Farm Wiring Issues and Solutions

During farm activation, various checks are made. When you run the /opt/terraspring/sbin/farm -l command, the farm state might display as NEW/ALLOCATED/30. This value indicates that the farm failed during the wiring phase. This section describes the possible causes and solutions to a farm wiring failure.

Problem:

Moving a device into a VLAN fails. Moving a device into the image VLAN or farm VLAN during the wiring phase could fail in several ways.

Solution:

Each failure has a unique solution:

Problem:

Unable to change the power state of a blade.

Solution:

This problem has several possible causes and solutions:

Problem:

Disk copy failed. If a disk copy fails, an error message appears in the Control Center window. Details about the error are in the debug log file /var/adm/tspr.debug.

Solution:

The process of preparing a resource pool server for image copy is similar to the process executed during the final lab validation test during the installation. Refer to Installation Validation Issues and Solutions to troubleshoot a device that did not come alive within 300 seconds.

Image copies could also fail if the images requested to be copied to a resource pool server do not exist or are not in the READY state. To verify that the image you are trying to copy to a resource pool server is in the READY state, use the following command: /opt/terraspring/sbin/image -l image-id. To mark an image as ready to execute, use the following command: /opt/terraspring/sbin/imagesync --nosync image-id.

If the disk copy process starts successfully and later fails due to an interrupt, you might need to resolve networking problems within your environment.