C H A P T E R 4 |
Firmware and Blade Server Management |
This chapter contains the following sections:
The Netra CP3260 blade server contains a modular firmware architecture that gives you latitude in controlling boot initialization. You can customize the initialization, test the firmware, and even enable the installation of a custom operating system.
This platform also employs the Intelligent Platform Management Controller (IPMC)--described in Section 5.2.8, Intelligent Platform Management Controller--which controls the system management, hot-swap control, and some board hardware. The IPMC configuration is controlled by separate firmware.
The Netra CP3260 blade server boots from the 4-Mbyte system flash PROM device that includes the power-on self-test (POST) and OpenBoot firmware.
A systems firmware progress sensor (SFPS) is available on the Sun Netra CP3260 blade server. The purpose of the sensor is to model the firmware running on the payload and provide various states to the external management software (ShMM on Netra CT 900 servers). This occurs via a standard IPMI event mechanism.
The firmware states are Progress, Hang, and Error, with various substates. The sensor generates an IPMI event message for each state. You can verify the messages by using clia sel command on the ShMM, through HPI event and SNMP traps for each state of a sensor event.
For more information, see Section B.4, Send Sensor State Command.
For detailed sensor command syntax and options, refer to the Netra CT 900 Software Developer’s Guide (819-1178). (Even if you are using a third-party chassis, the SFPS commands and options apply, and this document is available online.)
http://docs.sun.com/app/docs/prod/n900.srvr#hic
Power-on self-test (POST) is a firmware program that helps determine whether a portion of the system has failed. POST verifies the core functionality of the system, including the CPU modules, motherboard, memory, and some on-board I/O devices. The software then generates messages that can be useful in determining the nature of a hardware failure. POST can run even if the system is unable to boot.
If POST detects a faulty component, it is disabled automatically, preventing faulty hardware from potentially harming any software. If the system is capable of running without the disabled component, the system boots when POST is complete. For example, if one of the processor cores is deemed faulty by POST, the core is disabled, and the system boots and runs using the remaining cores.
POST diagnostic and error message reports are displayed on a console.
The POST diagnostics include the following tests:
POST diagnostic and error messages are displayed on a console. The format of the these messages is the following:
Core-ID:Strand-ID ERROR: TEST = test-name Core-ID:Strand-ID H/W under test = description Core-ID:Strand-ID Repair Instruction Core-ID:Strand-ID MSG = error-message-body Core-ID:Strand-ID END_ERROR |
The following is an example of a POST error message
The Solaris OS installed operates at different run levels. For a full description of run levels, refer to the Solaris system administration documentation.
Most of the time, the OS operates at run level 2 or run level 3, which are multiuser states with access to full system and network resources. Occasionally, you might operate the system at run level 1, which is a single-user administrative state. However, the lowest operational state is run level 0.
When the OS is at run level 0, the ok prompt appears. This prompt indicates that the OpenBoot firmware is in control of the system.
There are a number of scenarios under which OpenBoot firmware control can occur.
By default, before the operating system is installed, the system comes up under OpenBoot firmware control.
There are different ways of reaching the ok prompt. The methods are not equally desirable. See TABLE 4-1 for details.
If possible, back up system data before starting to access the ok prompt. Also exit or stop all applications, and warn users of the impending loss of service. For information about the appropriate backup and shutdown procedures, see Solaris system administration documentation.
The system firmware stores a configuration variable called auto-boot?, which controls whether the firmware will automatically boot the operating system after each reset. The default setting for Sun platforms is true.
Normally, if a system fails power-on diagnostics, auto-boot? is ignored and the system does not boot unless an operator boots the system manually. An automatic boot is generally not acceptable for booting a system in a degraded state. Therefore, the Netra CP3260 server OpenBoot firmware provides a second setting, auto-boot-on-error?. This setting controls whether the system will attempt a degraded boot when a subsystem failure is detected. Both the auto-boot? and auto-boot-on-error? switches must be set to true to enable an automatic degraded boot. To set the switches, type:
Note - The default setting for auto-boot-on-error? is false. The system will not attempt a degraded boot unless you change this setting to true. In addition, the system will not attempt a degraded boot in response to any fatal nonrecoverable error, even if degraded booting is enabled. For examples of fatal nonrecoverable errors, see Section 4.3.4, OpenBoot Configuration Variables. |
You type the OpenBoot commands at the ok prompt. The OpenBoot commands that can provide useful diagnostic information include:
For a complete list of OpenBoot commands and more information about the OpenBoot firmware, refer to the OpenBoot 4.x Command Reference Manual. An online version of the manual is included with the OpenBoot Collection AnswerBook that ships with Solaris software.
The probe-scsi and probe-scsi-all commands diagnose problems with the SCSI devices.
Caution - If you used the haltcommand or the Stop-A key sequence to reach the okprompt, issuing the probe-scsior probe-scsi-allcommand can hang the system. |
The probe-scsi command communicates with all SCSI devices connected to on-board SCSI controllers. The probe-scsi-all command also accesses devices connected to any host adapters installed in PCI slots.
For any SCSI device that is connected and active, the probe-scsi and probe-scsi-all commands display its loop ID, host adapter, logical unit number, unique worldwide name (WWN), and a device description that includes type and manufacturer.
The following sample output is from the probe-scsi-all command with a Netra CP32x0 ARTM connected to the Netra CP3260 blade server.
The probe-ide command communicates with all Integrated Drive Electronics (IDE) devices connected to the IDE bus. This is the internal system bus for media devices such as the DVD drive.
Caution - If you used the haltcommand or the Stop-A key sequence to reach the okprompt, issuing the probe-idecommand can hang the system. |
The following shows sample output from the probe-ide command.
The show-devs command lists the hardware device paths for each device in the firmware device tree. The following shows some sample output.
The watch-net diagnostics test monitors Ethernet packets on the primary network interface. The watch-net-all diagnostics test monitors Ethernet packets on the primary network interface and on any additional network interfaces connected to the system board. Good packets received by the system are indicated by a period (.). Errors such as the framing error and the cyclic redundancy check (CRC) error are indicated with an X and an associated error description.
To start the watch-net diagnostic test, type the watch-net command at the ok prompt.
{0} ok watch-net 1000 Mbps full duplex Link up Looking for Ethernet Packets. ‘.’ is a Good Packet. ‘X’ is a Bad Packet. Type any key to stop................................. |
To start the watch-net-all diagnostic test, type watch-net-all at the ok prompt.
The OpenBoot configuration variables are stored in the OBP flash PROM and determine how and when OpenBoot tests are performed. This section explains how to access and modify OpenBoot configuration variables. For a list of important OpenBoot configuration variables, see TABLE 4-2.
Changes to OpenBoot configuration variables take effect at the next reboot.
Halt the server to display the ok prompt.
The following example shows a short excerpt of this command’s output.
Error handling during the power-on sequence falls into one of the following three cases:
Automatic system recovery (ASR) consists of self-test features and an autoconfiguration capability to detect failed hardware components and unconfigure them. By enabling ASR, the server is able to resume operating after certain nonfatal hardware faults or failures have occurred.
If a component is monitored by ASR and the server is capable of operating without it, the server automatically reboots if that component develops a fault or fails. This capability prevents a faulty hardware component from stopping operation of the entire system or causing the system to fail repeatedly.
If a fault is detected during the power-on sequence, the faulty component is disabled. If the system remains capable of functioning, the boot sequence continues.
To support this degraded boot capability, the OpenBoot firmware uses the 1275 client interface (by means of the device tree) to mark a device as either failed or disabled, creating an appropriate status property in the device tree node. The Solaris OS does not activate a driver for any subsystem marked in this way.
As long as a failed component is electrically dormant (not causing random bus errors or signal noise, for example), the system reboots automatically and resumes operation while a service call is made.
Once a failed or disabled device is replaced with a new one, the OpenBoot firmware automatically modifies the status of the device upon reboot.
Note - ASR is not enabled until you activate it (see Section 4.5.1.1, To Enable Automatic System Recovery). |
The automatic system recovery (ASR) feature is not activated until you enable it. Enabling ASR requires changing configuration variables in OpenBoot.
2. To cause the parameter changes to take effect, type:
The system permanently stores the parameter changes and boots automatically when the OpenBoot configuration variable auto-boot? is set to true (its default value).
Note - To store parameter changes, you can also power cycle the system using the front panel Power button. |
2. To cause the parameter changes to take effect, type:
The system permanently stores the parameter change.
Note - To store parameter changes, you can also power cycle the system using the front panel Power button. |
After you disable the automatic system recovery (ASR) feature, it is not activated again until you re-enable it.
A device alias is a shorthand representation of a device path. The Solaris OS provides some predefined device aliases for the network devices so that you do not need to type the full device path name. TABLE 4-3 lists the network device aliases, the default Solaris OS device names, and associated ports for the Netra CP3260 blade server. You can use the devalias command to display the device aliases.
You can use the Solaris platform information and control library (PICL) framework for obtaining the state and condition of the Netra CP3260 blade server.
The PICL framework provides information about the system configuration that it maintains in the PICL tree. Within this PICL tree is a subtree named frutree, which represents the hierarchy of system field-replaceable units (FRUs) with respect to a root node in the tree called chassis. The frutree represents physical resources of the system. The PICL tree is updated whenever a change occurs in a device’s status.
TABLE 4-4 shows the frutree entries and properties that describe the condition of the Netra CP3260 blade server.
The prtpicl -v command shows the condition of all devices in the PICL tree. Sample output from the prtpicl command on the Netra CP3260 blade server is shown in CODE EXAMPLE 4-4.
For more information on the PICL framework, refer to the picld(1M) man page.
A multiplexer (MUX) controller and ShMM configuration is available for use on Netra CP3260 blade servers to multiplex 10GbE network interface unit (NIU) ports to Zone 2 (backplane) and/or to Zone 3 (ARTM).
Note - The host must be configured to match the MUX configuration. |
For customers using blade servers in a Netra CT 900 chassis, a complete end-to-end solution is provided. The MUX feature is implemented through the ShMM firmware and IPMI commands on the IPMC. These commands extend MUX configuration access to the management software so that during blade server hot-swaps, the MUX configuration is persistent across blade server activations and deactivations.
Customers who use Sun Netra CP3260 blade servers in a third-party chassis, which does not have the Netra CT 900 chassis ShMM management software implemented, can save MUX configurations in a configuration file or in a persistent storage managed by system management software. When the system management software detects blade server activation, it sends the command to set MUX to the programmed state. Because management software sends the command during every blade server activation, the configuration is persistent across blade server deactivation and activation.
Refer to the following documentation:
Be aware of the following possible issues when multiplexing zones:
Copyright © 2009 Sun Microsystems, Inc. All rights reserved.