C H A P T E R  4

 


Firmware and Blade Server Management

This chapter contains the following sections:


4.1 System Firmware

The Sun Netra CP3060 blade server contains a modular firmware architecture that gives you latitude in controlling boot initialization. You can customize the initialization, test the firmware, and even enable the installation of a custom operating system.

This platform also employs the Intelligent Platform Management controller (IPMC)--described in Section 5.1.6, Intelligent Platform Management Controller--which controls the system management, hot-swap control, and some blade server hardware. The IPMC configuration is controlled by separate firmware.

The Sun Netra CP3060 blade server boots from the 4-Mbyte system flash PROM device that includes the power-on self-test (POST) and OpenBoottrademark firmware.


4.2 Power-On Self-Test Diagnostics

Power-on self-test (POST) is a firmware program that helps determine whether a portion of the system has failed. POST verifies the core functionality of the system, including the CPU modules, motherboard, memory, and some on-board I/O devices. The software then generates messages that can be useful in determining the nature of a hardware failure. POST can run even if the system is unable to boot.

If POST detects a faulty component, it is disabled automatically, preventing faulty hardware from potentially harming any software. If the system is capable of running without the disabled component, the system boots when POST is complete. For example, if one of the processor cores is deemed faulty by POST, the core is disabled, and the system boots and runs using the remaining cores.

POST diagnostic and error message reports are displayed on a console.

4.2.1 POST Test Coverage

The POST diagnostics include the following tests:


  1. UltraSPARC T1 Processor Tests:

 

MMU (Memory Management Unit), all cores

 

 

DMMU TLBs: tags, data RAM tests

 

 

IMMU TLBs: tags, data RAM tests

 

Caches, all cores

 

 

L2 Cache

 

 

L1 Icache

 

 

L1 Dcache

 

FPU (Floating Point Unit

 

 

Functional

 

 

Register

 

Interrupts
  1. Memory Tests (up to 2-Gbyte/DIMM):

 

SDRAM data line test
SDRAM address line test
SDARM cell integrity Test
MOVing inversions memory test

 

 

 

  1. POST Image Tests

 

POST PROM checksum test
POST memory checksum test

 

  1. ECC Error Test
  1. XBUS SRAM Test
  1. JBus-to-PCIE Bridge Tests:

 

Internal registers test JBus interrupts

 

JBus interrupts

 

PCI-E MSI Interrupts test

 

PLX Interconnect test

 

PCI DMA tests

 

JBus-to-PCI-E loop-back test
  1. PCIE Tests:

 

Verify PCI-E Bus configuration

 

Verify VID/DIC registers for all onboard PCI device

 

Verify link status of all onboard PCI-E channel

4.2.2 POST Diagnostic and Error Message Format

POST diagnostic and error messages are displayed on a console. The format of the these messages is the following:

Core-ID:Strand-ID ERROR:  TEST = test-name
Core-ID:Strand-ID H/W under test = description
Core-ID:Strand-ID Repair Instruction
Core-ID:Strand-ID MSG = error-message-body
Core-ID:Strand-ID END_ERROR

The following is an example of a POST error message


TABLE 4-1
3:2>ERROR: TEST = L2-Cache Functional
3:2>H/W under test = Core l2 Cache
3:2>Repair Instructions: Replace items in order listed by ’H/W under test’ above.
3:2>MSG = No way found to match tag address 00000000.00600000, state 3
3:2>END_ERROR


4.3 OpenBoot Firmware

The Solaris OS installed operates at different run levels. For a full description of run levels, refer to the Solaris system administration documentation.

Most of the time, the OS operates at run level 2 or run level 3, which are multiuser states with access to full system and network resources. Occasionally, you might operate the system at run level 1, which is a single-user administrative state. However, the lowest operational state is run level 0.

When the OS is at run level 0, the ok prompt appears. This prompt indicates that the OpenBoottrademark firmware is in control of the system.

There are a number of scenarios under which OpenBoot firmware control can occur.

By default, before the operating system is installed the system comes up under OpenBoot firmware control.

4.3.1 Getting to the ok Prompt

There are different ways of reaching the ok prompt. The methods are not equally desirable. See TABLE 4-2 for details.


TABLE 4-2 Ways of Accessing the ok Prompt

Access Method

What to Do

Graceful shutdown of the Solaris OS

From a shell or command tool window, issue an appropriate command (for example, the shutdown or init command) as described in Solaris system administration documentation.

Manual system reset

Setting the OBP auto-boot variable to false causes the system to stop at the ok? prompt the next time the blade server is reset.




caution icon Caution - Obtaining the okprompt suspends all application and operating system software. After you issue firmware commands and run firmware-based tests from the okprompt, the system might not be able to resume where it left off.


If possible, back up system data before starting accessing the ok prompt. Also exit or stop all applications, and warn users of the impending loss of service. For information about the appropriate backup and shutdown procedures, see Solaris system administration documentation.

4.3.2 Auto-Boot Options

The system firmware stores a configuration variable called auto-boot?, which controls whether the firmware will automatically boot the operating system after each reset. The default setting for Sun platforms is true.

Normally, if a system fails power-on diagnostics, auto-boot? is ignored and the system does not boot unless an operator boots the system manually. An automatic boot is generally not acceptable for booting a system in a degraded state. Therefore, the Sun Netra CP3060 server OpenBoot firmware provides a second setting, auto-boot-on-error?. This setting controls whether the system will attempt a degraded boot when a subsystem failure is detected. Both the auto-boot? and auto-boot-on-error? switches must be set to true to enable an automatic degraded boot. To set the switches, type:


 
ok setenv auto-boot? true
ok setenv auto-boot-on-error? true



Note - The default setting for auto-boot-on-error? is false. The system will not attempt a degraded boot unless you change this setting to true. In addition, the system will not attempt a degraded boot in response to any fatal nonrecoverable error, even if degraded booting is enabled. For examples of fatal nonrecoverable errors, see OpenBoot Configuration Variables.


4.3.3 OpenBoot Commands

You type the OpenBoot commands at the ok prompt. Two of the OpenBoot commands that can provide useful diagnostic information include:

For a complete list of OpenBoot commands and more information about the OpenBoot firmware, refer to the OpenBoot 4.x Command Reference Manual. An online version of the manual is included with the OpenBoot Collection AnswerBook that ships with Solaris software.

4.3.3.1 probe-ide Command

The probe-ide command communicates with all Integrated Drive Electronics (IDE) devices connected to the IDE bus. This is the internal system bus for media devices such as the DVD drive.



caution icon Caution - If you used the haltcommand or the Stop-A key sequence to reach the okprompt, issuing the probe-idecommand can hang the system.


CODE EXAMPLE 4-2 shows sample output from the probe-ide command.


CODE EXAMPLE 4-1 probe-ide Command Output
{0} ok probe-ide
   Device 0  ( Primary Master )
           ATA Model: FUJITSU MHV2040BH
 
   Device 1  ( Primary Slave )
           ATA Model: 
 
   Device 2  ( Secondary Master )
          Not Present
 
   Device 3  ( Secondary Slave )
          Not Present 

4.3.3.2 show-devs Command

The show-devs command lists the hardware device paths for each device in the firmware device tree. CODE EXAMPLE 4-2 shows some sample output.


CODE EXAMPLE 4-2 show-devs Command Output
{o} ok show-devs
/pci@7c0
/pci@780
/cpu@17
/cpu@16
/cpu@15
/cpu@14
/cpu@13
/cpu@12
/cpu@11
/cpu@10
/cpu@f
/cpu@e
/cpu@d
/cpu@c
/cpu@b
/cpu@a
/cpu@9
/cpu@8
/cpu@7
/cpu@6
/cpu@5
/cpu@4
/cpu@3
/cpu@2
/cpu@1
/cpu@0
/virtual-devices@100
/virtual-memory
/memory@m0,800000
/aliases
/options
/openprom
/chosen
/packages
/pci@7c0/network@0,1
/pci@7c0/network@0
/pci@780/pci@0
/pci@780/pci@0/pci@9
/pci@780/pci@0/pci@8
/pci@780/pci@0/pci@2
/pci@780/pci@0/pci@1
/pci@780/pci@0/pci@2/network@0,1
/pci@780/pci@0/pci@2/network@0
/pci@780/pci@0/pci@1/pci@0
/pci@780/pci@0/pci@1/pci@0/ide@1f,1
/pci@780/pci@0/pci@1/pci@0/ide@1f
/pci@780/pci@0/pci@1/pci@0/ide@1f,1/cdrom
/pci@780/pci@0/pci@1/pci@0/ide@1f,1/disk
/pci@780/pci@0/pci@1/pci@0/ide@1f/cdrom
/pci@780/pci@0/pci@1/pci@0/ide@1f/disk
/virtual-devices@100/ipmi@f
/virtual-devices@100/flashupdate@e
/virtual-devices@100/led@d
/virtual-devices@100/explorer@c
/virtual-devices@100/sunmc@b
/virtual-devices@100/sunvts@a
/virtual-devices@100/fma@9
/virtual-devices@100/echo@8
/virtual-devices@100/loop@6
/virtual-devices@100/loop@7
/virtual-devices@100/rtc@5
/virtual-devices@100/ncp@4
/virtual-devices@100/console@1
/virtual-devices@100/flashprom@0
/virtual-devices@100/nvram@3
/openprom/client-services
/packages/SUNW,asr
/packages/obp-tftp
/packages/dropins
/packages/terminal-emulator
/packages/disk-label
/packages/deblocker
/packages/SUNW,builtin-drivers
{0} ok

4.3.3.3 Checking Network Using watch-net and watch-net-all Commands

The watch-net diagnostics test monitors Ethernet packets on the primary network interface. The watch-net-all diagnostics test monitors Ethernet packets on the primary network interface and on any additional network interfaces connected to the system blade server. Good packets received by the system are indicated by a period (.). Errors such as the framing error and the cyclic redundancy check (CRC) error are indicated with an X and an associated error description.

single-step bullet  To start the watch-net diagnostic test, type the watch-net command at the ok prompt.



{0} ok watch-net
Internal loopback test -- succeeded.
Link is -- up
Looking for Ethernet Packets.
‘.’ is a Good Packet. ‘X’ is a Bad Packet.
Type any key to stop.................................
 

single-step bullet  To start the watch-net-all diagnostic test, type watch-net-all at the ok prompt.



{0} ok watch-net-all
/pci@1f,0/pci@1,1/network@c,1
Internal loopback test -- succeeded.
Link is -- up 
Looking for Ethernet Packets.
‘.’ is a Good Packet. ‘X’ is a Bad Packet.
Type any key to stop.
 

 

4.3.4 OpenBoot Configuration Variables

The OpenBoot configuration variables are stored in the OBP flash PROM and determine how and when OpenBoot tests are performed. This section explains how to access and modify OpenBoot configuration variables. For a list of important OpenBoot configuration variables, see TABLE 4-3.

Changes to OpenBoot configuration variables take effect at the next reboot.

 


TABLE 4-3 OpenBoot Configuration Variables

Variable

Possible Values

Default Value

Description

local-mac-address?
true, false
true

If true, network drivers use their own MAC address, not the server MAC address.

fcode-debug?
true, false
false

If true, include name fields for plug-in device FCodes.

scsi-initiator-id
0-15
7

SCSI ID of the Serial Attached SCSI controller.

oem-logo?
true, false	
false

If true, use custom OEM logo; otherwise, use Sun logo.

oem-banner?
true, false	
false

If true, use custom OEM banner.

ansi-terminal?
true, false
true

If true, enable ANSI terminal emulation.

screen-#columns
0-n
80

Sets number of columns on screen.

screen-#rows
0-n
34

Sets number of rows on screen.

ttya-mode
9600,8,n,1,-
9600,8,n,1,-

Serial management port (baud rate, bits, parity, stop, handshake). The serial management port only works at the default values.

output-device
virtual-console, screen
virtual-console

Power-on output device.

input-device
virtual-console, keyboard
virtual-console

Power-on input device.

auto-boot-on-error?
true, false
false

If true, boot automatically after system error.

load-base
0-n
16384

Address.

auto-boot?
true, false
true

If true, boot automatically after power on or reset.

network-boot-arguments
[protocol, ] [key=value, ]
none

Arguments to be used by the PROM for network booting. Defaults to an empty string. network-boot-arguments can be used to specify the boot protocol (RARP/DHCP) to be used and a range of system knowledge to be used in the process. For further information, see the eeprom (1M) man page or your Solaris Reference Manual.

boot-command
variable-name
boot

Action following a boot command.

boot-file
variable-name
none

File from which to boot if diag-switch? is false.

boot-device
variable-name
disk net

Device(s) from which to boot if diag-switch? is false.

use-nvramrc?
true, false
false

If true, execute commands in NVRAMRC during server startup.

nvramrc
variable-name
none

Command script to execute if use-nvramrc? is true.

security-mode
none, command, full
none

Firmware security level.

security-password
variable-name
none

Firmware security password if security-mode is not none (never displayed). Do not set this directly.

security-#badlogins
variable-name
none

Number of incorrect security password attempts.

verbosity
max, min, none, normal
min

Controls the amount and detail of OpenBoot output.
Default is min.

  • none - Only error and fatal messages are displayed on the system console.
  • min - Notice, error, warning, and fatal messages are displayed on the system console.
  • normal - Summary progress and operational messages are displayed on the system console in addition to the messages displayed by the min setting.
  • max - Detailed progress and operational messages are displayed on the system console.
diag-switch?
true, false
false

If true:

  • After a boot request, boot diag-file from diag-device

If false:

  • After a boot request, boot boot-file from boot-device
error-reset-recovery
boot, none, sync
boot

Specifies recovery action after an error reset. Default is boot.

  • none - No recovery action.
  • boot - System attempts to boot.
  • sync - Firmware attempts to execute a Solaris sync callback routine.

4.3.4.1 Viewing and Setting OpenBoot Configuration Variables

single-step bullet  Halt the server to display the ok prompt.

The following example shows a short excerpt of this command’s output.


TABLE 4-4
ok printenv
Variable Name         Value                          Default Value
 
local-mac-address?      true                           true
fcode-debug?            false                          false
scsi-initiator-id       7                              7
oem-logo?               false                          false
boot-command            boot                           boot
boot-file
boot-device             disk net                       disk net
use-nvramrc?            false                          false
nvramrc


4.4 Error Handling Summary

Error handling during the power-on sequence falls into one of the following three cases:



Note - If POST or OpenBoot firmware detects a nonfatal error associated with the normal boot device, the OpenBoot firmware automatically unconfigures the failed device and tries the next-in-line boot device, as specified by the boot-device configuration variable.



4.5 Automatic System Recovery

Automatic system recovery (ASR) consists of self-test features and an autoconfiguration capability to detect failed hardware components and unconfigure them. By enabling ASR, the server is able to resume operating after certain nonfatal hardware faults or failures have occurred.

If a component is monitored by ASR and the server is capable of operating without it, the server automatically reboots if that component develops a fault or fails. This capability prevents a faulty hardware component from stopping operation of the entire system or causing the system to fail repeatedly.

If a fault is detected during the power-on sequence, the faulty component is disabled. If the system remains capable of functioning, the boot sequence continues.

To support this degraded boot capability, the OpenBoot firmware uses the 1275 client interface (by means of the device tree) to mark a device as either failed or disabled, creating an appropriate status property in the device tree node. The Solaris OS does not activate a driver for any subsystem marked in this way.

As long as a failed component is electrically dormant (not causing random bus errors or signal noise, for example), the system reboots automatically and resumes operation while a service call is made.

Once a failed or disabled device is replaced with a new one, the OpenBoot firmware automatically modifies the status of the device upon reboot.



Note - ASR is not enabled until you activate it (see Section 4.5.1.1, To Enable Automatic System Recovery).


4.5.1 Enabling and Disabling Automatic System Recovery

The automatic system recovery (ASR) feature is not activated until you enable it. Enabling ASR requires changing configuration variables in ALOM as well as OpenBoot.

4.5.1.1 To Enable Automatic System Recovery

1. At the ok prompt, type:


 
ok setenv auto-boot true
ok setenv auto-boot-on-error? true

2. To cause the parameter changes to take effect, type:


TABLE 4-6
ok reset-all

The system permanently stores the parameter changes and boots automatically when the OpenBoot configuration variable auto-boot? is set to true (its default value).



Note - To store parameter changes, you can also power cycle the system using the front panel Power button.


4.5.1.2 To Disable Automatic System Recovery

1. At the ok prompt, type:


 
ok setenv auto-boot-on-error? false

2. To cause the parameter changes to take effect, type:


TABLE 4-7
ok reset-all

The system permanently stores the parameter change.



Note - To store parameter changes, you can also power cycle the system using the front panel Power button.


After you disable the automatic system recovery (ASR) feature, it is not activated again until you re-enable it.


4.6 Hot-Swap Information

The Sun Netra CP3060 blade server supports hot-swapping and includes a blue Hot-Swap LED.

4.6.1 Hot-Swapping the Sun Netra CP3060 Blade Server

If the Solaris OS is running on a Sun Netra CP3060 blade server and you open the blade server’s latches, you see a message that the operating system will shut down in one minute. When the blue LED on the blade server lights steadily, you can safely remove the blade server.

4.6.1.1 Hot-Swap LED

The blue Hot-Swap LED, located on the front panel of the Sun Netra CP3060 blade server (FIGURE 1-1), blinks when a hot-swap is initiated, and lights steadily when the blade server is ready to be removed from the system.

Unlatching the bottom latch on the Sun Netra CP3060 blade server initiates the hot-swap sequence. The LED lights steadily when the blade server can be safely removed from the system. The reverse is true when a Sun Netra CP3060 blade server is installed into the system. Once the Sun Netra CP3060 blade server is installed into the system and the bottom latch is latched, the blue Hot-Swap LED blinks until the blade server is ready and then turns off. The green LED lights steadily when the blade server is ready.

FIGURE 4-1 shows the hot-swap latch and Hot-Swap LED.

FIGURE 4-1 Hot-Swap Latch and Hot-Swap LED


Figure showing the proper way to press down on the ejector handle.


4.7 Network Device Aliases

A device alias is a shorthand representation of a device path. The Solaris OS provides some predefined device aliases for the network devices so that you do not need to type the full device path name. TABLE 4-8 lists the network device aliases, the default Solaris OS device names, and associated ports for the Sun Netra CP3060 blade server. The devalias command can be used to display the device aliases.


TABLE 4-8 Network Device Aliases

Device Alias

Default Solaris 10 OS
Device Name

Port

net, net0

e1000g0

Base Interface Ethernet A, Management Ethernet A (Ethernet port A on front panel), RTM Ethernet port

net1

e1000g1

Base Interface Ethernet B, Management Ethernet A (Ethernet port A on front panel)

net2

e1000g2

Extended Interface Ethernet A (PICMG 3.1)

net3

e1000g3

Extended Interface Ethernet B (PICMG 3.1)



4.8 Retrieving Device Information

You use the Solaris platform information and control library (PICL) framework for obtaining the state and condition of the Sun Netra CP3060 blade server.

The PICL framework provides information about the system configuration that it maintains in the PICL tree. Within this PICL tree is a subtree named frutree, which represents the hierarchy of system field-replaceable units (FRUs) with respect to a root node in the tree called chassis. The frutree represents physical resources of the system. The PICL tree is updated whenever a change occurs in a device’s status.

TABLE 4-9 shows the frutree entries and properties that describe the condition of the Sun Netra CP3060 blade server.


TABLE 4-9 PICL Frutree Entries and Description for the Sun Netra CP3060 Blade Server

Frutree Entry:Property

Entry Description

Example of Condition

CPU (location) :State

State of the receptacle or slot

connected

CPU (fru) :Condition

Condition of the blade server or occupant

ok

CPU (fru) :State

State of the blade server or occupant

configured

CPU (fru) :FRUType

FRU type

bridge/fhs


The prtpicl -v command shows the condition of all devices in the PICL tree. Sample output from the prtpicl command on the Sun Netra CP3060 blade server is shown in CODE EXAMPLE 4-3.


CODE EXAMPLE 4-3 prtpicl Command Output
# prtpicl  
/ (picl, 5a00000001)
     platform (sun4v, 5a00000005)
         scsi_vhci (devctl, 5a00000021)
         memory (obp-device, 5a000000cf)
         virtual-devices (virtual-devices, 5a000000e1)
             nvram (nvram, 5a000000f4)
             flashprom (obp-device, 5a000000fc)
             console (serial, 5a00000103)
             ncp (obp-device, 5a00000113)
             rtc (obp-device, 5a00000120)
             loop (obp-device, 5a00000128)
             loop (obp-device, 5a00000138)
             echo (obp-device, 5a00000148)
             fma (obp-device, 5a00000158)
             sunvts (obp-device, 5a00000168)
             sunmc (obp-device, 5a00000178)
             explorer (obp-device, 5a00000188)
             led (obp-device, 5a00000198)
             ipmi (obp-device, 5a000001a8)
         cpu (cpu, 5a000001b8)
         cpu (cpu, 5a000001c6)
         cpu (cpu, 5a000001d4)
         cpu (cpu, 5a000001e2)
         cpu (cpu, 5a000001f0)
         cpu (cpu, 5a000001fe)
         cpu (cpu, 5a0000020c)
         cpu (cpu, 5a0000021a)
         cpu (cpu, 5a00000228)
         cpu (cpu, 5a00000236)
         cpu (cpu, 5a00000244)
         cpu (cpu, 5a00000252)
         cpu (cpu, 5a00000260)
         cpu (cpu, 5a0000026e)
         cpu (cpu, 5a0000027c)
         cpu (cpu, 5a0000028a)
         cpu (cpu, 5a00000298)
         cpu (cpu, 5a000002a6)
         cpu (cpu, 5a000002b4)
         cpu (cpu, 5a000002c2)
         cpu (cpu, 5a000002d0)
         cpu (cpu, 5a000002de)
         cpu (cpu, 5a000002ec)
         cpu (cpu, 5a000002fa)
         pci (pciex, 5a00000308)
             pci (pciex, 5a0000032a)
                 pci (pciex, 5a00000347)
                     pci (pciex, 5a00000363)
                         ide (ide, 5a00000384)
                         ide (ide, 5a000003a8)
                             dad (block, 5a000003d3)
                 pci (pciex, 5a000003ea)
                     network (network, 5a00000407)
                     network (network, 5a00000438)
                 pci (pciex, 5a00000455)
                 pci (pciex, 5a0000046f)
         pci (pciex, 5a00000487)
             network (network, 5a000004a7)
             network (network, 5a000004c4)
         pseudo (devctl, 5a000004f6)
     obp (picl, 5a0000001e)
         ib (ib, 5a00000032)
         packages (packages, 5a0000003e)
             SUNW,builtin-drivers (SUNW,builtin-drivers, 5a00000044)
             deblocker (deblocker, 5a0000004a)
             disk-label (disk-label, 5a00000051)
             terminal-emulator (terminal-emulator, 5a00000057)
             dropins (dropins, 5a0000005e)
             obp-tftp (obp-tftp, 5a00000065)
             SUNW,asr (SUNW,asr, 5a0000006b)
             ufs-file-system (ufs-file-system, 5a00000072)
         chosen (chosen, 5a00000079)
         openprom (openprom, 5a00000086)
             client-services (client-services, 5a00000090)
         options (options, 5a00000096)
         aliases (aliases, 5a000000be)
         virtual-memory (virtual-memory, 5a000000d7)
         iscsi (iscsi, 5a000004e1)

For more information on the PICL framework, refer to the picld(1M) man page.


4.9 Mandatory /etc/system File Entry

A mandatory entry must be listed in the /etc/system file to ensure the optimal functionality of the server.

The following entry must be in the /etc/system file:

set pcie:pcie_aer_ce_mask=0x1

Check that the entry is present before deploying the server.


procedure icon  To Check and Create the Mandatory /etc/system File Entry

1. Log in as superuser.

2. Check the /etc/system file to see if the mandatory line is present.


TABLE 4-10
# more /etc/system
*ident  "@(#)system     1.18 05/06/27 SMI" /* SVR4 1.5 */
*
* SYSTEM SPECIFICATION FILE
.
.
.
set pcie:pcie_aer_ce_mask=0x1
.

3. If the entry is not there, add it.

Use an editor to edit the /etc/system file and add the entry.

4. Reboot the server.