C H A P T E R  1

Feedback Server Diagnostics

This chapter describes the diagnostics that are available for monitoring and troubleshooting the server.

The following topics are covered:


1.1 Fault on Initial Power Up

If you have installed the server, and upon initial power up, you see errors indicating faults with the Fully Buffered DIMMs (FB-DIMMs), PCI cards, or other components, the suspect component might have become loosened or ajar during shipment.

Conduct a visual inspection of the server internals and its components. Remove the top cover and physically reseat the cable connections, the PCI cards, and the FB-DIMMs. See:

If performing these tasks is not successful, then continue to Server Diagnostics Overview.


1.2 Server Diagnostics Overview

There are a variety of diagnostic tools, commands, and indicators you can use to monitor and troubleshoot a server:

The LEDs, ILOM, Solaris OS PSH, and many of the log files and console messages are integrated. For example, a fault detected by the Solaris software will display the fault, log it, pass information to ILOM where it is logged, and depending on the fault, might light one or more LEDs.

The diagnostic flowchart in FIGURE 1-1 and TABLE 1-1 describes an approach for using the server diagnostics to identify a faulty field-replaceable unit (FRU). The diagnostics you use, and the order in which you use them, depend on the nature of the problem you are troubleshooting. So you might perform some actions and not others.

The flowchart assumes that you have already performed some rudimentary troubleshooting such as verification of proper installation, visual inspection of cables and power, and possibly performed a reset of the server (refer to the server installation guide and server administration guide for details).

Use this flowchart to understand what diagnostics are available to troubleshoot faulty hardware. Use TABLE 1-1 to find more information about each diagnostic in this chapter.

FIGURE 1-1 Diagnostic Flowchart


 [ D ]


TABLE 1-1 Diagnostic Flowchart Actions

Action No.

Diagnostic Action

Resulting Action

Additional Information

1.

Check Power OK and Input OK LEDs on the server.

The Power OK LED is located on the front and rear of the chassis.

The Input OK LED is located on the rear of the server on each power supply.

If these LEDs are not on, check the power source and power connections to the server.

Using LEDs to Identify the State of Devices

2.

Run the ALOM CMT CLI showfaults command to check for faults.

The showfaults command displays the following kinds of faults:

  • Environmental faults
  • Solaris Predictive Self-Healing (PSH) detected faults
  • POST detected faults

Faulty FRUs are identified in fault messages using the FRU name. For a list of FRU names, see TABLE 2-1.

Displaying System Faults

3.

Check the Solaris log files for fault information.

The Solaris message buffer and log files record system events and provide information about faults.

  • If system messages indicate a faulty device, replace the FRU.
  • To obtain more diagnostic information, go to Action 4.

Collecting Information From Solaris OS Files and Commands

4.

Run SunVTS.

SunVTS is an application you can run to exercise and diagnose FRUs. To run SunVTS, the server must be running the Solaris OS.

  • If SunVTS reports a faulty device replace the FRU.
  • If SunVTS does not report a faulty device, go to Action 5.

Exercising the System With SunVTS Software

5.

Run POST.

POST performs basic tests of the server components and reports faulty FRUs.

  • If POST indicates a faulty FRU, replace the FRU.
  • If POST does not indicate a faulty FRU, go to Action 9.

Running POST

6.

Determine if the fault is an environmental fault.

If the fault listed by the showfaults command displays a temperature or voltage fault, then the fault is an environmental fault. Environmental faults can be caused by faulty FRUs (power supply, fan, or blower), or by environmental conditions such as when computer room ambient temperature is too high, or the server airflow is blocked. When the environmental condition is corrected, the fault will automatically clear.

Displaying System Faults

 

 

If the fault indicates that a fan, blower, or power supply is bad, you can perform a hot-swap of the FRU. You can also use the fault LEDs on the server to identify the faulty FRU (fans, blower, and power supplies).

Using LEDs to Identify the State of Devices

7.

Determine if the fault was detected by PSH.

If the fault message displays the following text, the fault was detected by the Solaris Predictive Self-Healing software:
Host detected fault

Using the Solaris Predictive Self-Healing Feature

 

 

If the fault is a PSH detected fault, identify the faulty FRU from the fault message and replace the faulty FRU.

Clearing PSH Detected Faults

 

 

After replacing the FRU, perform the procedure to clear PSH detected faults.

 

8.

Determine if the fault was detected by POST.

POST performs basic tests of the server components and reports faulty FRUs. When POST detects a faulty FRU, it logs the fault and if possible, takes the FRU offline. POST detected FRUs display the following text in the fault message:

FRU-name deemed faulty and disabled

Running POST

 

 

In this case, replace the FRU and run the procedure to clear POST detected faults.

Clearing POST Detected Faults


1.2.1 Memory Configuration and Fault Handling

A variety of features play a role in how the memory subsystem is configured and how memory faults are handled. Understanding the underlying features helps you identify and repair memory problems. This section describes how the memory is configured and how the server deals with memory faults.

1.2.1.1 Memory Configuration

In the server memory there are 16 slots that hold DDR-2 memory FB-DIMMs in the following FB-DIMM sizes:

FB-DIMMs are installed in groups of 8, called ranks (ranks 0 and 1). At minimum, rank 0 must be fully populated with eight FB-DIMMs of the same capacity. A second rank of FB-DIMMs of the same capacity can be added to fill rank 1.

See Replacing FB-DIMMs for instructions about adding memory to a server.

1.2.1.2 Memory Fault Handling

The server uses an advanced ECC technology, called chipkill, that corrects up to 4 bits in error on nibble boundaries, as long as all of the bits are in the same DRAM. If a DRAM fails, the FB-DIMM continues to function.

The following server features independently manage memory faults:

For correctable memory errors (CEs), POST forwards the error to the Solaris Predictive Self-Healing (PSH) daemon for error handling. If an uncorrectable memory fault is detected or if a “storm” of CEs is detected, POST displays the fault with the device name of the faulty FB-DIMMs, logs the fault, and disables the faulty FB-DIMMs by placing them in the ASR blacklist. Depending on the memory configuration and the location of the faulty FB-DIMM, POST disables half of physical memory in the system, or half the physical memory and half the processor threads. When this offlining process occurs in normal operation, you must replace the faulty FB-DIMMs based on the fault message. You then must enable the disabled FB-DIMMs with the ALOM CMT CLI enablecomponent command.

1.2.1.3 Troubleshooting Memory Faults

If you suspect that the server has a memory problem, follow the flowchart (FIGURE 1-1). Run the ALOM CMT compatability CLI (in ILOM) showfaults command, see Using the ALOM CMT Compatibility CLI in ILOM and Displaying System Faults. The showfaults command lists memory faults and lists the specific FB-DIMMS that are associated with the fault. Once you identify which FB-DIMMs to replace, see Replacing FB-DIMMs for FB-DIMM replacement instructions. You must perform the instructions in that chapter to clear the faults and enable the replaced FB-DIMMs.


1.3 Using LEDs to Identify the State of Devices

The server provides the following groups of LEDs:

These LEDs provide a quick visual check of the state of the system.

1.3.1 Front and Rear Panel LEDs

The seven front panel LEDs (FIGURE 1-2) are located in the upper left corner of the server chassis. Three of these LEDs are also provided on the rear panel (FIGURE 1-3).

FIGURE 1-2 Location of the Bezel Server Status and Alarm Status Indicators


Figure showing the the location of the server and alarm status indicators on the front bezel


Figure Legend

1

User (amber) Alarm Status Indicator

5

Locator LED and Button

2

Minor (amber) Alarm Status Indicator

6

Fault LED

3

Major (red) Alarm Status Indicator

7

Activity LED

4

Critical (red) Alarm Status Indicator

8

Power OK LED


FIGURE 1-3 Rear Panel Connectors, LEDs, and Features on the Sun Netra T5220 Server


Figure showning the rear panel connectors, LEDs, and features


Figure Legend

1

Power Supply 0 LEDs top to bottom: Locator LED and Button, Service Required LED, Power OK LED

11

Alarm Port

 

2

Power Supply 0

12

USB ports left to right: USB0, USB1

 

3

Power Supply 1 LEDs top to bottom: Locator LED Button, Service Required LED, Power OK LED

13

TTYA Serial Port

 

4

Power Supply 1

14

Captive screw for securing motherboard (2 of 2)

 

5

Captive screw for securing motherboard (1 of 2)

15

PCI-X Slot 3

 

6

System LEDs left to right: Locator LED Button, Service Required LED, Power OK LED

16

PCIe or XAUI Slot 0

 

7

Service Processor Serial Management Port

17

PCI-X Slot 4

 

8

Service Processor Network Management Port

18

PCIe or XAUI Slot 1

 

9

Captive screws for securing the bottom PCI cards. Note that there are two screws on either side of each bottom PCI card (total 6).

19

PCIe Slot 5

 

10

Gigabit Ethernet Ports left to right: NET0, NET1, NET2, NET3

20

PCIe Slot 2


TABLE 1-2 lists and describes the front and rear panel LEDs.


TABLE 1-2 Front and Rear Panel LEDs

LED

Location

Color

Description

Locator LED and Button

Front upper left and rear center

White

Enables you to identify a particular server. The LED is activated using one of the following methods:

  • Issuing the setlocator on or off command.
  • Pressing the button to toggle the indicator on or off.

This LED provides the following indications:

  • Off - Normal operating state.
  • Fast blink - The server received a signal as a result of one of the preceding methods.

Fault LED

Front upper left and rear center

Amber

If on, indicates that service is required. The ALOM CMT CLI showfaults command provides details about any faults that cause this indicator to be lit.

Activity LED

Front upper left

Green

  • On - Drives are receiving power. Solidly lit if drive is idle.
  • Flashing - Drives are processing a command.
  • Off - Power is off.

Power Button

Front upper left

 

Turns the host system on and off. This button is recessed to prevent accidental server power-off. Use the tip of a pen to operate this button.

Alarm:Critical LED

Front left

Red

Indicates a critical alarm. Refer to the server administration guide for a description of alarm states.

Alarm:Major LED

Front left

Red

Indicates a major alarm.

Alarm:Minor LED

Front left

Amber

Indicates a minor alarm.

Alarm :User LED

Front left

Amber

Indicates a user alarm.

Power OK LED

Rear center

Green

The LED provides the following indications:

  • Off - The system is unavailable. Either the system has no power or ILOM is not running.
  • Steady on - Indicates that the system is powered on and is running it its normal operating state.
  • Standby blink - Indicates that the service processor is running while the system is running at a minimum level in Standby mode, and is ready to be returned to its normal operating state.
  • Slow blink - Indicates that a normal transitory activity is taking place. The system diagnostics might be running, or that the system might be booting.

1.3.2 Hard Drive LEDs

The hard drive LEDs (FIGURE 1-4 and TABLE 1-3) are located on the front of each hard drive that is installed in the server chassis.

FIGURE 1-4 Hard Drive LEDs


Figure showing the hard drive LEDs.


Figure Legend

1

OK to Remove

2

Fault

3

Activity



TABLE 1-3 Hard Drive LEDs

LED

Color

Description

OK to Remove

Blue

  • On - The drive is ready for hot-plug removal.
  • Off - Normal operation.

Fault

Amber

  • On - The drive has a fault and requires attention.
  • Off - Normal operation.

Activity

Green

  • On - The drive is receiving power. Solidly lit if drive is idle.
  • Flashing - The drive is processing a command.
  • Off - Power is off.

1.3.3 Power Supply LEDs

The power supply LEDs (FIGURE 1-5 and TABLE 1-4) are located on the rear of each power supply.

FIGURE 1-5 Power Supply LEDs


Figure showing the power supply LEDs


Figure Legend

1

Power OK power supply LED

2

Fault power supply LED

3

Input OK power supply LED



TABLE 1-4 Power Supply LEDs

LED

Color

Description

Power OK

Green

  • On - Normal operation. DC output voltage is within normal limits.
  • Off - Power is off.

Fault

Amber

  • On - Power supply has detected a failure.
  • Off - Normal operation.

Input OK

Green

  • On - Normal operation. Input power is within normal limits.
  • Off - No input voltage, or input voltage is below limits.

1.3.4 Ethernet Port LEDs

The ILOM management Ethernet port and the four 10/100/1000 Mbps Ethernet ports each have two LEDs, as shown in FIGURE 1-6 and described in TABLE 1-5.

FIGURE 1-6 Ethernet Port LEDs


Figure showing the Ethernet LEDs


Figure Legend

1

Link/Activity indicator LED (Same location for all Ethernet ports)

2

Speed indicator LED (Same location for all Ethernet ports)



TABLE 1-5 Ethernet Port LEDs

LED

Color

Description

Left LED

Green

Link/Activity indicator:

  • Steady On - a link is established.
  • Blinking - there is activity on this port.
  • Off - No link is established.

Right LED

Amber
or
Green

Speed indicator:

  • Amber On - The link is operating as a Gigabit connection (1000-Mbps)
  • Green On - The link is operating as a 100-Mbps connection.
  • Off - The link is operating as a 10/100-Mbps connection.



Note - The NET MGT port operates only in 100-Mbps or 10-Mbps so the speed indicator LED can be green or off (never amber).



1.4 Using the Service Processor Firmware for Diagnosis and Repair Verification

The Sun Integrated Lights Out Manager (ILOM) firmware is a service processor in the server that enables you to remotely manage and administer your server.

ILOM enables you to remotely run diagnostics, such as power-on self-test (POST), that would otherwise require physical proximity to the server’s serial port. You can also configure ILOM to send email alerts of hardware failures, hardware warnings, and other events related to the server or to ILOM.

The service processor runs independently of the server, using the server’s standby power. Therefore, ILOM firmware and software continue to function when the server operating system goes offline or when the server is powered off.



Note - ILOM provides an ALOM CMT compatibility CLI. Refer to the Sun Integrated Lights Out Management 2.0 Supplement for the Sun Netra T5220 Server for comprehensive ILOM and ALOM CMT compatibility information.


Faults detected by ILOM, POST, and the Solaris Predictive Self-Healing (PSH) technology are forwarded to ILOM for fault handling (FIGURE 1-7).

In the event of a system fault, ILOM ensures that the fault LED is lit, FRU ID PROMs are updated, the fault is logged, and alerts are displayed (faulty FRUs are identified in fault messages using the FRU name). For a list of FRU names, see TABLE 2-1.

FIGURE 1-7 ILOM Fault Management


Figure showing the fault source interfaces.

The service processor detects when a fault is no longer present and clears the fault in several ways:

The service processor also detects the removal of a FRU, in many cases even if the FRU is removed while the service processor is powered off (that is, if the system power cables are unplugged during service procedures). This situation enables ILOM to know that a fault, diagnosed to a specific FRU, has been repaired.



Note - ILOM does not automatically detect hard drive replacement.


Many environmental faults can automatically recover. A temperature that is exceeding a threshold might return to normal limits. An unplugged power supply can be plugged in, and so on. Recovery of environmental faults is automatically detected. Recovery events are reported using one of two forms:

Environmental faults can be repaired through hot-removal of the faulty FRU. FRU removal is automatically detected by the environmental monitoring, and all faults associated with the removed FRU are cleared. The message for that case, and the alert sent for all FRU removals is:

fru at location has been removed.

There is no ILOM command to manually repair an environmental fault.

The Solaris Predictive Self-Healing technology does not monitor the hard drive for faults. As a result, the service processor does not recognize hard drive faults, and will not light the fault LEDs on either the chassis or the hard drive itself. Use the Solaris message files to view hard drive faults. See Collecting Information From Solaris OS Files and Commands.

1.4.1 Using the ALOM CMT Compatibility CLI in ILOM

There are three methods of interacting with the service processor:



Note - The examples in this section use the ALOM CMT compatibility CLI.


The ALOM CMT CLI emulates the ALOM CMT interface supported on the previous generation of CMT servers. Using the ALOM CMT CLI (with few exceptions) you can use commands that resemble the ALOM CMT commands. The comparisons between the ILOM CLI and The ALOM CMT compatibility CLI are described in the Sun Integrated Lights Out Management 2.0 Supplement for the Sun Netra T5220 Server.

The service processor sends alerts to all ALOM CMT CLI users that are logged in, sending the alert through email to a configured email address, and writing the event to the ILOM event log.

1.4.2 Creating an ALOM CMT CLI Shell

To create an ALOM CMT CLI, do the following:

1. Log in to the service processor with username: root.

When powered on, the service processor boots to the ILOM login prompt. The factory default password is changeme.


SUNSPxxxxxxxxxxxx login: rootPassword:
Waiting for daemons to initialize...
 
Daemons ready
 
Sun(TM) Integrated Lights Out Manager
 
Version 2.0.0.0
 
Copyright 2008 Sun Microsystems, Inc. All rights reserved.
Use is subject to license terms.
 
Warning: password is set to factory default.
 

2. Create a new user, set the account role to Administrator and the CLI mode to alom.


-> create /SP/users/admin Creating user...Enter new password: ********Enter new password again: ********Created /SP/users/admin
-> set /SP/users/admin role=Administrator
Set 'role' to 'Administrator'-> set /SP/users/admin cli_mode=alomSet 'cli_mode' to 'alom'



Note - The asterisks in the example will not appear when you enter your password.


You can combine the create and set commands on a single line:


-> create /SP/users/admin role=Administrator cli_mode=alomCreating user...Enter new password: ********Enter new password again: ********Created /SP/users/admin

3. Log out of the root account after you have finished creating the new account.


-> exit

4. Log in to the ALOM CMT CLI (indicated by the sc> prompt) from the ILOM login prompt.


SUNSPxxxxxxxxxxxx login: admin
Password:
Waiting for daemons to initialize...
 
Daemons ready
 
Sun(TM) Integrated Lights Out Manager
 
Version 2.0.0.0
 
Copyright 2008 Sun Microsystems, Inc. All rights reserved.
Use is subject to license terms.
 
sc>



Note - Multiple service processor accounts can be active concurrently. A user can be logged in under one account using the ILOM CLI, and another account using the ALOM CMT CLI.


1.4.3 Running ALOM CMT CLI Service-Related Commands

This section describes commands commonly used for service-related activities.

1.4.3.1 Connecting to ALOM CMT CLI

Before you can run ALOM CMT CLI commands, you must connect to the service processor in one of two ways:



Note - Refer to the Sun Integrated Lights Out Management 2.0 Supplement for the Sun Netra T5220 Server for instructions on configuring and connecting to the service processor.


1.4.3.2 Switching Between the System Console and Service Processor

1.4.3.3 Service-Related ALOM CMT CLI Commands

TABLE 1-6 describes the typical ALOM CMT CLI commands for servicing a server. For descriptions of all ALOM CMT CLI commands, issue the help command or refer to the Integrated Lights Out Management User’s Guide.


TABLE 1-6 Service-Related ALOM CMT CLI Commands

ALOM CMT Command

Description

help [command]

Displays a list of all ALOM CMT CLI commands with syntax and descriptions. Specifying a command name as an option displays help for that command.

break [-y][-c][-D]

Takes the host server from the OS to either kmdb or OpenBoot PROM (equivalent to a Stop-A), depending on the mode Solaris software was booted.

  • -y skips the confirmation question
  • -c executes a console command after the break command completes
  • -D forces a core dump of the Solaris OS

clearfault UUID

Manually clears host-detected faults. The UUID is the unique fault ID of the fault to be cleared.

console [-f]

Connects you to the host system. The -f option forces the console to have read and write capabilities.

consolehistory [-b lines|-e lines|-v] [-g lines] [boot|run]

Displays the contents of the system’s console buffer. The following options enable you to specify how the output is displayed:

  • -g lines specifies the number of lines to display before pausing.
  • -e lines displays n lines from the end of the buffer.
  • -b lines displays n lines from beginning of buffer.
  • -v displays entire buffer.
  • boot|run specifies the log to display (run is the default log).

bootmode [normal|reset_nvram|
bootscript=string]

Enables control of the firmware during system initialization with the following options:

  • normal is the default boot mode.
  • reset_nvram resets OpenBoot PROM parameters to their default values.
  • bootscript=string enables the passing of a string to the boot command.

powercycle [-f]

Performs a poweroff followed by poweron. The -f option forces an immediate poweroff, otherwise the command attempts a graceful shutdown.

poweroff [-y] [-f]

Powers off the host server. The -y option enables you to skip the confirmation question. The -f option forces an immediate shutdown.

poweron [-c]

Powers on the host server. Using the -c option executes a console command after completion of the poweron command.

removefru PS0|PS1

Indicates if it is okay to perform a hot-swap of a power supply. This command does not perform any action, but it provides a warning if the power supply should not be removed because the other power supply is not enabled.

reset [-y] [-c]

Generates a hardware reset on the host server. The -y option enables you to skip the confirmation question. The -c option executes a console command after completion of the reset command.

resetsc [-y]

Reboots the service processor. The -y option enables you to skip the confirmation question.

setkeyswitch [-y] normal | stby | diag | locked

Sets the virtual keyswitch. The -y option enables you to skip the confirmation question when setting the keyswitch to stby.

setlocator [on | off]

Turns the Locator LED on the server on or off.

showenvironment

Displays the environmental status of the host server. This information includes system temperatures, power supply, front panel LED, hard drive, fan, voltage, and current sensor status. See Displaying the Server’s Environmental Status.

showfaults [-v]

Displays current system faults. See Displaying System Faults.

showfru [-g lines] [-s | -d] [FRU]

Displays information about the FRUs in the server.

  • -g lines specifies the number of lines to display before pausing the output to the screen.
  • -s displays static information about system FRUs (defaults to all FRUs, unless one is specified).
  • -d displays dynamic information about system FRUs (defaults to all FRUs, unless one is specified). See Displaying FRU Information.

showkeyswitch

Displays the status of the virtual keyswitch.

showlocator

Displays the current state of the Locator LED as either on or off.

showlogs [-b lines | -e lines | -v] [-g lines] [-p logtype[r|p]]]

Displays the history of all events logged in the ALOM CMT event buffers (in RAM or the persistent buffers).

showplatform [-v]

Displays information about the host system’s hardware configuration, the system serial number, and whether the hardware is providing service.




Note - See TABLE 1-10 for the ALOM CMT CLI automatic system recover (ASR) commands.


1.4.4 Displaying System Faults

The ALOM CMT CLI showfaults command displays the following kinds of faults:

Use the showfaults command for the following reasons:

single-step bullet  At the sc> prompt, type the showfaults command.

The following showfaults command examples show the different kinds of output from the showfaults command:

1.4.5 Manually Cleaning PSH Diagnosed Faults

The ALOM CMT CLI clearfault command enables you to manually clear PSH diagnosed faults from the service processor without a FRU replacement or if the service processor was unable to automatically detect the FRU replacement.

single-step bullet  At the sc> prompt, type the clearfault command.

1.4.6 Displaying the Server’s Environmental Status

The showenvironment command displays a snapshot of the server’s environmental status. This command displays system temperatures, hard drive status, power supply and fan status, front panel LED status, and voltage and current sensors. The output uses a format similar to the Solaris OS command prtdiag (1m).

single-step bullet  At the sc> prompt, type the showenvironment command.

The output differs according to your system’s model and configuration.

EXAMPLE 1-1 shows abridged output of the showenvironment command.


EXAMPLE 1-1 showenvironment Command Output

sc> showenvironment
 
------------------------------------------------------------------------------
System Temperatures (Temperatures in Celsius):
------------------------------------------------------------------------------
Sensor                         Status  Temp LowHard LowSoft LowWarn HighWarn HighSoft HighHard
------------------------------------------------------------------------------
/SYS/MB/T_AMB                  OK         29   -10     -5      0      50      55      60
/SYS/MB/CMP0/T_TCORE           OK         50   -14     -9     -4      86      96     106
/SYS/MB/CMP0/T_BCORE           OK         51   -14     -9     -4      86      96     106
/SYS/MB/CMP0/BR0/CH0/D0/T_AMB  OK         41   -10     -8     -5      95     100     105
...
------------------------------------------------------------------------------
System Indicator Status:
------------------------------------------------------------------------------
/SYS/LOCATE          /SYS/SERVICE         /SYS/ACT            
OFF                  OFF                  ON                  
------------------------------------------------------------------------------
/SYS/PSU_FAULT       /SYS/TEMP_FAULT      /SYS/FAN_FAULT      
OFF                  OFF                  OFF                 
 
------------------------------------------------------------------------------
System Disks:
------------------------------------------------------------------------------
Disk      Status           Service        OK2RM
------------------------------------------------------------------------------
/SYS/HDD0  OK               OFF           OFF     
/SYS/HDD1  NOT PRESENT      OFF           OFF     
...
------------------------------------------------------------------------------
Fan Status:
------------------------------------------------------------------------------
Fans (Speeds Revolution Per Minute):
Sensor                    Status       Speed     Warn      Low
------------------------------------------------------------------------------
/SYS/FANBD0/FM0/F0/TACH   OK            7000     4000     2400
...
------------------------------------------------------------------------------
Voltage sensors (in Volts):
------------------------------------------------------------------------------
Sensor               Status     Voltage LowSoft LowWarn HighWarn HighSoft
------------------------------------------------------------------------------
/SYS/MB/V_+3V3_STBY  OK         3.39    3.13     3.17     3.53      3.58
...
------------------------------------------------------------------------------
Power Supplies:
------------------------------------------------------------------------------
Supply     Status            Fan_Fault  Temp_Fault  Volt_Fault  Cur_Fault
------------------------------------------------------------------------------
/SYS/PS0    OK                   OFF       OFF          OFF         OFF
...



Note - Some environmental information might not be available when the server is in standby mode.


1.4.7 Displaying FRU Information

The showfru command displays information about the FRUs in the server. Use this command to see information about an individual FRU, or for all the FRUs.



Note - By default, the output of the showfru command for all FRUs is very long.


single-step bullet  At the sc> prompt, enter the showfru command.

In the following example, the showfru command is used to get information about the motherboard (MB).


sc> showfru /SYS/MB
/SYS/MB (container)
   SEGMENT: FL
      /Configured_LevelR
      /Configured_LevelR/UNIX_Timestamp32: Thu Jun  7 20:12:17 GMT 2007
      /Configured_LevelR/Sun_Part_No: 5412153
      /Configured_LevelR/Configured_Serial_No: BBX053
      /Configured_LevelR/Initial_HW_Dash_Level: 02
   SEGMENT: FD
      /InstallationR (1 iterations)
      /InstallationR[0]
      /InstallationR[0]/UNIX_Timestamp32: Thu Jun 21 19:37:57 GMT 2007
      /InstallationR[0]/Fru_Path: /SYS/MB
      /InstallationR[0]/Parent_Part_Number: 5017813
      /InstallationR[0]/Parent_Serial_Number: 110508
      /InstallationR[0]/Parent_Dash_Level: 01
      /InstallationR[0]/System_Id: 0721BBB050
      /InstallationR[0]/System_Tz: 0
...


1.5 Running POST

Power-on self-test (POST) is a group of PROM-based tests that run when the server is powered on or reset. POST checks the basic integrity of the critical hardware components in the server (CPU, memory, and I/O buses).

If POST detects a faulty component, the component is disabled automatically, preventing faulty hardware from potentially harming any software. If the system is capable of running without the disabled component, the system will boot when POST is complete. For example, if one of the processor cores is deemed faulty by POST, the core will be disabled, and the system will boot and run using the remaining cores.

1.5.1 Controlling How POST Runs

The server can be configured for normal, extensive, or no POST execution. You can also control the level of tests that run, the amount of POST output that is displayed, and which reset events trigger POST by using ALOM CMT CLI variables.

TABLE 1-7 lists the ALOM CMT CLI variables used to configure POST. FIGURE 1-8 shows how the variables work together.



Note - Use the ALOM CMT CLI setsc command to set all the parameters in TABLE 1-7 except setkeyswitch.



TABLE 1-7 ALOM CMT CLI Parameters Used for POST Configuration

Parameter

Values

Description

setkeyswitch

normal

The system can power on and run POST (based on the other parameter settings). For details see FIGURE 1-8. This parameter overrides all other commands.

 

diag

The system runs POST based on predetermined settings.

 

stby

The system cannot power on.

 

locked

The system can power on and run POST, but no flash updates can be made.

diag_mode

off

POST does not run.

 

normal

Runs POST according to diag_level value.

 

service

Runs POST with preset values for diag_level and diag_verbosity.

diag_level

max

If diag_mode = normal, runs all the minimum tests plus extensive CPU and memory tests.

 

min

If diag_mode = normal, runs minimum set of tests.

diag_trigger

none

Does not run POST on reset.

 

user_reset

Runs POST upon user-initiated resets.

 

power_on_reset

Only runs POST for the first power on. This option is the default.

 

error_reset

Runs POST if fatal errors are detected.

 

all_resets

Runs POST after any reset.

diag_verbosity

none

No POST output is displayed.

 

min

POST output displays functional tests with a banner and pinwheel.

 

normal

POST output displays all test and informational messages.

 

max

POST displays all test, informational, and some debugging messages.


FIGURE 1-8 Flowchart of ALOM CMT CLI Variables for POST Configuration


Figure showing POST flow chart

TABLE 1-8 shows typical combinations of ALOM CMT CLI variables and associated POST modes.


TABLE 1-8 ALOM CMT CLI Parameters and POST Modes

Parameter

Normal Diagnostic Mode

(Default Settings)

No POST Execution

Diagnostic Service Mode

Keyswitch Diagnostic Preset Values

diag_mode

normal

off

service

normal

setkeyswitch[1]

normal

normal

normal

diag

diag_level

 

max

n/a

max

max

diag_trigger

power-on-reset error-reset

none

all-resets

all-resets

diag_verbosity

normal

n/a

max

max

Description of POST execution

This is the default POST configuration. This configuration tests the system thoroughly, and suppresses some of the detailed POST output.

POST does not run, resulting in quick system initialization. This is not a suggested configuration.

POST runs the full spectrum of tests with the maximum output displayed.

POST runs the full spectrum of tests with the maximum output displayed.


1.5.2 Changing POST Parameters

1. Access the ALOM CMT CLI sc> prompt:

At the console, issue the #. key sequence:


#.

2. Use the ALOM CMT CLI sc> prompt to change the POST parameters.

Refer to TABLE 1-7 for a list of ALOM CMT CLI POST parameters and their values.

The setkeyswitch parameter sets the virtual keyswitch, so this parameter does not use the setsc command. For example, to change the POST parameters using the setkeyswitch command, enter the following:


sc> setkeyswitch diag

To change the POST parameters using the setsc command, you must first set the setkeyswitch parameter to normal. Then you can change the POST parameters using the setsc command:


sc> setkeyswitch normal
sc> setsc value

For example:


sc> setkeyswitch normal
sc> setsc diag_mode service

1.5.3 Reasons to Run POST

You can use POST for basic hardware verification and diagnosis, and for troubleshooting as described in the following sections.

1.5.3.1 Verifying Hardware Functionality

POST tests critical hardware components to verify functionality before the system boots and accesses software. If POST detects an error, the faulty component is disabled automatically, preventing faulty hardware from potentially harming software.

1.5.3.2 Diagnosing the System Hardware

You can use POST as an initial diagnostic tool for the system hardware. In this case, configure POST to run in maximum mode (diag_mode=service, setkeyswitch=diag, diag_level=max) for thorough test coverage and verbose output.

1.5.4 Running POST in Maximum Mode

This procedure describes how to run POST when you want maximum testing, as in the case when you are troubleshooting a server or verifying a hardware upgrade or repair.

1. Switch from the system console prompt to the sc> prompt by issuing the #. escape sequence.


ok #.
sc>

2. Set the virtual keyswitch to diag so that POST will run in service mode.


sc> setkeyswitch diag

3. Reset the system so that POST runs.

There are several ways to initiate a reset. EXAMPLE 1-2 shows the powercycle command. For other methods, refer to the Sun Netra T5220 Server Administration Guide.


EXAMPLE 1-2 Initiating POST Using the powercycle Command

sc> powercycleAre you sure you want to powercycle the system (y/n)? y
Powering host off at Fri Jul 27 08:11:52 2007
Waiting for host to Power Off; hit any key to abort.
Audit | minor: admin : Set : object = /SYS/power_state : value = soft : success
Chassis | critical: Host has been powered off
Powering host on at Fri Jul 27 08:13:08 2007
Audit | minor: admin : Set : object = /SYS/power_state : value = on : success
Chassis | major: Host has been powered on

4. Switch to the system console to view the POST output:


sc> console

EXAMPLE 1-3 depicts abridged POST output.


EXAMPLE 1-3 POST Output (Abridged)

sc> console
Enter #. to return to ALOM.
2007-07-03 10:25:12.081 0:0:0>@(#)Sun Netra[TM] T5220 POST 4.x.build_119 2007/06/06 09:48 
/export/delivery/delivery/4.x/4.x.build_119/post4.x/UltraSPARC/NetraT5220/integrated  (root)  
2007-07-03 10:25:12.386 0:0:0>Copyright 2007 Sun Microsystems, Inc. All rights reserved
2007-07-03 10:25:12.550 0:0:0>VBSC cmp0 arg is: 00ff00ff.ffffffff
2007-07-03 10:25:12.653 0:0:0>POST enabling threads: 00ff00ff.ffffffff
2007-07-03 10:25:12.766 0:0:0>VBSC mode is: 00000000.00000001
2007-07-03 10:25:12.867 0:0:0>VBSC level is: 00000000.00000001
2007-07-03 10:25:12.966 0:0:0>VBSC selecting POST MAX Testing.
2007-07-03 10:25:13.066 0:0:0>VBSC setting verbosity level 3
2007-07-03 10:25:13.161 0:0:0>	UltraSPARCT2, Version 2.1
2007-07-03 10:25:13.247 0:0:0>	Serial Number: 0fac006b.0e654482
2007-07-03 10:25:13.353 0:0:0>Basic Memory Tests.....
2007-07-03 10:25:13.456 0:0:0>Begin: Branch Sanity Check
2007-07-03 10:25:13.569 0:0:0>End  : Branch Sanity Check
2007-07-03 10:25:13.668 0:0:0>Begin: DRAM Memory BIST
2007-07-03 10:25:13.793 0:0:0>................................................................................................
2007-07-03 10:25:38.399 0:0:0>End  : DRAM Memory BIST
2007-07-03 10:25:39.547 0:0:0>Sys 166 MHz, CPU 1166 MHz, Mem 332 MHz 
2007-07-03 10:25:39.658 0:0:0>L2 Bank EFuse = 00000000.000000ff 
2007-07-03 10:25:39.760 0:0:0>L2 Bank status = 00000000.00000f0f 
2007-07-03 10:25:39.864 0:0:0>Core available Efuse = ffff00ff.ffffffff 
2007-07-03 10:25:39.982 0:0:0>Test Memory.....
2007-07-03 10:25:40.070 0:0:0>Begin: Probe and Setup Memory
2007-07-03 10:25:40.181 0:0:0>INFO:	  4096MB at Memory Branch 0 
...
 
2007-07-03 10:29:21.683 0:0:0>INFO:
2007-07-03 10:29:21.686 0:0:0>	POST Passed all devices.
2007-07-03 10:29:21.692 0:0:0>POST:	Return to VBSC.

5. Perform further investigation if needed.

a. Interpret the POST messages:

POST error messages use the following syntax:

c:s > ERROR: TEST = failing-test
c:s > H/W under test = FRU
c
:s > Repair Instructions: Replace items in order listed by H/W under test above
c:s > MSG = test-error-message
c
:s > END_ERROR

In this syntax, c = the core number, s = the strand number.

Warning and informational messages use the following syntax:

INFO or WARNING: message

In EXAMPLE 1-4, POST reports a memory error at FB-DIMM location /SYS/MB/CMP0/BR2/CH0/D0. The error was detected by POST running on core 7, strand 2.


EXAMPLE 1-4 POST Error Message

7:2>
7:2>ERROR: TEST = Data Bitwalk
7:2>H/W under test = /SYS/MB/CMP0/BR2/CH0/D0
7:2>Repair Instructions: Replace items in order listed by 'H/W
under test' above.
7:2>MSG = Pin 149 failed on /SYS/MB/CMP0/BR2/CH0/D0 (J2001)
7:2>END_ERROR
 
7:2>Decode of Dram Error Log Reg Channel 2 bits
60000000.0000108c
7:2> 1 MEC 62 R/W1C Multiple corrected
errors, one or more CE not logged
7:2> 1 DAC 61 R/W1C Set to 1 if the error
was a DRAM access CE
7:2> 108c SYND 15:0 RW ECC syndrome.
7:2>
7:2> Dram Error AFAR channel 2 = 00000000.00000000
7:2> L2 AFAR channel 2 = 00000000.00000000

b. Run the showfaults command to obtain additional fault information.

The fault is captured by ALOM CMT CLI, where the fault is logged, the Service Required LED is lit, and the faulty component is disabled.

Example:


EXAMPLE 1-5 showfaults Output

ok .#
sc> showfaults
Last POST Run: Wed Jun 27 21:29:02 2007
 
Post Status: Passed all devices
ID FRU                     Fault
0 /SYS/MB/CMP0/BR2/CH0/D0 SP detected fault: /SYS/MB/CMP0/BR2/CH0/D0 Forced fail (POST)

In this example, /SYS/MB/CMP0/BR2/CH0/D0 is disabled. The system can boot using memory that was not disabled until the faulty component is replaced.



Note - You can use ASR commands to display and control disabled components. See Managing Components With Automatic System Recovery Commands.


1.5.5 Clearing POST Detected Faults

In most cases, when POST detects a faulty component, POST logs the fault and automatically takes the failed component out of operation by placing the component in the ASR blacklist (see Managing Components With Automatic System Recovery Commands).

In most cases, the replacement of the faulty FRU is detected when the service processor is reset or power cycled. In this case, the fault is automatically cleared from the system. This procedure describes how to identify POST detected faults and, if necessary, manually clear the fault.

1. After replacing a faulty FRU, at the ALOM CMT CLI prompt use the showfaults command to identify POST detected faults.

POST detected faults are distinguished from other kinds of faults by the text:
Forced fail, and no UUID number is reported.

Example:


EXAMPLE 1-6 POST Detected Fault

sc> showfaults
Last POST Run: Wed Jun 27 21:29:02 2007
 
Post Status: Passed all devices
ID FRU                     Fault
0 /SYS/MB/CMP0/BR2/CH0/D0 SP detected fault: /SYS/MB/CMP0/BR2/CH0/D0 Forced fail (POST)

If no fault is reported, you do not need to do anything else. Do not perform the subsequent steps.

2. Use the enablecomponent command to clear the fault and remove the component from the ASR blacklist.

Use the FRU name that was reported in the fault in Step 1.


EXAMPLE 1-7 Using the enablecomponent Command

sc> enablecomponent /SYS/MB/CMP0/BR2/CH0/D0 

The fault is cleared and should not show up when you run the showfaults command. Additionally, the Service Required LED is no longer on.

3. Power cycle the server.

You must reboot the server for the enablecomponent command to take effect.

4. At the ALOM CMT CLI prompt, use the showfaults command to verify that no faults are reported.


TABLE 1-9 Verifying Cleared Faults Using the showfaults Command
sc> showfaults
Last POST run: THU MAR 09 16:52:44 2006
POST status: Passed all devices
 
No failures found in System


1.6 Using the Solaris Predictive Self-Healing Feature

The Solaris Predictive Self-Healing (PSH) technology enables the server to diagnose problems while the Solaris OS is running, and mitigate many problems before they negatively affect operations.

The Solaris OS uses the fault manager daemon, fmd(1M), which starts at boot time and runs in the background to monitor the system. If a component generates an error, the daemon handles the error by correlating the error with data from previous errors and other related information to diagnose the problem. Once diagnosed, the fault manager daemon assigns the problem a Universal Unique Identifier (UUID) that distinguishes the problem across any set of systems. When possible, the fault manager daemon initiates steps to self-heal the failed component and take the component offline. The daemon also logs the fault to the syslogd daemon and provides a fault notification with a message ID (MSGID). You can use the message ID to get additional information about the problem from Sun’s knowledge article database.

The Predictive Self-Healing technology covers the following server components:

The PSH console message provides the following information:

If the Solaris PSH facility detects a faulty component, use the fmdump command to identify the fault. Faulty FRUs are identified in fault messages using the FRU name. For a list of FRU names, see TABLE 2-1.

1.6.1 Identifying PSH Detected Faults

When a PSH fault is detected, a Solaris console message similar to EXAMPLE 1-8 is displayed.


EXAMPLE 1-8 Console Message Showing Fault Detected by PSH

SUNW-MSG-ID: SUN4V-8000-DX, TYPE: Fault, VER: 1, SEVERITY: Minor
EVENT-TIME: Wed Sep 14 10:09:46 EDT 2005
PLATFORM: SUNW,Sun-Netra-T5220, CSN: -, HOSTNAME: hostname
SOURCE: cpumem-diagnosis, REV: 1.5
EVENT-ID: f92e9fbe-735e-c218-cf87-9e1720a28004
DESC: The number of errors associated with this memory module has exceeded acceptable levels.
AUTO-RESPONSE: Pages of memory associated with this memory module are being removed from service as errors are reported.
IMPACT: Total system memory capacity will be reduced as pages are retired.
REC-ACTION: Schedule a repair procedure to replace the affected memory module.  Use fmdump -v -u <EVENT_ID> to identify the module.

Faults detected by the Solaris PSH facility are also reported through service processor alerts. EXAMPLE 1-9 depicts an ALOM CMT CLI alert of the same fault reported by Solaris PSH in EXAMPLE 1-8.


EXAMPLE 1-9 ALOM CMT CLI Alert of PSH Diagnosed Fault

SC Alert: Host detected fault, MSGID: SUN4V-8000-DX

The ALOM CMT CLI showfaults command provides summary information about the fault. See Displaying System Faults for more information about the showfaults command.



Note - The Service Required LED is also turns on for PSH diagnosed faults.


1.6.1.1 Using the fmdump Command to Identify Faults

The fmdump command displays the list of faults detected by the Solaris PSH facility and identifies the faulty FRU for a particular EVENT_ID (UUID).

Do not use fmdump to verify a FRU replacement has cleared a fault because the output of fmdump is the same after the FRU has been replaced. Use the fmadm faulty command to verify the fault has cleared.

1. Check the event log using the fmdump command with -v for verbose output:


EXAMPLE 1-10 Output from the fmdump -v Command

# fmdump -v -u fd940ac2-d21e-c94a-f258-f8a9bb69d05b
TIME                 UUID                                 SUNW-MSG-ID
Jul 31 12:47:42.2007 fd940ac2-d21e-c94a-f258-f8a9bb69d05b SUN4V-8000-JA
  100%  fault.cpu.ultraSPARC-T2.misc_regs
 
        Problem in: cpu:///cpuid=16/serial=5D67334847
           Affects: cpu:///cpuid=16/serial=5D67334847
               FRU: hc://:serial=101083:part=541215101/motherboard=0
          Location: MB

In EXAMPLE 1-10, a fault is displayed, indicating the following details:



Note - fmdump displays the PSH event log. Entries remain in the log after the fault has been repaired.


2. Use the Sun message ID to obtain more information about this type of fault.

a. Obtain the message ID from the console output or the ALOM CMT CLI showfaults command.

b. Enter the message ID in the SUNW-MSG-ID field, and click Lookup.

In EXAMPLE 1-11, the message ID SUN4V-8000-JA provides information for corrective action:


EXAMPLE 1-11 PSH Message Output

CPU errors exceeded acceptable levels
 
Type
    Fault 
Severity
    Major 
Description
    The number of errors associated with this CPU has exceeded acceptable levels. 
Automated Response
    The fault manager will attempt to remove the affected CPU from service. 
Impact
    System performance may be affected. 
 
Suggested Action for System Administrator
    Schedule a repair procedure to replace the affected CPU, the identity of which can be determined using fmdump -v -u <EVENT_ID>. 
 
Details
    The Message ID:  SUN4V-8000-JA indicates diagnosis has determined that a CPU is faulty. The Solaris fault manager arranged an automated attempt to disable this CPU. The recommended action for the system administrator is to contact Sun support so a Sun service technician can replace the affected component. 

3. Follow the suggested actions to repair the fault.

1.6.2 Clearing PSH Detected Faults

When the Solaris PSH facility detects faults the faults are logged and displayed on the console. In most cases, after the fault is repaired, the corrected state is detected by the system and the fault condition is repaired automatically. However, this must be verified and, in cases where the fault condition is not automatically cleared, the fault must be cleared manually.

1. After replacing a faulty FRU, power on the server.

2. At the ALOM CMT CLI prompt, use the showfaults command to identify PSH detected faults.

PSH detected faults are distinguished from other kinds of faults by the text:
Host detected fault.

Example:


sc> showfaults -v
Last POST Run: Wed Jun 29 11:29:02 2007
 
Post Status: Passed all devices
ID  Time              FRU                      Fault
0  Jun 30 22:13:02   /SYS/MB/CMP0/BR2/CH0/D0  Host detected fault, MSGID: SUN4V-8000-DX  UUID: 7ee0e46b-ea64-6565-e684-e996963f7b86

3. Run the ALOM CMT CLI clearfault command with the UUID provided in the showfaults output.

Example:


sc> clearfault 7ee0e46b-ea64-6565-e684-e996963f7b86
Clearing fault from all indicted FRUs...
Fault cleared.

4. Clear the fault from all persistent fault records.

In some cases, even though the fault is cleared, some persistent fault information remains and results in erroneous fault messages at boot time. To ensure that these messages are not displayed, perform the following Solaris command:

fmadm repair UUID

Example:


# fmadm repair 7ee0e46b-ea64-6565-e684-e996963f7b86


1.7 Collecting Information From Solaris OS Files and Commands

With the Solaris OS running on the server, you have the full complement of Solaris OS files and commands available for collecting information and for troubleshooting.

If POST, service processor, or the Solaris PSH features do not indicate the source of a fault, check the message buffer and log files for notifications for faults. Hard drive faults are usually captured by the Solaris message files.

Use the dmesg command to view the most recent system message. To view the system messages log file, view the contents of the /var/adm/messages file.

1.7.1 Checking the Message Buffer

1. Log in as superuser.

2. Type the dmesg command:


# dmesg

The dmesg command displays the most recent messages generated by the system.

1.7.2 Viewing System Message Log Files

The error logging daemon, syslogd, automatically records various system warnings, errors, and faults in message files. These messages can alert you to system problems such as a device that is about to fail.

The /var/adm directory contains several message files. The most recent messages are in the /var/adm/messages file. After a period of time (usually every ten days), a new messages file is automatically created. The original contents of the messages file are rotated to a file named messages.1. Over a period of time, the messages are further rotated to messages.2 and messages.3, and then deleted.

1. Log in as superuser.

2. Type the following command:


# more /var/adm/messages

3. If you want to view all logged messages, type the following command:


# more /var/adm/messages*


1.8 Managing Components With Automatic System Recovery Commands

The Automatic System Recovery (ASR) feature enables the server to automatically configure failed components out of operation until they can be replaced. In the server, theASR feature manages the following components:

The database that contains the list of disabled components is called the ASR blacklist (asr-db).

In most cases, POST automatically disables a faulty component. After the cause of the fault is repaired (FRU replacement, loose connector reseated, and so on), you must remove the component from the ASR blacklist.

The ASR commands (TABLE 1-10) enable you to view, and manually add or remove components from the ASR blacklist. You run these commands from the ALOM CMT CLI sc> prompt.


TABLE 1-10 ASR Commands

Command

Description

showcomponent

Displays system components and their current state.

enablecomponent asrkey

Removes a component from the asr-db blacklist, where asrkey is the component to enable.

disablecomponent asrkey

Adds a component to the asr-db blacklist, where asrkey is the component to disable.

clearasrdb

Removes all entries from the asr-db blacklist.




Note - The components (asrkeys) vary from system to system, depending on how many cores and memory are present. Use the showcomponent command to see the asrkeys on a given system.




Note - A reset or power cycle is required after disabling or enabling a component. If the status of a component is changed, there is no effect to the system until the next reset or power cycle.


1.8.1 Displaying System Components

The showcomponent command displays the system components (asrkeys) and reports their status.

single-step bullet  At the sc> prompt, enter the showcomponent command

EXAMPLE 1-12 shows partial output with no disabled components.


EXAMPLE 1-12 Output of the showcomponent Command With No Disabled Components

sc> showcomponent
Keys:
 
    /SYS/MB/RISER0/XAUI0
    /SYS/MB/RISER0/PCIE0
    /SYS/MB/RISER0/PCIE3
    /SYS/MB/RISER1/XAUI1
    /SYS/MB/RISER1/PCIE1
    /SYS/MB/RISER1/PCIE4
    /SYS/MB/RISER2/PCIE2
    /SYS/MB/RISER2/PCIE5
    /SYS/MB/GBE0
    /SYS/MB/GBE1
    /SYS/MB/PCIE
    /SYS/MB/PCIE-IO/USB
    /SYS/MB/SASHBA
    /SYS/MB/CMP0/NIU0
    /SYS/MB/CMP0/NIU1
    /SYS/MB/CMP0/MCU0
    /SYS/MB/CMP0/MCU1
    /SYS/MB/CMP0/MCU2
    /SYS/MB/CMP0/MCU3
 
    /SYS/MB/CMP0/L2_BANK0
    /SYS/MB/CMP0/L2_BANK1
    /SYS/MB/CMP0/L2_BANK2
    /SYS/MB/CMP0/L2_BANK3
    /SYS/MB/CMP0/L2_BANK4
    /SYS/MB/CMP0/L2_BANK5
    /SYS/MB/CMP0/L2_BANK6
    /SYS/MB/CMP0/L2_BANK7
    ...
    /SYS/TTYA
State: Clean

EXAMPLE 1-13 shows showcomponent command output with a component disabled:


EXAMPLE 1-13 Output of the showcomponent Command Showing Disabled Components

sc> showcomponent
Keys:
 
    /SYS/MB/RISER0/XAUI0
    /SYS/MB/RISER0/PCIE0
    /SYS/MB/RISER0/PCIE3
    /SYS/MB/RISER1/XAUI1
    /SYS/MB/RISER1/PCIE1
    /SYS/MB/RISER1/PCIE4
    /SYS/MB/RISER2/PCIE2
    /SYS/MB/RISER2/PCIE5
    ...
    /SYS/TTYA
Disabled Devices
  /SYS/MB/CMP0/L2_BANK0	Disabled by user

1.8.2 Disabling Components

The disablecomponent command disables a component by adding it to the ASR blacklist.

1. At the sc> prompt, enter the disablecomponent command.


sc> disablecomponent /SYS/MB/CMP0/BR1/CH0/D0
Chassis | major: /SYS/MB/CMP0/BR1/CH0/D0 has been disabled. Disabled by user

2. After receiving confirmation that the disablecomponent command is complete, reset the server so that the ASR command takes effect.


sc> reset

1.8.3 Enabling Disabled Components

The enablecomponent command enables a disabled component by removing it from the ASR blacklist.

1. At the sc> prompt, enter the enablecomponent command.


sc> enablecomponent /SYS/MB/CMP0/BR1/CH0/D0
Chassis | major: /SYS/MB/CMP0/BR1/CH0/D0 has been enabled.

2. After receiving confirmation that the enablecomponent command is complete, reset the server for so that the ASR command takes effect.


sc> reset


1.9 Exercising the System With SunVTS Software

Sometimes a server exhibits a problem that cannot be isolated definitively to a particular hardware or software component. In such cases, it might be useful to run a diagnostic tool that stresses the system by continuously running a comprehensive battery of tests. Sun provides the SunVTS software for this purpose.

This section describes the tasks necessary to use SunVTS software to exercise your server:

1.9.1 Checking Whether SunVTS Software Is Installed

This procedure assumes that the Solaris OS is running on the server, and that you have access to the Solaris command line.

1. Check for the presence of SunVTS packages using the pkginfo command.


% pkginfo -l SUNWvts SUNWvtsr SUNWvtsts SUNWvtsmn

TABLE 1-11 lists SunVTS packages:


TABLE 1-11 SunVTS Packages

Package

Description

SUNWvts

SunVTS framework

SUNWvtsr

SunVTS framework (root)

SUNWvtsts

SunVTS for tests

SUNWvtsmn

SunVTS man pages


The SunVTS 6.0 PS3 software, and future compatible versions, are supported on the server.

SunVTS installation instructions are described in the SunVTS User’s Guide.

1.9.2 Exercising the System Using SunVTS Software

Before you begin, the Solaris OS must be running. You also must ensure that SunVTS validation test software is installed on your system. See Checking Whether SunVTS Software Is Installed.

The SunVTS installation process requires that you specify one of two security schemes to use when running SunVTS. The security scheme you choose must be properly configured in the Solaris OS for you to run SunVTS. For details, refer to the SunVTS User’s Guide.

SunVTS software features both character-based and graphics-based interfaces. This procedure assumes that you are using the graphical user interface (GUI) on a system running the Common Desktop Environment (CDE). For more information about the character-based SunVTS TTY interface, and specifically for instructions on accessing it by tip or telnet commands, refer to the SunVTS User’s Guide.

SunVTS software can be run in several modes. This procedure assumes that you are using the default mode.

This procedure also assumes that the server is headless. That is, it is not equipped with a monitor capable of displaying bitmap graphics. In this case, you access the SunVTS GUI by logging in remotely from a machine that has a graphics display.

Finally, this procedure describes how to run SunVTS tests in general. Individual tests might presume the presence of specific hardware, or might require specific drivers, cables, or loopback connectors. For information about test options and prerequisites, refer to the following documentation:

1.9.3 Exercising the System With SunVTS Software

1. Log in as superuser to a system with a graphics display.

The display system should be one with a frame buffer and monitor capable of displaying bitmap graphics such as those produced by the SunVTS GUI.

2. Enable the remote display.

On the display system, type:


# /usr/openwin/bin/xhost + test-system

where test-system is the name of the server you plan to test.

3. Remotely log in to the server as superuser.

Use a command such as rlogin or telnet.

4. Start SunVTS software.

If you have installed SunVTS software in a location other than the default /opt directory, alter the path, as in EXAMPLE 1-15.


EXAMPLE 1-15 Alternate Command for Starting SunVTS Software

# /opt/SUNWvts/bin/sunvts -display display-system:0

where display-system is the name of the machine through which you are remotely logged in to the server.

The SunVTS GUI is displayed (FIGURE 1-9).


FIGURE 1-9 SunVTS GUI

Figure showing the SunVTS GUI for the server and the various buttons and areas on the GUI screen.

5. Expand the test lists to see the individual tests.


The test selection area lists tests in categories, such as Network, as shown in FIGURE 1-10. To expand a category, left-click the

icon (expand category icon) to the left of the category name.

FIGURE 1-10 SunVTS Test Selection Panel

Figure showing a small portion of the test selection area in the SunVTS graphical interface.

6. (Optional) Select the tests you want to run.

Certain tests are enabled by default, and you can choose to accept these.

Alternatively, you can enable and disable individual tests or blocks of tests by clicking the checkbox next to the test name or test category name. Tests are enabled when checked, and disabled when not checked.

TABLE 1-12 lists tests that are especially useful to run on this server.


TABLE 1-12 Useful SunVTS Tests to Run on This Server

SunVTS Tests

FRUs Exercised by Tests

cmttest, cputest, fputest, iutest, l1dcachetest, dtlbtest, and l2sramtest - indirectly: mptest, and systest

FB-DIMMS, CPU motherboard

disktest

Disks, cables, disk backplane

cddvdtest

CD/DVD device, cable, motherboard

nettest, netlbtest

Network interface, network cable, CPU motherboard

pmemtest, vmemtest, ramtest

FB-DIMMs, motherboard

serialtest

I/O (serial port interface)

usbkbtest, disktest

USB devices, cable, CPU motherboard (USB controller)

hsclbtest

Motherboard, service processor

(Host to service processor interface)


7. (Optional) Customize individual tests.

You can customize individual tests by right-clicking on the name of the test. For example, in FIGURE 1-10, right-clicking on the text string ce0(nettest) brings up a menu that enables you to configure this Ethernet test.

8. Start testing.

Click the Start button that is located at the top left of the SunVTS window. Status and error messages appear in the test messages area located across the bottom of the window. You can stop testing at any time by clicking the Stop button.

During testing, SunVTS software logs all status and error messages. To view these messages, click the Log button or select Log Files from the Reports menu. This action opens a log window from which you can choose to view the following logs:


1.10 Obtaining the Chassis Serial Number

To obtain support for your system, you need your chassis serial number. The chassis serial number is located on a sticker that is on the front of the server and another sticker on the side of the server. You can also run the ALOM CMT CLI showplatform command to obtain the chassis serial number.

For example:


TABLE 1-13 Obtaining the Chassis Serial Number With the showplatform Command
sc> showplatform
SUNW,Sun-Netra-T5220
Chassis Serial Number: xxxxxxxxxxxx
Domain Status
------ ------
S0 OS Standby
sc>


1.11 Additional Service Related Information

In addition to this service manual, the following resources are available to help you keep your server running optimally. These documents are available at:

http://www.oracle.com/technetwork/indexes/documentation/index.html