C H A P T E R 2 - SMS 1.5 Bugs

This chapter provides information about known SMS 1.5 bugs, as well as bugs that have been fixed in the SMS patches that support the UltraSPARC IV+ processor. The chapter includes the following sections:

Bug Fixes in This Update

This section lists the bugs in SMS 1.5 software and related bugs that have been fixed in the SMS patches that support the UltraSPARC IV+ processor.

Enhance UltraSPARC IV+ CPU Error Handling (CR ID 6257778)

Patch 120843-01 enhances the error handling and recovery capabilities of OpenBoot trademark

PROM to include the UltraSPARC IV+ processors.

prtdiag Shows Wrong Bus Frequency for C5 Slots (CR ID 6286277)

After a card is hot-plugged into slot 1 (c5v0) and the system is restarted, prtdiag showed the correct bus frequency for the populated slot, but incorrectly reported the bus frequency for the other empty slots. This has been fixed in Patch 120843-01.

"PCI IOC ECC Tests" Fails -l64 or Higher on Starcat with Dual-Core UltraSPARC IV+s (CR ID 6255743)

On Sun Fire E25K/E20K systems that have dual-core UltraSPARC IV+ boards installed, lpost might fail at diagnostic levels 64, 96, or 127. When the failure occurs, lpost returns the following error message:

{SB03/P0/C1} ERROR: TEST=PCI IOC Ecc Tests,SUBTEST=PCI IOC ECC

Modify hpost to Support UltraSPARC IV+ GA of 1500 MHz (CR ID 6270911)

hpost in SMS 1.5 needs to be modified to support the UltraSPARC IV+ boards. Patch 120648-02 makes this modification.

hpost -q Fails "Out Of Config on Timeout" When Rebooting from Solaris (CR ID 6324035)

Occasionally, a Sun Fire E25K/E20K system running the Solaris 9 4/04 OS on UltraSPARC IV+ boards will time out if you reboot a domain on the UltraSPARC IV+ board. The system returns the following error message:

Proccore SB0/P0/C0 timed out on test Domain Advanced Tests id=0x6F. Test Failed.FAIL Proccore SB0/P0/C0: test_seq_cwd(): failed out of config on timeout

(Timeout Secs Given: 30)

UltraSPARC IV+ Version 2.1 Early Lots Should Be Internal-Only (CR 6292571)

The first UltraSPARC IV+ processors released for customer systems are version 2.1.1. Patch 120648-02 modifies POST to detect earlier version 2.1 processors, which are not qualified for customer use, and fail them out of the configuration.

Note that versions 2.1 and 2.1.1 cannot be discriminated by MaskID, which is 2.1 for both. POST discriminates them based on other electrically-readable information.

UltraSPARC IV+: marginvoltage vcore minus on PN 1500 MHz Does Not Show the Correct Margined Voltage (CR 6288445)

This bug applies only to 1500 MHz UltraSPARC IV+ boards. Occasionally, using the marginvoltage command with the -m-1 option returns an incorrect value. If you issue the command again a few seconds later, it returns the correct value. This has been fixed in Patch 120789-01.

UltraSPARC IV+: marginvoltage Output Format for UltraSPARC IV+ vcore is Not Correct (CR 6290143)

This bug applies only to 1500 MHz UltraSPARC IV+ boards. When you use the
-m-1 or -m+1 options with the marginvoltage command, the system returns an incorrect output format. For example, using the -m+1 command returns a changed value of Nom (voltage) instead of Nom+3% (voltage) on the UltraSPARC IV+ boards, but the same command returns correct output on UltraSPARC IV and UltraSPARC III boards. Patch 120789-01 fixes this issue.

RFE: AVL-FS2 (Starcat): Provide Diagnosis of New UltraSPARC IV+ CPU Errors (CR ID 6277467)

UltraSPARC IV+ processors include additional error detection and RAS capabilities beyond those in UltraSPARC IV and III+ processors. This CR describes an enhancement to the Availability functionality to diagnose the new errors an UltraSPARC IV+ can report. With this enhancement, Availability diagnoses all fatal errors for all processor types, as well as non-fatal errors for Solaris 9 domains. Patch 120827-01 provides this enhancement.

SC CPU Needs to Handle L3/L2 Cache Errors on Non-FMA Domains So As Not to Cause Processor Indictment (CR ID 6302265)

The UltraSPARC IV+ chips have three levels of cache. Levels 2 and 3 refer to data caches; Level 2 is internal to the processor, and Level 3 is external to the processor.

Sometimes an error produces additional error as side effects. When an error occurs in either level of the data cache, the Availability software diagnoses the root cause of the error and discards the side effect error (or errors). This not only aids in diagnosability, but also ensures that a victim component is not indicted due to a side effect error. Patch 120827-01 fixes this condition.

hwad Sending Dstop Events in Serial Causes Delay and Incorrect dsmd ASR (CR ID 6302843)

On a system running multiple domains, hwad must issue a dstop (domain stop) event to each of the running domains before dsmd can recover the domains after an error condition. Because these dstops were issued in series, there was a delay between the time that the initial dstop was issued and the time when all of the domains have been recovered.

Patch 120789-01 fixes this issue so that the dstops are now issued to the domains in parallel using separate threads, thus eliminating the delay.

SERD Tunables for CPU Events are Not Consistent Between S9U8, S10U1/FMA, and SMS 1.5 (CR ID 6309365)

To account for the additional cache level in the UltraSPARC IV+ processors, the SC-side SERD (Soft Error Rate Discriminator) required different threshold values to align with existing thresholds on Solaris 9 domain. Without the adjustment, the domain will offline the processor prior to the SC-side diagnosis, and the processor's health status is not updated correctly.

Patch 120827-01 fixes this issue so that diagnoses are consistent between the two operating system versions and SMS 1.5 software for all supported types of processors.

Known Bugs in SMS 1.5 Software

FMA Event Reporting to NetConnect Does Not Pick Up Modified Chassis Serial Number (CR ID 5052078)

If a Sun Fire high-end server runs without having its chassis serial number (CSN) set on the SCs using the setcsn command, any Fault Management Architecture (FMA) reports sent to NetConnect after a domain stop (Dstop) will show the serial number as blank in its event reports.

Workaround: Use the setcsn command to set the chassis serial number and then restart SMS. You must restart SMS in order for the CSN to appear in the event reports.

For more information about how to set the chassis serial number on the SC, refer to the System Management Services (SMS) 1.5 Installation Guide.

ndd/dev/scman man_pathgroups_report Output Needs Clarification (CR ID 6252771)

The ndd(1M) command can be executed as root in order to read and write certain device driver parameters. scman(7D) (ndd/dev/scman) manages the Sun Fire E25K/E20K SC side of the Management (MAN) Network, and it supports the ndd(1M) command.

If the man_pathgroups_report parameter of scman(7D) is not interpreted correctly, it may appear as though a serious hardware error has occurred, when the error is actually caused by software. As a result, it might incorrectly be concluded that swapping of hardware is required in order to root-cause the problem.

When the man_pathgroups_report parameter is specified, you can obtain output such as the following:

# ndd /dev/scman man_pathgroups_report

MAN Pathgroup report: (* == error)

Interface       Destination             Active Path     Alternate Paths

----------------------------------------------------------------

scman1          Other SSC               eri0 eri0 exp 0, hme1 exp 0 *

The asterisk (*) in the last line denotes that "the last time the hme1 physical interface was used, an error was found". Historically, the majority of occurrences are due to software, not hardware.

Software causes an error when either the MAN network peer no longer responds to "heartbeat" messages, or when there is an incorrect dlpi(7P) state transition. You can repeatedly create the former case by running the following command as root (assuming the exact output appears as shown above):

# ndd -set /dev/scman man_set_active_path '1 0 1'

For the SC that executes the command (for example, SC0), its Active Path is switched from eri0 to hme1. For a while, SC1 will continue to send packets on the eri0 physical interface, and SC0 will send packets on hme1. After a short while, the two will synchronize and communicate using the same interface. However, an asterisk will be shown (on each SC) to show the last interface on which there was an error. In this case, the error is indeed caused by software (that is, the error is really a non-response to a "heartbeat" message sequence). It is not a fatal hardware error.

An asterisk will indeed be shown in the output if there is a persistent, fatal hardware error. However, you should not assume that hardware is the only possible cause of the asterisk.

SMS 1.5 Documentation Errata

marginvoltage(1M)

That statement is true only for core voltages. All other settings are persistent.

rcfgadm(1M)

If the rcfgadm command fails, a board does not return to its original state. A dxs or dcs error message is logged to the domain. If the error is recoverable, you can retry the command.

If you are running the Solaris 8 or Solaris 9 OS on the domain, perform the following check:

1. Before you retry the command, ensure that the following dcs entries exist in /etc/inetd.conf on the domain, and that they have not been disabled.

sun-dr stream tcp wait root /usr/lib/dcs dcs

sun-dr stream tcp6 wait root /usr/lib/dcs dcs

2. If the error is unrecoverable, you must reboot the domain in order to use that board.

If you are running the Solaris 10 OS on the domain, the dcs is now part of the SMF (Service Management Facility). Perform the following steps:

# inetadm | grep dcs

disabled disabled svc: /platform/sun4u/dcs: default

3. If the dcs is disabled as shown in the above example, enable it by typing the following command:

# svcadm enable svc:/platform/sun4u/dcs:tcp

testemail(1M)

The description of the -c option in the testemail(1M) man page should read as follows:

The fault class or comma-separated list of fault classes that testemail uses to generate an event.

Examples of valid fault classes are in the file /etc/opt/SUNWSMS/config/SF15000.dict .

When invoking testemail using an Ecache resource, make sure that the system board containing the Ecache is powered on. Otherwise, the testemail invocation will fail and no email will be generated.

System Management Services (SMS) 1.5 Administrator Guide

A voltage core monitoring parameter (VCMON) was added to the SMS software. When VCMON is enabled, it monitors any voltage changes or drifts on the processors. If VCMON detects an upward change in voltage (which usually indicates a socket attach issue), it notifies the user with an FMA event and marks the component health status (CHS) of that processor as faulty.

In the description of the showboards command, the -a option should read -v.

In the description of the showenvironment command, the category "Device" should be removed.

The following categories of error messages should be added between error codes 11300 and 50000:

System Management Services (SMS) 1.5 Installation Guide

The Hardware Compatibility Table (Table 2-1) should list Solaris 8 2/02 as the first supported version of Solaris 8 software for both the domains and the system controllers (SCs).

This table contains a typographical error; it refers to a 1.65 MHz UltraSPARC processor. The correct speed should be 1.5 MHz.

SMS 1.5 supports a /swap partition size of 2 Gbytes as well as the 4 Gbyte size described in the Installation Guide. The recommended partition sizes for SMS 1.5 are as follows:

To verify that Java version 1.2.2 has been installed, type the command java -version at the system prompt.

SMS must be up and running before you can record the chassis serial number (CSN).

The flashupdate example is missing the -f switch. It should read as follows:

Upgrade the Solaris OS. See "To Install or Upgrade the Solaris OS on the SC" on page 17.

Run smsupgrade to reinstall SMS after a major OS upgrade (see page 34). Otherwise, proceed to the next step and restore the SMS configuration.

The heading "To Reinstall SMS Software" should read "To Restore the SMS Configuration."

0	/ (`root`)	8 Gbyte
1	`swap`	4 Gbyte
4	OLDS/LVM database (`metadb`)	32 Mbyte
5	OLDS/LVM database (`metadb`)	32 Mbyte
7	`/export/install`	Remaining free

Bug Fixes in This Update

Enhance UltraSPARC IV+ CPU Error Handling (CR ID 6257778)

`prtdiag` Shows Wrong Bus Frequency for C5 Slots (CR ID 6286277)

"PCI IOC ECC Tests" Fails -l64 or Higher on Starcat with Dual-Core UltraSPARC IV+s (CR ID 6255743)

Modify `hpost` to Support UltraSPARC IV+ GA of 1500 MHz (CR ID 6270911)

`hpost -q` Fails "Out Of Config on Timeout" When Rebooting from Solaris (CR ID 6324035)

UltraSPARC IV+ Version 2.1 Early Lots Should Be Internal-Only (CR 6292571)

UltraSPARC IV+: `marginvoltage` vcore minus on PN 1500 MHz Does Not Show the Correct Margined Voltage (CR 6288445)

UltraSPARC IV+: `marginvoltage` Output Format for UltraSPARC IV+ vcore is Not Correct (CR 6290143)

RFE: AVL-FS2 (Starcat): Provide Diagnosis of New UltraSPARC IV+ CPU Errors (CR ID 6277467)

SC CPU Needs to Handle L3/L2 Cache Errors on Non-FMA Domains So As Not to Cause Processor Indictment (CR ID 6302265)

`hwad` Sending Dstop Events in Serial Causes Delay and Incorrect `dsmd` ASR (CR ID 6302843)

SERD Tunables for CPU Events are Not Consistent Between S9U8, S10U1/FMA, and SMS 1.5 (CR ID 6309365)

Known Bugs in SMS 1.5 Software

FMA Event Reporting to NetConnect Does Not Pick Up Modified Chassis Serial Number (CR ID 5052078)

`ndd/dev/scman man_pathgroups_report` Output Needs Clarification (CR ID 6252771)

SMS 1.5 Documentation Errata

`marginvoltage`(1M)

`rcfgadm`(1M)

`testemail`(1M)

System Management Services (SMS) 1.5 Administrator Guide

System Management Services (SMS) 1.5 Installation Guide

Bug Fixes in This Update

Enhance UltraSPARC IV+ CPU Error Handling (CR ID 6257778)

prtdiag Shows Wrong Bus Frequency for C5 Slots (CR ID 6286277)

"PCI IOC ECC Tests" Fails -l64 or Higher on Starcat with Dual-Core UltraSPARC IV+s (CR ID 6255743)

Modify hpost to Support UltraSPARC IV+ GA of 1500 MHz (CR ID 6270911)

hpost -q Fails "Out Of Config on Timeout" When Rebooting from Solaris (CR ID 6324035)

UltraSPARC IV+ Version 2.1 Early Lots Should Be Internal-Only (CR 6292571)

UltraSPARC IV+: marginvoltage vcore minus on PN 1500 MHz Does Not Show the Correct Margined Voltage (CR 6288445)

UltraSPARC IV+: marginvoltage Output Format for UltraSPARC IV+ vcore is Not Correct (CR 6290143)

RFE: AVL-FS2 (Starcat): Provide Diagnosis of New UltraSPARC IV+ CPU Errors (CR ID 6277467)

SC CPU Needs to Handle L3/L2 Cache Errors on Non-FMA Domains So As Not to Cause Processor Indictment (CR ID 6302265)

hwad Sending Dstop Events in Serial Causes Delay and Incorrect dsmd ASR (CR ID 6302843)

SERD Tunables for CPU Events are Not Consistent Between S9U8, S10U1/FMA, and SMS 1.5 (CR ID 6309365)

Known Bugs in SMS 1.5 Software

FMA Event Reporting to NetConnect Does Not Pick Up Modified Chassis Serial Number (CR ID 5052078)

ndd/dev/scman man_pathgroups_report Output Needs Clarification (CR ID 6252771)

SMS 1.5 Documentation Errata

marginvoltage(1M)

rcfgadm(1M)

testemail(1M)

System Management Services (SMS) 1.5 Administrator Guide

System Management Services (SMS) 1.5 Installation Guide

`prtdiag` Shows Wrong Bus Frequency for C5 Slots (CR ID 6286277)

Modify `hpost` to Support UltraSPARC IV+ GA of 1500 MHz (CR ID 6270911)

`hpost -q` Fails "Out Of Config on Timeout" When Rebooting from Solaris (CR ID 6324035)

UltraSPARC IV+: `marginvoltage` vcore minus on PN 1500 MHz Does Not Show the Correct Margined Voltage (CR 6288445)

UltraSPARC IV+: `marginvoltage` Output Format for UltraSPARC IV+ vcore is Not Correct (CR 6290143)

`hwad` Sending Dstop Events in Serial Causes Delay and Incorrect `dsmd` ASR (CR ID 6302843)

`ndd/dev/scman man_pathgroups_report` Output Needs Clarification (CR ID 6252771)

`marginvoltage`(1M)

`rcfgadm`(1M)

`testemail`(1M)