C H A P T E R  8

Open Issues for Sun Fire High-End Systems

This chapter describes open issues related to the Sun Fire high-end servers -- the Sun Fire E25K/E20K/15K/12K systems -- running Solaris 8 2/04 software.

Dynamic Reconfiguration on Sun Fire High-End Systems

Dynamic reconfiguration (DR) has two components: one that runs in the System Management Services (SMS) environment on the system controller (SC), another that runs in the Solaris environment on the domains.

This section describes open issues about domain-side DR running on Solaris 8 2/04 software. For information about SMS-side DR, see the System Management Services Dynamic Reconfiguration User Guide and the System Management Services Release Notes that correspond to the SMS release your system is running.

Known Dynamic Reconfiguration Bugs

This section lists important domain-side DR bugs known to exist as of the publication date of this document.

DR Operations Hang After a Few Loops When CPU Power Control Is Also Running (BugID 4114317)

When multiple concurrent DR operations occur, or when psradm is run at the same time as a DR operation, the system can hang because of a mutex deadly embrace.

Workaround: Perform DR operations serially (one DR operation at a time); and allow each to complete successfully before running psradm, or before beginning another DR operation.

Unable to Disconnect SCSI Controllers Using DR (BugID 4446253)

When a SCSI controller is configured but not busy, it cannot be disconnected using the DR cfgadm(1M) command.

Workaround: None.

DR Commands Hang Waiting for rcm_daemon While Running ipc, vm, and ism Stress (BugID 4508927)

In rare cases, a quiesce of the Solaris software fails to stop certain user threads, and to restart others, which remain in a stopped state. Depending on the threads affected, applications running on the domain may stop running and other DR operations might not be possible until the domain is rebooted.

Workaround: Do not use DR to remove a board that contains permanent memory.

cfgadm_sbd Plugin Signal Handling Is Completely Broken (BugID 4498600)

Description: When a single-threaded or multi-threaded client of the cfgadm library issues concurrent sbd requests, the system may hang.

Workaround: None. To avoid this bug, do not run in parallel multiple instances of cfgadm targeting system boards, and do not send signals, such as CTRL-C, to long-running cfgadm operations.

System Panics During Concurrent Slot 1 DR and PCI Hotplug Operations (BugID 4797110)

Unconfiguring an hsPCI I/O board at the same time a PCI option card is being configured into it causes a system panic. For example, the panic would occur if the following commands were executed simultaneously. In this example, pcisch18:e03b1slot2 is one of the four PCI Slots on IO3:

# cfgadm -c unconfigure IO3
# cfgadm -c configure pcisch18:e03b1slot2

Workaround: Do not execute a PCI hotplug operation while an hsPCI I/O board is being unconfigured.

PCI Slot With Empty Cassette May Show disconnected State After DR Operation(BugID 4809799)

Due to a race condition, a PCI slot with an empty cassette may show disconnected state rather than the usual connected state after a DR operation on a Slot 1 I/O board (hsPCI). The PCI Slot with an empty cassette should be in connected state forFRUID purposes. For example:

PCI Slot with empty cassette showing incorrect state: 
# cfgadm -al pcisch17:e00b1slot0 
pcisch17:e00b1slot0 unknown disconnected unconfigured unknown 
  
PCI Slot with empty cassette showing correct state:
# cfgadm -al pcisch17:e00b1slot0
pcisch17:e00b1slot0 unknown connected unconfigured unknown

Workaround: Run the cfgadm command to put the PCI Slot in a connected state. For example:

 # cfgadm -c connect pcisch17:e00b1slot0

Sequence of Copy-Rename/Reboot Events Causes OS to Hang During Quiesce Stage (BugID 4806726)

A hang can occur when permanent memory from a 32 GB board has been unconfigured, a copy-rename writes it to a target board that has less than 32 GB of memory, and another copy-rename attempts to write it to a third board with less than 32GB. One example might be where memory is moved from a 32 GB board to a 8GB board, then to a 16GB board.

Workaround: Do not assign permanent memory to a 32 GB board, or do not have a mix of boards that have 32 GB of memory and those that do not in a domain when it is unconfigured.

page_retire Might Not Update Retired Page List (BugID 4893666)

If non-permanent memory is unconfigured, the system removes retired pages from the retired pages list to prevent them from becoming dangling pages - that is, pages that point to physical memory that would have been unconfigured.

When permanent memory is unconfigured, a target board is identified and unconfigured first. Once a target board is ready, the contents of the source board (the permanent memory) are copied to the target board. The target board is then "renamed" (memory controllers are programmed) to have the same address range as the source board. Therefore, if the source board contains any retired pages, these pages are not dangling pages after the rename. They point to valid addresses, but the physical memory behind those addresses is in the target board. The problem is that the physical memory is probably good (does not contain ECC errors).

Workaround: None.

Page Removal Causes a Good Page to be Removed After a DR Operation (BugID 4860955)

The automatic page removal feature may result in removal of a good page after a DR operation.

Workaround: Disable automatic_page_removal.

DR Detach Fails With Solaris Failed to Deprobe Error (BugID 4873095); DR Attach Fails With Cannot Read Property Value: Device Node 0x0: Property Name (BugID 4913987)

These errors can occur on systems with devices that define a nonunique portID. For example, if you attempt a DR operation on a CPU for which the portID is defined as 0x000000, and the sytem contains an I/O device whose portID is also defined as 0x000000, the DR operation fails.

Workaround: None.

Known Non-DR Bugs

Incorrect Board Speed Displayed (BugID 4964679)

The Prtdiag, psrinfo, and cfgadm commands on a Sun Fire E25K or E20K might incorrectly display the speed at which the board is rated, not the actual speed.

Workaround: see your licensed Sun Service personnel for possible fixes.