C H A P T E R 8 |
Open Issues for Sun Fire High-End Systems |
This chapter describes open issues related to the Sun Fire high-end servers -- the Sun Fire E25K/E20K/15K/12K systems -- running Solaris 8 2/04 software.
Dynamic reconfiguration (DR) has two components: one that runs in the System Management Services (SMS) environment on the system controller (SC), another that runs in the Solaris environment on the domains.
This section describes open issues about domain-side DR running on Solaris 8 2/04 software. For information about SMS-side DR, see the System Management Services Dynamic Reconfiguration User Guide and the System Management Services Release Notes that correspond to the SMS release your system is running.
This section lists important domain-side DR bugs known to exist as of the publication date of this document.
When multiple concurrent DR operations occur, or when psradm is run at the same time as a DR operation, the system can hang because of a mutex deadly embrace.
Workaround: Perform DR operations serially (one DR operation at a time); and allow each to complete successfully before running psradm, or before beginning another DR operation.
When a SCSI controller is configured but not busy, it cannot be disconnected using the DR cfgadm(1M) command.
In rare cases, a quiesce of the Solaris software fails to stop certain user threads, and to restart others, which remain in a stopped state. Depending on the threads affected, applications running on the domain may stop running and other DR operations might not be possible until the domain is rebooted.
Workaround: Do not use DR to remove a board that contains permanent memory.
Description: When a single-threaded or multi-threaded client of the cfgadm library issues concurrent sbd requests, the system may hang.
Workaround: None. To avoid this bug, do not run in parallel multiple instances of cfgadm targeting system boards, and do not send signals, such as CTRL-C, to long-running cfgadm operations.
Unconfiguring an hsPCI I/O board at the same time a PCI option card is being configured into it causes a system panic. For example, the panic would occur if the following commands were executed simultaneously. In this example, pcisch18:e03b1slot2 is one of the four PCI Slots on IO3:
Workaround: Do not execute a PCI hotplug operation while an hsPCI I/O board is being unconfigured.
Due to a race condition, a PCI slot with an empty cassette may show disconnected state rather than the usual connected state after a DR operation on a Slot 1 I/O board (hsPCI). The PCI Slot with an empty cassette should be in connected state forFRUID purposes. For example:
Workaround: Run the cfgadm command to put the PCI Slot in a connected state. For example:
A hang can occur when permanent memory from a 32 GB board has been unconfigured, a copy-rename writes it to a target board that has less than 32 GB of memory, and another copy-rename attempts to write it to a third board with less than 32GB. One example might be where memory is moved from a 32 GB board to a 8GB board, then to a 16GB board.
Workaround: Do not assign permanent memory to a 32 GB board, or do not have a mix of boards that have 32 GB of memory and those that do not in a domain when it is unconfigured.
If non-permanent memory is unconfigured, the system removes retired pages from the retired pages list to prevent them from becoming dangling pages - that is, pages that point to physical memory that would have been unconfigured.
When permanent memory is unconfigured, a target board is identified and unconfigured first. Once a target board is ready, the contents of the source board (the permanent memory) are copied to the target board. The target board is then "renamed" (memory controllers are programmed) to have the same address range as the source board. Therefore, if the source board contains any retired pages, these pages are not dangling pages after the rename. They point to valid addresses, but the physical memory behind those addresses is in the target board. The problem is that the physical memory is probably good (does not contain ECC errors).
The automatic page removal feature may result in removal of a good page after a DR operation.
Workaround: Disable automatic_page_removal.
These errors can occur on systems with devices that define a nonunique portID. For example, if you attempt a DR operation on a CPU for which the portID is defined as 0x000000, and the sytem contains an I/O device whose portID is also defined as 0x000000, the DR operation fails.
The Prtdiag, psrinfo, and cfgadm commands on a Sun Fire E25K or E20K might incorrectly display the speed at which the board is rated, not the actual speed.
Workaround: see your licensed Sun Service personnel for possible fixes.
Copyright © 2003, Sun Microsystems, Inc. All rights reserved.