C H A P T E R  5

Dynamic Reconfiguration on Sun Fire High-End Systems

This chapter describes major domain-side dynamic reconfiguration (DR) bugs on Sun Fire high-end (Sun Fire E25K/E20K/15K/12K) systems running Solaris 9 9/05 software. It includes the known bugs at the time of this release.

For information about SMS-side DR bugs, see the SMS Release Notes for the version of SMS running on your system.

Known Bugs

Deleteboard Shows Leakage Error (BugID 4730142)

Description: When a DR command is executing on a system configured with the Freshchoice card (also called SunSwift PCI card, Option 1032), the system might display messages similar to the following:

Aug 12 12:27:41 machine genunix: WARNING:
 vmem_destroy('pcisch2_dvma'): leaked

These messages are benign; the DVMA space is properly refreshed during the DR operation. No true kernel memory leak occurs. This bug affects domains running both Solaris 8 and Solaris 9 operating environments.

Workaround: No workaround is necessary, but to prevent the message from displaying, add the following line to /etc/system:

set pcisch:pci_preserve_iommu_tsb=0

glm: Hang in scsi_transport During DR (BugID 4737786)

Description: A cfgadm(1M) unconfigure operation on permanent memory executed on a system with a glm driver that is active might hang. The problem is specific to DR operations involving permanent memory, which require that the system be quiesced by means of suspend/resume. The problem lies with the glm driver. This bug affects domains running both Solaris 8 and Solaris 9 operating environments.

Workaround: Do not unconfigure permanent memory in the system if the glm driver is active.

System Panic During ddi_attach sequence (BugID 4797110)

Description: Unconfiguring a hsPCI or hsPCI+ I/O board while a PCI option card is being configured into it causes a system panic. For example, the panic would occur if the following commands were executed simultaneously. In this example, pcisch18:e03b1slot2 is one of the four PCI slots on IO3:

Workaround: Do not execute a PCI hotplug operation while a hsPCI or hsPCI+ I/O board is being unconfigured.

Panic: mp_cpu_quiesce: cpu_thread != cpu_idle_thread (BugID 4873353)

Description: Under certain error conditions, using DR to unconfigure a processor can leave that processor in the powered-off state. If psradm(1M) is then used to transition the processor to the off-line state, a system panic may result. Factors contributing to the problem are that Solaris does not expect processors to be in the powered-off state long-term, and psradm(1M) does not allow transitioning of processors to the powered-off state.

Workaround: Do not use psradm(1M) to offline a processor that is in the powered-off state.

cfgadm_sbd Plugin Signal Handling Is Broken (BugID 4498600)

Description: Sending a catchable signal, such as SIGINT sent by CTRL-C, to one or more cfgadm instances can cause those instances to hang. The problem is more likely to occur when multiple cfgadm processes are running, and can affect cfgadm instances on system boards, processors, I/O boards, and PCI slot attachment points. The problem has not been observed with a SIGKILL, and does not affect cfgadm status commands.

Workaround: None. To avoid this bug, do not send a catchable signal to a cfgadm process invoked to change the state of a component; for example, one executed with its -c or -x option.

page_retire Does Not Update Retired Page List in Some Cases (BugID 4893666)

Description: If nonpermanent memory is unconfigured, the system removes retired pages from the retired pages list to prevent them from becoming dangling pages. That is, pages that point to physical memory that would have been unconfigured. When permanent memory is unconfigured, a target board is identified and unconfigured first. Once a target board is ready, the contents of the source board (the permanent memory) are copied to the target board. The memory controllers on the target board are then "renamed" (programmed) withthe same address range as the source board. This means that if the source board contained any retired pages, these pages would not be dangling pages after the rename. They would point to valid addresses, but the physical memory behind those addresses is in the target board. The problem is that the physical memory is probably good (does not contain ECC errors).

Workaround: None.

Page Removal Causes a Good Page to be Removed After a DR Operation (BugID 4860955)

Description: The automatic page removal feature may result in removal of a good page after a DR operation.

Workaround: Disable automatic_page_removal.