C H A P T E R 7 - Dynamic Reconfiguration on Sun Fire High-End Systems

C H A P T E R 7

Dynamic Reconfiguration on Sun Fire High-End Systems

This chapter describes major domain-side dynamic reconfiguration (DR) bugs on Sun Fire high-end (Sun Fire 25K/20K/15K/12K) systems running Solaris 9 4/04 software.

For information about SMS-side DR bugs, see the SMS Release Notes for the version of SMS running on your system.

Known Software Bugs

memscrubber Periodically Runs Nonstop with Big Mem, Interferes with DR (BugID 4647808)

Description: When a domain is configured with a large amount of memory (340 Gbytes or more), either at boot time or due to subsequent DR operations, the memory scrubbing thread monopolizes a particular system lock for 60 to 90 minutes once every 12 hours. Any DR operation that attempts to configure or unconfigure memory in the domain during one of these windows hangs until the system lock is released. As long as a DR operation remains hung for this reason, any additional DR operations also hang.

Workaround: This problem resolves on its own within 90 minutes. To avoid it, add the following line to the /etc/system file prior to booting:

set memscrub_span_pages = 0x3000

Deleteboard Shows Leakage Error (BugID 4730142)

Description: When a DR command is executing on a system configured with the Freshchoice card (also called SunSwift PCI card, Option 1032), the system may display messages similar to the following:

Aug 12 12:27:41 machine genunix: WARNING:

 vmem_destroy('pcisch2_dvma'): leaked

These messages are benign; the DVMA space is properly refreshed during the DR operation. No true kernel memory leak occurs. This bug affects domains running both Solaris 8 and Solaris 9 operating environments.

Workaround: No workaround is necessary, but to prevent the message from displaying, add the following line to /etc/system:

set pcisch:pci_preserve_iommu_tsb=0

glm: Hang in scsi_transport During DR (BugID 4737786)

Description: A cfgadm(1M) unconfigure operation on permanent memory executed on a system with a glm driver that is active may hang. The problem is specific to DR operations involving permanent memory, which require that the system be quiesced via suspend/resume. The problem lies with the glm driver. This bug affects domains running both Solaris 8 and Solaris 9 operating environments.

Workaround: Do not unconfigure permanent memory in the system if the glm driver is active.

System Panic During ddi_attach sequence (BugID 4797110)

Description: Unconfiguring a hsPCI or hsPCI+ I/O board while a PCI option card is being configured into it causes a system panic. For example, the panic would occur if the following commands were executed simultaneously. In this example, pcisch18:e03b1slot2 is one of the four PCI Slots on IO3:

cfgadm -c unconfigure IO3

cfgadm -c configure pcisch18:e03b1slot2

Workaround: Do not execute a PCI hotplug operation while a hsPCI or hsPCI+ I/O board is being unconfigured.

Panic: mp_cpu_quiesce: cpu_thread != cpu_idle_thread (BugID 4873353)

Description: Under certain error conditions, using DR to unconfigure a processor can leave that processor in the powered-off state. If psradm(1M) is then used to transition the processor to the off-line state, a system panic may result. Factors contributing to the problem are that Solaris does not expect processors to be in the powered-off state long-term, and psradm(1M) does not allow transitioning of processors to the powered-off state.

Workaround: Do not use psradm(1M) to offline a processor that is in the powered-off state.

Rated Proc Speed Used Instead of Actual with DR Operation on Sun Fire High-End Systems (BugID 4964679)

Description: Processors added using DR are shown by various tools as running at the processor's rated frequency rather than its actual frequency. In most cases, the rated and actual frequencies for a processor are the same. Processors present in the system at boot display the correct, actual frequency.

Workaround: None.

Failed to Indict L2 Cache on a Sun Fire E25K/E20K When the Board was Configured via DR (BugID 4984562)

Description: If automatic processor removal is enabled on a Sun Fire E25K/E20K system, an event notifying the system controller that a processor has been offlined due to L2 cache errors may not get delivered if the board was added using DR. The process of offlining the processor on the domain is not affected. Boards present in the domain at boot do not experience this problem.

Workaround: None.

cfgadm_sbd Plugin Signal Handling Is Broken (BugID 4498600)

Description: Sending a catchable signal, such as SIGINT sent by CTRL-C, to one or more cfgadm instances can cause those instances to hang. The problem is more likely to occur when multiple cfgadm processes are running, and can affect cfgadm instances on system boards, processors, I/O boards, and PCI slot attachment points. The problem has not been observed with a SIGKILL, and does not affect cfgadm status commands.

Workaround: None. To avoid this bug, do not send a catchable signal to a cfgadm process invoked to change the state of a component; for example, one executed with its -c or -x option.

page_retire Does Not Update Retired Page List in Some Cases (BugID 4893666)

Description: If non-permanent memory is unconfigured, the system removes retired pages from the retired pages list to prevent them from becoming dangling pages - that is, pages that point to physical memory that would have been unconfigured. When permanent memory is unconfigured, a target board is identified and unconfigured first. Once a target board is ready, the contents of the source board (the permanent memory) are copied to the target board. The target board is then "renamed" (memory controllers are programmed) to have the same address range as the source board. What this means is that if the source board contained any retired pages, these pages would not be dangling pages after the rename. They would point to valid addresses, but the physical memory behind those addresses is in the target board. The problem is that the physical memory is probably good (does not contain ECC errors).

Workaround: None.

Page Removal Causes a Good Page to be Removed After a DR Operation (BugID 4860955)

Description: The automatic page removal feature may result in removal of a good page after a DR operation.

Workaround: Disable automatic_page_removal.

Known Hardware Bugs

GigaSwift Ethernet MMF Link Goes Down With CISCO 4003 Switch After DR Attach (BugID 4709629)

Description: Attempting to execute a DR operation on a system with Sun GigaSwift Ethernet MMF Option X1151A, part number 595-5773, attached to certain CISCO switches causes the link to fail. The problem is caused by a known bug in the following CISCO hardware/firmware:

CISCO WS-c4003 switch (f/w: WS-C4003 Software, Version NmpSW: 4.4(1))
CISCO WS-c4003 switch (f/w: WS-C4003 Software, Version NmpSW: 7.1(2))
CISCO WS-c5500 switch (f/w: WS-C5500 Software, Version McpSW: 4.2(1) and NmpSW: 4.2(1))

This problem is not seen on CISCO 6509 switch.

Workaround: Use another switch or consult Cisco for a patch.