C H A P T E R  5

Open Issues for Sun Fire Midrange Systems

This chapter describes open issues related to the Sun Fire midrange servers -- the Sun Fire E6900/E4900/6800/4810/4800/3800 systems -- running Solaris 8 2/04 software.

For information about the earlier Sun Enterprise midrange servers -- the Sun Fire 6500/6000/5500/5000/4500/3500/3000 systems -- see Chapter 6.

Dynamic Reconfiguration on Sun Fire Midrange Systems

This section describes DR on Sun Fire midrange systems running Solaris 8 2/04 software. This is the first release of Solaris 8 software to support the new Sun Fire E6900 and E4900 systems. The first system controller (SC) firmware release to support these systems is 5.16.0.

TABLE 5-1 shows acceptable combinations of Solaris software and SC firmware for each Sun Fire midrange system to run DR. If the platform listed in the first column is running the Solaris release shown in the second column, the minimum SC firmware release is on that same line in the third column.

TABLE 5-1 Minimum SC Firmware for Each Platform/Solaris Release

Platform

Solaris Release

Minimum SC Firmware

E6900/E4900

Solaris 8 2/04 only

5.16.0

6800/4810/4800/3800

Solaris 8 2/04

5.13.0

6800/4810/4800/3800

Solaris 8 2/02

5.12.6


For the latest patch information, see http://sunsolve.sun.com



Note - Your Sun Fire midrange system should run the latest SC firmware version to take advantage of the most recent bug fixes and added features.



Sun Management Center

Sun Management Center software supports DR on domains running Solaris 8 2/04 software. Refer to the SunMC Software Supplement for Sun Fire Midrange Systems for complete instructions.

System-Specific DR Support

To view system-specific DR information, run the cfgadm(1M) command. System boards are indicated as class "sbd." CompactPCI (cPCI) cards are shown as class "pci." You may see other DR classes, as well.

To view the classes that are associated with attachment points, run the following command as superuser:

# cfgadm -s "cols=ap_id:class"

You can list use the cfgadm command with its -a option to list dynamic attachment points. To determine the class of a specific attachment point, add the point as an argument to the above command.

Page Retire Feature

The Dynamic Reconfiguration (DR) feature has been enhanced to take advantage of the Solaris Page Retire feature. DR now lets you logically detach a system board that is experiencing a high number of memory errors, in some cases where it would not let you do so before. The board can then be serviced to correct any failing memory problems.

Upgrading the System Firmware

Each firmware patch includes a file called Install.info, which contains firmware installation instructions. You can find all firmware patches for your system on SunSolve.

Known DR Limitations

This section contains known DR software limitations of the Sun Fire midrange systems.

General DR Limitations

Limitations Specific to CompactPCI

Procedures for Bringing a cPCI Network Interface (IPMP) Online or Offline

To Take a cPCI Network Interface (IPMP) Offline and Remove It

1. Retrieve the group name, test address, and interface index by typing the following command.

# ifconfig interface

For example, ifconfig hme0

2. Use the if_mpadm(1M) command as follows:

# if_mpadm -d interface

This takes the interface offline and causes the failover addresses to be failed over to another active interface in the group. If the interface is already in a failed state, then this step simply marks and ensures that the interface is offline.

3. (Optional) Unplumb the interface.

This step is required only if you want to use DR to reconfigure the interface automatically at a later time.

4. Remove the physical interface.

Refer to the cfgadm(1M) man page and the Sun Fire Midrange Systems Dynamic Reconfiguration User Guide for more information.

To Attach and Bring Online a cPCI Network Interface (IPMP)

1. Attach the physical interface.

Refer to the cfgadm(1M) man page and the Sun Fire Midrange Systems Dynamic Reconfiguration User Guide for more information.

After you attach the physical interface, it is automatically configured using settings in the hostname configuration file (/etc/hostname.interface, where interface is a value such as hme1 or qfe2).

This triggers the in.mpathd daemon to resume probing and detect repairs. Consequently, in.mpathd causes original IP addresses to failback to this interface. The interface should now be online and ready for use under IPMP.



Note - If the interface had not been unplumbed and set to the OFFLINE status prior to a previous detach, then the attach operation described here would not automatically configure it. To set the interface back to the ONLINE status and failback its IP address after the physical attach is complete, enter the following command: if_mpadm -r interface



Operating System Quiescence

This section discusses permanent memory, and the requirement to quiesce the operating system when unconfiguring a system board that has permanent memory.

A quick way to determine whether a board has permanent memory is to run the following command as superuser:

# cfgadm -av | grep permanent

The system responds with output such as the following, which describes system board 0 (zero):

N0.SB0::memory connected configured ok base address 0x0, 4194304 KBytes total, 668072 KBytes permanent

Permanent memory is where the Solaris kernel and its data reside. The kernel cannot be released from memory in the same way that user processes residing in other boards can release memory by paging out to the swap device. Instead, cfgadm uses the copy-rename technique to release the memory.

The first step in a copy-rename operation is to stop all memory activity on the system by pausing all I/O operations and thread activity; this is known as quiescence. During quiescence, the system is frozen and does not respond to external events such as network packets. The duration of the quiescence depends on two factors: how many I/O devices and threads need to be stopped; and how much memory needs to be copied. Typically the number of I/O devices determines the required quiescent time, because I/O devices must be paused and unpaused. Typically, a quiescent state lasts longer than two minutes.

Because quiescence has a noticeable impact, cfgadm requests confirmation before effecting quiescence. If you enter:

# cfgadm -c unconfigure N0.SB0

The system responds with a prompt for confirmation:

System may be temporarily suspended, proceed (yes/no)?

If you are using SunMC to perform the DR operation, a pop-up window displays this prompt.

Enter yes to confirm that the impact of the quiesce is acceptable, and to proceed.

Dynamic Reconfiguration Software Bugs

This section lists the more important bugs that have been discovered during testing of DR. This list does not include all bugs.

Known Dynamic Reconfiguration Bugs

cryptorand Exited After Removing CPU Board With Dynamic Reconfiguration (BugID 4456095)

Description: If a system is running the cryptorand process, which is found in the SUNWski package, an unconfigure of memory, such as part of a CPU/Memory (SB) board disconnect, causes cryptorand to close with messages recorded in /var/adm/messages. This action denies random number services to secure sub-systems, and any memory present when cryptorand is started should not be unconfigured.

The cryptorand process supplies a random number for /dev/random. After cryptorand is started, the amount of time before /dev/random becomes available depends on the amount of memory in the system. It takes about two minutes per GB of memory. Applications that use /dev/random to get random numbers may experience temporary blockage. It is not necessary to restart cryptorand if a CPU/memory board is added to a domain.

Workaround: If a CPU/memory board is removed from the domain, restart cryptorand by entering the following command as superuser:

# sh /etc/init.d/cryptorand start

SBM Sometimes Causes System Panic During DR Operations (BugID 4506562)

Description: A panic may occur when a system board that contains CPUs is removed from the system while Solaris Bandwidth Manager (SBM) is in use.

Workaround: Do not install SBM on systems that will be used for DR trials, and do not perform CPU system board DR operations on systems with SBM installed.

DR Commands Hang Waiting for rcm_daemon While Running ipc, vm, and ism Stress (BugID 4508927)

Description: In rare cases, a quiesce of the Solaris software fails to stop certain user threads, and to restart others, which remain in a stopped state. Depending on the threads affected, applications running on the domain may stop running and other DR operations may not be possible until the domain is rebooted.

Workaround: Do not use DR to remove a board that contains permanent memory.

Unable to Disconnect SCSI Controllers Using DR (BugID 4446253)

Description: When a SCSI controller is configured but not busy, it cannot be disconnected using the DR cfgadm(1M) command.

Workaround: None.

cfgadm_sbd Plugin Signal Handling Is Completely Broken (BugID 4498600)

Description: When a single-threaded or multi-threaded client of the cfgadm library issues concurrent sbd requests, the system may hang.

Workaround: None. To avoid this bug, do not run in parallel multiple instances of cfgadm targeting system boards, and do not send signals, such as CTRL-C, to long-running cfgadm operations.

DR Operations Hang After a Few Loops When CPU Power Control Is Also Running (BugID 4114317)

Description: When multiple concurrent DR operations occur, or when psradm is run at the same time as a DR operation, the system can hang because of a mutex deadly embrace.

Workaround: Perform DR operations serially (one DR operation at a time); and allow each to complete successfully before running psradm, or before beginning another DR operation.

System May Panic When send_mondo_set Times Out (BugID 4518324)

Description: A Sun Fire system may panic if one or more of the CPU boards are sync paused during a DR operation. Sync pause is required to attach or detach boards. If there are outstanding mondo interrupts, and for any reason the SC is not able to complete sync pause within the one-second send_mondo timeout limit, the system panics.

Test sdrfunc_072.pl Panicked in DDI Layer (BugID 4622581)

Description: A cPCI slot operation cannot be performed concurrently with a PCI bus operation. If at least one second does not separate these actions, the system may panic. The risk is very small for manual cfgadm operations, but higher for automated executions, such as those done in a shell script. Workaround: Insert at least a one second delay between cPCI slot DR operations and PCI bus DR operations when automating these operations.

DR Disconnect on Gigaswift cPCI Device Causes ifconfig Hang (BugID 4942945)

Description: Under heavy network load, a disconnect operation on a Gigaswift cPCI device appears to hang. This problem occurs because the disconnect calls ifconfig to unplumb the interface, and the process is unable to make progress under heavy load. This problem applies to ifconfig unplumb operations that are initiated manually, as well. Workaround: Do not attempt to disconnect or unplumb a Gigaswift cPCI device when heavy traffic is present.

page_retire Does Not Update Retired Page List in Some Cases (BugID 4893666)

Description: If non-permanent memory is unconfigured, the system removes retired pages from the retired pages list to prevent them from becoming dangling pages - that is, pages that point to physical memory that would have been unconfigured.

When permanent memory is unconfigured, a target board is identified and unconfigured first. Once a target board is ready, the contents of the source board (the permanent memory) are copied to the target board. The target board is then "renamed" (memory controllers are programmed) to have the same address range as the source board. What this means is that if the source board contained any retired pages, these pages would not be dangling pages after the rename. They would point to valid addresses, but the physical memory behind those addresses is in the target board. The problem is that the physical memory is probably good (does not contain ECC errors).

Workaround: None.

Page Removal Causes a Good Page to be Removed After a DR Operation (BugID 4860955)

Description: The automatic page removal feature may result in removal of a good page after a DR operation.

Workaround: Disable automatic_page_removal.

Cannot DR out cPCI IB with P0 Disabled (BugID 4798990)

For more information about this bug, please see Sun Alert 56880.

Description: On Sun Fire E6900/E4900/6800/4810/4800/3800 systems, a Compact PCI (cPCI) I/O board cannot be unconfigured when Port 0 (P0) on that board is disabled. This problem exists only on systems running Solaris 9, running Solaris 8 with Sun Patch 108528-11 through -27 and possibly later, or running Solaris 8 with Sun Patch 111372-02 through -04. It occurs only during DR operations that involve cPCI boards, and displays an error message similar to the following:.

# cfgadm -c unconfigure IB7

Workaround: If you do not need to disable P0 itself, disable its slots, instead.