Sun Enterprise 6x00, 5x00, 4x00, and 3x00 Systems Dynamic Reconfiguration User's Guide

Chapter 2 Description and Functions

This chapter describes how Dynamic Reconfiguration (DR) works and explains the terms used in DR.

Using this Guide

  1. Determine the name and status of the board or card cage slot. You will find it listed in the online DR status report. See "How to Monitor Board Status".

  2. In the following table, find the entry corresponding to the condition of the board or device, then go to the procedure or reference listed in the Service Reference column.

Table 2-1 DR Conditions

Condition 

 Explanation Service Reference

empty

No board is present in the slot. All LEDs are off. 

To install a board, see "Installing a New Board"

disconnected

A board is present but is electrically disconnected. The system is able to identify the board type. The board LEDs show that the board is in low power mode and can be unplugged at any time. 

LEDs: green, yellow , green (Off, On, Off) 

Use cfgadm -c disconnect to enable this state.

To remove a disconnected board, refer to the service manual for the system. To power up a disconnected board, see "Installing a New Board"

connected

The board is electrically connected and powered up. The system is actively monitoring the board for temperature and cooling. 

LEDs: green, yellow, green (On, Off, Off) 

Use cfgadm -c connect to enable this state.

To remove a connected board, see "Removing a Board". To use a connected board, see "Installing a New Board".

configured

Devices on the board are fully initialized and may be mounted or configured for use. The LEDs show the normal running pattern. 

LEDs: On, Off, Flash 

Use cfgadm -c configure to enable this state.

To remove a configured board, see "Removing a Board".

unconfigured

The unconfigured state covers all other device states, including receptacles in the empty state. The LED pattern is the same as for the connected receptacle state. 

LEDs: green, yellow, green (On, Off, Off) 

Use cfgadm -c unconfigure to enable this state.

To remove an unconfigured board, see "Removing a Board". To use an unconfigured board, see "Installing a New Board".

unknown

The current condition cannot be determined. This situation results either when a new board is inserted in a running system, or a board is placed on the disabled board list prior to a reboot. A transition to a connected receptacle state will change an attachment point condition from unknown to either OK or Failed. 

To use an unknown board, see "Installing a New Board"

ok

No problems have been detected. This condition can only occur after a board has been connected. This condition will persist either until the board is physically removed, or a problem is detected. An ok condition requires correct hardware compatibility, correct firmware revision, adequate power, adequate cooling, and adequate precharge. 

To remove an ok board, see "Removing a Board"

failing

A failing condition can only occur when a board that was in the OK condition develops a problem. For example, the board has begun to overheat. This condition will be displayed until the problem is corrected or the attachment point is disconnected.  

To remove a failing board, see "Removing a Board". To correct an overheating condition, see the system service manual.

failed

The board has failed POST/OBP. A failed condition may occur either during bootup or after a failed connect attempt. This condition is considered uncorrectable and will persist until the board is physically removed. For a failed attachment point condition, the receptacle state should never transition beyond disconnected. 

To remove a failed board, see "Removing a Board".

unusable

Either an attachment point has incompatible hardware, or an empty attachment point lacks power, cooling, or precharge current. An unusable condition is correctable. This condition is caused by one of the following events: (1) inadequate cooling in a slot, (2) power is detected in an empty slot, (3) a disconnected board has inadequate cooling, inadequate power, or unsupported hardware, or (4) firmware has detected a problem either during bootup or when a board is inserted.  

To remove a board from an unusable slot, see "Removing a Board". To correct overheating conditions in the slot, refer to the system service manual.

How to Monitor Board Status

The cfgadm program can display the status of DR boards and slots.

When used without options, the cfgadm command displays a simple list of all known DR attachment points in the system. Here is a typical output:

Figure 2-1 Typical Display for the cfgadm Command

Graphic

When used with the -v option, the cfgadm command displays a more detailed list:

Figure 2-2 Typical Display for the cfgadm Command with the -v Option

Graphic

Here are some useful details of the display:

Figure 2-3 Details of the Display for cfgadm -v

Graphic

Hardware Support

The following table lists currently supported and unsupported boards.

Table 2-2 Supported and Unsupported Boards

Name 

Supported? 

Board Identification 

CPU/memory 

No 

 

CPU/memory+ 

No 

 

I/O type 1 (SBus)  

Yes 

3 SBus slots, 2 FC/OM fiber channel slots 

I/O type 2 

Yes 

Graphics slot, 2 SBus slots, 2 FC/OM fiber channel slots 

I/O type 3 

No  

2 PCI slots, 2 FC/OM fiber channel slots 

I/O type 4 

Yes 

3 SBus slots, 2 GBIC (FC/AL) fiber channel slots 

I/O type 5 

Yes 

Graphics slot, 2 SBus slots, 2 GBIC (FC/AL) fiber channel slots 


Note -

Support for additional types of boards is being developed. Refer to the DR web site (see below) or the release notes supplement for Solaris(TM) 7 for any changes to this list.


http://sunsolve2.Sun.COM/sunsolve/Enterprise-dr/

Software Patches

For software patch requirements, refer to the release notes supplement for Solaris(TM) 7, or the DR web site at:

http://sunsolve2.Sun.COM/sunsolve/Enterprise-dr/

Definitions

Attachment Point

Attachment point: a collective term for a board and its card cage slot.

DR can display the status of the slot, the board, and the attachment point. For DR purposes, a board also includes the devices connected to it, so the DR term occupant is used to refer to the combination of board and attached devices.

There are two types of system names for attachment points:

Detachability

A board is not detachable if it has a critical resource (such as a boot drive) connected to it. Similarly, if a system has only one CPU board, the CPU board cannot be detached.

For a device to be detachable:

If there is no alternate pathway for an I/O board, you can:

Conditions and States

State: the operational status of either a receptacle (slot) or an occupant (board).

Condition: the operational status of an attachment point.

The cfgadm program can display 10 types of states and conditions. See Table 2-1.


Note -

For a receptacle procedure to be valid, the receptacle must transition in sequence through all three states (empty, disconnected, connected) or in the reverse sequence (connected, disconnected, empty).


Connection and Configuration

There are four main types of DR operations:

Connection: in this operation, the slot provides power to the board and begins monitoring the board temperature.

Configuration: the operating system assigns functional roles to a board and loads device drivers for the board and for devices attached to the board.

Unconfiguration: the system detaches a board logically from the operating system and takes the associated device drivers offline. Environmental monitoring continues, but any devices on the board are not available for system use.

Disconnection: the system stops monitoring the board and power to the slot is turned off.

If a system board is in use, before powering it off and removing it, stop its use and unconfigure it. After a new or upgraded system board is inserted and powered on, connect its attachment point and configure it for use by the operating system.

cfgadm can connect and configure (or unconfigure and disconnect) in a single command, but if necessary, each operation (connection, configuration, unconfiguration, or disconnection) can be performed separately.

Hot-plug Hardware

Hot-plug: hot-plug boards and modules have special connectors which supply electrical power to the board or module before the data pins make contact. Boards and devices which do not have hot-plug connectors cannot be inserted or removed while the system is running.

I/O boards and CPU/memory boards used in Enterprise x000 and x500 systems are hot-plug devices. Some devices, such as the clock board and peripheral power supply (PPS), are not hot-plug modules and cannot be removed while the system is running.

Quiescence

Quiescence: during a DR unconfigure/disconnect operation on a system board with non-pageable Open Boot PROM (OBP) or kernel memory, the operating system is briefly paused, which is known as operating system quiescence. All operating system and device activity on the backplane must cease for a few seconds during a critical phase of the operation.

Before it can achieve quiescence, the operating system must temporarily suspend all processes, processors, and device activities. If the operating system cannot achieve quiescence, it displays the reasons, which may include the following:

The conditions that cause processes to fail to suspend are generally temporary. Examine the reasons for the failure. If the operating system encountered a transient condition--a failure to suspend a process--you can try the operation again.

Suspend-Safe and Suspend-Unsafe Devices

suspend-safe: a suspend-safe device is one that does not access memory or interrupt the system while the operating system is in quiescence. A driver is suspend-safe if it supports operating system quiescence (suspend/resume). It also guarantees that when a suspend request is successfully completed, the device that the driver manages will not attempt to access memory, even if the device is open when the suspend request is made.

suspend-unsafe: a suspend-unsafe device is one that allows a memory access or a system interruption while the operating system is in quiescence.

Suspend-safe drivers provide the ability to:

The operating system refuses a quiescence request if a suspend-unsafe device is open. To manually suspend the device, you may have to close the device by killing the processes that have it open, asking users not to use the device, or disconnecting the cables. For example, if a device that allows asynchronous unsolicited input is open, you can disconnect its cables prior to activating operating system quiescence and reconnect them after the operating system resumes. This action prevents traffic from arriving at the device and, thus, the device has no reason to access the backplane.

Testing for Suspend-Safe Drivers

The quiesce-test option tests for suspendable drivers.


# cfgadm -x quiesce-test sysctrl#:slot#

Tape Devices

The sequential nature of tape devices prevents them from being reliably suspended in the middle of an operation, and then resumed. Therefore, all tape drivers are suspend-unsafe. Before executing a DR operation that activates operating system quiescence, make sure all tape devices are closed or not in use.

Installation of a Board or Device

The installation of a new board involves the DR connection and configuration operations described below. If the board is intended to be a spare board, it must additionally be disabled now, then enabled when you later wish to use it.

For the board installation procedure, see "Installing a New Board".

To add a storage device to an existing board, see "Adding Storage Devices".

Board Connection

After physically inserting a board in the card cage, logically connect the board:


# cfgadm -c connect sysctrl#:slot#

sysctrl#:slot# is the logical attachment point identification (the system name for the board), which can be found in the cfgadm status display.

The states and conditions for the attachment point before a board is inserted are:

After a board is physically inserted, the states and conditions are:

After the attachment point is logically connected, the states and conditions are:

Now the system is aware of the board, but not the usable devices which reside on the board. Temperature is monitored and power and cooling affect the attachment point condition.

Board Configuration

To logically configure a board (add the board to the system configuration), enter:


# cfgadm -c configure sysctrl#:slot#

The states and conditions for a configured attachment point are:

Now the system is also aware of the usable devices which reside on the board and all devices may be mounted or configured to be used.

If the configure operation fails for any reason, the states and conditions will still transition to configured. This creates a special situation where the board is partially configured. In this situation, only an unconfigure operation is allowed. A further attempt to configure the partial configuration is not permitted.

Disabling a Board

If a board is to be kept in the system for use as a spare board, enter this board in the disabled board list. This prevents the board from being used when the system is turned on or rebooted.

To disable a board, use the EEPROM command:


# eeprom disabled-board-list=sysctrl#:slot#

Alternatively, you can use the DR command:


# cfgadm -c disconnect -o disable-at-boot sysctrl#:slot#

Note that disabled boards remain in the cfgadm status display even if a different board is subsequently placed in the same slot.

Enabling an Unconfigured Board

A running system may contain one or more unconfigured boards. That is, the boards are not being used by the system. These unconfigured boards may have been:

To enable a board, use the configure option described above.

Addition of Storage Devices

To add a storage device, see "Adding Storage Devices".

Removal of a Board

The removal of a board requires the devices attached to the board be prepared, followed by the unconfiguration and disconnection of the board, as described below.

For the removal procedure, see "Removing a Board".

Preparing I/O and Network Devices

A board with vital system resources cannot be detached unless alternate resources are available on another board. A boot disk is an example of a vital system resource.

A board hosting non-vital system resources can be unconfigured whether or not there are alternate paths to the resources. All of its file systems must be unmounted and its swap partitions must be deleted. You may have to kill processes that have open files or devices, or place a hard lock on the file systems (using lockfs(1M)) before unmounting them. All I/O device drivers must be detachable.

The system swap space should be configured as multiple partitions on disks attached to controllers hosted by different boards. With this kind of configuration, a particular swap partition is not a vital resource because swap partitions can be added and deleted dynamically. See swap(1M) for more information.


Note -

When memory or disk swap space is detached, there must be enough memory or swap disk space remaining in the machine to accommodate currently running programs.


I/O Board Unconfiguration


Note -

The screen, mouse, and keyboard will not be operational while the system is suspended, but you will regain control of these devices after the suspension.


Preparation of an I/O Board for Removal

Before the Unconfigure operation can be completed, you must manually terminate usage of all I/O devices on the board, including network interfaces.


Note -

To identify the components that are on the board to be unconfigured, use the ifconfig, mount, pf, or swap commands. The prtdiag(1M) command provides some information, but is less informative.


Termination of Network Devices

DR does not automatically terminate use of all network interfaces on the board that is being disconnected. You must manually terminate the use of each interface.

DR does not allow an Unconfigure operation on any interface that fits the following conditions. In these cases, the Unconfigure operation fails and DR displays an error message.

Replacement or Modification of a Board or Device

For the procedure to replace a board, see "Installing a Replacement Board"

For the procedure to add an interface to a board, see "Adding Storage Devices"

Replacement Sequence

When replacing other types of hardware at the same time that you add or replace a board in Enterprise x000 and x500 servers, replace the hardware in this order, as applicable, before adding or replacing a board:

  1. Clock board or clock+ board

  2. Peripheral power supply (PPS)--the PPS supplies hot-plug current

  3. Power and cooling module (PCM)--the PCM supplies cooling air

System Reconfiguration

This section describes how to reconfigure your system after you have configured or unconfigured a system board.

When to Reconfigure

You might need to reconfigure the system under several conditions, including:

I/O Device Reconfiguration

The DR reconfiguration sequence is the same as the Solaris reconfiguration boot sequence (boot -r):


drvconfig; devlinks; disks; ports; tapes;

When the reconfiguration sequence is executed after a board is configured, device path names not previously seen by the system are entered into the /etc/path_to_inst file. The same path names are also added to the /devices hierarchy and links to them are created in the /dev directory.

Disk Controller Renumbering during a Reconfiguration


Caution - Caution -

The disk controller number is part of the /dev link name used to access the disk. If that number changes during the reconfiguration sequence, the /dev link name also changes. This change may affect file system tables and software, such as Solstice(TM) DiskSuite(TM), which uses the /dev link names. Update /etc/vfstab files and execute other administrative actions necessary due to the changes in the /dev link names.


When the reconfiguration sequence is executed after a board is unconfigured or disconnected, the /dev links for all the disk partitions on that board are deleted. The remaining boards retain their current numbering. Disk controllers on a newly inserted board are assigned the next available lowest number by disks(1M).

The disks(1m) utility creates symbolic links in the /dev/dsk and /dev/rdsk directories pointing to the actual special disk device files under the /devices directory tree. These entries take the form /dev/dsk/cXtXdXsX where:

Removing boards that contain one or more disk controllers prompts the disks(1m) utility to examine entries in /dev/dsk and /dev/rdsk. These entries list the disks attached to the removed controller(s). The disks(1m) utility discovers references to disconnected devices have been removed from /dev/dsk and /dev/rdsk. This removal action makes the logical controller numbers available for re-use. This re-use of controller numbers can lead to confusion when unexpected controller numbers are assigned to disk controllers that are added to the system.