This chapter describes how Dynamic Reconfiguration (DR) works and explains the terms used in DR.
Determine the name and status of the board or card cage slot. You will find it listed in the online DR status report. See "How to Monitor Board Status".
In the following table, find the entry corresponding to the condition of the board or device, then go to the procedure or reference listed in the Service Reference column.
Condition | Explanation | Service Reference |
---|---|---|
empty |
No board is present in the slot. All LEDs are off. | To install a board, see "Installing a New Board" |
disconnected |
A board is present but is electrically disconnected. The system is able to identify the board type. The board LEDs show that the board is in low power mode and can be unplugged at any time. LEDs: green, yellow , green (Off, On, Off) Use cfgadm -c disconnect to enable this state. | To remove a disconnected board, refer to the service manual for the system. To power up a disconnected board, see "Installing a New Board" |
connected |
The board is electrically connected and powered up. The system is actively monitoring the board for temperature and cooling. LEDs: green, yellow, green (On, Off, Off) Use cfgadm -c connect to enable this state. | To remove a connected board, see "Removing a Board". To use a connected board, see "Installing a New Board". |
configured |
Devices on the board are fully initialized and may be mounted or configured for use. The LEDs show the normal running pattern. LEDs: On, Off, Flash Use cfgadm -c configure to enable this state. | To remove a configured board, see "Removing a Board". |
unconfigured |
The unconfigured state covers all other device states, including receptacles in the empty state. The LED pattern is the same as for the connected receptacle state. LEDs: green, yellow, green (On, Off, Off) Use cfgadm -c unconfigure to enable this state. | To remove an unconfigured board, see "Removing a Board". To use an unconfigured board, see "Installing a New Board". |
unknown |
The current condition cannot be determined. This situation results either when a new board is inserted in a running system, or a board is placed on the disabled board list prior to a reboot. A transition to a connected receptacle state will change an attachment point condition from unknown to either OK or Failed. | To use an unknown board, see "Installing a New Board" |
ok |
No problems have been detected. This condition can only occur after a board has been connected. This condition will persist either until the board is physically removed, or a problem is detected. An ok condition requires correct hardware compatibility, correct firmware revision, adequate power, adequate cooling, and adequate precharge. | To remove an ok board, see "Removing a Board" |
failing |
A failing condition can only occur when a board that was in the OK condition develops a problem. For example, the board has begun to overheat. This condition will be displayed until the problem is corrected or the attachment point is disconnected. | To remove a failing board, see "Removing a Board". To correct an overheating condition, see the system service manual. |
failed |
The board has failed POST/OBP. A failed condition may occur either during bootup or after a failed connect attempt. This condition is considered uncorrectable and will persist until the board is physically removed. For a failed attachment point condition, the receptacle state should never transition beyond disconnected. | To remove a failed board, see "Removing a Board". |
unusable |
Either an attachment point has incompatible hardware, or an empty attachment point lacks power, cooling, or precharge current. An unusable condition is correctable. This condition is caused by one of the following events: (1) inadequate cooling in a slot, (2) power is detected in an empty slot, (3) a disconnected board has inadequate cooling, inadequate power, or unsupported hardware, or (4) firmware has detected a problem either during bootup or when a board is inserted. | To remove a board from an unusable slot, see "Removing a Board". To correct overheating conditions in the slot, refer to the system service manual. |
The cfgadm program can display the status of DR boards and slots.
When used without options, the cfgadm command displays a simple list of all known DR attachment points in the system. Here is a typical output:
When used with the -v option, the cfgadm command displays a more detailed list:
Here are some useful details of the display:
The following table lists currently supported and unsupported boards.
Table 2-2 Supported and Unsupported Boards
Name |
Supported? |
Board Identification |
---|---|---|
CPU/memory |
No |
|
CPU/memory+ |
No |
|
I/O type 1 (SBus) |
Yes |
3 SBus slots, 2 FC/OM fiber channel slots |
I/O type 2 |
Yes |
Graphics slot, 2 SBus slots, 2 FC/OM fiber channel slots |
I/O type 3 |
No |
2 PCI slots, 2 FC/OM fiber channel slots |
I/O type 4 |
Yes |
3 SBus slots, 2 GBIC (FC/AL) fiber channel slots |
I/O type 5 |
Yes |
Graphics slot, 2 SBus slots, 2 GBIC (FC/AL) fiber channel slots |
Support for additional types of boards is being developed. Refer to the DR web site (see below) or the release notes supplement for Solaris(TM) 7 for any changes to this list.
http://sunsolve2.Sun.COM/sunsolve/Enterprise-dr/
For software patch requirements, refer to the release notes supplement for Solaris(TM) 7, or the DR web site at:
http://sunsolve2.Sun.COM/sunsolve/Enterprise-dr/
Attachment point: a collective term for a board and its card cage slot.
DR can display the status of the slot, the board, and the attachment point. For DR purposes, a board also includes the devices connected to it, so the DR term occupant is used to refer to the combination of board and attached devices.
A slot (also called a receptacle) may have the ability to electrically isolate the occupant from the host machine. That is, DR software can put a single slot into low-power mode.
Receptacles can be named according to slot numbers or can be anonymous (for example, a SCSI chain). To obtain a list of all available logical attachment points, use the -l option with the cfgadm command.
An occupant I/O board includes any external storage devices connected by interface cables.
There are two types of system names for attachment points:
A physical attachment point describes the software driver and location of the card cage slot. An example of a physical attachment point name is:
/devices/central@1f,0/fhc@0,f8800000/clock-board@0,900000:sysctrl,slot0
A logical attachment point is an abbreviated name created by the system to refer to the physical attachment point:
sysctrl0:slot0
A board is not detachable if it has a critical resource (such as a boot drive) connected to it. Similarly, if a system has only one CPU board, the CPU board cannot be detached.
For a device to be detachable:
The device driver must support DDI_DETACH
Critical resources must be accessible through an alternate pathway
If there is no alternate pathway for an I/O board, you can:
Put the second disk chain on a separate I/O board. The secondary I/O board can be detached (with a loss of access to the secondary disk chain).
Add a second path to the device through a second I/O board. The I/O board can be detached (using Alternate Pathing software to switch access through the alternate board) without losing access to the secondary disk chain.
State: the operational status of either a receptacle (slot) or an occupant (board).
Condition: the operational status of an attachment point.
The cfgadm program can display 10 types of states and conditions. See Table 2-1.
For a receptacle procedure to be valid, the receptacle must transition in sequence through all three states (empty, disconnected, connected) or in the reverse sequence (connected, disconnected, empty).
There are four main types of DR operations:
Connection: in this operation, the slot provides power to the board and begins monitoring the board temperature.
Configuration: the operating system assigns functional roles to a board and loads device drivers for the board and for devices attached to the board.
Unconfiguration: the system detaches a board logically from the operating system and takes the associated device drivers offline. Environmental monitoring continues, but any devices on the board are not available for system use.
Disconnection: the system stops monitoring the board and power to the slot is turned off.
If a system board is in use, before powering it off and removing it, stop its use and unconfigure it. After a new or upgraded system board is inserted and powered on, connect its attachment point and configure it for use by the operating system.
cfgadm can connect and configure (or unconfigure and disconnect) in a single command, but if necessary, each operation (connection, configuration, unconfiguration, or disconnection) can be performed separately.
Hot-plug: hot-plug boards and modules have special connectors which supply electrical power to the board or module before the data pins make contact. Boards and devices which do not have hot-plug connectors cannot be inserted or removed while the system is running.
I/O boards and CPU/memory boards used in Enterprise x000 and x500 systems are hot-plug devices. Some devices, such as the clock board and peripheral power supply (PPS), are not hot-plug modules and cannot be removed while the system is running.
Quiescence: during a DR unconfigure/disconnect operation on a system board with non-pageable Open Boot PROM (OBP) or kernel memory, the operating system is briefly paused, which is known as operating system quiescence. All operating system and device activity on the backplane must cease for a few seconds during a critical phase of the operation.
Before it can achieve quiescence, the operating system must temporarily suspend all processes, processors, and device activities. If the operating system cannot achieve quiescence, it displays the reasons, which may include the following:
A user thread did not suspend
Real-time processes are running
A device exists that cannot be paused by the operating system
The conditions that cause processes to fail to suspend are generally temporary. Examine the reasons for the failure. If the operating system encountered a transient condition--a failure to suspend a process--you can try the operation again.
suspend-safe: a suspend-safe device is one that does not access memory or interrupt the system while the operating system is in quiescence. A driver is suspend-safe if it supports operating system quiescence (suspend/resume). It also guarantees that when a suspend request is successfully completed, the device that the driver manages will not attempt to access memory, even if the device is open when the suspend request is made.
suspend-unsafe: a suspend-unsafe device is one that allows a memory access or a system interruption while the operating system is in quiescence.
Suspend-safe drivers provide the ability to:
Stop user threads
Execute the DDI_SUSPEND call in each device driver
Stop the clock
Stop the CPUs
The operating system refuses a quiescence request if a suspend-unsafe device is open. To manually suspend the device, you may have to close the device by killing the processes that have it open, asking users not to use the device, or disconnecting the cables. For example, if a device that allows asynchronous unsolicited input is open, you can disconnect its cables prior to activating operating system quiescence and reconnect them after the operating system resumes. This action prevents traffic from arriving at the device and, thus, the device has no reason to access the backplane.
The quiesce-test option tests for suspendable drivers.
# cfgadm -x quiesce-test sysctrl#:slot#
The sequential nature of tape devices prevents them from being reliably suspended in the middle of an operation, and then resumed. Therefore, all tape drivers are suspend-unsafe. Before executing a DR operation that activates operating system quiescence, make sure all tape devices are closed or not in use.
The installation of a new board involves the DR connection and configuration operations described below. If the board is intended to be a spare board, it must additionally be disabled now, then enabled when you later wish to use it.
For the board installation procedure, see "Installing a New Board".
To add a storage device to an existing board, see "Adding Storage Devices".
After physically inserting a board in the card cage, logically connect the board:
# cfgadm -c connect sysctrl#:slot#
sysctrl#:slot# is the logical attachment point identification (the system name for the board), which can be found in the cfgadm status display.
The states and conditions for the attachment point before a board is inserted are:
Receptacle state--Empty
Occupant state--Unconfigured
Condition--Unknown
After a board is physically inserted, the states and conditions are:
Receptacle state--Disconnected
Occupant state--Unconfigured
Condition--Unknown
After the attachment point is logically connected, the states and conditions are:
Receptacle state--Connected
Occupant state--Unconfigured
Condition--OK
Now the system is aware of the board, but not the usable devices which reside on the board. Temperature is monitored and power and cooling affect the attachment point condition.
To logically configure a board (add the board to the system configuration), enter:
# cfgadm -c configure sysctrl#:slot#
The states and conditions for a configured attachment point are:
Receptacle state--Connected
Occupant state--Configured
Condition--OK
Now the system is also aware of the usable devices which reside on the board and all devices may be mounted or configured to be used.
If the configure operation fails for any reason, the states and conditions will still transition to configured. This creates a special situation where the board is partially configured. In this situation, only an unconfigure operation is allowed. A further attempt to configure the partial configuration is not permitted.
If a board is to be kept in the system for use as a spare board, enter this board in the disabled board list. This prevents the board from being used when the system is turned on or rebooted.
To disable a board, use the EEPROM command:
# eeprom disabled-board-list=sysctrl#:slot#
Alternatively, you can use the DR command:
# cfgadm -c disconnect -o disable-at-boot sysctrl#:slot#
Note that disabled boards remain in the cfgadm status display even if a different board is subsequently placed in the same slot.
A running system may contain one or more unconfigured boards. That is, the boards are not being used by the system. These unconfigured boards may have been:
Hot-swapped into the system after the system was booted
Disabled by the EEPROM setting disable-board-list
Previously unconfigured
To enable a board, use the configure option described above.
To add a storage device, see "Adding Storage Devices".
The removal of a board requires the devices attached to the board be prepared, followed by the unconfiguration and disconnection of the board, as described below.
For the removal procedure, see "Removing a Board".
A board with vital system resources cannot be detached unless alternate resources are available on another board. A boot disk is an example of a vital system resource.
A board hosting non-vital system resources can be unconfigured whether or not there are alternate paths to the resources. All of its file systems must be unmounted and its swap partitions must be deleted. You may have to kill processes that have open files or devices, or place a hard lock on the file systems (using lockfs(1M)) before unmounting them. All I/O device drivers must be detachable.
The system swap space should be configured as multiple partitions on disks attached to controllers hosted by different boards. With this kind of configuration, a particular swap partition is not a vital resource because swap partitions can be added and deleted dynamically. See swap(1M) for more information.
When memory or disk swap space is detached, there must be enough memory or swap disk space remaining in the machine to accommodate currently running programs.
The screen, mouse, and keyboard will not be operational while the system is suspended, but you will regain control of these devices after the suspension.
Before the Unconfigure operation can be completed, you must manually terminate usage of all I/O devices on the board, including network interfaces.
To identify the components that are on the board to be unconfigured, use the ifconfig, mount, pf, or swap commands. The prtdiag(1M) command provides some information, but is less informative.
DR does not automatically terminate use of all network interfaces on the board that is being disconnected. You must manually terminate the use of each interface.
DR does not allow an Unconfigure operation on any interface that fits the following conditions. In these cases, the Unconfigure operation fails and DR displays an error message.
The network interface is the primary network interface for the machine. That is, the IP address of the interface corresponds to the network interface name contained in the file /etc/nodename. Halting the primary network interface for the machine prevents network information name services from operating, which results in the inability to make network connections to remote hosts using applications such as ftp(1), rsh(1), rcp(1), rlogin(1). NFS client and server operations are also affected.
The active alternate for an Alternate Pathing (AP) meta device when the AP meta device is plumbed. Interfaces used by the AP system should not be the active path when the board is being unconfigured. Manually switch the active path to one that is not on the board being unconfigured. If no such path exists, manually execute the ifconfig down and ifconfig unplumb commands on the AP interface. (To manually switch an active path, use the apconfig(1M) command.)
For the procedure to replace a board, see "Installing a Replacement Board"
For the procedure to add an interface to a board, see "Adding Storage Devices"
When replacing other types of hardware at the same time that you add or replace a board in Enterprise x000 and x500 servers, replace the hardware in this order, as applicable, before adding or replacing a board:
Clock board or clock+ board
Peripheral power supply (PPS)--the PPS supplies hot-plug current
Power and cooling module (PCM)--the PCM supplies cooling air
This section describes how to reconfigure your system after you have configured or unconfigured a system board.
You might need to reconfigure the system under several conditions, including:
Board addition: When adding a board, you must execute the reconfiguration sequence to configure the I/O devices associated with the board.
Board removal: If you remove a board that is not to be replaced, you may (but do not have to) execute the reconfiguration sequence to clean up the /dev links for disk devices.
Board replacement: If you remove a board and then insert it into a different slot, or replace a board with another board that has different I/O devices, you must execute the reconfiguration sequence to configure the I/O devices associated with the board. However, if you replace a board with another board that hosts the same set of I/O devices, inserting the replacement into the same slot, you do not need to execute the reconfiguration sequence. But be sure to insert a replacement into the same slot that was vacated to retain the original /dev link names.
The DR reconfiguration sequence is the same as the Solaris reconfiguration boot sequence (boot -r):
drvconfig; devlinks; disks; ports; tapes;
When the reconfiguration sequence is executed after a board is configured, device path names not previously seen by the system are entered into the /etc/path_to_inst file. The same path names are also added to the /devices hierarchy and links to them are created in the /dev directory.
The disk controller number is part of the /dev link name used to access the disk. If that number changes during the reconfiguration sequence, the /dev link name also changes. This change may affect file system tables and software, such as Solstice(TM) DiskSuite(TM), which uses the /dev link names. Update /etc/vfstab files and execute other administrative actions necessary due to the changes in the /dev link names.
When the reconfiguration sequence is executed after a board is unconfigured or disconnected, the /dev links for all the disk partitions on that board are deleted. The remaining boards retain their current numbering. Disk controllers on a newly inserted board are assigned the next available lowest number by disks(1M).
The disks(1m) utility creates symbolic links in the /dev/dsk and /dev/rdsk directories pointing to the actual special disk device files under the /devices directory tree. These entries take the form /dev/dsk/cXtXdXsX where:
cX is the disk controller number
tX corresponds to the disk target number, in most cases
dX refers to the logical unit number
sX is the partition number
Removing boards that contain one or more disk controllers prompts the disks(1m) utility to examine entries in /dev/dsk and /dev/rdsk. These entries list the disks attached to the removed controller(s). The disks(1m) utility discovers references to disconnected devices have been removed from /dev/dsk and /dev/rdsk. This removal action makes the logical controller numbers available for re-use. This re-use of controller numbers can lead to confusion when unexpected controller numbers are assigned to disk controllers that are added to the system.