C H A P T E R  10

CPU/Memory Board Replacement and Dynamic Reconfiguration (DR)

This chapter describes how to dynamically reconfigure the CPU/Memory boards on the Sun Fire entry-level midrange systems system.


Dynamic Reconfiguration

Overview

DR software is part of the Solaris operating environment. With the DR software you can dynamically reconfigure system boards and safely remove them or install them into a system while the Solaris operating environment is running and with minimum disruption to user processes running on the system. You can use DR to do the following:

Command Line Interface

The Solaris cfgadm(1M) command provides the command line interface for the administration of DR functionality.

DR Concepts

Quiescence

During the unconfigure operation on a system board with permanent memory (OpenBoot PROM or kernel memory), the operating environment is briefly paused, which is known as operating environment quiescence. All operating environment and device activity on the baseplane must cease during a critical phase of the operation.



Note - Quiescence may take several minutes, depending on workload and system configuration.



Before it can achieve quiescence, the operating environment must temporarily suspend all processes, CPUs, and device activities. It may take a few minutes to achieve quiescence depending on system usage and activities currently in progress. If the operating environment cannot achieve quiescence, it displays the reasons, which may include the following:

The conditions that cause processes to fail to suspend are generally temporary. Examine the reasons for the failure. If the operating environment encountered a transient condition--a failure to suspend a process--you can try the operation again.

RPC or TCP Time-out or Loss of Connection

Time-outs occur by default after two minutes. Administrators may need to increase this time-out value to avoid time-outs during a DR-induced operating system quiescence, which may take longer than two minutes. Quiescing a system makes the system and related network services unavailable for a period of time that can exceed two minutes. These changes affect both the client and server machines.

Suspend-Safe and Suspend-Unsafe Devices

When DR suspends the operating environment, all of the device drivers that are attached to the operating environment must also be suspended. If a driver cannot be suspended (or subsequently resumed), the DR operation fails.

A suspend-safe device does not access memory or interrupt the system while the operating environment is in quiescence. A driver is suspend-safe if it supports operating environment quiescence (suspend/resume). A suspend-safe driver also guarantees that when a suspend request is successfully completed, the device that the driver manages will not attempt to access memory, even if the device is open when the suspend request is made.

A suspend-unsafe device allows a memory access or a system interruption to occur while the operating environment is in quiescence.

Attachment Points

An attachment point is a collective term for a board and its slot. DR can display the status of the slot, the board, and the attachment point. The DR definition of a board also includes the devices connected to it, so the term `occupant' refers to the combination of board and attached devices.

There are two formats used when referring to attachment points:

where N0 is node 0 (zero),

SB is a system board,

x is a slot number. A slot number can be 0, 2 or 4 for a system board.

DR Operations

There are four main types of DR operation.

TABLE 10-1 Types of DR Operation

Connect

The slot provides power to the board and monitors its temperature.

Configure

The operating environment assigns functional roles to a board, and loads device drivers for the board, and brings the devices on that board into use by the Solaris operating environment.

Unconfigure

The system detaches a board logically from the operating environment. Environmental monitoring continues, but devices on the board are not available for system use.

Disconnect

The system stops monitoring the board, and power to the slot is turned off.


If a system board is in use, stop its use and disconnect it from the system before you power it off. After a new or upgraded system board is inserted and powered on, connect its attachment point and configure it for use by the operating environment. The cfgadm(1M) command can connect and configure (or unconfigure and disconnect) in a single command, but if necessary, each operation (connection, configuration, unconfiguration, or disconnection) can be performed separately.

Hot-Plug Hardware

Hot-plug devices have special connectors that supply electrical power to the board or module before the data pins make contact. Boards and devices that have hot-plug connectors can be inserted or removed while the system is running. The devices have control circuits to ensure they have a common reference and power control during the insertion process. The interfaces are not powered on until the board is home and the System Controller instructs them to.

The CPU/Memory boards used in the Sun Fire entry-level midrange systems system are hot-plug devices.

Conditions and States

A state is the operational status of either a receptacle (slot) or an occupant (board). A condition is the operational status of an attachment point.

Before you attempt to perform any DR operation on a board or component from a system, you must determine state and condition. Use the cfgadm(1M) command with the -la options to display the type, state, and condition of each component and the state and condition of each board slot in the system. See the section Component Types for a list of the component types.

Board States and Conditions

This section contains descriptions of the states and conditions of CPU/Memory boards (also known as system slots).

Board Receptacle States

A board can have one of three receptacle states: empty, disconnected, or connected. Whenever you insert a board, the receptacle state changes from empty to disconnected. Whenever you remove a board the receptacle state changes from disconnected to empty.



caution icon

Caution - Physically removing a board that is in the connected state, or that is powered on and in the disconnected state, crashes the operating system and can result in permanent damage to that system board.



TABLE 10-2 Board Receptacle States

Name

Description

empty

A board is not present.

disconnected

The board is disconnected from the system bus. A board can be in the disconnected state without being powered off. However, a board must be powered off and in the disconnected state before you remove it from the slot.

connected

The board is powered on and connected to the system bus. You can view the components on a board only after it is in the connected state.


Board Occupant States

A board can have one of two occupant states: configured or unconfigured. The occupant state of a disconnected board is always unconfigured.

TABLE 10-3 Board Occupant States

Name

Description

configured

At least one component on the board is configured.

unconfigured

All of the components on the board are unconfigured.


Board Conditions

A board can be in one of four conditions: unknown, ok, failed, or unusable.

TABLE 10-4 Board Conditions

Name

Description

unknown

The board has not been tested.

ok

The board is operational.

failed

The board failed testing.

unusable

The board slot is unusable.


Component States and Conditions

This section contains descriptions of the states and conditions for components.

Component Receptacle States

A component cannot be individually connected or disconnected. Thus, components can have only one state: connected.

Component Occupant States

A component can have one of two occupant states: configured or unconfigured.

TABLE 10-5 Component Occupant States

Name

Description

configured

Component is available for use by the Solaris operating environment.

unconfigured

Component is not available for use by the Solaris operating environment.


Component Conditions

A component can have one of three conditions: unknown, ok, failed.

TABLE 10-6 Component Conditions

Name

Description

unknown

Component has not been tested.

ok

Component is operational.

failed

Component failed testing.


Component Types

You can use DR to configure or to unconfigure several types of component.

TABLE 10-7 Component Types

Name

Description

cpu

Individual CPU

memory

All the memory on the board


Nonpermanent and Permanent Memory

Before you can delete a board, the environment must vacate the memory on that board. Vacating a board means flushing its nonpermanent memory to swap space and copying its permanent (that is, kernel and OpenBoot PROM memory) to another memory board. To relocate permanent memory, the operating environment on a system must be temporarily suspended, or quiesced. The length of the suspension depends on the system configuration and the running workloads. Detaching a board with permanent memory is the only time when the operating environment is suspended; therefore, you should know where permanent memory resides so that you can avoid significantly impacting the operation of the system. You can display the permanent memory by using the cfgadm(1M) command with the -v option. When permanent memory is on the board, the operating environment must find another memory component of adequate size to receive the permanent memory. If that is not possible the DR operation will fail.

Limitations

Memory Interleaving

System boards cannot be dynamically reconfigured if system memory is interleaved across multiple CPU/Memory boards.

Reconfiguring Permanent Memory

When a CPU/Memory board containing non-relocatable (permanent) memory is dynamically reconfigured out of the system, a short pause in all domain activity is required which may delay application response. Typically, this condition applies to one CPU/Memory board in the system. The memory on the board is identified by a non-zero permanent memory size in the status display produced by the
cfgadm -av command.

DR supports reconfiguration of permanent memory from one system board to another only if one of the following conditions is met:

-OR-


Command Line Interface

The following procedures are discussed in this section:



Note - There is no need to enable dynamic reconfiguration explicitly. DR is enabled by default.



The cfgadm Command

The cfgadm(1M) command provides configuration administration operations on dynamically reconfigurable hardware resources. TABLE 10-8 lists the DR board states.

TABLE 10-8 DR Board States from the System Controller (SC)

Board States

Description

Available

The slot is not assigned.

Assigned

The board is assigned, but the hardware has not been configured to use it. The board may be reassigned by the chassis port or released.

Active

The board is being actively used. You cannot reassign an active board.


Displaying Basic Board Status

The cfgadm program displays information about boards and slots. Refer to the cfgadm(1) man page for options to this command.

Many operations require that you specify the system board names. To obtain these system names, type:

# cfgadm

When used without options, cfgadm displays information about all known attachment points, including board slots and SCSI buses. The following display shows a typical output.

CODE EXAMPLE 10-1 Output of the Basic cfgadm Command
# cfgadm
Ap_Id 	Type 	Receptacle 	Occupant 	Condition
N0.IB6 	PCI_I/O_Boa 	connected 	configured 	ok
N0.SB0 	CPU_Board 	connected 	configured 	unknown
N0.SB4 	unknown 	empty	unconfigured 	unknown
c0 	scsi-bus 	connected 	configured 	unknown
c1 	scsi-bus 	connected 	unconfigured 	unknown
c2 	scsi-bus 	connected 	unconfigured 	unknown
c3 	scsi-bus 	connected 	configured 	unknown

Displaying Detailed Board Status

For a more detailed status report, use the command cfgadm -av. The -a option lists attachment points and the -v option turns on expanded (verbose) descriptions.

CODE EXAMPLE 10-2 is a partial display produced by the cfgadm -av command. The output appears complicated because the lines wrap around in this display. (This status report is for the same system used in CODE EXAMPLE 10-1.) FIGURE 10-1 provides details of each display item.

CODE EXAMPLE 10-2 Output of the cfgadm -av Command
# cfgadm -av
Ap_Id Receptacle Occupant Condition Information
When Type Busy Phys_Id
N0.IB6 connected configured ok powered-on, assigned
Apr 3 18:04 PCI_I/O_Boa n /devices/ssm@0,0:N0.IB6
N0.IB6::pci0 connected configured ok device
/ssm@0,0/pci@19,70000
Apr 3 18:04 io n /devices/ssm@0,0:N0.IB6::pci0
N0.IB6::pci1 connected configured ok device
/ssm@0,0/pci@19,600000
Apr 3 18:04 io n /devices /ssm@0,0:N0.IB6::pci1
N0.IB6::pci2 connected configured ok device
/ssm@0,0/pci@18,700000
Apr 3 18:04 io n /devices/ssm@0,0:N0.IB6::pci2
N0.IB6::pci3 connected configured ok device
/ssm@0,0/pci@18,600000
Apr 3 18:04 io n /devices/ssm@0,0:N0.IB6::pci3
N0.SB0 connected configured unknown powered-on, assigned
Apr 3 18:04 CPU_Board n /devices/ssm@0,0:N0.SB0
N0.SB0::cpu0 connected configured ok cpuid 0, speed 750 MHz,
ecache 8 MBytes
Apr 3 18:04 cpu n /devices/ssm@0,0:N0.SB0::cpu0
N0.SB0::cpu1 connected configured ok cpuid 1, speed 750 MHz,
ecache 8 MBytes
Apr 3 18:04 cpu n /devices/ssm@0,0:N0.SB0::cpu1
N0.SB0::cpu2 connected configured ok cpuid 2, speed 750 MHz,
ecache 8 MBytes
Apr 3 18:04 cpu n /devices/ssm@0,0:N0.SB0::cpu2

FIGURE 10-1 shows details of the display in CODE EXAMPLE 10-2:

 FIGURE 10-1 Details of the Display for cfgadm -av

Description of the results of the cfgadm -av command.[ D ]

Command Options

The options to the cfgadm -c command are listed in TABLE 10-9.

TABLE 10-9 cfgadm -c Command Options

cfgadm -c Option

Function

connect

The slot provides power to the board and begins monitoring the board. The slot is assigned if it was not previously assigned.

disconnect

The system stops monitoring the board and power to the slot is turned off.

configure

The operating system assigns functional roles to a board and loads device drivers for the board and for the devices attached to the board.

unconfigure

The system detaches a board logically from the operating system and takes the associated device drivers offline. Environmental monitoring continues, but any devices on the board are not available for system use.


The options provided by the cfgadm -x command are listed in TABLE 10-10.

TABLE 10-10 cfgadm -x Command Options

cfgadm -x Option

Function

poweron

Powers on a CPU/Memory board.

poweroff

Powers off a CPU/Memory board.


The cfgadm_sbd man page provides additional information on the cfgadm -c and cfgadm -x options. The sbd library provides the functionality for hot-plugging system boards of the class sbd, through the cfgadm framework.

Testing Boards and Assemblies


procedure icon  To Test a CPU/Memory Board

Before you can test a CPU/Memory board, it must first be powered on and disconnected. If these conditions are not met, the board test fails.

You can use the Solaris cfgadm command to test CPU/memory boards. As superuser, type:

# cfgadm -t ap-id

To change the level of diagnostics that cfgadm runs, supply a diagnostic level for the cfgadm command as follows:

# cfgadm -o platform=diag=<level> -t ap-id

where level is a diagnostic level, and ap-id is one of the following: N0.SB0, N0.SB2 or N0.SB4.

If you do not supply level, the default diagnostic level is set to the default. The diagnostic levels are:

TABLE 10-11 Diagnostic Levels

Diagnostic Level

Description

init

Only system board initialization code is run. No testing is done. This is a very fast pass through POST.

quick

All system board components are tested with few tests and test patterns.

default

All system board components are tested with all tests and test patterns, except for memory and Ecache modules. Note that max and default are the same definition.

max

All system board components are tested with all tests and test patterns, except for memory and Ecache modules. Note that max and default are the same definition.

mem1

Runs all tests at the default level, plus more exhaustive DRAM and SRAM test algorithms. For Memory and Ecache modules, all locations are tested with multiple patterns. More extensive, time-consuming algorithms are not run at this level.

mem2

The same as mem1, with the addition of a DRAM test that does explicit compare operations of the DRAM data.


Installing or Replacing CPU/Memory Boards



caution icon

Caution - Physical board replacement should only be carried out by qualified service personnel.




procedure icon  To Install a New Board



caution icon

Caution - For complete information about physically removing and replacing CPU/Memory boards, refer to the Sun Fire E2900 System Service Manual or Sun Fire V1280/Netra 1280 Service Manual, as appropriate. Failure to follow the stated procedures can result in damage to system boards and other components.





Note - When replacing boards, you sometimes need filler panels.



If you are unfamiliar with how to insert a board into the system, read the Sun Fire E2900 System Service Manual or Sun Fire V1280/Netra 1280 Service Manual, as appropriate before you begin this procedure.

1. Make sure you are properly grounded with a wrist strap.

2. After locating an empty slot, remove the system board filler panel from the slot.

3. Insert the board into the slot within one minute to prevent the system overheating.

Refer to the Sun Fire E2900 System Service Manual or Sun Fire V1280/Netra 1280 Service Manual, as appropriate for complete step-by-step board insertion procedures.

4. Power on, test, and configure the board using the cfgadm -c configure command:

# cfgadm -c configure ap_id

where ap_id is one of the following: N0.SB0, N0.SB2 or N0.SB4.


procedure icon  To Hot-Swap a CPU/Memory Board



caution icon

Caution - For complete information about physically removing and replacing boards, refer to the Sun Fire E2900 System Service Manual or Sun Fire V1280/Netra 1280 Service Manual, as appropriate. Failure to follow the stated procedures can result in damage to system boards and other components.



1. Make sure you are properly grounded using a wrist strap.

2. Power off the board with cfgadm.

# cfgadm -c disconnect ap_id

where ap_id is one of the following: N0.SB0, N0.SB2 or N0.SB4.

This command removes the resources from the Solaris operating environment and the OpenBoot PROM, and powers off the board.

3. Verify the state of the Power and Hotplug OK LEDs.

The green Power LED will flash briefly as the CPU/Memory board is cooling down. In order to safely remove the board from the systems the green Power LED must be off and the amber Hotplug OK LED must be on.

4. Complete the hardware removal and installation of the board.

For more information refer to the Sun Fire E2900 System Service Manual or Sun Fire V1280/Netra 1280 Service Manual, as appropriate.

5. After removing and installing board, bring the board back to the Solaris operating environment with the Solaris dynamic reconfiguration cfgadm command.

# cfgadm -c configure ap_id

where ap_id is one of the following: N0.SB0, N0.SB2 or N0.SB4.

This command powers the board on, tests it, attaches the board, and brings all of its resources back to the Solaris operating environment.

6. Verify that the green Power LED is lit.


procedure icon  To Remove a CPU/Memory Board From the System



Note - Before you begin this procedure, make sure you have ready a system board filler panel to replace the system board you are going to remove. A system board filler panel is a metal board with slots that allow cooling air to circulate.



1. Detach and power off the board from the system by using the cfgadm -c disconnect command.

# cfgadm -c disconnect ap_id

where ap_id is one of the following: N0.SB0, N0.SB2 or N0.SB4.



caution icon

Caution - For complete information about physically removing and replacing boards, refer to the Sun Fire E2900 System Service Manual or Sun Fire V1280/Netra 1280 Service Manual, as appropriate. Failure to follow the stated procedures can result in damage to system boards and other components.



2. Remove the board from the system.

Refer to the Sun Fire E2900 System Service Manual or Sun Fire V1280/Netra 1280 Service Manual, as appropriate for complete step-by-step board removal procedures.

3. Insert a system board filler panel into the slot within one minute of removing the board to prevent system overheating.


procedure icon  To Disconnect a CPU/Memory Board Temporarily

You can use DR to power down the board and leave it in place. For example, you might want to do this if the board fails and a replacement board or a system board filler panel is not available.

single-step bulletDetach and power off the board using the cfgadm -c disconnect command.

# cfgadm -c disconnect ap_id

where ap_id is one of the following: N0.SB0, N0.SB2 or N0.SB4.


Troubleshooting

This section discusses common types of failure:

The following are examples of cfgadm diagnostic messages. (Syntax error messages are not included here.)

cfgadm: hardware component is busy, try again
cfgadm: operation: Data error: error_text
cfgadm: operation: Hardware specific failure: error_text
cfgadm: operation: Insufficient privileges
cfgadm: operation: Operation requires a service interruption
cfgadm: System is busy, try again
WARNING: Processor number number failed to offline.

See the following man pages for additional error message detail: cfgadm(1M), cfgadm_sbd(1M), and config_admin(3X).

Unconfigure Operation Failure

An unconfigure operation for a CPU/Memory board can fail if the system is not in a correct state before you begin the operation.

CPU/Memory Board Unconfiguration Failures

Cannot Unconfigure a Board Whose Memory Is Interleaved Across Boards

If you try to unconfigure a system board whose memory is interleaved across system boards, the system displays an error message such as:

cfgadm: Hardware specific failure: unconfigure N0.SB2::memory: Memory is
interleaved across boards: /ssm@0,0/memory-controller@b,400000

Cannot Unconfigure a CPU to Which a Process is Bound

If you try to unconfigure a CPU to which a process is bound, the system displays an error message such as the following:

cfgadm: Hardware specific failure: unconfigure N0.SB2::cpu3: Failed to off-line:
/ssm@0,0/SUNW,UltraSPARC-III

single-step bulletUnbind the process from the CPU and retry the unconfigure operation.

Cannot Unconfigure a CPU Before All Memory is Unconfigured

All memory on a system board must be unconfigured before you try to unconfigure a CPU. If you try to unconfigure a CPU before all memory on the board is unconfigured, the system displays an error message such as:

cfgadm: Hardware specific failure: unconfigure N0.SB2::cpu0: Can't unconfig cpu
if mem online: /ssm@0,0/memory-controller

single-step bulletUnconfigure all memory on the board and then unconfigure the CPU.

Unable to Unconfigure Memory on a Board With Permanent Memory

To unconfigure the memory on a board that has permanent memory, move the permanent memory pages to another board that has enough available memory to hold them. Such an additional board must be available before the unconfigure operation begins.

Memory Cannot Be Reconfigured

If the unconfigure operation fails with a message such as the following, the memory on the board could not be unconfigured:

cfgadm: Hardware specific failure: unconfigure N0.SB0: No available memory
target: /ssm@0,0/memory-controller@3,400000

Add to another board enough memory to hold the permanent memory pages, and then retry the unconfigure operation.

To confirm that a memory page cannot be moved, use the verbose option with the cfgadm command and look for the word permanent in the listing:

# cfgadm -av -s "select=type(memory)"

Not Enough Available Memory

If the unconfigure fails with one of the messages below, there will not be enough available memory in the system if the board is removed:

cfgadm: Hardware specific failure: unconfigure N0.SB0: Insufficient memory

single-step bulletReduce the memory load on the system and try again. If practical, install more memory in another board slot.

Memory Demand Increased

If the unconfigure fails with the following message, the memory demand has increased while the unconfigure operation was proceeding:

cfgadm: Hardware specific failure: unconfigure N0.SB0: Memory operation failed

cfgadm: Hardware specific failure: unconfigure N0.SB0: Memory operation refused

single-step bulletReduce the memory load on the system and try again.

Unable to Unconfigure a CPU

CPU unconfiguration is part of the unconfiguration operation for a CPU/Memory board. If the operation fails to take the CPU offline, the following message is logged to the console:

WARNING: Processor number failed to offline.

This failure occurs if:

Unable to Disconnect a Board

It is possible to unconfigure a board and then discover that it cannot be disconnected. The cfgadm status display lists the board as not detachable. This problem occurs when the board is supplying an essential hardware service that cannot be relocated to an alternate board.

Configure Operation Failure

CPU/Memory Board Configuration Failure

Cannot Configure Either CPU0 or CPU1 While the Other Is Configured

Before you try to configure either CPU0 or CPU1, make sure that the other CPU is unconfigured. Once both CPU0 and CPU1 are unconfigured, it is then possible to configure both of them.

CPUs on a Board Must Be Configured Before Memory

Before configuring memory, all CPUs on the system board must be configured. If you try to configure memory while one or more CPUs are unconfigured, the system displays an error message such as:

cfgadm: Hardware specific failure: configure N0.SB2::memory: Can't
config memory if not all cpus are online: /ssm@0,0/memorycontroller