A P P E N D I X  B

Troubleshooting

This chapter discusses common types of failure:

The following are examples of cfgadm diagnostic messages. (Syntax error messages are not included here.)


cfgadm: Configuration administration not supported on this machine
cfgadm: hardware component is busy, try again
cfgadm: operation: configuration operation not supported on this machine
cfgadm: operation: Data error: error_text
cfgadm: operation: Hardware specific failure: error_text
cfgadm: operation: Insufficient privileges
cfgadm: operation: Operation requires a service interruption
cfgadm: System is busy, try again
WARNING: Processor number failed to offline. 

See the following man pages for additional error message detail: cfgadm(1M), cfgadm_sbd(1M), cfgadm_pci(1M), and config_admin(3CFGADM).


Unconfigure Operation Failure

An unconfigure operation for a system board or I/O board can fail if the system is not in a correct state when you begin the operation.

System Board Unconfiguration Failures

Cannot Unconfigure a Board Whose Memory Is Interleaved Across Boards

If you try to unconfigure a system board whose memory is interleaved across system boards, the system displays an error message such as:


cfgadm: Hardware specific failure: unconfigure N0.SB2::memory: Memory is
interleaved across boards: /ssm@0,0/memory-controller@b,400000 

Cannot Unconfigure a CPU to Which a Process is Bound

If you try to unconfigure a CPU to which a process is bound, the system displays an error message such as the following:


cfgadm: Hardware specific failure: unconfigure N0.SB2::cpu3: Failed to off-line:
/ssm@0,0/SUNW,UltraSPARC-III 

single-step bulletUnbind the process from the CPU and retry the unconfigure operation.

Cannot Unconfigure a CPU Before All Memory is Unconfigured (Midrange Only)

All memory on a system board must be unconfigured before you try to unconfigure a CPU. If you try to unconfigure a CPU before all memory on the board is unconfigured, the system displays an error message such as:


cfgadm: Hardware specific failure: unconfigure N0.SB2::cpu0: Can't unconfig cpu 
if mem online: /ssm@0,0/memory-controller 

single-step bulletUnconfigure all memory on the board and then unconfigure the CPU.

Unable to Unconfigure Memory on a Board With Permanent Memory

To unconfigure the memory on a board that has permanent memory, move the permanent memory pages to another board that has enough available memory to hold them. Such an additional board must be available before the unconfigure operation begins.

Memory Cannot Be Reconfigured

If the unconfigure operation fails with a message such as the following, the memory on the board could not be unconfigured:


cfgadm: Hardware specific failure: unconfigure N0.SB0: No available memory 
target: /ssm@0,0/memory-controller@3,400000 

Add to another board enough memory to hold the permanent memory pages, and then retry the unconfigure operation.

single-step bulletConfirm the memory page cannot be moved.

Look for the word "permanent" in the listing.


# cfgadm -av -s "select=type(memory)"

Not Enough Available Memory

If the unconfigure fails with one of the messages below, removal of the board would not leave enough available memory in the system.


cfgadm: Hardware specific failure: unconfigure N0.SB0: Insufficient memory

cfgadm: Hardware specific failure: unconfigure N0.SB0: Memory operation failed

 

single-step bulletReduce the memory load on the system and try again; if practical, install more memory in another board slot.

Memory Demand Increased

If the unconfigure fails with the following message, the memory demand has increased while the unconfigure operation was proceeding:


cfgadm: Hardware specific failure: unconfigure N0.SB0: Memory operation refused

single-step bulletReduce the memory load on the system and try again.

Unable to Unconfigure a CPU

CPU unconfiguration is part of the unconfiguration operation for a
system board. If the operation fails to take the CPU offline, the following message is logged to the console:


WARNING: Processor number failed to offline. 

This failure occurs if:

Unable to Disconnect a Board

It is possible to unconfigure a board and then discover that it cannot be disconnected. The cfgadm status display lists the board as not detachable. This problem occurs when the board is supplying an essential hardware service that cannot be relocated to an alternate board.

I/O Board Unconfiguration Failure

A device cannot be unconfigured or disconnected while it is in use. Many failures to unconfigure I/O boards occur because activity on the boards has not been stopped, or because an I/O device becomes active again after it has been stopped.

Device Busy

Disks attached to an I/O board must be idled before you attempt to unconfigure or disconnect that board. Any attempt to unconfigure/disconnect a board whose devices are still in use is rejected.

If an unconfiguration operation fails because an I/O board has a busy or open device, the board is left only partially unconfigured. The operation sequence stopped at the busy device.

To regain access to the devices that were not unconfigured, the board must be completely unconfigured, then reconfigured.

If a device on the board is busy, the system logs a message such as the following after an attempt to unconfigure:


cfgadm: Hardware specific failure: unconfigure N0.IB6: Device busy: /ssm@0,0/pci@18,700000/pci@1/SUNW,isptwo@4/sd@6,0

To continue the unconfigure operation, unmount the device and retry the unconfigure operation. The board must be in the unconfigured state before you try to reconfigure this board.

Problems with I/O Devices

1. Use the fuser(1M) command to identify the processes that have these devices open.

2. Kill the vold daemon gracefully.


 # /etc/init.d/volmgt stop

3. Disconnect all SCSI controllers that are associated with the card you are trying to unconfigure.

To get a list of all connected SCSI controllers use the following command.


 # cfgadm -l -s "select=class(scsi)"

4. If the redundancy features of Solaris Volume Manager mirroring are used to access a device connected to the board, reconfigure these subsystems so that the device or network is accessible by way of controllers on other system boards.

5. Unmount file systems, including volume manager meta-devices that have a board resident partition.


# umount/partition

6. Remove the volume manager database from board-resident partitions.

The location of the volume manager database is explicitly chosen by the user and can be changed.

7. Remove any private regions used by Solaris Volume Manager or Veritas Volume Manager.

Solaris Volume Manager by default uses a private region on each device that it controls, so such devices must be removed from Solaris Volume Manager control before they can be detached.

8. Remove disk partitions from the swap configuration.

9. Either kill any process that directly opens a device or raw partition, or direct it to close the open device on the board.



Note - Unmounting file systems might affect NFS client systems.



RPC or TCP Time-out or Loss of Connection

Time-outs occur by default after two minutes. Administrators might need to increase this time-out value to avoid time-outs during a DR-induced operating system quiescence, which might take longer than two minutes. Quiescing a system makes the system and related network services unavailable for a period of time that can exceed two minutes. These changes affect both the client and server machines.


Configure Operation Failure

Memory Configuration Failure (Midrange Only)

Before configuring memory, all CPUs on the system board must be configured. If you try to configure memory while one or more CPUs are unconfigured, the system displays an error message such as:


cfgadm: Hardware specific failure: configure N0.SB2::memory: Can't config memory if not all cpus are online: /ssm@0,0/memory-controller

I/O Board Configuration Failure

A configure operation might fail because an I/O board with a device does not currently support hot-plugging. In such a situation, the board is now only partially configured. The operation has stopped at the unsupported device. In this situation, the board must be brought back to the unconfigured state before another configure attempt. The system logs a message, such as:


cfgadm: Hardware specific failure: configure N0.IB6: Unsafe driver present: <device path>

single-step bulletTo continue the configure operation, either remove the unsupported device driver or replace it with a new version of the driver that supports hot-plugging.