A P P E N D I X  A

DR Error Messages

This appendix contains a list of error messages that you might see while performing DR operations. The list does not include Protocol Independent Module (PIM) layer errors, which are more generic than the error messages in the tables that follow.

All DR error messages are sent to the one or both of the following locations:


Searching This Appendix

Before you use this appendix, be sure to read the following list of search tips.

Error-Type Links

The following are different types of errors:

SSP Errors

See the section SSP Error Messages .

Domain Errors

Use one of the following links to start a search of domain-related error messages:

DCS Error Messages

DR Driver Error Messages

Plugin Error Messages


SSP Error Messages

The following are SSP-related error messages:

TABLE 2-1 SSP-Related Error Messages

Error Message

Probable Cause

Suggested Action(s)

Domain domain_name has unknown DR model

SSP failed to determine which DR model is running on domain. Possible causes for this error message are:

  1. The domain is down, or hung, or too busy to respond to the request from SSP
  2. The SSP-to-domain link is down
  3. DCS is not running
  4. The DR driver is not loaded on domain.

  1. Make sure the domain is up and running
  2. Check the link between SSP and the domain
  3. Retry at a later time when the domain is not as busy
  4. Make sure DCS is running on the domain
  5. Make sure the DR driver is properly loaded.
Board xx is in the intermediate state A failed DR operation left the board in an intermediate state. Re-run the command at a later time. If the error persist, check the domain message file for the root cause. Some DR operations are not allowed due to certain constraints. If the issue cannot be resolved, run the opposite DR command to restore the board to its original state.
Board xx is in intermediate attachment state Another board xx in the target domain is in an intermediate attachment state and has to be fully attached to the domain or fully detached from the domain before the current DR operation can proceed. Use addboard(1M) to attach board xx to the target domain or deleteboard(1M) to detach it from the target domain, and re-run the current command.
Failed in complete attachment stage for board xx The board could not be connected and configured into the domain. Check the domain message file for detailed errors; the file may indicate the cause of the failure. Resolve the issue and re-run the command.
Board xx is not a member of a domain The SSP software shows that the specified board does not belong to any domain. Make sure the right board number is specified; and use domain_status(1M) to to find out whether the board belongs to any domain.
Board xx is in intermediate detachment state Another board xx in the target domain is in an intermediate detachment state and must be fully attached to the domain, or fully detached from the domain, before the current DR operation can proceed. Use addboard(1M) to attach board xx to the target domain, or deleteboard(1M) to detach it from the target domain. Then re-run the current command.
Failed in complete detachment stage for board xx The board could not be unconfigured and disconnected from the domain. Check the domain message file for detailed errors; it may indicate the cause of the failure. Resolve the issue and re-run the command. If the board cannot be detached due to certain restraints, run addboard(1M) to return the board to its original state.
Unable to connect to SNMP agent The DR thread failed to establish a connection with snmpd, which may be down or too busy.
  • Make sure snmpd is up and running.
  • Make sure there are not too many jobs running at the same time, which requires snmpd attention.
  • Reboot the SSP machine if none of the previous actions succeed.

RDR_ERROR

There was an error communicating with the dcs process.

Make sure the internal network for the domain is working properly. The SSP should be working and you should be able to ping the domain.

RDR_NET_ERR

There was an error setting up a socket connection with the dcs process.

Make sure the internal network for the domain is working properly. The SSP should be working and you should be able to ping the domain.

RDR_TIMEOUT

A poll() system call timed out due to a loss of communication to the dcs; or the domain is busy.

Verify the network connection or retry the DR operation.

RDR_ABORTED

A poll(), read(), or write() system call was interrupted.

Retry the DR operation.

RDR_MSG_INVAL

A message being sent to, or received from, dcs is invalid.

No user intervention is needed or possible.

RDR_MEM_ALLO

Unable to allocate memory.

No user intervention is needed or possible.



Domain Error Messages

DCS Error Messages

The following table contains DCS error messages that are sent to the console window, to the /var/adm/messages directory, and to the $SSPLOGGER/ domain_name /messages directory.

TABLE 2-2 DCS Error Messages

Error Message

Probable Cause

Suggested Action

dcs: permission denied

ERROR: Only the superuser of the domain can run the DCS.

Check the inetd.conf file on the domain to ensure that the DCS is started with superuser ID.

dcs: internal error: operation: error_description

ERROR: An internal error occurred within the DCS.

Use the error_description , which corresponds with the errno_value , to diagnose the error. The operation field refers to the function call that caused the error.

dcs: unrecognized error reported

NOTICE: The DCS reported an unknown error condition.

Use the log file on the domain to help determine what caused the error.

dcs: network initialization failed

ERROR: The DCS failed to initialize the network connection used to accept DR requests from the SSP.

Retry the DR operation.

dcs: failed to acquire reserved port

ERROR: The DCS uses port 665, which is reserved through sun-dr. The error occurred because another process is using the port.

Determine if another process is still using the port. If so, kill the process, if possible, then retry the DR operation.

dcs: connection attempt failed

ERROR: The DCS failed to establish a connection with the SSP.

Retry the DR operation.

dcs: unable to receive message

ERROR: The DCS failed to receive a message from the SSP.

Retry the DR operation.

dcs: unable to send message for operation_name operation

ERROR: The DCS failed to send a message to the SSP.

Retry the DR operation.

dcs: sun-dr service not found, using reserved port 665

ERROR: The DCS failed to find the sun-dr service in /etc/services .

None

DCS NOTICE: client disconnected

NOTICE: The client unexpectedly disconnected.

None

dcs: unknown operation requested

ERROR: The SSP requested an operation that is not recognized by the DCS.

Retry the DR operation.

dcs: operation failed

ERROR: The current DCS operation failed to complete. The DR operation can still succeed, if the DCS failed only to send the results to the SSP.

Check the status of the operation manually. If the DR operation did not succeed, retry the operation.

dcs: invalid session establishment sequence

ERROR: The session establishment sequence and (the initialization handshake) between the SSP and the DCS failed.

Retry the DR operation.

dcs: operation_name operation issued before session established

ERROR: A DR operation was requested before the session was established.

Retry the DR operation.

dcs: received an invalid message

ERROR: The DCS received unexpected information in the message.

Retry the DR operation.

dcs: confirm callback failed, aborting operation

ERROR: The DCS was unable to display the confirmation prompt to the user.

None

dcs: message callback failed, continuing

NOTICE: The DCS was unable to display a message to the user.

None

dcs: retry value invalid ( retry_value )

NOTICE: The value given for the retry_value was invalid, so the operation proceeded with the retry value set to zero.

None

dcs: timeout value invalid ( timeout_value )

NOTICE: The value given for the timeout_value was invalid, so the operation proceeded with the retry value set to zero.

None

dcs: retrying operation, attempt attempt_number

INFO: The DCS is retrying the operation. The attempt_number field represents the current attempt.

None

dcs: failed to start a new session handler

ERROR: The DCS failed to start a concurrent session handler to process the incoming DR request.

Retry the DR operation.

dcs: abort attempt of session, session_id , unsuccessful

ERROR: The DCS failed to abort session, session_id .

Retry the abort request.

dcs: unsupported message protocol version: version_number

ERROR: The DCS does not support the reported protocol version, version_number .

Check the DR software on the domain and the SSP. Reinstall the proper version of the software on the domain if they are not compatible.

dcs: session aborted

INFO: The current DR operation was aborted by the user.

None

dcs: illegal option option , exiting

ERROR: The DCS was passed an illegal option name.

Check the inetd.conf file on the domain and remove the illegal option from the entries for the DCS.

dcs: illegal argument to option flag ( argument ), action

NOTICE: The option option was given the illegal argument argument . The DCS will perform the action specified by action .

Check the inetd.conf file on the domain and fix the entries for the DCS.

dcs: resource info init error ( error_code )

ERROR: The DCS failed to initialize the module responsible for providing resource usage information.

Retry the operation.


DR Driver Error Messages

The following table contains DR driver error messages that are sent to the console window, to the /var/adm/messages directory, and to the $SSPLOGGER/ domain_name /messages directory.

TABLE 2-3 DR Driver Error Messages

Error Message

Probable Cause

Suggested Action

ngdr: Internal error: dr.c line_number

An internal error has occurred in the DR driver.

Retry the operation that failed. If the error persists, exit and restart various DR software components, then retry the operation. If the problem still persists, reboot the domain. Check the console or the system logs for additional information.

ngdr: Insufficient memory: resource

The DR framework was unable to configure or unconfigure resources because a KPHYSM_ERESOURCE error or cpu_configure() / cpu_unconfigure() error with the ENOMEM errno occurred.

This condition might be transient. Retry the DR operation. If the error persists and if the operation that is failing is the unconfigure operation, then try configuring more memory into the domain from a different domain. If the error still persists, reboot the domain.

ngdr: Device busy: resource

Translation of possible EBUSY errno messages from cpu_configure() or cpu_unconfigure() ; or an I/O device cannot be detached because it is busy. This error message is also returned if a CPU to be detached is online when dr_pre_detach_cpu is called. A CPU cannot be detached while a memory drain is in progress.

Use showdevices (1M) on the system controller to find out why the resource is busy. Or, on the domain, use fuser (1M), psrinfo (1M), prtdiag (1M), or similar tools to find out why the device is busy. Also check whether another memory drain is already in progress. Either reconfigure or shut down whatever is consuming the resource, or wait for the previous memory drain to complete, depending on the cause of the error. Then retry the DR operation.

ngdr: Operation already in progress: resource

Translation of possible EALREADY errno from cpu_configure() or cpu_unconfigure() .

Use showdevices (1M) on the system controller to examine the configuration of the specified resource. Or, on the domain, use cfgadm (1M), pbind (1M), psrinfo (1M), and similar commands to examine the configuration of the resource. Determine what operations are already in progress on this resource, and either wait for them to complete or cancel them. Then, retry the DR operation. The operation already in progress may already have terminated, so retrying to the operation might succeed, or may produce another error.

ngdr: I/O error: resource

An unexpected error code resulted from a call to kphysm_del_start . A more verbose cmn_err message is also printed.

Check the verbose error message from cmn_err in the system logs, and/or on the console for a more specific condition and suggested action.

ngdr: Bad address: resource

kphysm_add_memory_dynamic returned KPHYSM_EFAULT .

Retry the DR operation. If this error persists, contact your Sun Service representative.

ngdr: No device(s) on board: board_path

The board is connected or disconnected with no devices (I/O, memory, or CPU).

If devices were expected to be on the board, then disconnect the board and remove it from the system. The board's components should be reseated by a qualified technician.

ngdr: Invalid argument: attachment_point

DR was passed an invalid argument.

Retry the DR operation. If this error persists, contact your Sun service representative.

ngdr: Invalid state transition: attachment_point

A DR operation was sequenced out of order. This could be operator error if the cfgadm (1M) commands were issued out of order. Or, the DR driver could be confused due to some internal error conditions.

Retry the DR operation. If this error persists, stop and restart (or unload and load) DR software components to recover from this error condition. If the error persists, reboot the domain.

ngdr: Device in fatal state

The device could not be suspended, or it refused to be suspended.

Retry the DR operation. If this error persists, the device could be in suspend-unsafe mode. Check the list of suspend-unsafe devices. If the device is unsafe, use showdevices (1M) or fuser (1M) to show whether the device is in use, and manually reconfigure the resource. Then, manually unload the driver, or if needed, disconnect the cables attached to the device. The device should now be safe to retry the operation. Do not reconnect the cables to the device, reload its driver, or reconfigure its resources before the DR operation has completed successfully.

ngdr: Device failed to resume: path

A previously suspended device could not be resumed.

ngdr: Cannot stop user thread

DR could not stop a user thread(s) in preparing a device to be suspended.

Retry the DR operation. If this error persists, examine the user threads that failed to suspend, and determine why they could not be suspended. You might have to kill the threads to enable the DR operation to proceed.

ngdr: Cannot stop kernel thread: name

DR could not stop a kernel thread.

Retry the DR operation. If this error persists, examine the user threads that failed to suspend, and determine why they could not be suspended. Kill the kernel threads, if possible, to enable the DR operation to proceed.

ngdr: Failed to off-line: cpu

A CPU could not be brought off-line, which prevents it from being unconfigured. The CPU might have a thread(s) bound to it. An additional cmn_err message is logged if there are threads bound to the CPU. DR must be able to off-line CPUs and/or to power off CPUs before the board can be disconnected.

Check the console and system log messages to determine if threads are bound to the CPU. If they are, they can be manually unbound or rebound to CPUs on other boards in the domain. If threads are not bound to the CPU, use psrset (1M), pbind (1M), and psrinfo (1M) to determine what changes are required to enable DR to off-line the CPU. For example, you might have to add more CPUs to the domain from different boards. Or, you may have to online other CPUs. Finally, you might have to add more CPU boards to take over the CPU workload.

ngdr: Failed to on-line: cpu

DR could not online a CPU on a newly-connected or previously unconfigured board.

ngdr: Failed to start CPU: cpu

DR could not start a CPU on a newly-connected or previously unconfigured board.

ngdr: Failed to stop CPU: cpu

DR could not power off a CPU on a board to be unconfigured. All of the CPUs on a board to be unconfigured must be taken offline and powered off before the operation can succeed.

ngdr: Kernel cage is disabled: resource

When the kernel cage is disabled, boards hosting permanent memory cannot be detached.

Enable the kernel cage in /etc/system and reboot the domain.

ngdr: No available memory target: resource

DR could not detach the board because it hosts permanent memory and there is no available target for the memory. Permanent memory must be moved to another memory component within the domain before the DR operation can succeed.

Configure an additional memory component that contains an adequate amount of memory to act as a target for this board. Then, retry the DR operation.

ngdr: VM viability test failed: resource

Translation of error code returned by kphysm_del_start .

Configure additional memory components into the domain to relieve memory resource pressure. Then, retry the DR operation.

ngdr: Memory operation refused: resource

Translation of error code returned by kphysm_del_start .

Configure additional memory components into the domain to relieve memory resource pressure. Then, retry the DR operation.

ngdr: Non-relocatable pages in span: resource

ngdr: Memory operation cancelled: resource

ngdr: Memory operation failed: resource

DR failed to attach the memory on a newly attached board.

ngdr: Can't unconfig cpu if mem online

DR cannot unconfigure a CPU if the memory on the board is online.

You must bring memory offline before you can unconfigure the board.

ngdrmach: Cannot read property value: Device Node node_address : property property_name

DR could not get the specified property of a particular device node.

ngdrmach: Cannot determine property length: board :: slot: property

DR could not get the length of the specified property for a particular device node.

ngdrmach: No CPU specified for connect: slot

A board connect operation requires that a CPU on the board to connect is specified as part of the DR request.

Retry the addboard operation.

ngdrmach: Cannot move SIGB assignment

No CPU could be found as a target for relocating the boot proc of the domain.

Make sure there are CPUs present and online on another board in the domain.

ngdrmach: Cannot disconnect CPU; SIGB is currently assigned: slot :: board

The disconnect operation is attempting to remove the board that has the boot proc.

Retry the deleteboard operation. It may be necessary to run addboard before retrying deleteboard.

ngdrmach: Device driver failure: path

Operations to online or offline a device failed.

Retry the operation.

ngdrmach: Must specify a CPU on the given board: cpu_id

A board connect operation requires that a CPU on the board to connect is specified as part of the DR request.

Retry the addboard operation.

ngdrmach: No such device: board :: slot

The specified device does not exist on the specified board.

ngdrmach: Memory configured with inter-board interleaving: board :: slot

Memory that is interleaved across boards cannot be unconfigured from the system.

Configure the system without memory interleaving.

ngdrmach: Invalid board number: board_number

An invalid board number was specified for the connect board operation.

Retry the deleteboard operation.

ngdrmach:: Cannot proceed; Board is configured or busy: component_name

DR cannot disconnect a board that is configured or busy.

Unconfigure the board, or wait for any previous DR operations on the board to complete. Then, retry the DR operation

ngdrmach: Firmware probe failed: attachment_point

OBP failed to probe the board during board disconnect.

Retry the deleteboard operation. It may be necessary to run addboard before retrying the deleteboard

ngdrmach: Firmware deprobe failed: attachment_point

OBP failed to deprobe the board.

ngdrmach: Operation not supported

The operation you attempted is not supported.

None

ngdrmach: Unrecognized platform command: command / options

An unrecognized command was passed to DR.

Refer to the cfgadm_sbd (1M) man page to ensure that you use a valid argument. If you used a valid argument and this error persists, contact your Sun service representative.

ngdrmach: drmach parameter is not a valid ID

An invalid drmachid_t value was encountered.

ngdrmach: drmach parameter is inappropriate for operation

The wrong type of drmachid_t was passed to a function.

ngdrmach: Unexpected internal condition: drmach.c line_number

An internal drmach error occurred.

Use modunload (1M) and modload (1M) to unload then to load the drmach driver. Then, retry the DR operation. If this error persists, then you must reboot the domain.

ngdrmach: Firmware move_cpu0 failed: CPU cpu_id

OBP failed to move the boot proc during the unconfigure operation.

Retry the deleteboard operation. It may be necessary to run addboard before retrying the deleteboard.


Plugin Error Messages

The following error messages are generated by the libcfgadm system board plugin. The messages are sent to the netcon (1M) console window, to the /var/adm/messages directory, and to the $SSPLOGGER/ domain_name /messages directory.

TABLE 2-4 Plugin Error Messages

Error Message

Probable Cause

Suggested Action

Configuration operation cancelled: command ap_id

You did not confirm a configuration operation that requires confirmation.

See the cfgadm (1M) and/or the cfgadm_sbd (1M) man page for more information about which configuration operations require confirmation.

Hardware specific failure: command ap_id : error : resource

A system error occurred during the execution of the command. The error message, error , can be a standard error, or it can be a more specific error message that is returned by the DR driver. (See the DR Driver error messages for more information.) The name of the resource, resource , that is causing the error (for example, a busy device) can also be returned by the DR driver.

For busy devices, identify and stop usage of the device. For other errors, refer to the driver's documentation for recovery options.

Library Error: command invalid: command

The specified command is invalid for system boards.

Refer to the cfgadm_sbd (1M) man page for a list of valid commands.

Library Error: command not supported: command ap_id

The command that was executed is not supported for the attachment point specified by ap_id. For example, the connect operation is not allowed for CPUs.

Refer to the cfgadm_sbd (1M) man page for a list of supported commands.

Library Error: command aborted: command

You aborted the command.

N/A

Library Error: option invalid: option

The specified option, option , is invalid.

Refer to the cfgadm_sbd (1M) man page for a list of the valid options.

Library Error: option requires value: option

The specified option, option , requires a value.

Refer to the cfgadm_sbd (1M) man page for a list of the option values.

Library Error: option requires no value: option

The specified option, option , does not require a value.

Refer to the cfgadm_sbd (1M) man page for a list of options that do not require values.

Library Error: option value invalid: option value

The specified value, value , for the option, option , is invalid.

Refer to the cfgadm_sbd (1M) man page for a list of valid option values.

Library Error: attachment point invalid: ap_id

The specified attachment point, ap_id , could not be parsed correctly. This error is rare and could indicate an internal error.

Refer to the cfgadm_sbd (1M) man page for a list of valid attachment points. If this error persists, contact your Sun service representative.

Library Error: component invalid: ap_id

The specified component, ap_id , is invalid.

Refer to the cfgadm_sbd (1M) man page for a list of valid dynamic attachment points.

Library Error: sequence invalid: command ( rstate ostate ) ap_id

The specified command, command , is invalid for the receptacle and/or occupant state of the specified attachment point. For example, trying to connect an empty slot results in an invalid sequence error.

Refer to the cfgadm_sbd (1M) man page for a list of valid operations.

Library Error: offline ap_id ( path ): error

The Reconfiguration Coordination Manager (RCM) failed to take the resource, ap_id , offline. The error message, error , returned by the RCM will indicate the reason for the failure. Usually, the reason is a busy device.

For busy devices, identify and remove the usage of the device.

Library Error: suspend ap_id ( path ): error

The Reconfiguration Coordination Manager (RCM) failed to suspend the resource, ap_id . The error message, error , returned by the RCM will indicate the reason for the failure. Usually, the reason is a busy device.

For busy devices, identify and remove the usage of the device.

Library Error: not enough memory

The plugin operation failed due to a lack of memory.

Check the memory usage.

Library Error: change signal disposition failed

The plugin operation failed to set up signals before it started the DR operation.

None

Library Error: cannot get RCM handle

The Reconfiguration Coordination Manager (RCM) failed to initialize.

None

Library Error: cannot open library: error

The Reconfiguration Coordination Manager (RCM) library, library , was found, but an error occurred when it was opened. The error message, error , is returned by dlopen (3DL).

Check for proper installation of the RCM.

Library Error: cannot find symbol symbol in library

A required symbol, symbol , was not found in the Reconfiguration Coordination Manager (RCM) library, library .

Check for proper installation of the RCM.

Library Error: cannot stat library : error

The Reconfiguration Coordination Manager (RCM) library, library , exists, but the stat (2) function failed to get the file status. The error message, error , will be returned by the Solaris operating environment.

None