Sun Enterprise 10000 Dynamic Reconfiguration User Guide

Introduction to the DR 2.0 Model

The DR 2.0 model is based on the use of the DR daemon, dr_daemon(1M), to control DR operations. This model includes the Automated DR (ADR) commands, such as addboard(1M), deleteboard(1M), and moveboard(1M). Using ADR commands is described in "DR 3.0 Procedures" .

DR 2.0 still supports DR commands that are executed in the drshell(1M) and on the Hostview DR menu functions. For information about using DR 2.0, see the "DR 2.0 Procedures".

DR 2.0 Operations

You can execute DR operations from the SSP through the Hostview GUI or the dr(1M) shell application (refer to the hostview(1M) and dr(1M) man pages for more information). DR supports the following operations:

While DR operations are being performed within a domain, the dr_daemon(1M) (refer to the Sun Enterprise 10000 Dynamic Reconfiguration Reference Manual) and the operating environment write messages regarding the status or exceptions of DR requests to the domain syslog message buffer (/var/adm/messages) and the SSP message files ($SSPOPT/adm/domainName/messages and $SSPOPT/adm/messages). In addition to the status and exception information displayed by Hostview and the dr(1M) shell application, the dr_daemon(1M) and operating environment messages are useful for determining the status of DR requests.


Note -

Only one DR operation per platform can be active at any time. A partially completed DR operation must be finished before a subsequent DR operation is permitted in the same domain. A DR operation that is partially completed and then dismissed within one domain does not prevent a subsequent DR operation from being started in a different domain.


Memory

If you use memory interleaving between system boards, those system boards cannot be detached because DR does not yet support interboard interleaving. By default, hpost(1M) does not set up boards with interleaved memory. Look for the following line in the hpost(1M) file .postrc (see postrc(4)):


mem_board_interleave_ok

If mem_board_interleave_ok is present, you may not be able to detach a board that uses memory interleaving.

Pageable and Nonpageable Memory

Before you can detach a board, the operating system must vacate the memory on that board. Vacating a board means flushing its pageable memory to swap space and copying its nonpageable (that is, kernel and OBP memory) to another memory board. To relocate nonpageable memory, the operating environment on a domain must be temporarily suspended, or quiesced. The length of the suspension depends on the domain I/O configuration and the running workloads. Detaching a board with nonpageable memory is the only time when the operating environment is suspended; therefore, you should know where nonpageable memory resides, so you can avoid significantly impacting the operation of the domain. When permanent memory is on the board, the operating environment must find other memory to receive the copy.

You can use the dr(1M) command drshow(1M) to determine whether the memory on a board is pageable or nonpageable:


% dr
dr> drshow board_number mem

Similarly, you can determine whether the memory on a board is pageable by looking at the DR Memory Configuration window, which is available when you perform a detach operation within Hostview. The DR Memory Configuration window is described in the Sun Enterprise 10000 DR Configuration Guide in the Solaris 8, Update 6, Sun Hardware Answerbook Collection.

Target Memory Constraints

When permanent memory is detached, DR chooses a target memory area to receive a copy of the memory. The DR software automatically checks for total adherence. It does not allow the DR memory operation to continue if it cannot verify total adherence. A DR memory operation can be disallowed because of the following reasons:

In Solaris 7 and later releases, if no target board is found, the detach operation is refused, and DR displays an error message. (See Appendix A, Appendix A, DR Error Messages for more information about DR error messages.)

Correctable Memory Errors

Correctable memory errors indicate that the memory on a system board (that is, one or more of its Dual Inline Memory Modules (DIMMs), or portions of the hardware interconnect) may be faulty and need replacement. When the SSP detects correctable memory errors, it initiates a record-stop dump to save the diagnostic data, which can interfere with a DR detach operation. Therefore, Sun Microsystems suggests that when a record-stop occurs from a correctable memory error, you allow the record-stop dump to complete its process before you initiate a DR detach operation.

If the faulty component causes repeated reporting of correctable memory errors, the SSP performs multiple record-stop dumps. If this happens, you should temporarily disable the dump-detection mechanism on the SSP, allow the current dump to finish, then initiate the DR detach operation. After the detach operation finishes, you should re-enable the dump detection.

To Re-Enable Dump Detection
  1. Log in to the SSP as the user ssp.

  2. Disable record-stop dump detection:


    SSP% edd_cmd -x stop
    

    This command suspends all event detection on all of the domains.

  3. Monitor the in-progress record-stop dump:


    SSP% ps -ef | grep hpost
    

    In the grep(1) output, the -D option of hpost indicates that a record-stop dump is in progress.

  4. Perform the DR detach operation.

  5. Enable event detection:


    SSP% edd_cmd -x start
    

DR 2.0 and IDNs

The IDN feature allows domains to communicate to each other over the interconnect by using standard TCP/IP protocols. To provide this capability, the IDN feature maintains detailed information about the hardware configuration and is dependent on the hardware configuration of the member domains.

The DR feature allows the user to reconfigure the hardware while the operating system is running. Thus, DR is required to make an IDN aware of the changes so that the IDN can maintain consistent, up-to-date information about the hardware.

DR accomplishes this requirement by unlinking the domain from the IDN, reconfiguring the hardware, and relinking the domain to the IDN. The unlinking and relinking of the domain occurs during the complete attach or complete detach phase of the DR operation. DR determines whether the domain is a member of an IDN, and it performs the unlinking and relinking of the domain during the complete phase. No interaction is needed by the user. However, if a member domain is in an unknown state (that is, AWOL), the unlink operation will not succeed, especially if the domain is in a non-responsive state. If one or more domains were in an unknown state when you attempted to perform a DR operation, you must unlink all of the AWOL domains within the IDN in a single step (that is, use the domain_unlink(1M) command with all of the names of the AWOL domains).

During the period in which the domain is not linked to the IDN, no transmission to or from the domain are allowed. In contrast, the domain remains a member of the IDN as defined in the domain_config(4) file on the SSP, and the domain continues to be listed as a member of the IDN when you use the domain_status(1M) command.


Note -

Due to the interaction between the DR and IDN features, only one DR or IDN operation is allowed at any given time within a single Sun Enterprise 10000 system.


Certain conditions may require you to use the force option. In the context of a DR operation, you can use the DR force option, which is passed to the domain_unlink(1M) command. When used on a domain that is a member of an IDN, the force option should be used with extreme care. Refer to the Sun Enterprise 10000 InterDomain Networks User Guide for more information about the force option and its use.

RPC Time-Out or Loss of Connection

The dr_daemon(1M), which runs in each domain, communicates with Hostview and the dr(1M) shell application (both of which run on the SSP) by way of Remote Procedure Calls (RPCs).

For more information about RPC time-outs and loss of connection failures, refer to the Sun Enterprise 10000 DR Configuration Guide.