|C H A P T E R 1|
System Configuration Parameters
Sun Enterprise 250 servers, like all UltraSPARC-based systems, are based on the high-speed Ultra Port Architecture (UPA) bus, a switched system bus that provides up to 32 port ID addresses (or slots) for high-speed motherboard devices like CPUs, I/O bridges, and frame buffers. The Sun Enterprise 250 server provides up to three active ports for the following subsystems.
The order of probing these three port IDs is not subject to user control; however, a list of ports can be excluded from probing via the upa-port-skip-list NVRAM variable. In the following example, the upa-port-skip-list variable is used to exclude CPU-1 from the UPA probe list.
This capability lets you exclude a given device from probing (and subsequent use) by the system without physically removing the plug-in card. This can be useful in helping to isolate a failing card in a system experiencing transient failures.
Of the Sun Enterprise 250 server's two PCI buses, Bus 0 ( /pci@1f,4000 in the device tree) is unique in that it is the only PCI bus that contains motherboard (non plug-in) devices such as the on-board SCSI controller. By definition, such devices cannot be unplugged and swapped to change the order in which they are probed. To control the probing order of these devices, the system provides the NVRAM variable pci0-probe-list . This variable controls both the probing order and exclusion of devices on PCI Bus 0. The values that you can specify in the pci0-probe-list are defined in the following table.
In the following example, the pci0-probe-list variable is used to define a probing order of 5-2-4, while excluding from the probe list the on-board SCSI controller for internal and external SCSI devices.
Note that the pci0-probe-list variable has no effect on probing of the top PCI slot (slot 3 on the system rear panel). However, another NVRAM variable, pci-slot-skip-list , is available for excluding any PCI slot from the PCI probe list. In the following example, the pci-slot-skip-list variable is used to exclude back panel slots 0 and 3 from the PCI probe list.
Note Note - The values in the pci-slot-skip-list correspond to the back panel numbering scheme of 0-3. If a PCI slot number appears in this list, it will be excluded from probing even if it appears in the pci0-probe-list variable.
Environmental monitoring and control capabilities for Sun Enterprise 250 servers reside at both the operating system level and the OBP firmware level. This ensures that monitoring capabilities are operational even if the system has halted or is unable to boot. The way in which OBP monitors and reacts to environmental over temperature conditions is controlled by the NVRAM variable env-monitor . The following table shows the various settings for this variable and the effect each setting has on OBP behavior. For additional information about the system's environmental monitoring capabilities, see "About Reliability, Availability, and Serviceability Features" in the Sun Enterprise 250 Server Owner's Guide .
The automatic system recovery (ASR) feature allows Sun Enterprise 250 servers to resume operation after experiencing certain hardware faults or failures. Power-on self-test (POST) and OpenBoot Diagnostics (OBDiag) can automatically detect failed hardware components, while an auto-configuring capability designed into the OBP firmware allows the system to deconfigure failed components and restore system operation. As long as the system is capable of operating without the failed component, the ASR features will enable the system to reboot automatically, without operator intervention. Such a "degraded boot" allows the system to continue operating while a service call is generated to replace the faulty part.
If a faulty component is detected during the power-on sequence, the component is deconfigured and, if the system remains capable of functioning without it, the boot sequence continues. In a running system, certain types of failures (such as a processor failure) can cause an automatic system reset. If this happens, the ASR functionality allows the system to reboot immediately, provided that the system can function without the failed component. This prevents a faulty hardware component from keeping the entire system down or causing the system to crash again.
To support a degraded boot capability, the OBP uses the IEEE 1275 Client Interface (via the device tree) to "mark" devices as either failed or disabled , by creating an appropriate "status" property in the corresponding device tree node. By convention, UNIX will not activate a driver for any subsystem so marked.
Thus, as long as the failed component is electrically dormant (that is, it will not cause random bus errors or signal noise, etc.), the system can be rebooted automatically and resume operation while a service call is made.
In two special cases of deconfiguring a subsystem (CPUs and memory), the OBP actually takes action beyond just creating an appropriate "status" property in the device tree. At the first moments after reset, the OBP must initialize and functionally configure (or bypass) these functions in order for the rest of the system to work correctly. These actions are taken based on the status of two NVRAM configuration variables, post-status and asr-status , which hold the override information supplied either from POST or via a manual user override (see ASR User Override Capability ).
If any CPU is marked as having failed POST, or if a user chooses to disable a CPU, then the OBP will set the Master Disable bit of the affected CPU, which essentially turns it off as an active UPA device until the next power-on system reset.
Detecting and isolating system memory problems is one of the more difficult diagnostic tasks. This problem is further complicated by the possibility of installing different capacity DIMMs within the same memory bank. (Each memory bank must contain four DIMMs of the same capacity.) Given a failed memory component, the firmware will deconfigure the entire bank associated with the failure.
While the default settings will properly configure or deconfigure the server in most cases, it is useful to provide advanced users with a manual override capability. Because of the nature of "soft" versus "hard" deconfiguration, it is necessary to provide two related but different override mechanisms.
For any subsystem represented by a distinct device tree node, users may disable that function via the NVRAM variable asr-disable-list , which is simply a list of device tree paths separated by spaces.
If a system fails power-on diagnostics, then auto-boot? is ignored and the system does not boot unless the user does it manually. This behavior is obviously not acceptable for a degraded boot scenario, so the Sun Enterprise 250 OBP provides a second NVRAM-controlled switch called auto-boot-on-error? . This switch controls whether the system will attempt a degraded boot when a subsystem failure is detected. Both the auto-boot? and auto-boot-on-error? switches must be set to true to enable a degraded boot.
Note Note - The default setting for auto-boot-on-error? is false. Therefore, the system will not attempt a degraded boot unless you change this setting to true. In addition, the system will not attempt a degraded boot in response to any fatal unrecoverable error, even if degraded booting is enabled. An example of a fatal unrecoverable error is when both of the system's CPUs have been disabled, either by failing POST or as a result of a manual user override.
To support ASR in Sun Enterprise 250 servers, it is desirable to be able to run firmware diagnostics (POST/OBDiag) on any or all reset events. Rather than simply changing the default setting of diag-switch? to true , which carries with it other side effects (see the OpenBoot 3.x Command Reference Manual ), the Sun Enterprise 250 OBP provides a new NVRAM variable called diag-trigger that lets you choose which reset events, if any, will automatically engage POST/OBDiag. The diag-trigger variable, and its various settings are described in the following table.
Disables the automatic triggering of diagnostics by any reset event. Users can still invoke diagnostics manually by holding down the Stop and d keys when powering on the system, or by turning the front panel keyswitch to the Diagnostics position when powering on the system.