C H A P T E R 6 |
Diagnostic Tools |
The Sun Fire V490 server and its accompanying software contain many tools and features that help you:
This chapter introduces the tools that let you accomplish these goals, and helps you to understand how the various tools fit together.
Topics in this chapter include:
If you only want instructions for using diagnostic tools, skip this chapter and turn to
Part Three of this manual. There, you can find chapters that tell you how to isolate failed parts (Chapter 10), monitor the system (Chapter 11), and exercise the system (Chapter 12).
Sun provides a wide spectrum of diagnostic tools for use with the Sun Fire V490 server. These tools range from the formal--like Sun's comprehensive Validation Test Suite (SunVTS), to the informal--like log files that may contain clues helpful in narrowing down the possible sources of a problem.
The diagnostic tool spectrum also ranges from standalone software packages, to firmware-based power-on self-tests (POST), to hardware LEDs that tell you when the power supplies are operating.
Some diagnostic tools enable you to examine many computers from a single console, others do not. Some diagnostic tools stress the system by running tests in parallel, while other tools run sequential tests, enabling the machine to continue its normal functions. Some diagnostic tools function even when power is absent or the machine is out of commission, while others require the operating system to be up and running.
The full palette of tools discussed in this manual is summarized in TABLE 6-1.
Why are there so many different diagnostic tools?
There are a number of reasons for the lack of a single all-in-one diagnostic test, starting with the complexity of the server systems.
Consider the data bus built into every Sun Fire V490 server. This bus features a five-way switch called a CDX that interconnects all processors and high-speed I/O interfaces (refer to FIGURE 6-1). This data switch enables multiple simultaneous transfers over its private data paths. This sophisticated high-speed interconnect represents just one facet of the Sun Fire V490 server's advanced architecture.
Consider also that some diagnostics must function even when the system fails to start. Any diagnostic capable of isolating problems when the system fails to start up must be independent of the operating system. But any diagnostic that is independent of the operating system will also be unable to make use of the operating system's considerable resources for getting at the more complex causes of failures.
Another complicating factor is that different installations have different diagnostic requirements. You may be administering a single computer or a whole data center full of equipment racks. Alternatively, your systems may be deployed remotely-- perhaps in areas that are physically inaccessible.
Finally, consider the different tasks you expect to perform with your diagnostic tools:
Not every diagnostic tool can be optimized for all these varied tasks.
Instead of one unified diagnostic tool, Sun provides a palette of tools each of which has its own specific strengths and applications. To appreciate how each tool fits into the larger picture, it is necessary to have some understanding of what happens when the server starts up, during the so-called boot process.
You have probably had the experience of powering on a Sun system and watching as it goes through its boot process. Perhaps you have watched as your console displays messages that look like the following:
It turns out these messages are not quite so inscrutable once you understand the boot process. These kinds of messages are discussed later.
It is important to understand that almost all of the firmware-based diagnostics can be disabled so as to minimize the amount of time it takes the server to start up. In the following discussion, assume that the system is configured to run its firmware-based tests.
As soon as you plug in the Sun Fire V490 server to an electrical outlet, and before you turn on power to the server, the system controller (SC) inside the server begins its self-diagnostic and boot cycle. During this time, the locator LED blinks. Running off standby power, the system controller card begins functioning before the server itself comes up.
The system controller provides access to a number of control and monitoring functions through Remote System Control (RSC) software. For more information about RSC software, refer to Sun Remote System Control Software.
Every Sun Fire V490 server includes a chip holding about 2 Mbytes of firmware-based code. This chip is called the Boot PROM. After you turn on system power, the first thing the system does is execute code that resides in the Boot PROM.
This code, which is referred to as the OpenBoot firmware, is a small-scale operating system unto itself. However, unlike a traditional operating system that can run multiple applications for multiple simultaneous users, OpenBoot firmware runs in single-user mode and is designed solely to test, configure, and boot the system, thereby ensuring that the hardware is sufficiently "healthy" to run its normal operating system software.
When system power is turned on, the OpenBoot firmware begins running directly out of the Boot PROM, since at this stage system memory has not been verified to work properly.
Soon after power is turned on, the system hardware determines that at least one processor is powered on, and is submitting a bus access request, which indicates that the processor in question is at least partly functional. This becomes the master processor, and is responsible for executing OpenBoot firmware instructions.
The OpenBoot firmware's first actions are to check whether to run the power-on self-test (POST) diagnostics and other tests. The POST diagnostics constitute a separate chunk of code stored in a different area of the Boot PROM (refer to FIGURE 6-2).
The extent of these power-on self-tests, and whether they are performed at all, is controlled by configuration variables stored in a separate firmware memory device called the IDPROM. These OpenBoot configuration variables are discussed in Controlling POST Diagnostics.
As soon as POST diagnostics can verify that some subset of system memory is functional, tests are loaded into system memory.
The POST diagnostics verify the core functionality of the system. A successful execution of the POST diagnostics does not ensure that there is nothing wrong with the server, but it does ensure that the server can proceed to the next stage of the boot process.
For a Sun Fire V490 server, this means:
It is possible for a system to pass all POST diagnostics and still be unable to boot the operating system. However, you can run POST diagnostics even when a system fails to boot, and these tests are likely to disclose the source of most hardware problems.
POST generally reports errors that are persistent in nature. To catch intermittent problems, consider running a system exercising tool. Refer to About Exercising the System.
Each POST diagnostic is a low-level test designed to pinpoint faults in a specific hardware component. For example, individual memory tests called address bitwalk and data bitwalk ensure that binary 0s and 1s can be written on each address and data line. During such a test, the POST may display output similar to this:
In this example, processor 1 is the master processor, as indicated by the prompt 1:0>, and it is about to test the memory associated with processor 3, as indicated by the message "Slave 3."
Note - The x:y numbering system identifies processors that have multiple cores. |
The failure of such a test reveals precise information about particular integrated circuits, the memory registers inside them, or the data paths connecting them:
1:0>ERROR: TEST = Data Bitwalk on Slave 3 1:0>H/W under test = CPU3 Memory 1:0>MSG = ERROR: miscompare on mem test! Address: 00000030.001b0038 Expected: 00000000.00100000 Observed: 00000000.00000000 |
When a specific power-on self-test discloses an error, it reports different kinds of information about the error:
Here is an excerpt of POST output showing another error message.
An important feature of POST error messages is the H/W under test line. (Refer to the arrow in CODE EXAMPLE 6-1.)
The H/W under test line indicates which FRU or FRUs may be responsible for the error. Note that in CODE EXAMPLE 6-1, three different FRUs are indicated. Using TABLE 6-13 to decode some of the terms, you can refer to that this POST error was most likely caused by a bad system interconnect circuit (Schizo) on the centerplane. However, the error message also indicates that the PCI riser board (I/O board) may be at fault. In the least likely case, the error might stem from the master processor, in this case processor 0.
Because each test operates at such a low level, the POST diagnostics are often more definite in reporting the minute details of the error, like the numerical values of expected and observed results, than they are about reporting which FRU is responsible. If this seems counter-intuitive, consider the block diagram of one data path within a Sun Fire V490 server, shown in FIGURE 6-3.
The dashed lines in FIGURE 6-3 represent boundaries between FRUs. Suppose a POST diagnostic is running in the processor in the left part of the diagram. This diagnostic attempts to initiate a built-in self-test in a PCI device located in the right side of the diagram.
If this built-in self-test fails, there could be a fault in the PCI controller, or, less likely, in one of the data paths or components leading to that PCI controller. The POST diagnostic can tell you only that the test failed, but not why. So, though the POST may present very precise data about the nature of the test failure, any of three different FRUs could be implicated.
You control POST diagnostics (and other aspects of the boot process) by setting OpenBoot configuration variables in the IDPROM. Changes to OpenBoot configuration variables generally take effect only after the machine is restarted. These variables affect OpenBoot Diagnostics tests as well as POST diagnostics.
TABLE 6-2 lists the most important and useful of these variables. You can find more extensive lists and descriptions in OpenBoot PROM Enhancements for Diagnostic Operation and OpenBoot 4.x Command Reference Manual. The former is included on the Sun Fire V490 Documentation CD. The latter is included with the Solaris Software Supplement CD that ships with Solaris software.
You can find instructions for changing OpenBoot configuration variables in How to View and Set OpenBoot Configuration Variables.
Determines whether the operating system automatically starts up. Default is true. |
|
Determines whether the system attempts to boot after a nonfatal error. Default is true. |
|
Determines the level or type of diagnostics executed. Default is max. |
|
Redirects diagnostic and console messages to the system controller. Default is false. |
|
Determines which devices are tested by OpenBoot Diagnostics. Default is normal. |
|
Controls diagnostic execution in normal mode. Default is false.
Note: The above behaviors only apply to server machines like the Sun Fire V490 server. Workstations behave differently. For details, refer to OpenBoot PROM Enhancements for Diagnostic Operation. |
|
Specifies the class of reset event that causes diagnostic tests to run. This variable can accept single keywords as well as combinations of the first three keywords separated by spaces. For details, refer to How to View and Set OpenBoot Configuration Variables. Default is power-on-reset and error-reset.
|
|
Selects where console input is taken from. Default is keyboard.
Note: Should the specified input device be unavailable, the system automatically reverts to ttya. |
|
Selects where diagnostic and other console output is displayed. Default is screen.
Note: POST messages cannot be displayed on a graphics terminal. They are sent to ttya even when output-device is set to screen. Should the specified output device be unavailable, the system automatically reverts to ttya. |
|
Controls whether the system is in service mode. Default is false.
Note: If the system control switch is in Diagnostics position, the system will boot in service mode even if the service-mode? variable is false. |
Once POST diagnostics have finished running, POST reports back to the OpenBoot firmware the status of each test it has run. Control then reverts back to the OpenBoot firmware code.
OpenBoot firmware code compiles a hierarchical "census" of all devices in the system. This census is called a device tree. Though different for every system configuration, the device tree generally includes both built-in system components and optional PCI bus devices.
Following the successful execution of POST diagnostics, the OpenBoot firmware proceeds to run OpenBoot Diagnostics tests. Like the POST diagnostics, OpenBoot Diagnostics code is firmware-based and resides in the Boot PROM.
OpenBoot Diagnostics tests focus on system I/O and peripheral devices. Any device in the device tree, regardless of manufacturer, that includes an IEEE 1275-compatible self-test is included in the suite of OpenBoot Diagnostics tests. On a Sun Fire V490 server, OpenBoot Diagnostics test the following system components:
By default, the OpenBoot Diagnostics tests run automatically via a script when you start up the system. However, you can also run OpenBoot Diagnostics tests manually, as explained in the next section.
When you restart the system, you can run OpenBoot Diagnostics tests either interactively from a test menu, or by entering commands directly from the ok prompt.
Most of the same OpenBoot configuration variables you use to control POST (refer to TABLE 6-2) also affect OpenBoot Diagnostics tests. Notably, you can determine OpenBoot Diagnostics testing level--or suppress testing entirely--by appropriately setting the diag-level variable.
In addition, the OpenBoot Diagnostics tests use a special variable called test-args that enables you to customize how the tests operate. By default, test-args is set to contain an empty string. However, you can set test-args to one or more of the reserved keywords, each of which has a different effect on OpenBoot Diagnostics tests. TABLE 6-3 lists the available keywords.
If you want to make multiple customizations to the OpenBoot Diagnostics testing, you can set test-args to a comma-separated list of keywords, as in this example:
It is easiest to run OpenBoot Diagnostics tests interactively from a menu. You access the menu by typing obdiag at the ok prompt. Refer to How to Isolate Faults Using Interactive OpenBoot Diagnostics Tests for full instructions.
The obdiag> prompt and the OpenBoot Diagnostics interactive menu (FIGURE 6-4) appear. For a brief explanation of each OpenBoot Diagnostics test, refer to TABLE 6-10 in Reference for OpenBoot Diagnostics Test Descriptions.
You run individual OpenBoot Diagnostics tests from the obdiag> prompt by typing:
where n represents the number associated with a particular menu item.
There are several other commands available to you from the obdiag> prompt. For descriptions of these commands, refer to TABLE 6-11 in Reference for OpenBoot Diagnostics Test Descriptions.
You can obtain a summary of this same information by typing help at the obdiag> prompt.
You can also run OpenBoot Diagnostics tests directly from the ok prompt. To do this, type the test command, followed by the full hardware path of the device (or set of devices) to be tested. For example:
Note - Knowing how to construct an appropriate hardware device path requires precise knowledge of the hardware architecture of the Sun Fire V490 system. |
To customize an individual test, you can use test-args as follows:
This affects only the current test without changing the value of the test-args OpenBoot configuration variable.
You can test all the devices in the device tree with the test-all command:
If you specify a path argument to test-all, then only the specified device and its children are tested. The following example shows the command to test the USB bus and all connected devices with self-tests:
OpenBoot Diagnostics error results are reported in a tabular format that contains a short summary of the problem, the hardware device affected, the subtest that failed, and other diagnostic information. CODE EXAMPLE 6-2 displays a sample OpenBoot Diagnostics error message.
The i2c@1,2e and i2c@1,30 OpenBoot Diagnostics tests examine and report on environmental monitoring and control devices connected to the Sun Fire V490 server's Inter-IC (I2C) bus.
Error and status messages from the i2c@1,2e and i2c@1,30 OpenBoot Diagnostics tests include the hardware addresses of I2C bus devices:
The I2C device address is given at the very end of the hardware path. In this example, the address is 2,a8, which indicates a device located at hexadecimal address A8 on segment 2 of the I2C bus.
To decode this device address, refer to Reference for Decoding I2C Diagnostic Test Messages. Using TABLE 6-12, you can refer to that fru@2,a8 corresponds to an I2C device on DIMM 4 on processor 2. If the i2c@1,2e test were to report an error against fru@2,a8, you would need to replace this memory module.
Beyond the formal firmware-based diagnostic tools, there are a few commands you can invoke from the ok prompt. These OpenBoot commands display information that can help you assess the condition of a Sun Fire V490 server. These include the following commands:
This section describes the information these commands give you. For instructions on using these commands, turn to How to Use OpenBoot Information Commands, or look up the appropriate man page.
The .env command displays the current environmental status, including fan speeds; and voltages, currents, and temperatures measured at various system locations. For more information, refer to About OpenBoot Environmental Monitoring, and How to Obtain OpenBoot Environmental Status Information.
The printenv command displays the OpenBoot configuration variables. The display includes the current values for these variables as well as the default values. For details, refer to How to View and Set OpenBoot Configuration Variables.
For more information about printenv, refer to the printenv man page. For a list of some important OpenBoot configuration variables, refer to TABLE 6-2.
The probe-scsi and probe-scsi-all commands check the presence of SCSI or FC-AL devices and verify that the bus itself is operating properly.
Caution - If you used the haltcommand or the Stop-A key sequence to reach the okprompt, then issuing the probe-scsior probe-scsi-allcommand can hang the system. |
The probe-scsi command communicates with all SCSI and FC-AL devices connected to on-board SCSI and FC-AL controllers. The probe-scsi-all command additionally accesses devices connected to any host adapters installed in PCI slots.
For any SCSI or FC-AL device that is connected and active, the probe-scsi and probe-scsi-all commands display its loop ID, host adapter, logical unit number, unique World Wide Name (WWN), and a device description that includes type and manufacturer.
The following is sample output from the probe-scsi command.
ok probe-scsi LiD HA LUN --- Port WWN --- ----- Disk description ----- 0 0 0 2100002037cdaaca SEAGATE ST336704FSUN36G 0726 1 1 0 2100002037a9b64e SEAGATE ST336704FSUN36G 0726 |
The following is sample output from the probe-scsi-all command.
Note that the probe-scsi-all command lists dual-ported devices twice. This is because these FC-AL devices (refer to the qlc@2 entry in CODE EXAMPLE 6-4) can be accessed through two separate controllers: the on-board Loop-A controller and the optional Loop-B controller provided through a PCI card.
The probe-ide command communicates with all Integrated Drive Electronics (IDE) devices connected to the IDE bus. This is the internal system bus for media devices such as the DVD drive.
Caution - If you used the haltcommand or the Stop-A key sequence to reach the okprompt, then issuing the probe-idecommand can hang the system. |
The following is sample output from the probe-ide command.
ok probe-ide Device 0 ( Primary Master ) Removable ATAPI Model: TOSHIBA DVD-ROM SD-C2512 Device 1 ( Primary Slave ) Not Present |
The show-devs command lists the hardware device paths for each device in the firmware device tree. CODE EXAMPLE 6-6 shows some sample output (edited for brevity).
If a system passes OpenBoot Diagnostics tests, it normally attempts to boot its multiuser operating system. For most Sun systems, this means the Solaris OS. Once the server is running in multiuser mode, you have recourse to software-based diagnostic tools, like SunVTS and Sun Management Center. These tools can help you with more advanced monitoring, exercising, and fault isolating capabilities.
Note - If you set the auto-boot OpenBoot configuration variable to false, the operating system does not boot automatically following completion of the firmware-based tests. |
In addition to the formal tools that run on top of Solaris OS software, there are other resources that you can use when assessing or monitoring the condition of a Sun Fire V490 server. These include:
Error and other system messages are saved in the file /var/adm/messages. Messages are logged to this file from many sources, including the operating system, the environmental control subsystem, and various software applications.
For information about /var/adm/messages and other sources of system information, refer to your Solaris system administration documentation.
Some Solaris commands display data that you can use when assessing the condition of a Sun Fire V490 server. These include the following commands:
This section describes the information these commands give you. For instructions on using these commands, turn to How to Use Solaris System Information Commands, or look up the appropriate man page.
The prtconf command displays the Solaris device tree. This tree includes all the devices probed by OpenBoot firmware, as well as additional devices, like individual disks, that only the operating system software "knows" about. The output of prtconf also includes the total amount of system memory. CODE EXAMPLE 6-7 shows an excerpt of prtconf output (edited to save space).
The prtconf command's -p option produces output similar to the OpenBoot
show-devs command (refer to show-devs Command). This output lists only those devices compiled by the system firmware.
The prtdiag command displays a table of diagnostic information that summarizes the status of system components.
The display format used by the prtdiag command can vary depending on what version of the Solaris OS is running on your system. Following is an excerpt of some of the output produced by prtdiag on a healthy Sun Fire V490 system running Solaris 8, Update 7.
In addition to that information, prtdiag with the verbose option (-v) also reports on front panel status, disk status, fan status, power supplies, hardware revisions, and system temperatures.
System Temperatures (Celsius): ------------------------------- Device Temperature Status --------------------------------------- CPU0 59 OK CPU2 64 OK DBP0 22 OK |
In the event of an overtemperature condition, prtdiag reports an error in the Status column.
System Temperatures (Celsius): ------------------------------- Device Temperature Status --------------------------------------- CPU0 62 OK CPU1 102 ERROR |
Similarly, if there is a failure of a particular component, prtdiag reports a fault in the appropriate Status column.
The Sun Fire V490 system maintains a hierarchical list of all field-replaceable units (FRUs) in the system, as well as specific information about various FRUs.
The prtfru command can display this hierarchical list, as well as data contained in the serial electrically-erasable programmable read-only memory (SEEPROM) devices located on many FRUs. CODE EXAMPLE 6-12 shows an excerpt of a hierarchical list of FRUs generated by the prtfru command with the -l option.
/frutree /frutree/chassis (fru) /frutree/chassis/io-board (container) /frutree/chassis/rsc-board (container) /frutree/chassis/fcal-backplane-slot |
CODE EXAMPLE 6-13 shows an excerpt of SEEPROM data generated by the prtfru command with the -c option.
Data displayed by the prtfru command varies depending on the type of FRU. In general, this information includes:
Information about the following Sun Fire V490 FRUs is displayed by the prtfru command:
The psrinfo command displays the date and time each processor came online. With the verbose (-v) option, the command displays additional information about the processors, including their clock speed. The following is sample output from the psrinfo command with the -v option.
The showrev command displays revision information for the current hardware and software. CODE EXAMPLE 6-15 shows sample output of the showrev command.
When used with the -p option, this command displays installed patches. CODE EXAMPLE 6-16 shows a partial sample output from the showrev command with the -p option.
Different diagnostic tools are available to you at different stages of the boot process. TABLE 6-4 summarizes what tools are available to you and when they are available.
Each of the tools available for fault isolation discloses faults in different field-replaceable units (FRUs). The row headings along the left of TABLE 6-5 list the FRUs in a Sun Fire V490 system. The available diagnostic tools are shown in column headings across the top. A check mark (✔) in this table indicates that a fault in a particular FRU can be isolated by a particular diagnostic.
In addition to the FRUs listed in TABLE 6-5, there are several minor replaceable system components--mostly cables--that cannot directly be isolated by any system diagnostic. For the most part, you determine when these components are faulty by eliminating other possibilities. These FRUs are listed in TABLE 6-6.
Sun provides two tools that can give you advance warning of difficulties and prevent future downtime. These are:
These monitoring tools let you specify system criteria that bear watching. For instance, you can set a threshold for system temperature and be notified if that threshold is exceeded.
Sun Remote System Control (RSC) software, working in conjunction with the system controller (SC) card, enables you to monitor and control your server over a serial port or a network. RSC software provides both graphical and command-line interfaces for remotely administering geographically distributed or physically inaccessible machines.
You can also redirect the server's system console to the system controller, which lets you remotely run diagnostics (like POST) that would otherwise require physical proximity to the machine's serial port.
The system controller card runs independently, and uses standby power from the server. Therefore, the SC and its RSC software continue to be effective when the server operating system goes offline.
RSC software lets you monitor the following on the Sun Fire V490 server.
Before you can start using RSC software, you must install and configure it on the server and client systems. Instructions for doing this are given in the Sun Remote System Control (RSC) 2.2 User's Guide, which is included on the Sun Fire V490 Documentation CD.
You also have to make any needed physical connections and set OpenBoot configuration variables that redirect the console output to the system controller. The latter task is described in How to Redirect the System Console to the System Controller.
For instructions on using RSC software to monitor a Sun Fire V490 system, refer to How to Monitor the System Using the System Controller and RSC Software.
Sun Management Center software provides enterprise-wide monitoring of Sun servers and workstations, including their subsystems, components, and peripheral devices. The system being monitored must be up and running, and you need to install all the proper software components on various systems in your network.
Sun Management Center lets you monitor the following on the Sun Fire V490 server.
The Sun Management Center product comprises three software entities:
You install agents on systems to be monitored. The agents collect system status information from log files, device trees, and platform-specific sources, and report that data to the server component.
The server component maintains a large database of status information for a wide range of Sun platforms. This database is updated frequently, and includes information about boards, tapes, power supplies, and disks as well as operating system parameters like load, resource usage, and disk space. You can create alarm thresholds and be notified when these are exceeded.
The monitor components present the collected data to you in a standard format. Sun Management Center software provides both a standalone Java application and a Web browser-based interface. The Java interface affords physical and logical views of the system for highly-intuitable monitoring.
Sun Management Center software provides you with additional tools in the form of an informal tracking mechanism and an optional add-on diagnostics suite. In a heterogeneous computing environment, the product can interoperate with management utilities made by other companies.
Sun Management Center agent software must be loaded on any system you want to monitor. However, the product lets you informally track a supported platform even when the agent software has not been installed on it. In this case, you do not have full monitoring capability, but you can add the system to your browser, have Sun Management Center periodically check whether it is up and running, and notify you if it goes out of commission.
The Hardware Diagnostic Suite is available as a premium package you can purchase as an add-on to the Sun Management Center product. This suite lets you exercise a system while it is still up and running in a production environment. Refer to Exercising the System Using Hardware Diagnostic Suite for more information.
If you administer a heterogeneous network and use a third-party network-based system monitoring or management tool, you may be able to take advantage of Sun Management Center software's support for Tivoli Enterprise Console, BMC Patrol, and HP Openview.
Sun Management Center software is geared primarily toward system administrators who have large data centers to monitor or other installations that have many computer platforms to monitor. If you administer a more modest installation, you need to weigh Sun Management Center software's benefits against the requirement of maintaining a significant database (typically over 700 Mbytes) of system status information.
The servers being monitored must be up and running if you want to use Sun Management Center, since this tool relies on the Solaris OS. For instructions, refer to How to Monitor the System Using Sun Management Center Software. For detailed information about the product, refer to the Sun Management Center User's Guide.
For the latest information about this product, go to the Sun Management Center Web site at: http://www.sun.com/sunmanagementcenter.
It is relatively easy to detect when a system component fails outright. However, when a system has an intermittent problem or seems to be "behaving strangely," a software tool that stresses or exercises the computer's many subsystems can help disclose the source of the emerging problem and prevent long periods of reduced functionality or system downtime.
Sun provides two tools for exercising Sun Fire V490 systems:
TABLE 6-9 shows the FRUs that each system exercising tool is capable of isolating. Note that individual tools do not necessarily test all the components or paths of a particular FRU.
SunVTS software validation test suite performs system and subsystem stress testing. You can view and control a SunVTS session over a network. Using a remote machine, you can view the progress of a testing session, change testing options, and control all testing features of another machine on the network.
You can run SunVTS software in five different test modes:
Since SunVTS software can run many tests in parallel and consume many system resources, you should take care when using it on a production system. If you are stress-testing a system using SunVTS software's Comprehensive test mode, you should not run anything else on that system at the same time.
The Sun Fire V490 server to be tested must be up and running if you want to use SunVTS software, since it relies on the Solaris operating system. Since SunVTS software packages are optional, they may not be installed on your system. Turn to How to Check Whether SunVTS Software Is Installed for instructions.
It is important to use the most-up-to-date version of SunVTS available, to ensure you have the latest suite of tests. To download the most recent SunVTS software, point your Web browser to: http://www.sun.com/oem/products/vts/.
For instructions on running SunVTS software to exercise the Sun Fire V490 server, refer to How to Exercise the System Using SunVTS Software. For more information about the product, refer to:
These documents are available on the Solaris Software Supplement CD and on the Web at: http://docs.sun.com. You should also consult the SunVTS README file located at /opt/SUNWvts/. This document provides late-breaking information about the installed version of the product.
During SunVTS software installation, you must choose between Basic or Sun Enterprise Authentication Mechanism (SEAM) security. Basic security uses a local security file in the SunVTS installation directory to limit the users, groups, and hosts permitted to use SunVTS software. SEAM security is based on Kerberos--the standard network authentication protocol--and provides secure user authentication, data integrity, and privacy for transactions over networks.
If your site uses SEAM security, you must have the SEAM client and server software installed in your networked environment and configured properly in both Solaris and SunVTS software. If your site does not use SEAM security, do not choose the SEAM option during SunVTS software installation.
If you enable the wrong security scheme during installation, or if you improperly configure the security scheme you choose, you may find yourself unable to run SunVTS tests. For more information, refer to the SunVTS User's Guide and the instructions accompanying the SEAM software.
The Sun Management Center product features an optional Hardware Diagnostic Suite, which you can purchase as an add-on. The Hardware Diagnostic Suite is designed to exercise a production system by running tests sequentially.
Sequential testing means the Hardware Diagnostic Suite has a low impact on the system. Unlike SunVTS, which stresses a system by consuming its resources with many parallel tests (refer to Exercising the System Using SunVTS Software), the Hardware Diagnostic Suite lets the server run other applications while testing proceeds.
The best use of the Hardware Diagnostic Suite is to disclose a suspected or intermittent problem with a noncritical part on an otherwise functioning machine. Examples might include questionable disk drives or memory modules on a machine that has ample or redundant disk and memory resources.
In cases like these, the Hardware Diagnostic Suite runs unobtrusively until it identifies the source of the problem. The machine under test can be kept in production mode until and unless it must be shut down for repair. If the faulty part is hot-pluggable or hot-swappable, the entire diagnose-and-repair cycle can be completed with minimal impact to system users.
Since it is a part of Sun Management Center, you can only run Hardware Diagnostic Suite if you have set up your data center to run Sun Management Center. This means you have to dedicate a master server to run the Sun Management Center server software that supports Sun Management Center software's database of platform status information. In addition, you must install and set up Sun Management Center agent software on the systems to be monitored. Finally, you need to install the console portion of Sun Management Center software, which serves as your interface to the Hardware Diagnostic Suite.
Instructions for setting up Sun Management Center, as well as for using the Hardware Diagnostic Suite, can be found in the Sun Management Center User's Guide.
This section describes the OpenBoot Diagnostics tests and commands available to you. For background information about these tests, refer to Stage Two: OpenBoot Diagnostics Tests.
Tests the registers of the Fibre Channel-Arbitrated Loop
|
||
Tests all writable registers in the Boot Bus Controller. Also verifies that at least one system processor has Boot Bus access |
||
Tests the PCI configuration registers, DMA control registers, and EBus mode registers. Also tests DMA controller functions |
||
Tests segments 0-4 of the I2C environmental monitoring subsystem, which includes various temperature and other sensors located throughout the system |
Multiple. Refer to Reference for Decoding I2C Diagnostic Test Messages. |
|
Same as above, for segment 5 of the I2C environmental monitoring subsystem |
||
Tests the on-board IDE controller and IDE bus subsystem that controls the DVD drive |
||
Tests the on-board Ethernet logic, running internal loopback tests. Can also run external loopback tests, but only if you install a loopback connector (not provided) |
||
Tests SC hardware, including the SC serial and Ethernet ports |
||
Tests the registers of the real-time clock and then tests the interrupt rates |
||
Tests all possible baud rates supported by the ttya serial line. Performs an internal and external loopback test on each line at each speed |
||
Tests the writable registers of the USB open host controller |
TABLE 6-11 describes the commands you can type from the obdiag> prompt.
Exits OpenBoot Diagnostics tests and returns to the ok prompt |
|
Displays a brief description of each OpenBoot Diagnostics command and OpenBoot configuration variable |
|
Sets the value for an OpenBoot configuration variable (also available from the ok prompt) |
|
Tests all devices displayed in the OpenBoot Diagnostics test menu (also available from the ok prompt) |
|
Tests only the device identified by the given menu entry number. (A similar function is available from the ok prompt. Refer to From the ok Prompt: The test and test-all Commands.) |
|
Tests only the devices identified by the given menu entry numbers |
|
Tests all devices in the OpenBoot Diagnostics test menu except those identified by the specified menu entry numbers |
|
Displays the version, last modified date, and manufacturer of each self-test in the OpenBoot Diagnostics test menu and library |
|
Displays selected properties of the devices identified by menu entry numbers. The information provided varies according to device type |
TABLE 6-12 describes each I2C device in a Sun Fire V490 system, and helps you associate each I2C address with the proper FRU. For more information about I2C tests, refer to I2C Bus Device Tests.
The status and error messages displayed by POST diagnostics and OpenBoot Diagnostics tests occasionally include acronyms or abbreviations for hardware sub-components. TABLE 6-13 is included to assist you in decoding this terminology and associating the terms with specific FRUs, where appropriate.
Advanced Power Control - A function provided by the SuperIO integrated circuit |
||
Boot Bus Controller - Interface between the processors and components on many other buses |
||
Direct Memory Access - In diagnostic output, usually refers to a controller on a PCI card |
||
Inter-Integrated Circuit (also written as I2C) - A bidirectional, two-wire serial data bus. Used mainly for environmental monitoring and control |
Various. Refer to TABLE 6-12. |
|
Joint Test Access Group - An IEEE subcommittee standard (1149.1) for scanning system components |
||
Media Access Controller - Hardware address of a device connected to a network |
||
Multifunction integrated circuit bridging the PCI bus with EBus and USB |
||
The system interconnect architecture--that is, the data and address buses |
||
A means for monitoring and altering the content of ASICs and system components, as provided for in the IEEE 1149.1 standard |
||
SuperIO integrated circuit - Controls the SC UART port and more |
||
Universal Asynchronous Receiver Transmitter - Serial port hardware |
Copyright © 2004, Sun Microsystems, Inc. All Rights Reserved.