Sun HPC ClusterTools 3.0 Administrator's Guide: With CRE

Diagnostics

The following sections describe Solaris diagnostics that may be useful in troubleshooting various types of error conditions.

Network Diagnostics

You can use /usr/sbin/ping to check whether you can connect to the network interface on another node. For example:

% ping hpc-node3

will test (over the default network) the connection to hpc-node3.

You can use /usr/sbin/spray to determine whether a node can handle significant network traffic. spray indicates the amount of dropped traffic. For example:

% spray -c 100 hpc-node3 

sends 100 small packets to hpc-node3.

Checking Load Averages

You can use mpinfo -N or, if the CRE is not running, /usr/bin/uptime, to determine load averages. These averages can help to determine the current load on the machine and how quickly it reached that load level.

Using Interval Diagnostics

The diagnostic programs described below check the status of various parameters. Each accepts a numerical option that specifies the time interval between status checks. If the interval option is not used, the diagnostics output an average value for the respective parameter since boot time. Specify the numerical value at the end of the command to get current information.

Use /usr/bin/netstat to check local system network traffic. For example:

% netstat -ni 3 

checks and reports traffic every 3 seconds.

Use /usr/bin/iostat to display disk and system usage. For example:

% iostat -c 2 

displays percentage utilizations every 2 seconds.

Use /usr/bin/vmstat to generate additional information about the virtual memory system. For example:

% vmstat -S 5

reports on swapping activity every 5 seconds.

It can be useful to run these diagnostics periodically, monitoring their output for multiple intervals.