Sun MPI 4.0 User's Guide: With CRE

Chapter 4 Getting Information

The CRE user interface includes two commands for obtaining information about a Sun HPC cluster's configuration (mpinfo) and information about jobs running on the cluster (mpps).

`mpps`: Finding Out Job Status

The mpps command is comparable to the Solaris ps command. It returns information about jobs and processes currently running on the Sun HPC cluster.

By default mpps shows basic information about the user's jobs currently running in the default partition. For example,

% mpps
   JID   NPROC  UID   STATE  AOUT
   41    3      slu   RUN    AAA
   46    4      slu   EXNG   tmp
   49    1      slu   EXIT   tmp
   99    9      slu   EXNG   uname
   100   9      slu   EXNG   uname

In the response,

JID is the executing program's job ID.

NPROC is the number of processes in the job.

UID is the user ID of the person who executed the program.

STATE is the execution status of the job's processes. (See below for a list of possible process states.)

AOUT is the name of the executable program.

Table 4-1 lists the states reported by mpps. Some states refer only to jobs, some only to processes, and some to both. (See "Displaying Process Information".)

Table 4-1 Job and Process States


State	mpps Display	Meaning
`CORE`	`CORE`	The job or process exited due to a signal and core was dumped.
COREING	CRNG	The job is exiting due to a signal. The first process to die dumped core.
EXIT	EXIT	The job or process exited normally.
EXITING	EXNG	The job is exiting. At least one process exited normally.
FAIL	FAIL	The job or process failed on startup or was aborted.
FAILING	FLNG	Initialization of the job failed, or a job abort has been signaled.
ORPHAN	ORPHAN	The process has been "orphaned," that is, the node on which it exists has gone offline.
RUNNING	RUN	The job or process is running.
SEXIT	SEXIT	The job or process exited due to a signal.
SEXITING	SEXNG	The job is exiting due to a signal. The first process to die was killed by a signal. At least one of its processes is still in the RUN state.
`SPAWNING`	`SPAWN`	The job or process is being spawned.
STOP	STOP	The job or process is stopped.

Use the -f option to display, in addition, the start time for each job and the job's arguments.

Use the -e option to display information on all jobs, not just your jobs.

Specifying the Partition

To show information about jobs running in all partitions, use the -A option.

To show information about jobs running in a specific partition, use the -a option, followed by the name of the partition.

Displaying Process Information

Use the -p option to also view information about the processes that make up the jobs. The process information is listed below each job. For example,

% mpps -p
   JID  NPROC  UID   STATE  AOUT   RANK  PID    STATE  NODE   2320    4   shaw  RUN    sleep  0     10190  RUN    node6                                   1      4744  RUN    node7                                   2     16564  RUN    node4                                   3      9412  RUN    node5

In this example,

RANK is the process's rank within the job.
PID is the process's process ID.
STATE is the process's execution status.
NODE is the node on which the process is running.

Displaying Specific Process and Job Information

You can also use the -P option to display one or more specific process values and the -J option to display one or more job values. Separate multiple values either with spaces or with commas and no spaces.

Arguments to -P are

rank - the rank of the process within the job.
pid - the process's process ID.
state - the current execution state of the process.
iod - the process ID of the I/O daemon for this process.
load - the load on the node on which the process is executing.
node - the name of the node on which the process is executing.

You can list these via the -lp option.

Arguments to -J are

part - the name of the partition in which the job will run.
jid - the job's unique ID, which can be used as an argument to mpkill.

nproc- the number of processes requested (the actual number of processes started may differ if the -W or -S flags were used with mprun).

uid - the user on whose behalf the job will be run (normally the user who submitted the job; see the -U flag to mprun for details).

gid - the group on whose behalf the job will be run (normally the group of the user who submitted the job; see the -G flag to mprun for details).

state - there are six states:
- BUILD - The job is being submitted.
- WAIT - The job is waiting to run.
- SPAWN - The job is preparing to run.
- RUN - The job is running.
- RSTRT - The job has been killed because one of the nodes on which it was running went down; the job will be restarted.

running - the number of processes actually running for this job. This is not always equal to the number of processes started for this job, since processes that have exited are not counted.

wkdir - the directory in which the job's processes will be (or were) started.

aout - the name of the program to be run.

paout - the full path of the program to be run.

ctime - the job creation time (when mprun was invoked for the job).

args - the command-line arguments for the program to be run.

stime - the time the job was started.

prio - the job priority (higher numbers run first).

`mpinfo`: Configuration and Status

Use the mpinfo command to display information about the configuration of partitions and nodes, and status information about nodes.

Overview

You can display information on all partitions or nodes, or on any subset of them. You can either list the partitions or nodes, or you can use the -R option, along with a resource requirement specifier (RRS), to have the CRE determine which objects should be displayed. See "Expressing More Complex Resource Requirements" for information on RRSs. If you specify a partition, you must include only partition attributes in the RRS; if you specify a node, you must use only node attributes.

Use the -A option to specify an attribute whose value you want to display. If you want to display more than one attribute, separate them by commas with no spaces. Alternatively, you can issue multiple -A options on the same command line. If you omit -A, mpinfo displays values for a default set of attributes.

Use the -v option to display information about all attributes for one or more partitions or nodes. These include attributes defined by the system administrator.

When a Boolean attribute is displayed, yes indicates that the attribute is set, and no indicates that the attribute is not set.

Partitions

Use the -P option to display information for all partitions.

Use the -p option, followed by the name of the partition, to display information about an individual partition. To display information about multiple partitions, list the names, either separating them with commas and no spaces or enclosing the list in quotation marks.

Partition attributes whose settings you can view via mpinfo are shown in Table 4-2; the heading displayed for each attribute is shown in parentheses after its description.

The following summarizes various points discussed earlier.

You can specify one or more of these attributes via the -A option, or as part of an RRS as an argument to the -R option. You can use either the attribute's real name or, in some cases, a shorter version.

For attributes that are defined as negatives (for example, no_logins), you can specify a positive version (for example, logins) for -A.

You can list the settings of all attributes (including any system administrator-defined attributes) on a per-partition basis via the -v option.

You can list the names and brief descriptions of these attributes via the -lp option.

Table 4-2 Partition attributes available via mpinfo


Attribute (`mpadmin` form)	Description (`mpinfo` output heading)
`enabled`	Set if the partition is enabled, that is, if it is ready to accept jobs (ENA).
`maxt`	Maximum number of simultaneously running processes allowed on each node of the partition (MAXT).
`name`	Name of the partition (NAME).
`login`	Allow logins. When `login` is set, LOG is set. Note that this is the inverse of the `mpadmin` meaning. (LOG).
`mp`	Allow multinode jobs. When no_mp_jobs is unset, MP is set. Note that this is the inverse of the `mpadmin` meaning. (MP).
`nodes`	Number of nodes in the partition (NODES).

The following example illustrates the default mpinfo output for partitions:

% mpinfo -P 
  NAME         NODES: Tot(cpu) Enb(cpu) Onl(cpu) ENA LOG MP
  part10                1(  4)   1(  4)   1(  4) no  yes yes
  part11                1(  4)   1(  4)   1(  4) yes yes yes

The following example displays the names, numbers of nodes, and enabled status for all partitions:

% mpinfo -A name,enabled,nodes -P
 NAME         ENA NODES: Tot(cpu) Enb(cpu) Onl(cpu)
 part10       no           1(  4)   1(  4)   1(  4)
 part11       yes          1(  4)   1(  4)   1(  4)

Nodes

Use the -N option to display information about all nodes.

Use the -n option, followed by the name(s) of one or more nodes. When listing multiple node names, separate the names with commas without spaces.

The following table shows the node attributes that you can display via mpinfo. The heading that is displayed for each attribute is shown in parentheses at the end of each description.

Note these points:

You can specify one or more of these attributes via the -A option, or as part of an RRS as an argument to the -R option. You can use either the attribute's real name or, in some cases, a shorter version.

You can list the settings of all attributes (including any system administrator-defined attributes) on a per-node basis via the -v option.

You can list the names and brief descriptions of these attributes via the -ln option.

Table 4-3 Node attributes available via mpinfo


Attribute	Short Form	Description (`mpinfo` output heading)
`cpu_idle`	`idle`	Percent of time CPU is idle (IDLE).
`cpu_iowait`	`iowait`	Percent of time CPU spends waiting for I/O (IWAIT).
`cpu_kernel`	`kernel`	Percent of time CPU spends in kernel (KERNL).
`cpu_swap`	`swap`	Percent of time CPU spends waiting for swap (SWAP).
`cpu_type`	`cpu`	CPU architecture (CPU).
`cpu_user`	`user`	Percent of time CPU spends running user's program (USER).
`domain`		DNS domain.
`enabled`		If set, node is available for spawning jobs on it.
`load1`		Load average for the past minute (LOAD1).
`load5`		Load average for the past five minutes (LOAD5).
`load15`		Load average for the past 15 minutes (LOAD15).
`manufacturer`	`manuf`	Hardware manufacturer (MANUFACTURER).
`mem_free`	`memf`	Node's available RAM (in Mbytes) (FMEM).
`mem_total`	`memr`	Node's total physical memory (in Mbytes) (MEM).
`name`		Name of the node (NAME).
`ncpus`	`ncpu`	Number of CPU modules in the node (NCPU).
`os_arch_kernel`	`mach`	Node's kernel architecture (MACH).
`os_max_proc`	`maxproc`	Maximum number of processes allowed on the node (note that this is all processes, including cluster daemons) (MPROC).
`os_name`	`os`	Name of the operating system running on the node (OS).
`os_release`	`osrel`	Operating system's release number (OSREL).
`os_release_maj`	`osmaj`	The major number of the operating system release number (MAJ).
`os_release_min`	`osmin`	The minor number of the operating system release number (MIN).
`os_version`	`osver`	Operating system's version (OSVER).
`partition`		The partition of which the node is a member (PARTITION).
`serial_number`	`serno`	Hardware serial number (SERIAL).
`swap_free`	`swapf`	Node's available swap space (in Mbytes) (FSWP).
`swap_total`	`swapr`	Node's total swap space (in Mbytes) (SWAP).

The following is an example of the mpinfo output for nodes:

% mpinfo -N
node0 87 =>mpinfo -N
NAME  UP PARTITION OS    OSREL NCPU FMEM   FSWP    LOAD1 LOAD5 LOAD15
node0 y  p0        SunOS 5.6   1    0.89   158.34  0.09  0.11  0.13
node1 y  p0        SunOS 5.6   1    31.41  276.12  0.00  0.01  0.01
node2 y  p1        SunOS 5.6   1    25.59  279.77  0.00  0.00  0.01
node3 y  p1        SunOS 5.6   1    25.40  279.88  0.00  0.00  0.01

The following example shows only the names of nodes and the partition they're in:

% mpinfo -N -A name,partition
NAME         PARTITION
node0        part0
node1        part0
node2        part1
node3        part1

Cluster

Use the -C option to display information about the entire cluster. For example,

% mpinfo -C
NAME   ADMINISTRATOR    DEF_INTER_PART
node0  wmitty           part0

where:

NAME - The name of the cluster

ADMINISTRATOR - The name of its administrator

DEF_INTER_PART - The default interactive partition