Troubleshooting

A P P E N D I X A

Troubleshooting

This appendix describes some common problem situations, resulting error messages, and suggestions for fixing the problems. Open MPI error reporting, including I/O, follows the MPI-2 Standard. By default, errors are reported in the form of standard error classes. These classes and their meanings are listed in TABLE A-1 (for non-I/O MPI) and TABLE A-2 (for MPI I/O), and are also available on the MPI man page.

MPI Me ssages

Standard Error Classes

Listed below are the error return classes you might encounter in your MPI programs. Error values can also be found in mpi.h (for C), mpif.h (for Fortran), and mpi++.h (for C⁺⁺).

TABLE A-1 Open MPI Standard Error Classes
Error Code	Value	Meaning
MPI_SUCCESS	`0`	Successful return code.
MPI_ERR_BUFFER	`1`	Invalid buffer pointer.
MPI_ERR_COUNT	`2`	Invalid count argument.
MPI_ERR_TYPE	`3`	Invalid datatype argument.
MPI_ERR_TAG	`4`	Invalid tag argument.
MPI_ERR_COMM	`5`	Invalid communicator.
MPI_ERR_RANK	`6`	Invalid rank.
MPI_ERR_ROOT	`7`	Invalid root.
MPI_ERR_GROUP	`8`	Null group passed to function.
MPI_ERR_OP	`9`	Invalid operation.
MPI_ERR_TOPOLOGY	`10`	Invalid topology.
MPI_ERR_DIMS	`11`	Illegal dimension argument.
MPI_ERR_ARG	`12`	Invalid argument.
MPI_ERR_UNKNOWN	`13`	Unknown error.
MPI_ERR_TRUNCATE	`14`	Message truncated on receive.
MPI_ERR_OTHER	`15`	Other error; use `Error_string`.
MPI_ERR_INTERN	`16`	Internal error code.
MPI_ERR_IN_STATUS	`17`	Look in status for error value.
MPI_ERR_PENDING	`18`	Pending request.
MPI_ERR_REQUEST	`19`	Illegal `MPI_Request`() handle.
MPI_ERR_KEYVAL	`36`	Illegal key value.
MPI_ERR_INFO	`37`	Invalid info object.
MPI_ERR_INFO_KEY	`38`	Illegal info key.
MPI_ERR_INFO_NOKEY	`39`	No such key.
MPI_ERR_INFO_VALUE	`40`	Illegal info value.
MPI_ERR_TIMEDOUT	`41`	Timed out.
MPI_ERR_SYSRESOURCES	`42`	Out of resources.
MPI_ERR_SPAWN	`45`	Error spawning.
MPI_ERR_WIN	`46`	Invalid window.
MPI_ERR_BASE	`47`	Invalid base.
MPI_ERR_SIZE	`48`	Invalid size.
MPI_ERR_DISP	`49`	Invalid displacement.
MPI_ERR_LOCKTYPE	`50`	Invalid locktype.
MPI_ERR_ASSERT	`51`	Invalid assert.
MPI_ERR_RMA_CONFLICT	`52`	Conflicting accesses to window.
MPI_ERR_RMA_SYNC	`53`	Erroneous RMA synchronization.
MPI_ERR_NO_MEM	`54`	Memory exhausted.
MPI_ERR_LASTCODE	`55`	Last error code.

MPI I/O Error Handling

Open MPI I/O error reporting follows the MPI-2 Standard. By default, errors are reported in the form of standard error codes (found in /opt/SUNWhpc/include/mpi.h). Error classes and their meanings are listed in TABLE A-2. They can also be found in mpif.h (for Fortran) and mpi.h (for C).

You can change the default error handler by specifying MPI_FILE_NULL as the file handle with the routine MPI_File_set_errhandler(), even if no file is currently open. Or, you can use the same routine to change the error handler for a specific file.

TABLE A-2 Open MPI I/O E rror Classes
Error Class	Value	Meaning
MPI_ERR_FILE	`20`	Bad file handle.
MPI_ERR_NOT_SAME	`21`	Collective argument not identical on all processes.
MPI_ERR_AMODE	`22`	Unsupported `amode` passed to open.
MPI_ERR_UNSUPPORTED_DATAREP	`23`	Unsupported `datarep` passed to `MPI_File_set_view`().
MPI_ERR_UNSUPPORTED_OPERATION	`24`	Unsupported operation, such as seeking on a file that supports only sequential access.
MPI_ERR_NO_SUCH_FILE	`25`	File (or directory) does not exist.
MPI_ERR_FILE_EXISTS	`26`	File exists.
MPI_ERR_BAD_FILE	`27`	Invalid file name (for example, path name too long).
MPI_ERR_ACCESS	`28`	Permission denied.
MPI_ERR_NO_SPACE	`29`	Not enough space.
MPI_ERR_QUOTA	`30`	Quota exceeded.
MPI_ERR_READ_ONLY	`31`	Read-only file system.
MPI_ERR_FILE_IN_USE	`32`	File operation could not be completed, as the file is currently open by some process.
MPI_ERR_DUP_DATAREP	`33`	Conversion functions could not be registered because a data representation identifier that was already defined was passed to `MPI_REGISTER_DATAREP`.
MPI_ERR_CONVERSION	`34`	An error occurred in a user-supplied data-conversion function.
MPI_ERR_IO	`35`	I/O error.
MPI_ERR_INFO	`37`	Invalid info object.
MPI_ERR_INFO_KEY	`38`	Illegal info key.
MPI_ERR_INFO_NOKEY	`39`	No such key.
MPI_ERR_INFO_VALUE	`40`	Illegal info value.
MPI_ERR_LASTCODE	`55`	Last error code.

Exceeding the File Descriptor Limit

If your application tries to open a file descriptor when the maximum limit of open file descriptors has been reached, the job will fail and display the following message:

% =>mpirun -np 64 foo -v
% ORTE_ERROR_LOG: The system limit on number of pipes a process can open was 
reached in file base/iof_base_setup.c at line 115
% ORTE_ERROR_LOG: The system limit on number of pipes a process can open was 
reached in file odls_default_module.c at line 233
% The system limit on number of network connections a process can open was reached 
in file oob_tcp.c at line 447
--------------------------------------------------------------------------
Error: system limit exceeded on number of network connections that can be open
 
This can be resolved by setting the mca parameter opal_set_max_sys_limits to 1,
increasing your limit descriptor setting (using limit or ulimit commands),
or asking the system administrator to increase the system limit.
--------------------------------------------------------------------------

Should this occur, do as the message says and try --mca opal_set_max_sys_limits 1. Alternatively, you need to increase the number of file descriptors.

The Solaris OS default file descriptor limit is 256. When you start an MPI job, a program called an orted (for ORTE daemon) spawns the user processes. For each user process spawned, the orted takes up four file descriptors. In addition, the job takes 12 additional file descriptors regardless of the number of processes spawned.

To calculate the number of file descriptors needed to run a certain job, use the following formula:

file descriptors = 12 + 4 * np

where np is the number of processes launched.

If the number of file descriptors needed is greater than 256, you must increase the number of available descriptors to a value equal to or greater than the number you calculated. Otherwise, the processes fail and the error message is displayed.

Increasing the Number of Available File Descriptors

To View the Hard Limit from the C Shell

1. Log in to a C shell as superuser.

2. Determine the current hard limit value for your Solaris implementation. Type the following command:

# limit -h descriptors

To View the Hard Limit from the Bourne Shell

1. Log in to a Bourne shell as superuser.

2. Use the ulimit function. Type the following command:

# ulimit -Hn

Each function returns the file descriptor hard limit that was in effect. The new value you set for the number of available file descriptors must be less than or equal to this number. The usual default value for the hard limit in the Solaris OS is 64000 (64K).

To Increase the Number of File Descriptors

Note - You must perform this procedure on each of the nodes on which you plan to run.

1. Open the /etc/system file in a text editor.

2. Add the following line to the file:


`set rlim_fd_cur=`value

where value is the new maximum number of file descriptors. For example, the following line added to the /etc/system file increases the maximum number of file descriptors to 1024:


`set rlim_fd_cur=1024`

3. Save the file and exit the text editor.

4. Reboot the system.

Setting File Descriptor Limits When Using Sun Grid Engine

If you are using Sun Grid Engine to launch your jobs on very large multi-processor nodes, you might see an error message about exceeding your file descriptor limit, and your jobs might fail. This can happen because Sun Grid Engine cannot set the file descriptor limit in its queue.

There are three ways in which you can adjust the number of available file descriptors when you use Sun Grid Engine:

1. Set the file descriptor limit in your login shell (.cshrc, .tcshrc, .bashrc, and so on).

2. Modify the /etc/shell file for each of the nodes on your cluster as described in the previous section, To Increase the Number of File Descriptors. Remember that you must reboot all of the nodes in the cluster once you have finished modifying the files.

3. On a Sun Grid Engine execution host, modify the $SGE_ROOT/default/common/sgeexecd startup script to increase the file descriptor limit to the same value as the hard limit (as described in the previous section). You must restart the sgeexecd daemon on the host. Since this script is shared among the Sun Grid Engine execution hosts in the cluster using NFS, you may make the change on one host, and it will be propagated to the other Sun Grid Engine hosts in the cluster.