A P P E N D I X  A

Troubleshooting

This appendix describes some common problem situations, resulting error messages, and suggestions for fixing the problems. Open MPI error reporting, including I/O, follows the MPI-2 Standard. By default, errors are reported in the form of standard error classes. These classes and their meanings are listed in TABLE A-1 (for non-I/O MPI) and TABLE A-2 (for MPI I/O), and are also available on the MPI man page.


MPI Messages

Standard Error Classes

Listed below are the error return classes you might encounter in your MPI programs. Error values can also be found in mpi.h (for C), mpif.h (for Fortran), and mpi++.h (for C++).


TABLE A-1 Open MPI Standard Error Classes

Error Code

Value

Meaning

MPI_SUCCESS

0

Successful return code.

MPI_ERR_BUFFER

1

Invalid buffer pointer.

MPI_ERR_COUNT

2

Invalid count argument.

MPI_ERR_TYPE

3

Invalid datatype argument.

MPI_ERR_TAG

4

Invalid tag argument.

MPI_ERR_COMM

5

Invalid communicator.

MPI_ERR_RANK

6

Invalid rank.

MPI_ERR_ROOT

7

Invalid root.

MPI_ERR_GROUP

8

Null group passed to function.

MPI_ERR_OP

9

Invalid operation.

MPI_ERR_TOPOLOGY

10

Invalid topology.

MPI_ERR_DIMS

11

Illegal dimension argument.

MPI_ERR_ARG

12

Invalid argument.

MPI_ERR_UNKNOWN

13

Unknown error.

MPI_ERR_TRUNCATE

14

Message truncated on receive.

MPI_ERR_OTHER

15

Other error; use Error_string.

MPI_ERR_INTERN

16

Internal error code.

MPI_ERR_IN_STATUS

17

Look in status for error value.

MPI_ERR_PENDING

18

Pending request.

MPI_ERR_REQUEST

19

Illegal MPI_Request() handle.

MPI_ERR_KEYVAL

36

Illegal key value.

MPI_ERR_INFO

37

Invalid info object.

MPI_ERR_INFO_KEY

38

Illegal info key.

MPI_ERR_INFO_NOKEY

39

No such key.

MPI_ERR_INFO_VALUE

40

Illegal info value.

MPI_ERR_TIMEDOUT

41

Timed out.

MPI_ERR_SYSRESOURCES

42

Out of resources.

MPI_ERR_SPAWN

45

Error spawning.

MPI_ERR_WIN

46

Invalid window.

MPI_ERR_BASE

47

Invalid base.

MPI_ERR_SIZE

48

Invalid size.

MPI_ERR_DISP

49

Invalid displacement.

MPI_ERR_LOCKTYPE

50

Invalid locktype.

MPI_ERR_ASSERT

51

Invalid assert.

MPI_ERR_RMA_CONFLICT

52

Conflicting accesses to window.

MPI_ERR_RMA_SYNC

53

Erroneous RMA synchronization.

MPI_ERR_NO_MEM

54

Memory exhausted.

MPI_ERR_LASTCODE

55

Last error code.



MPI I/O Error Handling

Open MPI I/O error reporting follows the MPI-2 Standard. By default, errors are reported in the form of standard error codes (found in /opt/SUNWhpc/include/mpi.h). Error classes and their meanings are listed in TABLE A-2. They can also be found in mpif.h (for Fortran) and mpi.h (for C).

You can change the default error handler by specifying MPI_FILE_NULL as the file handle with the routine MPI_File_set_errhandler(), even if no file is currently open. Or, you can use the same routine to change the error handler for a specific file.


TABLE A-2 Open MPI I/O E rror Classes

Error Class

Value

Meaning

MPI_ERR_FILE

20

Bad file handle.

MPI_ERR_NOT_SAME

21

Collective argument not identical on all processes.

MPI_ERR_AMODE

22

Unsupported amode passed to open.

MPI_ERR_UNSUPPORTED_DATAREP

23

Unsupported datarep passed to MPI_File_set_view().

MPI_ERR_UNSUPPORTED_OPERATION

24

Unsupported operation, such as seeking on a file that supports only sequential access.

MPI_ERR_NO_SUCH_FILE

25

File (or directory) does not exist.

MPI_ERR_FILE_EXISTS

26

File exists.

MPI_ERR_BAD_FILE

27

Invalid file name (for example, path name too long).

MPI_ERR_ACCESS

28

Permission denied.

MPI_ERR_NO_SPACE

29

Not enough space.

MPI_ERR_QUOTA

30

Quota exceeded.

MPI_ERR_READ_ONLY

31

Read-only file system.

MPI_ERR_FILE_IN_USE

32

File operation could not be completed, as the file is currently open by some process.

MPI_ERR_DUP_DATAREP

33

Conversion functions could not be registered because a data representation identifier that was already defined was passed to MPI_REGISTER_DATAREP.

MPI_ERR_CONVERSION

34

An error occurred in a user-supplied data-conversion function.

MPI_ERR_IO

35

I/O error.

MPI_ERR_INFO

37

Invalid info object.

MPI_ERR_INFO_KEY

38

Illegal info key.

MPI_ERR_INFO_NOKEY

39

No such key.

MPI_ERR_INFO_VALUE

40

Illegal info value.

MPI_ERR_LASTCODE

55

Last error code.



Exceeding the File Descriptor Limit

If your application tries to open a file descriptor when the maximum limit of open file descriptors has been reached, the job will fail and display the following message:


% =>mpirun -np 64 foo -v
% ORTE_ERROR_LOG: The system limit on number of pipes a process can open was 
reached in file base/iof_base_setup.c at line 115
% ORTE_ERROR_LOG: The system limit on number of pipes a process can open was 
reached in file odls_default_module.c at line 233
% The system limit on number of network connections a process can open was reached 
in file oob_tcp.c at line 447
--------------------------------------------------------------------------
Error: system limit exceeded on number of network connections that can be open
 
This can be resolved by setting the mca parameter opal_set_max_sys_limits to 1,
increasing your limit descriptor setting (using limit or ulimit commands),
or asking the system administrator to increase the system limit.
--------------------------------------------------------------------------

Should this occur, do as the message says and try --mca opal_set_max_sys_limits 1. Alternatively, you need to increase the number of file descriptors.

The Solaris OS default file descriptor limit is 256. When you start an MPI job, a program called an orted (for ORTE daemon) spawns the user processes. For each user process spawned, the orted takes up four file descriptors. In addition, the job takes 12 additional file descriptors regardless of the number of processes spawned.

To calculate the number of file descriptors needed to run a certain job, use the following formula:

file descriptors = 12 + 4 * np

where np is the number of processes launched.

If the number of file descriptors needed is greater than 256, you must increase the number of available descriptors to a value equal to or greater than the number you calculated. Otherwise, the processes fail and the error message is displayed.

Increasing the Number of Available File Descriptors


procedure icon  To View the Hard Limit from the C Shell

1. Log in to a C shell as superuser.

2. Determine the current hard limit value for your Solaris implementation. Type the following command:


# limit -h descriptors


procedure icon  To View the Hard Limit from the Bourne Shell

1. Log in to a Bourne shell as superuser.

2. Use the ulimit function. Type the following command:


# ulimit -Hn

Each function returns the file descriptor hard limit that was in effect. The new value you set for the number of available file descriptors must be less than or equal to this number. The usual default value for the hard limit in the Solaris OS is 64000 (64K).


procedure icon  To Increase the Number of File Descriptors



Note - You must perform this procedure on each of the nodes on which you plan to run.


1. Open the /etc/system file in a text editor.

2. Add the following line to the file:


 

set rlim_fd_cur=value

where value is the new maximum number of file descriptors. For example, the following line added to the /etc/system file increases the maximum number of file descriptors to 1024:


 

set rlim_fd_cur=1024

3. Save the file and exit the text editor.

4. Reboot the system.

Setting File Descriptor Limits When Using Sun Grid Engine

If you are using Sun Grid Engine to launch your jobs on very large multi-processor nodes, you might see an error message about exceeding your file descriptor limit, and your jobs might fail. This can happen because Sun Grid Engine cannot set the file descriptor limit in its queue.

There are three ways in which you can adjust the number of available file descriptors when you use Sun Grid Engine:

1. Set the file descriptor limit in your login shell (.cshrc, .tcshrc, .bashrc, and so on).

2. Modify the /etc/shell file for each of the nodes on your cluster as described in the previous section, To Increase the Number of File Descriptors. Remember that you must reboot all of the nodes in the cluster once you have finished modifying the files.

3. On a Sun Grid Engine execution host, modify the $SGE_ROOT/default/common/sgeexecd startup script to increase the file descriptor limit to the same value as the hard limit (as described in the previous section). You must restart the sgeexecd daemon on the host. Since this script is shared among the Sun Grid Engine execution hosts in the cluster using NFS, you may make the change on one host, and it will be propagated to the other Sun Grid Engine hosts in the cluster.