A P P E N D I X A |
Troubleshooting |
This appendix describes some common problem situations, resulting error messages, and suggestions for fixing the problems. Open MPI error reporting, including I/O, follows the MPI-2 Standard. By default, errors are reported in the form of standard error classes. These classes and their meanings are listed in TABLE A-1 (for non-I/O MPI) and TABLE A-2 (for MPI I/O), and are also available on the MPI man page.
Listed below are the error return classes you might encounter in your MPI programs. Error values can also be found in mpi.h (for C), mpif.h (for Fortran), and mpi++.h (for C++).
Open MPI I/O error reporting follows the MPI-2 Standard. By default, errors are reported in the form of standard error codes (found in /opt/SUNWhpc/include/mpi.h). Error classes and their meanings are listed in TABLE A-2. They can also be found in mpif.h (for Fortran) and mpi.h (for C).
You can change the default error handler by specifying MPI_FILE_NULL as the file handle with the routine MPI_File_set_errhandler(), even if no file is currently open. Or, you can use the same routine to change the error handler for a specific file.
If your application tries to open a file descriptor when the maximum limit of open file descriptors has been reached, the job will fail and display the following message:
Should this occur, do as the message says and try --mca opal_set_max_sys_limits 1. Alternatively, you need to increase the number of file descriptors.
The Solaris OS default file descriptor limit is 256. When you start an MPI job, a program called an orted (for ORTE daemon) spawns the user processes. For each user process spawned, the orted takes up four file descriptors. In addition, the job takes 12 additional file descriptors regardless of the number of processes spawned.
To calculate the number of file descriptors needed to run a certain job, use the following formula:
file descriptors = 12 + 4 * np
where np is the number of processes launched.
If the number of file descriptors needed is greater than 256, you must increase the number of available descriptors to a value equal to or greater than the number you calculated. Otherwise, the processes fail and the error message is displayed.
1. Log in to a C shell as superuser.
2. Determine the current hard limit value for your Solaris implementation. Type the following command:
1. Log in to a Bourne shell as superuser.
2. Use the ulimit function. Type the following command:
Each function returns the file descriptor hard limit that was in effect. The new value you set for the number of available file descriptors must be less than or equal to this number. The usual default value for the hard limit in the Solaris OS is 64000 (64K).
Note - You must perform this procedure on each of the nodes on which you plan to run. |
1. Open the /etc/system file in a text editor.
2. Add the following line to the file:
where value is the new maximum number of file descriptors. For example, the following line added to the /etc/system file increases the maximum number of file descriptors to 1024:
3. Save the file and exit the text editor.
If you are using Sun Grid Engine to launch your jobs on very large multi-processor nodes, you might see an error message about exceeding your file descriptor limit, and your jobs might fail. This can happen because Sun Grid Engine cannot set the file descriptor limit in its queue.
There are three ways in which you can adjust the number of available file descriptors when you use Sun Grid Engine:
1. Set the file descriptor limit in your login shell (.cshrc, .tcshrc, .bashrc, and so on).
2. Modify the /etc/shell file for each of the nodes on your cluster as described in the previous section, To Increase the Number of File Descriptors. Remember that you must reboot all of the nodes in the cluster once you have finished modifying the files.
3. On a Sun Grid Engine execution host, modify the $SGE_ROOT/default/common/sgeexecd startup script to increase the file descriptor limit to the same value as the hard limit (as described in the previous section). You must restart the sgeexecd daemon on the host. Since this script is shared among the Sun Grid Engine execution hosts in the cluster using NFS, you may make the change on one host, and it will be propagated to the other Sun Grid Engine hosts in the cluster.
Copyright © 2010, Oracle and/or its affiliates. All rights reserved.