The following sections contain information about product irregularities discovered during testing, but too late to fix or document.
This Sun N1 Grid Engine 6.1 software release has the following limitations:
Sun N1 Grid Engine 6.1 Update 5 – When the installation is started as root and you choose an administrative user that is different from the owner of the $SGE_ROOT directory, the installation fails when creating the cluster name.
Workaround – Before you start the installation, change the owner of the $SGE_ROOT directory to the administrative user that you want to use. For example, if the $SGE_ROOT directory is /sge and you want to use the administrative user sgeadmin, use the following command:
# chown sgeadmin /sge |
After the ownership is changed, sgeadmin is suggested as the administrative user during the installation. Just accept that suggestion.
The stack size for sge_qmaster should be set to 16 MBytes. sge_qmaster might not run with the default values for stack size on the following architectures: IBM AIX and HP UX 11.
You should set a high file descriptor limit in the kernel configuration on hosts that are designated to run the sge_qmaster daemon. You might want to set a high file descriptor limit on the shadow master hosts as well. A large number of available file descriptors enables the communication system to keep connections open instead of having to constantly close and reopen them. If you have many execution hosts, a high file descriptor limit significantly improves performance. Set the file descriptor limit to a number that is higher than the number of intended execution hosts. You should also make room for concurrent client requests, in particular for jobs submitted with qsub -sync or when you are running DRMAA sessions that maintain a steady communication connection with the master daemon. Refer to you operating system documentation for information about how to set the file descriptor limit.
The number of concurrent dynamic event clients is limited by the number of file descriptors. The default is 99. Dynamic event clients are jobs submitted with the qsub -sync command and a DRMAA session. You can limit the number of dynamic event clients with the qmaster_params global cluster configuration setting. Set this parameter to MAX_DYN_EC=n. See the sge_conf(5) man page for more information.
The ARCo module is available only for the Solaris Sparc, Solaris Sparc 64 bit, Solaris x86, Solaris x64, Linux x86, and Linux 64 bit kernels.
Only a limited set of predefined queries is currently shipped with ARCo. Later releases will include more comprehensive sets of predefined queries.
Jobs requesting the amount INFINITY for resources are not handled correctly with respect to resource reservation. INFINITY might be requested by default in case no explicit request for a certain resource has been made. Therefore it is important to request that all resources be explicitly taken into account for resource reservation.
Resource reservation currently takes only pending jobs into account. Consequently, jobs that are in a hold state due to the submit options -a time and -hold_jid joblist, and are thus not pending, do not get reservations. Such jobs are treated as if the -R n submit option were specified for them.
Berkeley DB requires that the database files reside on the local disk, if qmaster is not running on Solaris 10 and uses a NFSv4 mount (full NFSv4 compliant clients and servers from other vendors are also supported, but have not yet been tested.) If the sge_qmaster cannot be run on the file server intended to store the spooling data (for example, if you want to use the shadow master facility), a Berkeley DB RPC server can be used. The RPC server runs on the file server and connects with the Berkeley DB sge_qmaster instance. However, Berkeley DB's RPC server uses an insecure protocol for this communication and so it presents a security problem. Do not use the RPC server method if you are concerned about security at your site. Use sge_qmaster local disks for spooling instead and, for fail-over, use a high availability solution such as Sun Cluster, which maintains host local file access in the fail-over case.
Busy QMON with large array task numbers. If large array task numbers are used, you should use “compact job array display” in the QMON Job Control dialog box customization. Otherwise the QMON GUI will cause high CPU load and show poor performance.
The automatic installation option does not provide full diagnostic information in case of installation failures. If the installation process aborts, check for the presence and the contents of an installation log file in qmaster-spool-dir/install_hostname_timestamp.log or in /tmp/install.pid.
On IBM AIX, HP/UX 11, and SGI IRIX 6.5 systems, two different binaries are provided for sge_qmaster, spooldefaults, and spoolinit. One of these binaries is for the Berkeley DB spooling method, the other binary is for the classic spooling method. The names of these binaries are binary.spool_db and binary.spool_classic.
To change to the desired spooling method, modify three symbolic links before you install the master host. Do the following:
# cd sge-root/bin/arch # rm sge_qmaster # ln -s sge_qmaster.spool_classic sge_qmaster # cd sge-root/utilbin/arch # rm spooldefaults spoolinit # ln -s spooldefaults.spool_classic spooldefaults # ln -s spoolinit.spool_classic spoolinit |
The default Mac OS X installation does not include the OpenMotif library that QMON needs. You can get the OpenMotif library for the PowerPC and x86 architectures from various web sites, such as http://www.ist-inc.com/DOWNLOADS/openmotif_download.html. You can also find information about how to install packages that have been ported to Mac OS X at http://www.macports.org.
PDF export in ARCo requires a lot of memory. Huge reports can result in a OutOfMemoryException when they are exported into PDF.
Workaround – Increase the JVM heap size for the Sun Java Web Console The following command the set max. heap size to 512 MB.
# smreg add -p java.options="... -Xmx512M ..."
A restart of the Sun Java Web Console is necessary to make the change effective as in this command:
# smcwebserver restart
For DBWriter (part of ARCo) the 64-Bit support of the Java virtual machine needs to be installed on Solaris Sparc 64-bit and Solaris x64, and Linux 64-bit kernels.
When you use Java bindings with DRMAA, verify that the LD_LIBRARY_PATH is set correctly.
If you are using a 32–bit Java Virtual Machine (JVM), you must set the LD_LIBRARY_PATH to the 32–bit shared DRMAA library (for example, $SGE_ROOT/lib/sol-sparc), even when your application actually runs on a 64–bit operating system platform.
The N1 Grid Engine 6.1 version of the drmaa.jar file is not compatible with the previous drmaa.jar file. The old drmaa.jar file has been renamed to drmaa-0.5.jar.
For a fully-featured automatic installation (not using CSP), you must grant the root user permissions to remote login through rsh or ssh without asking for a password. This enables the installation script to start the installation on the remote hosts. If this is not configured correctly, you have to log into each execution host and manually execute the automatic installation using the following command:
inst_sge -x -auto <conf-file> -noremote |
The installation of Services For UNIX (SFU) 3.5 requires a good administrative understanding of the Windows platform and its integration into a UNIX environment. For an overview of SFU, see Appendix A, Microsoft Services For UNIX, in Sun N1 Grid Engine 6.1 Installation Guide. You can find additional technical information and documentation about SFU on the Microsoft web site at http://www.microsoft.com/windows/sfu/default.asp.
Username mapping, NFS mounts, and hostname resolving in SFU require special attention to successfully install the Grid Engine execution daemon, submit host functionality, and integration of Windows hosts into a N1 Grid Engine cluster.
You cannot install a Windows execution host remotely with the auto installation procedure. You can use the auto installation procedure through the inst_sge -noremote command to install locally.
You cannot submit a job from a Windows submit host as the Windows “local Administrator” to a Unix or Linux execution host. However, you can submit a job as local Administrator from Windows to Windows, and you can submit as user root from Unix or Linux to Windows, Unix, or Linux execution hosts.