Sun N1 Grid Engine 6.1 Installation Guide

Issues between IPMP and Grid Engine

The only major issue is the error messages which occur when starting the Grid Engine daemons on a machine where the main interface is part of an IPMP group. This situation occurs when the IPMP load balancing distributes the connections across the interfaces in the group; therefore, the IP packets show up at the receiving end as coming from a different host rather than the one associated with the main interface. For example, on a machine with three interfaces named qfe0, qfe1, and qfe3, where the IP addresses for these interfaces are 10.1.1.1, 10.1.1.2 and 10.1.13 respectively, IPMP would need an extra address for each interface for testing. However that requirement is ignored in this example. Each of these addresses has a hostname associated with it. The hosts table looks like the following example:


10.1.1.1 sge
    10.1.1.2 sge-qfe1
    10.1.1.3 sge-qfe2

The machine's hostname is sge. When a connection is established from sge to another machine, it might go through sge, sge-qfe1 , or sge-qfe2. Upon installation, Grid Engine will only recognize sge. When Grid Engine receives a connection request from sge-qfe2, it closes the connection because the request is not from one of the authorized (or known) nodes.

You solve this problem by using the host_aliases files (see the sge_h_aliases man page for details). You can use this file to "tell" Grid Engine that sge, sge1, and sge-qfe2 are all from the same machine. The host_aliases file in this case would look like this:


sge sge-qfe1 sge-qfe2

Note that if you make any changes to the $SGE_ROOT/$SGE_CELL/common/host_aliases file, you must stop and restart all running Grid Engine daemons (sge_qmaster, sge_scheduler, and sge_execd). To do this, login as root to all your Grid Engine hosts and enter these commands :


/etc/init.d/sgemaster stop
/etc/init.d/sgeexecd stop
    /etc/init.d/sgemaster start
/etc/init.d/sgeexecd start