Sun Cluster Data Service for Sun Grid Engine Guide for Solaris OS

Sun Cluster HA for Sun Grid Engine Overview

Sun Grid Engine is a distributed resource management program, which runs jobs in parallel on multiple machines. To minimize the loss of work that a failure of a machine might cause, nodes in the management tier must be protected against failure. However, protection of individual execution nodes in the grid against failure is not required. Failure of an individual execution node in a grid causes only a minor loss of work.

To eliminate single points of failure in the management tier of a Sun Grid Engine system, Sun Cluster HA for Sun Grid Engine provides fault monitoring and automatic fault recovery for the following Sun Grid Engine daemons:

You must configure Sun Cluster HA for Sun Grid Engine as a failover service.

For conceptual information about failover data services and scalable data services, see Sun Cluster Concepts Guide for Solaris OS.

Because the management tier relies on the Sun Grid Engine file system, the NFS server that exports this file system must also be protected against failure. To eliminate single points of failure in the NFS server, use the Sun Cluster HA for NFS data service. For more information about this data service, see Sun Cluster Data Service for NFS Guide for Solaris OS.

Each component of Sun Grid Engine has a data service that protects the component when the component is configured in Sun Cluster. See the following table.

Table 1 Protection of Sun Grid Engine Components by Sun Cluster Data Services

Sun Grid EngineComponent 

Data Service 

Sun Grid Engine daemons: 

  • Queue master daemon (sge_qmaster)

  • Scheduling daemon (sge_schedd)

Sun Cluster HA for Sun Grid Engine 

The resource type is SUNW.gds.

NFS server 

Sun Cluster HA for NFS 

The resource type is SUNW.nfs.