This chapter explains how to install and configure Sun Cluster HA for Sun Grid Engine.
Sun Grid Engine was formerly known as Sun ONE Grid Engine. In this book, references to Sun Grid Engine also apply to Sun ONE Grid Engine unless this book explicitly states otherwise.
This chapter contains the following sections.
Overview of Installing and Configuring Sun Cluster HA for Sun Grid Engine
Planning the Sun Cluster HA for Sun Grid Engine Installation and Configuration
Verifying the Installation and Configuration of Sun Grid Engine
Configuring the HAStoragePlus Resource Type to Work With Sun Cluster HA for Sun Grid Engine
Configuring Sun Cluster HA for NFS for Use With Sun Cluster HA for Sun Grid Engine
Registering and Configuring Sun Cluster HA for Sun Grid Engine
Verifying the Sun Cluster HA for Sun Grid Engine Installation and Configuration
Tuning the Sun Cluster HA for Sun Grid Engine Fault Monitors
Sun Grid Engine is a distributed resource management program, which runs jobs in parallel on multiple machines. To minimize the loss of work that a failure of a machine might cause, nodes in the management tier must be protected against failure. However, protection of individual execution nodes in the grid against failure is not required. Failure of an individual execution node in a grid causes only a minor loss of work.
To eliminate single points of failure in the management tier of a Sun Grid Engine system, Sun Cluster HA for Sun Grid Engine provides fault monitoring and automatic fault recovery for the following Sun Grid Engine daemons:
Queue master daemon
Scheduling daemon
You must configure Sun Cluster HA for Sun Grid Engine as a failover service.
For conceptual information about failover data services and scalable data services, see Sun Cluster Concepts Guide for Solaris OS.
Because the management tier relies on the Sun Grid Engine file system, the NFS server that exports this file system must also be protected against failure. To eliminate single points of failure in the NFS server, use the Sun Cluster HA for NFS data service. For more information about this data service, see Sun Cluster Data Service for NFS Guide for Solaris OS.
Each component of Sun Grid Engine has a data service that protects the component when the component is configured in Sun Cluster. See the following table.
Table 1 Protection of Sun Grid Engine Components by Sun Cluster Data Services
The following table summarizes the tasks for installing and configuring Sun Cluster HA for Sun Grid Engine and provides cross-references to detailed instructions for performing these tasks. Perform the tasks in the order that they are listed in the table.
Table 2 Tasks for Installing and Configuring Sun Cluster HA for Sun Grid Engine
Task |
Instructions |
---|---|
Plan the installation |
Sun Cluster HA for Sun Grid Engine Overview Planning the Sun Cluster HA for Sun Grid Engine Installation and Configuration |
Prepare the nodes and disks | |
Install and configure Sun Grid Engine | |
Verify Sun Cluster HA for Sun Grid Engine installation and configuration |
Verifying the Installation and Configuration of Sun Grid Engine |
Install Sun Cluster HA for Sun Grid Engine Packages | |
Configure the HAStoragePlus resource type to work with Sun Cluster HA for Sun Grid Engine |
Configuring the HAStoragePlus Resource Type to Work With Sun Cluster HA for Sun Grid Engine |
Configure Sun Cluster HA for NFS for use with Sun Cluster HA for Sun Grid Engine |
Configuring Sun Cluster HA for NFS for Use With Sun Cluster HA for Sun Grid Engine |
Register and Configure Sun Cluster HA for Sun Grid Engine |
Registering and Configuring Sun Cluster HA for Sun Grid Engine |
Verify Sun Cluster HA for Sun Grid Engine installation and configuration |
Verifying the Sun Cluster HA for Sun Grid Engine Installation and Configuration |
Tune Sun Cluster HA for Sun Grid Engine fault monitors |
Tuning the Sun Cluster HA for Sun Grid Engine Fault Monitors |
Debug Sun Cluster HA for Sun Grid Engine |
This section contains the information that you need to plan your Sun Cluster HA for Sun Grid Engine installation and configuration.
Before you begin, consult your Sun Grid Engine documentation for configuration restrictions and requirements that are not imposed by Sun Cluster software.
The configuration restrictions in the subsections that follow apply only to Sun Cluster HA for Sun Grid Engine.
Your data service configuration might not be supported if you do not observe these restrictions.
Do not use the Sun Grid Engine shadow daemon. The Sun Grid Engine shadow daemon provides an optional mechanism for recovery from failures. This mechanism interferes with the automatic fault recovery that Sun Cluster provides.
Do not choose the option to use a Berkley DB spooling server. Either choose the Classic spooling method or the local Berkley DB spooling method. Currently it is not possible to configure the Berkley DB spooling server in a highly available way within the Sun Cluster framework.
Do not choose the start at boot option when installing Sun Grid Engine. To ensure that Sun Cluster HA for Sun Grid Engine can provide fault monitoring and automatic fault recovery, Sun Grid Engine must be started only by Sun Cluster.
The configuration requirements in this section apply only to Sun Cluster HA for Sun Grid Engine.
If your data service configuration does not conform to these requirements, the data service configuration might not be supported.
Use Sun Grid Engine version 6.0 or 6.1. Make sure to apply the most recent available Patches to the Sun Grid Engine software.
The Sun Grid Engine management tier must run on Sun Cluster nodes. Because Sun Cluster runs only on the Solaris Operating System, the Sun Grid Engine management tier must also run on the Solaris Operating System. However, Sun Grid Engine supports other operating systems. Therefore, this requirement applies only to the management tier, not to individual execution nodes in the grid.
Ensure that enough free memory is available on the cluster nodes where you plan to run the Sun Grid Engine master.
The amount of free memory that is required on each cluster node depends on the number of jobs that are running on the grid. For example:
If 100 jobs are running, 10 Mbytes of free memory are required.
If 10,000 jobs are running, 1 Gbyte of free memory is required.
Ensure that you have enough disk space in the Sun Grid Engine file system and on the local disk of each node.
The disk space requirements for each type of file or directory in the Sun Grid Engine file system are listed in the following table.
File Type or Directory Type |
Required Disk Space |
---|---|
Binary files |
15 Mbytes for each architecture |
Spool directories |
30–200 Mbytes |
Installation tar file |
40 Mbytes |
On the local disk of each node, 10–20 Mbytes of disk space are required. If you are installing the Sun Grid Engine software on the local disk of a node, 15 Mbytes of disk space are additionally required for the binary files.
Configure Sun Cluster HA for Sun Grid Engine as a failover data service. You cannot configure Sun Cluster HA for Sun Grid Engine as a scalable data service. For more information, see:
If you are using the Solaris 10 OS, install and configure this data service to run only in the global zone. At publication of this document, this data service is not supported in non-global zones. For updated information about supported configurations of this data service, contact your Sun service representative.
The Sun Grid Engine file system must reside on a multihost disk. This disk must be available to the other nodes in the cluster that will be used for the Sun Grid Engine administrative services,
You must use NFS to export the Sun Grid Engine file system to the noncluster nodes. The NFS server that exports this file system must also be protected against failure. To protect the NFS server against failure, use the Sun Cluster HA for NFS data service. For more information about this data service, see Sun Cluster Data Service for NFS Guide for Solaris OS.
Configure the resources for the Sun Grid Engine management tier in the same resource group as the resource for NFS. For more information, see Configuring Sun Cluster HA for NFS for Use With Sun Cluster HA for Sun Grid Engine.
The dependencies between Sun Grid Engine components are shown in the following table.
Table 3 Dependencies Between Sun Grid Engine Components
Sun Grid Engine Component |
Dependency |
---|---|
Sun Grid Engine queue master daemon (sge_qmaster) |
SUNW.HAStoragePlus resource |
Sun Grid Enginescheduling daemon (sge_schedd) |
Sun Grid Engine queue master daemon (sge_qmaster) resource |
These dependencies are set when you register and configure Sun Cluster HA for Sun Grid Engine. For more information, see Registering and Configuring Sun Cluster HA for Sun Grid Engine.
The configuration considerations in the subsections that follow affect the installation and configuration of Sun Cluster HA for Sun Grid Engine.
You can install Sun Grid Engine on one of the following locations:
A highly available local file system
The cluster file system
For the advantages and disadvantages of placing the Sun Grid Engine binary files on a highly available local file system and the cluster file system, see Configuration Guidelines for Sun Cluster Data Services in Sun Cluster Data Services Planning and Administration Guide for Solaris OS.
To enable the type of file system to be identified from the mount point, use a prefix that indicates the type of file system as follows:
For mount points on a highly available local file system, use the /local prefix.
For mount points on the cluster file system, use the /global prefix.
The optimum distribution of spool directories and binary files among file systems depends on the grid configuration. See the following table.
Grid Configuration |
File System Configuration |
---|---|
The execution tier contains fewer than 200 hosts. |
Use a single shared NFS file system under the root of the Sun Grid Engine file system for the spool directories and binary files. |
The execution tier contains about 200 hosts, or the applications are disk intensive. |
Use a separate area on an NFS file system for the spool directories. |
The execution tier contains more than 200 hosts, or NFS performance is likely to be a problem. |
See the Sun Grid Engine documentation for alternate grid configurations. |
Use the questions in this section to plan the installation and configuration of Sun Cluster HA for Sun Grid Engine. Write the answers to these questions in the space that is provided on the data service worksheets in Configuration Worksheets in Sun Cluster Data Services Planning and Administration Guide for Solaris OS.
Which resource group will you use for the following resources:
Logical host name resource
HAStoragePlus resource
NFS resource
Sun Grid Engine application resources
Use the answer to this question when you perform the following procedures:
What is the logical host name for the Sun Grid Engine resource? Clients access the data service through this logical host name.
Use the answer to this question when you perform the procedure How to Enable Sun Grid Engine to Run in a Cluster.
Which resources will you use for the components of Sun Grid Engine?
You require one resource for each component in the following list:
Queue master daemon
Scheduling daemon
Use the answer to this question when you perform the procedure Specifying Configuration Parameters for Sun Cluster HA for Sun Grid Engine Resources.
Where will the system configuration files reside?
See Configuration Guidelines for Sun Cluster Data Services in Sun Cluster Data Services Planning and Administration Guide for Solaris OS for the advantages and disadvantages of using the local file system instead of the cluster file system.
Preparing the nodes and disks modifies the configuration of the operating system to enable Sun Cluster HA for Sun Grid Engine to eliminate single points of failure in a Sun Grid Engine system.
Before you begin, ensure that the requirements in the following sections are met:
Become superuser on all the cluster nodes where you are installing Sun Grid Engine.
Create an administrative user account for Sun Grid Engine on all those cluster nodes.
Either select an existing user account other than root for the grid administration, or create an account specifically for grid administration.
For consistency with the Sun Grid Engine documentation, name the account sgeadmin.
Create a directory for the root of Sun Grid Engine file system.
# mkdir sge-root-dir |
The sge-root-dir must reside in the cluster filesystem. Refer to Configuring the HAStoragePlus Resource Type to Work With Sun Cluster HA for Sun Grid Engine for more details.
Change the owner of the root of the Sun Grid Engine file system to the administrative user whose account you created in Step 2.
# chown sge-admin sge-root-dir |
Set the mode of the root of Sun Grid Engine file system to drwxr-xr-x .
# chmod 755 sge-root-dir |
Specify the port number and protocol for the sge_qmaster and sge_execd services.
Choose an unused port number below 1024. The sge_qmaster and sge_execd services are to be provided through Transmission Control Protocol (TCP).
To specify the port number and protocol, add the following line to the /etc/services file.
sge_qmaster port-no/tcp sge_execd port-no/tcp
For each type of host in the grid, create a plain text file that contains the names of all hosts of that type in the grid.
The install_qmaster script uses these files when you install Sun Grid Engine. Create a separate file for each type of host in the grid:
Execution hosts
Administrative hosts
Submit hosts
This example shows how to prepare the nodes and disks for a Sun Grid Engine installation that is to be configured as follows:
The root of Sun Grid Engine file system is the /global/gridmaster directory. This directory resides in the cluster file system.
The account for grid administration is named sgeadmin.
The sge_qmaster service is to be provided through port 536 and TCP.
The sge_execd service is to be provided through port 537 and TCP.
The sequence of operations for preparing the nodes and disks for the installation of Sun Grid Engine is as follows:
To create the /global/gridmaster directory for the root of Sun Grid Engine file system, the following command is run:
# mkdir /global/gridmaster |
To change the owner of the /global/gridmaster directory to the sgeadmin user, the following command is run:
# chown sgeadmin /global/gridmaster |
To set the mode of the /global/gridmaster directory to drwxr-xr-x, the following command is run:
# chmod 755 /global/gridmaster |
To specify that the sge_qmaster service is to be provided through port 536 and TCP, and that the sge_execd service is to be provided through port 537 and TCP, the following line is added to the /etc/services file:
sge_qmaster 536/tcp sge_execd 537/tcp
The procedure that follows explains only the special requirements for installing Sun Grid Engine for use with Sun Cluster HA for Sun Grid Engine. For complete information about installing and configuring Sun Grid Engine, see your Sun Grid Engine documentation.
To enable Sun Grid Engine to run in a cluster, you must modify Sun Grid Engine to use a logical host name.
Before you begin, ensure that you have the host names of all hosts in the grid. Create a separate list of host names for each type of host in the grid:
Execution hosts
Administrative hosts
Submit hosts
Become superuser of the cluster node where you are installing Sun Grid Engine.
Install the Sun Grid Engine distribution files. You have to choose between the tar.gz format and the pkgadd format.
Follow the instructions outlined in How to Load the Distribution Files On a Workstation in the N1 Grid Engine 6 Installation Guide.
If you choose the pkgadd format, you need to make sure to install Patches for the Sun Grid Engine software on exactly the same node the Sun Grid Engine packages are registered on.
Set the SGE_ROOT environment variable to the directory for the root of Sun Grid Engine file system that you created in Preparing the Nodes and Disks.
# SGE_ROOT=sge-root-dir # export SGE_ROOT |
Go to the directory for the root of Sun Grid Engine file system.
# cd sge-root-dir |
Start the script that installs the Sun Grid Engine master host.
# ./install_qmaster |
Follow the prompts on screen to provide or confirm the following information:
The name of the Sun Grid Engine administrative user
The value of the SGE_ROOT environment variable
The TCP port number
The name of the Sun Grid Engine cell to be configured
The path to the spool directory
The setup for the correct file permissions
Details of your domain name service (DNS) domains
When you are asked whether you want to use classic spooling or Berkley DB, do not choose to use a Berkely DB spooling Server.
Either choose the classic spooling method, or choose Berkley DB with local spooling.
When you are prompted, specify the range of group IDs for Sun Grid Engine to use.
To ensure that you allocate enough group IDs, specify a range of approximately 100 group IDs, for example, 20000-20100.
Follow the prompts on screen to provide or confirm the following information:
The path to the spooling directory for the execution daemon
The email address of the user who should receive problem reports
Confirm the configuration parameters
When you are asked if you want to install the script that starts Sun Grid Engine at boot time, reply no.
You are asked if you want to install the script that starts Sun Grid Engine at boot time.
We can install the startup script that will start qmaster/scheduler at machine boot (y/n) [y] >> n |
To ensure that Sun Cluster HA for Sun Grid Engine can provide fault monitoring and automatic fault recovery, Sun Grid Engine must be started only by Sun Cluster.
Follow the prompts on screen to provide or confirm the following information:
Specify the list of execution, admin and submit hosts
Do not use a shadow host
Select a scheduler profile
Become superuser of a node in the cluster that will host Sun Grid Engine.
Create a failover resource group to contain the Sun Cluster HA for Sun Grid Engine resources.
Use the resource group that you identified when you answered the questions in Configuration Planning Questions.
# clresourcegroup create -p Pathprefix=sge-root-dir sge-rg |
Specifies a directory on a cluster file system that Sun Cluster HA for NFS uses to maintain administrative and status information. This directory must be the directory that you created for the root of the Sun Grid Engine file system in Preparing the Nodes and Disks.
Specifies that the resource group that you are creating is named sge-rg.
Add a resource for the Sun Grid Engine logical host name to the failover resource group that you created in Step 2.
# clreslogicalhostname create \ -g sge-rg \ -h hostlist \ sge-lh-rs |
Specifies that the logical host name resource is to be added to the failover resource group that you created in Step 2
Specifies a comma-separated list of host names that are to be made available by this logical host name resource
Specifies that the resource that you are creating is named sge-lh-rs
Before you install the Sun Cluster HA for Sun Grid Engine packages, verify that the Sun Grid Engine software is correctly installed and configured to run in a cluster. This verification does not verify that the Sun Grid Engine application is highly available because the Sun Cluster HA for Sun Grid Engine data service is not yet installed.
If any step in this procedure fails, see your Sun Grid Engine documentation for more information about how to verify the Sun Grid Engine installation.
You verify the installation and configuration of Sun Grid Engine by submitting a dummy job and checking that the required processes are running.
Log in to the master host as the administrative user whose account you created in Preparing the Nodes and Disks.
Set the SGE_ROOT environment variable to the directory for the root of Sun Grid Engine file system that you created in Preparing the Nodes and Disks.
$ SGE_ROOT=sge-root-dir $ export SGE_ROOT |
Start the script that modifies your environment to enable Sun Grid Engine to run.
$ . $SGE_ROOT/default/common/settings.sh |
Submit a dummy job to Sun Grid Engine.
$ qsub $SGE_ROOT/examples/jobs/sleeper.sh your job 1 (*Sleeper*) has been submitted |
On the master host, confirm that these processes are running:
sge_qmaster
sge_schedd
# ps -ef | grep sge_ root 429 1 0 Jul 27 3:37 /global/gridmaster/bin/solaris64/sge_qmaster root 429 1 0 Jul 27 3:37 /global/gridmaster/bin/solaris64/sge_schedd |
View the global configuration of the grid.
If you are using the command line, type the following command:
$ qconf -sconf |
If you are using the QMON graphical user interface (GUI), select Cluster Configuration.
On at minimum one execution host, confirm that these processes are running:
sge_execd
# ps -ef | grep sge_ root 451 1 0 Jul 27 3:37 /global/gridmaster/bin/solaris64/sge_execd |
If you did not install the Sun Cluster HA for Sun Grid Engine packages during your initial Sun Cluster installation, perform this procedure to install the packages. To install the packages, use the Sun JavaTM Enterprise System Installation Wizard.
You need to install the Sun Cluster HA for Sun Grid Engine packages in the global cluster and not in the zone cluster.
Perform this procedure on each cluster node where you are installing the Sun Cluster HA for Sun Grid Engine packages.
You can run the Sun Java Enterprise System Installation Wizard with a command-line interface (CLI) or with a graphical user interface (GUI). The content and sequence of instructions in the CLI and the GUI are similar.
Ensure that you have the Sun Java Availability Suite DVD-ROM.
If you intend to run the Sun Java Enterprise System Installation Wizard with a GUI, ensure that your DISPLAY environment variable is set.
On the cluster node where you are installing the data service packages, become superuser.
Load the Sun Java Availability Suite DVD-ROM into the DVD-ROM drive.
If the Volume Management daemon vold(1M) is running and configured to manage DVD-ROM devices, the daemon automatically mounts the DVD-ROM on the /cdrom directory.
Change to the Sun Java Enterprise System Installation Wizard directory of the DVD-ROM.
Start the Sun Java Enterprise System Installation Wizard.
# ./installer |
When you are prompted, accept the license agreement.
If any Sun Java Enterprise System components are installed, you are prompted to select whether to upgrade the components or install new software.
From the list of Sun Cluster agents under Availability Services, select the data service for Sun Grid Engine.
If you require support for languages other than English, select the option to install multilingual packages.
English language support is always installed.
When prompted whether to configure the data service now or later, choose Configure Later.
Choose Configure Later to perform the configuration after the installation.
Follow the instructions on the screen to install the data service packages on the node.
The Sun Java Enterprise System Installation Wizard displays the status of the installation. When the installation is complete, the wizard displays an installation summary and the installation logs.
(GUI only) If you do not want to register the product and receive product updates, deselect the Product Registration option.
The Product Registration option is not available with the CLI. If you are running the Sun Java Enterprise System Installation Wizard with the CLI, omit this step.
Exit the Sun Java Enterprise System Installation Wizard.
Unload the Sun Java Availability Suite DVD-ROM from the DVD-ROM drive.
Refer to the Sun Cluster Data Service for NFS Guide for Solaris OS on how to also install the Sun Cluster HA for NFS packages.
For maximum availability of the Sun Grid Engine application, resources that Sun Cluster HA for Sun Grid Engine requires must be available before the Sun Grid Engine management tier is started. An example of such a resource is the Sun Grid Engine file system. To ensure that these resources are available, configure the HAStoragePlus resource type to work with Sun Cluster HA for Sun Grid Engine.
For information about the relationship between resource groups and disk device groups, see Relationship Between Resource Groups and Device Groups in Sun Cluster Data Services Planning and Administration Guide for Solaris OS.
Configuring the HAStoragePlus resource type to work with Sun Cluster HA for Sun Grid Engine involves the following operations:
Synchronizing the startups between resource groups and disk device groups as explained in Synchronizing the Startups Between Resource Groups and Device Groups in Sun Cluster Data Services Planning and Administration Guide for Solaris OS
Registering and configuring an HAStoragePlus resource
Become superuser on a node in the cluster that will host Sun Grid Engine.
Register the SUNW.HAStoragePlus resource type.
# clresourcetype register SUNW.HAStoragePlus |
Add an HAStoragePlus resource for the Sun Grid Engine file system to the resource group that you created in How to Enable Sun Grid Engine to Run in a Cluster.
# clresource create \ -g sge-rg \ -t SUNW.HAStoragePlus \ -p FilesystemMountPoints=sge-root \ sge-hasp-rs |
Specifies that the resource is to be added to the resource group that you created in How to Enable Sun Grid Engine to Run in a Cluster
Specifies that the mount point for this file system is the root of the Sun Grid Engine file system
Specifies that the resource that you are creating is named sge-hasp-rs
You must use NFS to export the Sun Grid Engine file system to the noncluster nodes. The NFS server that exports this file system must also be protected against failure. To protect the NFS server against failure, use the Sun Cluster HA for NFS data service.
The procedure that follows explains only the special requirements for using Sun Cluster HA for NFS with Sun Cluster HA for Sun Grid Engine. For complete information about installing and configuring Sun Cluster HA for NFS, see Sun Cluster Data Service for NFS Guide for Solaris OS.
Commands in this procedure assume that you have set the $SGE_ROOT environment variable to specify the root of the Sun Grid Engine file system.
Register the SUNW.nfs resource type.
# clresourcetype register SUNW.nfs |
From any cluster node, create a directory for NFS configuration files.
Create the directory under root of the Sun Grid Engine file system. Name the directory SUNW.nfs.
# mkdir -p $SGE_ROOT/SUNW.nfs |
In the directory that you created in Step 2, create a file that contains the share command for the root of the Sun Grid Engine file system.
Name the file the dfstab.sge-nfs-rs, where sge-nfs-rs is the name of the NFS resource that you will create in Step 4.
# echo "share -F nfs -o rw sge-root" \ > $SGE_ROOT/SUNW.nfs/dfstab.sge-nfs-rs |
Add a SUNW.nfs resource to the failover resource group that you created in How to Enable Sun Grid Engine to Run in a Cluster.
# clresource create \ -g sge-rg \ -t SUNW.nfs \ -p Resource_dependencies=sge-hasp-rs \ sge-nfs-rs |
This example shows the command for creating a dfstab file for the root of the Sun Grid Engine file system.
The root of the Sun Grid Engine file system is /global/gridmaster.
The name of the NFS resource for which this file is created is sge-nfs-rs.
# echo "share -F nfs -o rw /global/gridmaster" \ > /global/gridmaster/SUNW.nfs/dfstab.sge-nfs-rs |
Before you perform this procedure, ensure that the Sun Cluster HA for Sun Grid Engine data service packages are installed.
Use the configuration and registration files in the /opt/SUNWscsge/util directory to register the Sun Cluster HA for Sun Grid Engine resources. The files define the dependencies that are required between Sun Grid Engine components. For information about these dependencies, see Dependencies Between Sun Grid Engine Components. For a listing of these files, see Files for Configuring and Removing Sun Cluster HA for Sun Grid Engine Resources.
Registering and configuring Sun Cluster HA for Sun Grid Engine involves the tasks that are explained in the following sections:
Specifying Configuration Parameters for Sun Cluster HA for Sun Grid Engine Resources
How to Create and Enable Sun Cluster HA for Sun Grid Engine Resources
Sun Cluster HA for Sun Grid Engineprovides scripts that automate the process of configuring and removing Sun Cluster HA for Sun Grid Engine resources. These scripts obtain configuration parameters from the sge_config file in the /opt/SUNWscsge/util/ directory. To specify configuration parameters for Sun Cluster HA for Sun Grid Engine resources, edit the sge_config file.
Each configuration parameter in the sge_config file is defined as a keyword-value pair. The sge_config file already contains the required keywords and equals signs. For more information, see Listing of sge_config. When you edit the sge_config file, add the required value to each keyword. Use the values that you identified in Configuration Planning Questions.
The keyword-value pairs in the sge_config file are as follows:
QMASTERRS=sge-qmaster-rs SCHEDDRS=sge-schedd-rs MASTERRG=sge-rg MASTERLH=sge-lh-rs MASTERPORT=portno MASTERHASP=sge-hasp-rs SGE_ROOT=sge-root-dir SGE_CELL=cell-name SGE_VER=6.0
The meaning and permitted values of the keywords in the sge_config file are as follows:
Specifies the name that you are assigning to the resource for the Sun Grid Engine queue master daemon sge_qmaster. This must be defined.
Specifies the name that you are assigning to the resource for the Sun Grid Engine scheduling daemon sge_schedd. This must be defined.
Specifies the name of the resource group that contains the Sun Cluster HA for Sun Grid Engine resources. This name must be the name that you assigned when you created the resource group as explained in How to Enable Sun Grid Engine to Run in a Cluster. This must be defined.
Specifies the name of the logical host name resource for Sun Grid Engine. This name must be the name that you assigned when you created the resource in How to Enable Sun Grid Engine to Run in a Cluster. This must be defined.
Specifies the port number that is configured for sge_qmaster, the default is set to 536. It must be an integer and must be defined.
Specifies the name of the SUNW.HAStoragePlus resource for Sun Grid Engine. This name must be the name that you assigned when you created the resource in Configuring the HAStoragePlus Resource Type to Work With Sun Cluster HA for Sun Grid Engine. If this resource is used it must be defined.
Specifies the root directory of the Sun Grid Engine file system. This directory must be the directory that you created for root of the Sun Grid Engine file system in Preparing the Nodes and Disks. This must be defined.
Specifies the cell that Sun Grid Engine references. This must be defined.
Specifies the version of the installed Sun Grid Engine configuration. This keyword must be defined and set the value to "6.0".
You must set the SGE_VER keyword to “6.0”, even if you are using Sun Grid Engine version 6.1.
This example shows an sge_config file in which configuration parameters are set as follows:
The name of the resource for the Sun Grid Engine queue master daemon sge_qmaster is sge_qmaster-rs.
The name of the resource for the Sun Grid Engine scheduling daemon sge_schedd is sge_schedd-rs.
The name of the resource group that contains the Sun Cluster HA for Sun Grid Engine resources is sge-rg.
The name of the logical host name resource for Sun Grid Engine is sge-lh-rs.
The port number for sge_qmaster is set to 536.
The name of the SUNW.HAStoragePlus resource for Sun Grid Engine is sge-hasp-rs.
The root directory of the Sun Grid Engine file system is /global/gridmaster.
Sun Grid Engine references the default cell.
The version for Sun Grid Engine is set to 6.0.
QMASTERRS=sge_qmaster-rs SCHEDDRS=sge_schedd-rs MASTERRG=sge-rg MASTERLH=sge-lh-rs MASTERPORT=536 MASTERHASP=sge-hasp-rs SGE_ROOT=/global/gridmaster SGE_CELL=default SGE_VER=6.0
Before you begin, ensure that you have edited the sge_config file or a copy of it to specify configuration parameters for Sun Cluster HA for Sun Grid Engine resources. For more information, see Specifying Configuration Parameters for Sun Cluster HA for Sun Grid Engine Resources.
Register the SUNW.gds resource type.
# clresourcetype register SUNW.gds |
Go to the directory that contains the script for creating the Sun Grid Engine resources.
# cd /opt/SUNWscsge/util/ |
Run the script that creates the Sun Grid Engine resources.
# ./sge_register -f /mypath/sge_config |
Bring online the failover resource group that you created in How to Enable Sun Grid Engine to Run in a Cluster.
This resource group contains the following resources:
Logical host name resource
HAStoragePlus resource
NFS resource
Sun Grid Engine application resources
# clresourcegroup online -M sge-rg |
Specifies the resource group that you created in How to Enable Sun Grid Engine to Run in a Cluster is to be brought online
Make sure that the Sun Grid Engine daemons (sge_qmaster and sge_schedd) are not running before bringing the failover resource group online. They may be running because the install_qmaster installation script started them or they are still running after performing the verification described in How to Verify the Sun Cluster HA for Sun Grid Engine Installation and Configuration.
Extension properties for Sun Cluster HA for Sun Grid Engine resources are set when you run the script that creates these resources. You need to set these properties only if you require values other than the values that are set by the script. For information about Sun Cluster HA for Sun Grid Engine extension properties, see the SUNW.gds(5) man page. You can update some extension properties dynamically. You can update other properties, however, only when you create or disable a resource. The Tunable entry indicates when you can update a property.
To update an extension property of a resource, run the clresource(1CL) command with the following option to modify the resource:
-p property=value |
Identifies the extension property that you are setting
Specifies the value to which you are setting the extension property
You can also use the procedures in Chapter 2, Administering Data Service Resources, in Sun Cluster Data Services Planning and Administration Guide for Solaris OS to configure resources after the resources are created.
After you install, register, and configure Sun Cluster HA for Sun Grid Engine, verify the Sun Cluster HA for Sun Grid Engine installation and configuration. Verifying the Sun Cluster HA for Sun Grid Engine installation and configuration determines if the Sun Cluster HA for Sun Grid Engine data service makes the Sun Grid Engine application highly available.
Become superuser a node that will host Sun Grid Engine.
Verify that all Sun Grid Engine resources are online.
# cluster status -t rg,rs |
If a Sun Grid Engine resource is not online, enable the resource.
# clresource enable sge-rs |
Switch the Sun Grid Engine resource group to another cluster node.
# clresourcegroup switch -n node sge-rg |
The Sun Cluster HA for Sun Grid Engine fault monitors verify that the following daemons are running correctly:
Queue master daemon sge_qmaster
Scheduling daemon sge_schedd
Each Sun Cluster HA for Sun Grid Engine fault monitor is contained in the resource that represents Sun Grid Engine component. You create these resources when you register and configure Sun Cluster HA for Sun Grid Engine. For more information, see Registering and Configuring Sun Cluster HA for Sun Grid Engine.
System properties and extension properties of these resources control the behavior of the fault monitor. The default values of these properties determine the preset behavior of the fault monitor. The preset behavior should be suitable for most Sun Cluster installations. Therefore, you should tune the Sun Cluster HA for Sun Grid Engine fault monitor only if you need to modify this preset behavior.
Tuning the Sun Cluster HA for Sun Grid Engine fault monitors involves the following tasks:
Setting the interval between fault monitor probes
Setting the timeout for fault monitor probes
Defining the criteria for persistent faults
Specifying the failover behavior of a resource
For more information, see Tuning Fault Monitors for Sun Cluster Data Services in Sun Cluster Data Services Planning and Administration Guide for Solaris OS.
The config file in the /opt/SUNWscsge/etc directory enables you to activate debugging for Sun Grid Engine resources. This file enables you to activate debugging for all Sun Grid Engine resources or for a specific Sun Grid Engine resource on a particular node. If you require debugging for Sun Cluster HA for Sun Grid Engine to be enabled throughout the cluster, repeat this procedure on all nodes.
Determine whether debugging for Sun Cluster HA for Sun Grid Engine is active.
If debugging is inactive, daemon.notice is set in the file /etc/syslog.conf.
# grep daemon /etc/syslog.conf *.err;kern.debug;daemon.notice;mail.crit /var/adm/messages *.alert;kern.err;daemon.err operator # |
If debugging is inactive, edit the /etc/syslog.conf file to change daemon.notice to daemon.debug.
Confirm that debugging for Sun Cluster HA for Sun Grid Engine is active.
If debugging is active, daemon.debug is set in the file /etc/syslog.conf.
# grep daemon /etc/syslog.conf *.err;kern.debug;daemon.debug;mail.crit /var/adm/messages *.alert;kern.err;daemon.err operator # |
Restart the syslogd daemon.
If your operating system is Solaris 9, perform:
# pkill -1 syslogd |
If your operating system is Solaris 10, perform:
# svcadm restart system-log |
Edit the /opt/SUNWscsge/etc/config file to change DEBUG= to DEBUG=ALL or DEBUG=sge-rs.
# cat /opt/SUNWscsge/etc/config # # Copyright 2006 Sun Microsystems, Inc. All rights reserved. # Use is subject to license terms. # # ident "@(#)config 1.1 06/02/18 SMI" # # Usage: # DEBUG=<RESOURCE_NAME> or ALL # DEBUG=ALL # |
To deactivate debugging, reverse the preceding steps.