This chapter explains how to install and configure Sun Cluster HA for Sun Grid Engine.
Sun Grid Engine was formerly known as Sun ONE Grid Engine. In this book, references to Sun Grid Engine also apply to Sun ONE Grid Engine unless this book explicitly states otherwise.
This chapter contains the following sections.
Overview of Installing and Configuring Sun Cluster HA for Sun Grid Engine
Planning the Sun Cluster HA for Sun Grid Engine Installation and Configuration
Verifying the Installation and Configuration of Sun Grid Engine
Configuring the HAStoragePlus Resource Type to Work With Sun Cluster HA for Sun Grid Engine
Configuring Sun Cluster HA for NFS for Use With Sun Cluster HA for Sun Grid Engine
Registering and Configuring Sun Cluster HA for Sun Grid Engine
Verifying the Sun Cluster HA for Sun Grid Engine Installation and Configuration
Tuning the Sun Cluster HA for Sun Grid Engine Fault Monitors
Sun Grid Engine is a distributed resource management program, which runs jobs in parallel on multiple machines. To minimize the loss of work that a failure of a machine might cause, nodes in the management tier must be protected against failure. However, protection of individual execution nodes in the grid against failure is not required. Failure of an individual execution node in a grid causes only a minor loss of work.
To eliminate single points of failure in the management tier of a Sun Grid Engine system, Sun Cluster HA for Sun Grid Engine provides fault monitoring and automatic fault recovery for the following Sun Grid Engine daemons:
Queue master daemon
Scheduling daemon
You must configure Sun Cluster HA for Sun Grid Engine as a failover service.
For conceptual information about failover data services and scalable data services, see Sun Cluster Concepts Guide for Solaris OS.
Because the management tier relies on the Sun Grid Engine file system, the NFS server that exports this file system must also be protected against failure. To eliminate single points of failure in the NFS server, use the Sun Cluster HA for NFS data service. For more information about this data service, see Sun Cluster Data Service for NFS Guide for Solaris OS.
Each component of Sun Grid Engine has a data service that protects the component when the component is configured in Sun Cluster. See the following table.
Table 1 Protection of Sun Grid Engine Components by Sun Cluster Data Services
The following table summarizes the tasks for installing and configuring Sun Cluster HA for Sun Grid Engine and provides cross-references to detailed instructions for performing these tasks. Perform the tasks in the order that they are listed in the table.
Table 2 Tasks for Installing and Configuring Sun Cluster HA for Sun Grid Engine
Task |
Instructions |
---|---|
Plan the installation |
Sun Cluster HA for Sun Grid Engine Overview Planning the Sun Cluster HA for Sun Grid Engine Installation and Configuration |
Prepare the nodes and disks | |
Install and configure Sun Grid Engine | |
Verify Sun Cluster HA for Sun Grid Engine installation and configuration |
Verifying the Installation and Configuration of Sun Grid Engine |
Install Sun Cluster HA for Sun Grid Engine Packages | |
Configure the HAStoragePlus resource type to work with Sun Cluster HA for Sun Grid Engine |
Configuring the HAStoragePlus Resource Type to Work With Sun Cluster HA for Sun Grid Engine |
Configure Sun Cluster HA for NFS for use with Sun Cluster HA for Sun Grid Engine |
Configuring Sun Cluster HA for NFS for Use With Sun Cluster HA for Sun Grid Engine |
Register and Configure Sun Cluster HA for Sun Grid Engine |
Registering and Configuring Sun Cluster HA for Sun Grid Engine |
Verify Sun Cluster HA for Sun Grid Engine installation and configuration |
Verifying the Sun Cluster HA for Sun Grid Engine Installation and Configuration |
Tune Sun Cluster HA for Sun Grid Engine fault monitors |
Tuning the Sun Cluster HA for Sun Grid Engine Fault Monitors |
Debug Sun Cluster HA for Sun Grid Engine |
This section contains the information that you need to plan your Sun Cluster HA for Sun Grid Engine installation and configuration.
Before you begin, consult your Sun Grid Engine documentation for configuration restrictions and requirements that are not imposed by Sun Cluster software.
The configuration restrictions in the subsections that follow apply only to Sun Cluster HA for Sun Grid Engine.
Your data service configuration might not be supported if you do not observe these restrictions.
Do not use the Sun Grid Engine shadow daemon. The Sun Grid Engine shadow daemon provides an optional mechanism for recovery from failures. This mechanism interferes with the automatic fault recovery that Sun Cluster provides.
Do not choose the option to use a Berkley DB spooling server. Either choose the Classic spooling method or the local Berkley DB spooling method. Currently it is not possible to configure the Berkley DB spooling server in a highly available way within the Sun Cluster framework.
Do not choose the start at boot option when installing Sun Grid Engine. To ensure that Sun Cluster HA for Sun Grid Engine can provide fault monitoring and automatic fault recovery, Sun Grid Engine must be started only by Sun Cluster.
The configuration requirements in this section apply only to Sun Cluster HA for Sun Grid Engine.
If your data service configuration does not conform to these requirements, the data service configuration might not be supported.
Use Sun Grid Engine version 6.0. Make sure to apply the most recent available Patches to the Sun Grid Engine software.
Although Sun Grid Engine version 5.3 has reached its end of life, Sun Cluster HA for Sun Grid Engine still supports this version. If you are using Sun Grid Engine version 5.3, consider upgrading to Sun Grid Engine version 6.0.
The instructions in this book apply only to Sun Grid Engine version 6.0. For information about how to install and configure Sun Cluster HA for Sun Grid Engine with Sun Grid Engine version 5.3, see Sun Cluster Data Service for Sun Grid Engine Guide for Solaris OS for Sun Cluster 3.1 9/04.
Since Sun Cluster 3.1 9/04 was released, keywords in the sge_config file are changed as follows:
RG is changed to MASTERRG.
LH is changed to MASTERLH.
PORT is changed to MASTERPORT.
SGE_VER is introduced.
For an explanation of these keywords, read the comments in the sge_config file.
The Sun Grid Engine management tier must run on Sun Cluster nodes. Because Sun Cluster runs only on the Solaris Operating System, the Sun Grid Engine management tier must also run on the Solaris Operating System. However, Sun Grid Engine supports other operating systems. Therefore, this requirement applies only to the management tier, not to individual execution nodes in the grid.
Ensure that enough free memory is available on the cluster nodes where you plan to run the Sun Grid Engine master.
The amount of free memory that is required on each cluster node depends on the number of jobs that are running on the grid. For example:
If 100 jobs are running, 10 Mbytes of free memory are required.
If 10,000 jobs are running, 1 Gbyte of free memory is required.
Ensure that you have enough disk space in the Sun Grid Engine file system and on the local disk of each node.
The disk space requirements for each type of file or directory in the Sun Grid Engine file system are listed in the following table.
File Type or Directory Type |
Required Disk Space |
---|---|
Binary files |
15 Mbytes for each architecture |
Spool directories |
30–200 Mbytes |
Installation tar file |
40 Mbytes |
On the local disk of each node, 10–20 Mbytes of disk space are required. If you are installing the Sun Grid Engine software on the local disk of a node, 15 Mbytes of disk space are additionally required for the binary files.
Configure Sun Cluster HA for Sun Grid Engine as a failover data service. You cannot configure Sun Cluster HA for Sun Grid Engine as a scalable data service. For more information, see:
The Sun Grid Engine file system must reside on a multihost disk. This disk must be available to the other nodes in the cluster that will be used for the Sun Grid Engine administrative services,
You must use NFS to export the Sun Grid Engine file system to the noncluster nodes. The NFS server that exports this file system must also be protected against failure. To protect the NFS server against failure, use the Sun Cluster HA for NFS data service. For more information about this data service, see Sun Cluster Data Service for NFS Guide for Solaris OS.
Configure the resources for the Sun Grid Engine management tier in the same resource group as the resource for NFS. For more information, see Configuring Sun Cluster HA for NFS for Use With Sun Cluster HA for Sun Grid Engine.
The dependencies between Sun Grid Engine components are shown in the following table.
Table 3 Dependencies Between Sun Grid Engine Components
Sun Grid Engine Component |
Dependency |
---|---|
Sun Grid Engine queue master daemon (sge_qmaster) |
SUNW.HAStoragePlus resource |
Sun Grid Enginescheduling daemon (sge_schedd) |
Sun Grid Engine queue master daemon (sge_qmaster) resource |
These dependencies are set when you register and configure Sun Cluster HA for Sun Grid Engine. For more information, see Registering and Configuring Sun Cluster HA for Sun Grid Engine.
The configuration considerations in the subsections that follow affect the installation and configuration of Sun Cluster HA for Sun Grid Engine.
You can install Sun Grid Engine on one of the following locations:
A highly available local file system
The cluster file system
For the advantages and disadvantages of placing the Sun Grid Engine binary files on a highly available local file system and the cluster file system, see Configuration Guidelines for Sun Cluster Data Services in Sun Cluster Data Services Planning and Administration Guide for Solaris OS.
To enable the type of file system to be identified from the mount point, use a prefix that indicates the type of file system as follows:
For mount points on a highly available local file system, use the /local prefix.
For mount points on the cluster file system, use the /global prefix.
The optimum distribution of spool directories and binary files among file systems depends on the grid configuration. See the following table.
Grid Configuration |
File System Configuration |
---|---|
The execution tier contains fewer than 200 hosts. |
Use a single shared NFS file system under the root of the Sun Grid Engine file system for the spool directories and binary files. |
The execution tier contains about 200 hosts, or the applications are disk intensive. |
Use a separate area on an NFS file system for the spool directories. |
The execution tier contains more than 200 hosts, or NFS performance is likely to be a problem. |
See the Sun Grid Engine documentation for alternate grid configurations. |
Use the questions in this section to plan the installation and configuration of Sun Cluster HA for Sun Grid Engine. Write the answers to these questions in the space that is provided on the data service worksheets in Configuration Worksheets in Sun Cluster Data Services Planning and Administration Guide for Solaris OS.
Which resource group will you use for the following resources:
Logical host name resource
HAStoragePlus resource
NFS resource
Sun Grid Engine application resources
Use the answer to this question when you perform the following procedures:
What is the logical host name for the Sun Grid Engine resource? Clients access the data service through this logical host name.
Use the answer to this question when you perform the procedure How to Enable Sun Grid Engine to Run in a Cluster.
Which resources will you use for the components of Sun Grid Engine?
You require one resource for each component in the following list:
Queue master daemon
Scheduling daemon
Use the answer to this question when you perform the procedure Specifying Configuration Parameters for Sun Cluster HA for Sun Grid Engine Resources.
Where will the system configuration files reside?
See Configuration Guidelines for Sun Cluster Data Services in Sun Cluster Data Services Planning and Administration Guide for Solaris OS for the advantages and disadvantages of using the local file system instead of the cluster file system.
Preparing the nodes and disks modifies the configuration of the operating system to enable Sun Cluster HA for Sun Grid Engine to eliminate single points of failure in a Sun Grid Engine system.
Before you begin, ensure that the requirements in the following sections are met:
Become superuser on all the cluster nodes where you are installing Sun Grid Engine.
Create an administrative user account for Sun Grid Engine on all those cluster nodes.
Either select an existing user account other than root for the grid administration, or create an account specifically for grid administration.
For consistency with the Sun Grid Engine documentation, name the account sgeadmin.
Create a directory for the root of Sun Grid Engine file system.
# mkdir sge-root-dir |
The sge-root-dir must reside in the cluster filesystem. Refer to Configuring the HAStoragePlus Resource Type to Work With Sun Cluster HA for Sun Grid Engine for more details.
Change the owner of the root of the Sun Grid Engine file system to the administrative user whose account you created in Step 2.
# chown sge-admin sge-root-dir |
Set the mode of the root of Sun Grid Engine file system to drwxr-xr-x .
# chmod 755 sge-root-dir |
Specify the port number and protocol for the sge_qmaster and sge_execd services.
Choose an unused port number below 1024. The sge_qmaster and sge_execd services are to be provided through Transmission Control Protocol (TCP).
To specify the port number and protocol, add the following line to the /etc/services file.
sge_qmaster port-no/tcp sge_execd port-no/tcp
For each type of host in the grid, create a plain text file that contains the names of all hosts of that type in the grid.
The install_qmaster script uses these files when you install Sun Grid Engine. Create a separate file for each type of host in the grid:
Execution hosts
Administrative hosts
Submit hosts
This example shows how to prepare the nodes and disks for a Sun Grid Engine installation that is to be configured as follows:
The root of Sun Grid Engine file system is the /global/gridmaster directory. This directory resides in the cluster file system.
The account for grid administration is named sgeadmin.
The sge_qmaster service is to be provided through port 536 and TCP.
The sge_execd service is to be provided through port 537 and TCP.
The sequence of operations for preparing the nodes and disks for the installation of Sun Grid Engine is as follows:
To create the /global/gridmaster directory for the root of Sun Grid Engine file system, the following command is run:
# mkdir /global/gridmaster |
To change the owner of the /global/gridmaster directory to the sgeadmin user, the following command is run:
# chown sgeadmin /global/gridmaster |
To set the mode of the /global/gridmaster directory to drwxr-xr-x, the following command is run:
# chmod 755 /global/gridmaster |
To specify that the sge_qmaster service is to be provided through port 536 and TCP, and that the sge_execd service is to be provided through port 537 and TCP, the following line is added to the /etc/services file:
sge_qmaster 536/tcp sge_execd 537/tcp
The procedure that follows explains only the special requirements for installing Sun Grid Engine for use with Sun Cluster HA for Sun Grid Engine. For complete information about installing and configuring Sun Grid Engine, see your Sun Grid Engine documentation.
To enable Sun Grid Engine to run in a cluster, you must modify Sun Grid Engine to use a logical host name.
Before you begin, ensure that you have the host names of all hosts in the grid. Create a separate list of host names for each type of host in the grid:
Execution hosts
Administrative hosts
Submit hosts
Become superuser of the cluster node where you are installing Sun Grid Engine.
Install the Sun Grid Engine distribution files. You have to choose between the tar.gz format and the pkgadd format.
Follow the instructions outlined in How to Load the Distribution Files On a Workstation in N1 Grid Engine 6 Installation Guide in the N1 Grid Engine 6 Installation Guide.
If you choose the pkgadd format, you need to make sure to install Patches for the Sun Grid Engine software on exactly the same node the Sun Grid Engine packages are registered on.
Set the SGE_ROOT environment variable to the directory for the root of Sun Grid Engine file system that you created in Preparing the Nodes and Disks.
# SGE_ROOT=sge-root-dir # export SGE_ROOT |
Go to the directory for the root of Sun Grid Engine file system.
# cd sge-root-dir |
Start the script that installs the Sun Grid Engine master host.
# ./install_qmaster |
Follow the prompts on screen to provide or confirm the following information:
The name of the Sun Grid Engine administrative user
The value of the SGE_ROOT environment variable
The TCP port number
The name of the Sun Grid Engine cell to be configured
The path to the spool directory
The setup for the correct file permissions
Details of your domain name service (DNS) domains
When you are asked whether you want to use classic spooling or Berkley DB, do not choose to use a Berkely DB spooling Server.
Either choose the classic spooling method, or choose Berkley DB with local spooling.
When you are prompted, specify the range of group IDs for Sun Grid Engine to use.
To ensure that you allocate enough group IDs, specify a range of approximately 100 group IDs, for example, 20000-20100.
Follow the prompts on screen to provide or confirm the following information:
The path to the spooling directory for the execution daemon
The email address of the user who should receive problem reports
Confirm the configuration parameters
When you are asked if you want to install the script that starts Sun Grid Engine at boot time, reply no.
You are asked if you want to install the script that starts Sun Grid Engine at boot time.
We can install the startup script that will start qmaster/scheduler at machine boot (y/n) [y] >> n |
To ensure that Sun Cluster HA for Sun Grid Engine can provide fault monitoring and automatic fault recovery, Sun Grid Engine must be started only by Sun Cluster.
Follow the prompts on screen to provide or confirm the following information:
Specify the list of execution, admin and submit hosts
Do not use a shadow host
Select a scheduler profile
Become superuser of a node in the cluster that will host Sun Grid Engine.
Create a failover resource group to contain the Sun Cluster HA for Sun Grid Engine resources.
Use the resource group that you identified when you answered the questions in Configuration Planning Questions.
# scrgadm -a -g sge-rg \ -y Pathprefix=sge-root-dir |
Specifies that the resource group that you are creating is named sge-rg.
Specifies a directory on a cluster file system that Sun Cluster HA for NFS uses to maintain administrative and status information. This directory must be the directory that you created for the root of the Sun Grid Engine file system in Preparing the Nodes and Disks.
Add a resource for the Sun Grid Engine logical host name to the failover resource group that you created in Step 2.
# scrgadm -a -L -j sge-lh-rs \ -g sge-rg \ -l hostlist |
Specifies that the resource that you are creating is named sge-lh-rs
Specifies that the logical host name resource is to be added to the failover resource group that you created in Step 2
Specifies a comma-separated list of host names that are to be made available by this logical host name resource
Before you install the Sun Cluster HA for Sun Grid Engine packages, verify that the Sun Grid Engine software is correctly installed and configured to run in a cluster. This verification does not verify that the Sun Grid Engine application is highly available because the Sun Cluster HA for Sun Grid Engine data service is not yet installed.
If any step in this procedure fails, see your Sun Grid Engine documentation for more information about how to verify the Sun Grid Engine installation.
You verify the installation and configuration of Sun Grid Engine by submitting a dummy job and checking that the required processes are running.
Log in to the master host as the administrative user whose account you created in Preparing the Nodes and Disks.
Set the SGE_ROOT environment variable to the directory for the root of Sun Grid Engine file system that you created in Preparing the Nodes and Disks.
$ SGE_ROOT=sge-root-dir $ export SGE_ROOT |
Start the script that modifies your environment to enable Sun Grid Engine to run.
$ . $SGE_ROOT/default/common/settings.sh |
Submit a dummy job to Sun Grid Engine.
$ qsub $SGE_ROOT/examples/jobs/sleeper.sh your job 1 (*Sleeper*) has been submitted |
On the master host, confirm that these processes are running:
sge_qmaster
sge_schedd
# ps -ef | grep sge_ root 429 1 0 Jul 27 3:37 /global/gridmaster/bin/solaris64/sge_qmaster root 429 1 0 Jul 27 3:37 /global/gridmaster/bin/solaris64/sge_schedd |
View the global configuration of the grid.
If you are using the command line, type the following command:
$ qconf -sconf |
If you are using the QMON graphical user interface (GUI), select Cluster Configuration.
On at minimum one execution host, confirm that these processes are running:
sge_execd
# ps -ef | grep sge_ root 451 1 0 Jul 27 3:37 /global/gridmaster/bin/solaris64/sge_execd |
If you did not install the Sun Cluster HA for Sun Grid Engine packages during your initial Sun Cluster installation, perform this procedure to install the packages. Perform this procedure on each cluster node where you are installing the Sun Cluster HA for Sun Grid Engine packages. To complete this procedure, you need the Sun Cluster Agents CD-ROM.
If you are installing more than one data service simultaneously, perform the procedure in Installing the Software in Sun Cluster Software Installation Guide for Solaris OS.
Install the Sun Cluster HA for Sun Grid Engine packages by using one of the following installation tools:
The Web Start program
The scinstall utility
If you are using Solaris 10, install these packages only in the global zone. To ensure that these packages are not propagated to any local zones that are created after you install the packages, use the scinstall utility to install these packages. Do not use the Web Start program.
You can run the Web Start program with a command-line interface (CLI) or with a graphical user interface (GUI). The content and sequence of instructions in the CLI and the GUI are similar. For more information about the Web Start program, see the installer(1M) man page.
On the cluster node where you are installing the Sun Cluster HA for Sun Grid Engine packages, become superuser.
(Optional) If you intend to run the Web Start program with a GUI, ensure that your DISPLAY environment variable is set.
Insert the Sun Cluster Agents CD-ROM into the CD-ROM drive.
If the Volume Management daemon vold(1M) is running and configured to manage CD-ROM devices, it automatically mounts the CD-ROM on the /cdrom/cdrom0 directory.
Change to the Sun Cluster HA for Sun Grid Engine component directory of the CD-ROM.
The Web Start program for the Sun Cluster HA for Sun Grid Engine data service resides in this directory.
# cd /cdrom/cdrom0/components/SunCluster_HA_SUN_GRID_ENG_3.1 |
Start the Web Start program.
# ./installer |
When you are prompted, select the type of installation.
Follow the instructions on the screen to install the Sun Cluster HA for Sun Grid Engine packages on the node.
After the installation is finished, the Web Start program provides an installation summary. This summary enables you to view logs that the Web Start program created during the installation. These logs are located in the /var/sadm/install/logs directory.
Exit the Web Start program.
Remove the Sun Cluster Agents CD-ROM from the CD-ROM drive.
Perform this procedure on all of the cluster members that can master Sun Cluster HA for Sun Grid Engine.
Ensure that you have the Sun Cluster Agents CD-ROM.
Load the Sun Cluster Agents CD-ROM into the CD-ROM drive.
Run the scinstall utility with no options.
This step starts the scinstall utility in interactive mode.
Select the menu option, Add Support for New Data Service to This Cluster Node.
The scinstall utility prompts you for additional information.
Provide the path to the Sun Cluster Agents CD-ROM.
The utility refers to the CD as the “data services cd.”
Specify the data service to install.
The scinstall utility lists the data service that you selected and asks you to confirm your choice.
Exit the scinstall utility.
Unload the CD from the drive.
For maximum availability of the Sun Grid Engine application, resources that Sun Cluster HA for Sun Grid Engine requires must be available before the Sun Grid Engine management tier is started. An example of such a resource is the Sun Grid Engine file system. To ensure that these resources are available, configure the HAStoragePlus resource type to work with Sun Cluster HA for Sun Grid Engine.
For information about the relationship between resource groups and disk device groups, see Relationship Between Resource Groups and Disk Device Groups in Sun Cluster Data Services Planning and Administration Guide for Solaris OS.
Configuring the HAStoragePlus resource type to work with Sun Cluster HA for Sun Grid Engine involves the following operations:
Synchronizing the startups between resource groups and disk device groups as explained in Synchronizing the Startups Between Resource Groups and Disk Device Groups in Sun Cluster Data Services Planning and Administration Guide for Solaris OS
Registering and configuring an HAStoragePlus resource
Become superuser on a node in the cluster that will host Sun Grid Engine.
Register the SUNW.HAStoragePlus resource type.
# scrgadm -a -t SUNW.HAStoragePlus |
Add an HAStoragePlus resource for the Sun Grid Engine file system to the resource group that you created in How to Enable Sun Grid Engine to Run in a Cluster.
# scrgadm -a -j sge-hasp-rs \ -g sge-rg \ -t SUNW.HAStoragePlus \ -x FilesystemMountPoints=sge-root |
Specifies that the resource that you are creating is named sge-hasp-rs
Specifies that the resource is to be added to the resource group that you created in How to Enable Sun Grid Engine to Run in a Cluster
Specifies that the mount point for this file system is the root of the Sun Grid Engine file system
You must use NFS to export the Sun Grid Engine file system to the noncluster nodes. The NFS server that exports this file system must also be protected against failure. To protect the NFS server against failure, use the Sun Cluster HA for NFS data service.
The procedure that follows explains only the special requirements for using Sun Cluster HA for NFS with Sun Cluster HA for Sun Grid Engine. For complete information about installing and configuring Sun Cluster HA for NFS, see Sun Cluster Data Service for NFS Guide for Solaris OS.
Commands in this procedure assume that you have set the $SGE_ROOT environment variable to specify the root of the Sun Grid Engine file system.
Register the SUNW.nfs resource type.
# scrgadm -a -t SUNW.nfs |
From any cluster node, create a directory for NFS configuration files.
Create the directory under root of the Sun Grid Engine file system. Name the directory SUNW.nfs.
# mkdir -p $SGE_ROOT/SUNW.nfs |
In the directory that you created in Step 2, create a file that contains the share command for the root of the Sun Grid Engine file system.
Name the file the dfstab.sge-nfs-rs, where sge-nfs-rs is the name of the NFS resource that you will create in Step 4.
# echo "share -F nfs -o rw sge-root" \ > $SGE_ROOT/SUNW.nfs/dfstab.sge-nfs-rs |
Add a SUNW.nfs resource to the failover resource group that you created in How to Enable Sun Grid Engine to Run in a Cluster.
# scrgadm -a -j sge-nfs-rs \ -g sge-rg \ -t SUNW.nfs \ -y Resource_dependencies=sge-hasp-rs |
This example shows the command for creating a dfstab file for the root of the Sun Grid Engine file system.
The root of the Sun Grid Engine file system is /global/gridmaster.
The name of the NFS resource for which this file is created is sge-nfs-rs.
# echo "share -F nfs -o rw /global/gridmaster" \ > /global/gridmaster/SUNW.nfs/dfstab.sge-nfs-rs |
Before you perform this procedure, ensure that the Sun Cluster HA for Sun Grid Engine data service packages are installed.
Use the configuration and registration files in the /opt/SUNWscsge/util directory to register the Sun Cluster HA for Sun Grid Engine resources. The files define the dependencies that are required between Sun Grid Engine components. For information about these dependencies, see Dependencies Between Sun Grid Engine Components. For a listing of these files, see Appendix A, Files for Configuring and Removing Sun Cluster HA for Sun Grid Engine Resources.
Registering and configuring Sun Cluster HA for Sun Grid Engine involves the tasks that are explained in the following sections:
Specifying Configuration Parameters for Sun Cluster HA for Sun Grid Engine Resources
How to Create and Enable Sun Cluster HA for Sun Grid Engine Resources
Sun Cluster HA for Sun Grid Engineprovides scripts that automate the process of configuring and removing Sun Cluster HA for Sun Grid Engine resources. These scripts obtain configuration parameters from the sge_config file in the /opt/SUNWscsge/util/ directory. To specify configuration parameters for Sun Cluster HA for Sun Grid Engine resources, edit the sge_config file.
Each configuration parameter in the sge_config file is defined as a keyword-value pair. The sge_config file already contains the required keywords and equals signs. For more information, see Listing of sge_config. When you edit the sge_config file, add the required value to each keyword. Use the values that you identified in Configuration Planning Questions.
The keyword-value pairs in the sge_config file are as follows:
COMMDRS=sge-commd-rs QMASTERRS=sge-qmaster-rs SCHEDDRS=sge-schedd-rs MASTERRG=sge-rg MASTERLH=sge-lh-rs MASTERPORT=portno SGE_ROOT=sge-root-dir SGE_CELL=cell-name SGE_VER=6.0|5.3
The meaning and permitted values of the keywords in the sge_config file are as follows:
Specifies the name that you are assigning to the resource for the Sun Grid Engine communications daemon sge_commd. This is only needed for Sun Grid Engine 5.3 and can be left empty for Sun Grid Engine 6.0.
Specifies the name that you are assigning to the resource for the Sun Grid Engine queue master daemon sge_qmaster. This must be defined.
Specifies the name that you are assigning to the resource for the Sun Grid Engine scheduling daemon sge_schedd. This must be defined.
Specifies the name of the resource group that contains the Sun Cluster HA for Sun Grid Engine resources. This name must be the name that you assigned when you created the resource group as explained in How to Enable Sun Grid Engine to Run in a Cluster. This must be defined.
Specifies the name of the logical host name resource for Sun Grid Engine. This name must be the name that you assigned when you created the resource in How to Enable Sun Grid Engine to Run in a Cluster. This must be defined.
Specifies the port number that is configured for sge_qmaster in /etc/inet/services (normally 536). While this value is not used by the Sun Cluster HA for Sun Grid Engine dataservice, it is good practice to document it here. It must be an integer and needs to be always defined.
Specifies the root directory of the Sun Grid Engine file system. This directory must be the directory that you created for root of the Sun Grid Engine file system in Preparing the Nodes and Disks. This must be defined.
Specifies the cell that Sun Grid Engine references. This must be defined.
Specifies the version of the installed Sun Grid Engine configuration. This keyword needs to be defined and can have the value of "5.3" or "6.0".
This example shows an sge_config file in which configuration parameters are set as follows:
The name of the resource for the Sun Grid Engine communications daemon sge_commd is "" i.e. unset, since it is not needed with Sun Grid Engine 6.0
The name of the resource for the Sun Grid Engine scheduling daemon sge_schedd is sge_qmaster-rs.
The name of the resource for the Sun Grid Engine scheduling daemon sge_schedd is sge_schedd-rs.
The name of the resource group that contains the Sun Cluster HA for Sun Grid Engine resources is sge-rg.
The name of the logical host name resource for Sun Grid Engine is sge-lh-rs.
The root directory of the Sun Grid Engine file system is /global/gridmaster.
Sun Grid Engine references the default cell.
The port number is set to 536. This number is ignored.
The version for Sun Grid Engine is set to 6.0.
COMMDRS="" QMASTERRS=sge_qmaster-rs SCHEDDRS=sge_schedd-rs MASTERRG=sge-rg MASTERLH=sge-lh-rs MASTERPORT=536 SGE_ROOT=/global/gridmaster SGE_CELL=default SGE_VER=6.0
Before you begin, ensure that you have edited the sge_config file to specify configuration parameters for Sun Cluster HA for Sun Grid Engine resources. For more information, see Specifying Configuration Parameters for Sun Cluster HA for Sun Grid Engine Resources.
Register the SUNW.gds resource type.
# scrgadm -a -t SUNW.gds |
Go to the directory that contains the script for creating the Sun Grid Engine resources.
# cd /opt/SUNWscsge/util/ |
Run the script that creates the Sun Grid Engine resources.
# ./sge_register |
Bring online the failover resource group that you created in How to Enable Sun Grid Engine to Run in a Cluster.
This resource group contains the following resources:
Logical host name resource
HAStoragePlus resource
NFS resource
Sun Grid Engine application resources
# scswitch -Z -g sge-rg |
Specifies the resource group that you created in How to Enable Sun Grid Engine to Run in a Cluster is to be brought online
Make sure that the Sun Grid Engine daemons (sge_qmaster and sge_schedd) are not running before bringing the failover resource group online. They may be running because the install_qmaster installation script started them or they are still running after performing the verification described in How to Verify the Sun Cluster HA for Sun Grid Engine Installation and Configuration.
Extension properties for Sun Cluster HA for Sun Grid Engine resources are set when you run the script that creates these resources. You need to set these properties only if you require values other than the values that are set by the script. For information about Sun Cluster HA for Sun Grid Engine extension properties, see the SUNW.gds(5) man page. You can update some extension properties dynamically. You can update other properties, however, only when you create or disable a resource. The Tunable entry indicates when you can update a property.
To update an extension property of a resource, run the scrgadm(1M) command with the following option to modify the resource:
-x property=value |
Identifies the extension property that you are setting
Specifies the value to which you are setting the extension property
You can also use the procedures in Chapter 2, Administering Data Service Resources, in Sun Cluster Data Services Planning and Administration Guide for Solaris OS to configure resources after the resources are created.
After you install, register, and configure Sun Cluster HA for Sun Grid Engine, verify the Sun Cluster HA for Sun Grid Engine installation and configuration. Verifying the Sun Cluster HA for Sun Grid Engine installation and configuration determines if the Sun Cluster HA for Sun Grid Engine data service makes the Sun Grid Engine application highly available.
Become superuser a node that will host Sun Grid Engine.
Verify that all Sun Grid Engine resources are online.
# scstat |
If a Sun Grid Engine resource is not online, enable the resource.
# scswitch -e -j sge-rs |
Switch the Sun Grid Engine resource group to another cluster node.
# scswitch -z -g sge-rg -h node |
The Sun Cluster HA for Sun Grid Engine fault monitors verify that the following daemons are running correctly:
Queue master daemon sge_qmaster
Scheduling daemon sge_schedd
Each Sun Cluster HA for Sun Grid Engine fault monitor is contained in the resource that represents Sun Grid Engine component. You create these resources when you register and configure Sun Cluster HA for Sun Grid Engine. For more information, see Registering and Configuring Sun Cluster HA for Sun Grid Engine.
System properties and extension properties of these resources control the behavior of the fault monitor. The default values of these properties determine the preset behavior of the fault monitor. The preset behavior should be suitable for most Sun Cluster installations. Therefore, you should tune the Sun Cluster HA for Sun Grid Engine fault monitor only if you need to modify this preset behavior.
Tuning the Sun Cluster HA for Sun Grid Engine fault monitors involves the following tasks:
Setting the interval between fault monitor probes
Setting the timeout for fault monitor probes
Defining the criteria for persistent faults
Specifying the failover behavior of a resource
For more information, see Tuning Fault Monitors for Sun Cluster Data Services in Sun Cluster Data Services Planning and Administration Guide for Solaris OS.
The config file in the /opt/SUNWscsge/etc directory enables you to activate debugging for Sun Grid Engine resources. This file enables you to activate debugging for all Sun Grid Engine resources or for a specific Sun Grid Engine resource on a particular node. If you require debugging for Sun Cluster HA for Sun Grid Engine to be enabled throughout the cluster, repeat this procedure on all nodes.
Determine whether debugging for Sun Cluster HA for Sun Grid Engine is active.
If debugging is inactive, daemon.notice is set in the file /etc/syslog.conf.
# grep daemon /etc/syslog.conf *.err;kern.debug;daemon.notice;mail.crit /var/adm/messages *.alert;kern.err;daemon.err operator # |
If debugging is inactive, edit the /etc/syslog.conf file to change daemon.notice to daemon.debug.
Confirm that debugging for Sun Cluster HA for Sun Grid Engine is active.
If debugging is active, daemon.debug is set in the file /etc/syslog.conf.
# grep daemon /etc/syslog.conf *.err;kern.debug;daemon.debug;mail.crit /var/adm/messages *.alert;kern.err;daemon.err operator # |
Restart the syslogd daemon.
# pkill -1 syslogd |
Edit the /opt/SUNWscsge/etc/config file to change DEBUG= to DEBUG=ALL or DEBUG=sge-rs.
# cat /opt/SUNWscsge/etc/config # # Copyright 2003 Sun Microsystems, Inc. All rights reserved. # Use is subject to license terms. # # Usage: # DEBUG=<RESOURCE_NAME> or ALL # DEBUG=ALL # |
To deactivate debugging, reverse the preceding steps.