This manual contains information about accessing and using GEMM as well as managing N1GE versions.
The Grid Engine Management Module (GEMM) for the Sun™ Control Station allows you to install and set up a grid. It also allows you to monitor the performance of the hosts in the grid. This document explains the features and services available through the Grid Engine control module.
GEMM supports the following operating systems and hardware platforms:
Solaris 9 and 10 on SPARC, x86, and x64 hardware platforms
RedHatLinux 7.3, RedHatLinux 8.0, RedHatLinux 9 on x86 hardware platforms
FedoraLinux 1, FedoraLinux 2, FedoraLinux 3 on x86, and x64 hardware platforms
RedHatLinux 2.1 WS,RedHatLinux 2.1 AS, RedHatLinux 2.1 ES on x86 hardware platforms
RedHatLinux 3WS, RedHatLinux 3 AS, RedHatLinux 3ES on x86, and x64 hardware platforms
SuSELinux 9.0 on x86 and x64 hardware platforms
JDS 1, JDS 2 on x86 hardware
The GEMM module allows you to:
manage different versions of N1 Grid Engine
install a master host on the grid
install additional compute hosts on the grid
monitor and diagnose the performance of the grid
uninstall the Grid Engine components from selected compute and access hosts
uninstall the Grid Engine components from the master host and all compute and access hosts
check the status of Grid Engine (from Station Settings > Active Monitor on the SCS.)
configure the monitoring settings for the grid
The following sections describe each of these functions.
In Grid Engine terminology, compute hosts are called execution hosts. Access hosts are called submit hosts.
GEMM is part of the Grid Engine Update 4 distribution and is included in the Gemm/tar/n1ge-6_0u4-gemm.tar.gz package. Use the following steps to install GEMM.
Unpack the GEMM archive file.
Add GEMM to a Sun Control Station 2.2 system. See the Sun Control Station documentation for instructions on adding a new management module.
You access the GEMM features by clicking on the Monitor menu item on the Sun Control Station main screen as shown in the following figure.
In most of the short procedures in this chapter, the first step is to click the Grid Engine item in the left menu bar and the second step is to click on a sub-menu item. To reduce the number of steps in each procedure, the menu commands are grouped together and shown in Initial Caps. Right-angle brackets separate the individual items. For example, select Grid Engine > Settings means to click Grid Engine in the left menu bar and then click the Settings sub-menu item.
When you launch a task (for example, when installing a master host or uninstalling a host), a Task Progress dialog appears in the user interface (UI). This dialog has a Status field indicating the current status of the task and a progress bar. When the progress bar displays 100%, the task has completed.
If you want to perform another task in the UI while the current task is underway, you can put the Task Progress dialog in the background. Simply click the Run Task In Background button located below the progress bar.
To return to the Task Progress dialog, select Administration > Tasks on the left. The Task table appears. If the task is still underway, a status message displays in the Duration column. Click on the progress-bar icon in this column to re-display the Task Progress dialog for this task.
Once the task is complete and the progress bar displays 100%, two buttons appear below the Task Progress dialog: Done and View Events.
To view the list of events associated with the completed task, click View Events. The Events For <Task> table appears. If you then click the up-arrow icon in the top-right corner, the Tasks table appears.
To return to the previous screen, click Done.
GEMM allows you to upload one or more versions of N1 Grid Engine, and choose which one to deploy on the grid. To do these tasks, click the Versions menu item to display the Version Management page.
This page is where you can upload, modify and manage different versions of N1 Grid Engine software.
You can only deploy one version at any given time. If you wish to deploy another version, you must first uninstall the grid from all the hosts.
In the Version list, each version has three icons:
The Minus icon removes a version and all its files.
The Modify icon lets you rename a version.
The Inspect icon lets you add or remove individual files from a version.
The Version Management main page displays a list of versions currently available on the SCS server. Initially, no versions are defined.
Click the Add button which produces a dialog where you name a version.
This version name can be anything, as long as it does not contain any non-whitespace or punctuation characters other than “-” and “_”.
After you name the version, click the Submit button.
Once you submit the name, the Version list displays again with the newly-created version present. You can add more versions at any time.
You must first add files to a version before you deploy it to the grid. Adding files consists of adding all N1 Grid Engine package files that are part of the given version.
The following criteria apply to the package files:
GEMM requires N1 Grid Engine 6 Update 4 or later.
All package files must be in .tar.gz format. Although N1 Grid Engine currently is made available in .pkg format for Solaris, the .pkg format files cannot be used by GEMM.
For any given version, there must be the “common” package for that version, as well as all the “bin” (binary) packages which support the kinds of hosts in your grid. For example, if your grid consists of Solaris 9 SPARC hosts as well as Solaris 10 and x64 hosts, then you must include in the version these files:
where -<name> is the name given to that version, for example n1ge-6_0.
In any given version, you cannot mix different update levels of N1 Grid Engine. All packages associated with a version must belong to the same update level, for example n1ge-6_0. The only exception is is when you deploy N1GE 6 patches, which will be described ***.
click the Inspect icon in the version list. You will see a list of files currently contained in that version.
Clicking the Add button produces a dialog box where you can load version files one at a time. You can also upload files from the local browser using the File browser or from a remote URL.
When you upload files from a remote URL, you can only specify a URL which can be accessed from the SCS server directly without going through a proxy server. You cannot specify a proxy server when using the Version Management web dialog. Please see the documentation for the command-line equivalent of Version Management, gemmVersionMgmt.pl, to learn how to upload files using a web proxy.
N1GE software updates are made available through the mechanism of patch files. You cannot use an N1GE patch alone; you must use it in conjunction with a full distribution of N1GE software. When you install a patch, it replaces various files in the existing full version.
There are two ways you can install N1GE patches:
Install patch files on a live N1GE grid already running an existing full installation of N1GE 6. This procedure is described in the patch documentation but is not supported by GEMM.
Install patch files at the same time as you install a fresh installation of an original, full, N1GE software distribution. You can use this technique when you are creating a new grid and want to install it with the latest N1GE updates. You also can use this approach when you want to use N1GE with the latest updates and don't mind getting rid of your old setup entirely (without worrying about saving old configurations or maintaining jobs currently in the systems). GEMM can handle this procedure automatically as described here.
Create a new version in the Version Manager of GEMM. Populate this version with N1GE files from an original full version, just as if you were going to deploy this version.
Get the desired patch files. When Sun Microsystems creates and releases patch updates, these files are made available on the SunSolve website (http://sunsolve.sun.com). For each patch release, there is one patch for the N1GE "common" package, as well as one patch for each architecture-specific package. Get all the patch files necessary for your particular environment.
Patch files are distributed in both .pkg format as well as .tar.gz format. Make sure to obtain only the .tar.gz form of the patches. These patch files are themselves contained in a ZIP archive; be sure to unzip the archive to extract the .tar.gz files.
Put these .tar.gz patch files into the previously created version, using either the Version Manager web UI or the command line. Now, You can use GEMM to deploy this version onto any Grid host, just as with an original, unpatched version of N1GE.
Be sure that only one patch level of N1GE is deployed across the grid. You should take care should to avoid mixing different patch levels in the same distribution. Also do not use patch files for only some but not all of the architecture-specific packages required for your environment.
To set up a compute grid, you must first select one of the managed hosts to be the master host. You can then set up additional compute hosts.
After you install the N1GE6 software on a server and add hosts to the grid using GEMM,, all the N1GE daemons on the hosts will be running, but you must submit jobs separately.
For documentation on the N1GE6 software, refer to the user manuals at the following URL:
You can configure only one managed host to be the master host. If you have already configured a master and you select the Install Master sub-menu item, a message appears that you have already configured a master for the compute grid.
The Grid Engine module deploys only a dedicated N1GE6 master host. Unless you plan to have relatively low job throughput on your grid, you should not have the N1GE6 master host also act as a compute host. To add a host as a master host in the compute grid, you must first import the host into the SCS 2.2 framework. For more information, see “About Adding Managed Hosts” in the SCS 2.2 Release Notes.
The SCS server cannot server as an N1 Grid Engine (N1GE) Master Host or Compute Host, since only SCS clients can have those roles. An SCS server cannot also be an SCS client at the same time. Thus, the SCS server has to be a different host than either the N1GE Master Host or Compute Hosts.
To install a master host for the grid:
Select Grid Engine > Install Master.
The selector appears, displaying the list of managed hosts; see Installing Grid Engine Hosts.
Click to highlight the managed host that you want to configure as the master host in the compute grid.
Pick a Version from the list presented
The version picked at this step will be installed on the Master as well as all the hosts in the grid
Click Install in the bottom right corner.
The Task Progress dialog appears.
Once you have configured one of the managed hosts as the master host, you can add additional hosts to act as compute hosts or access hosts in the grid.
To add a host as a compute host in the grid, you must first import the host into the Sun Control Station framework. For more information, see “About Adding Managed Hosts” in the SCS 2.2 Product Notes.
Before you can add a compute host to a grid, you must first designate a master host. If you have not yet designated a master host, the system instructs you to do so. For more information see, Installing Grid Engine Hosts.
Select Grid Engine > Install Compute Host.
The selector appears, displaying the list of managed hosts; see the previous figure.
Click to highlight a host(s). You can also click Select All at the top to choose all hosts in the list.
You can pick the host(s) to be either compute or access hosts. Pick the desired button at the bottom of the page labelled “Install Compute Hosts” or “Install Access Hosts”
The Task Progress dialog appears. When the installation completes, a new dialog box appears which allows you to either finish the installation or view the installation events. If you choose View Events, a dialog similar to the following appears.
When you are finished installing hosts, click Done.
When you click the Monitor Grid menu item, a page with a high-level overview of the state of the grid appears.
This page has tables that allow you to:
View Summary Status
Examine Cluster Queue status
Check Job Alerts
Check Host Alerts
Check Queue Alerts
Buttons on the main page let you go to pages where you can:
View Job Details
View Queue Details
View Host Details
Examine Daemon Log files
Also available from the SCS menu is the ability to quickly see the state of the Grid by choosing Station Settings >Active Monitor.
The Summary Status table shows the total number of jobs in various states (pending, running, suspended, and so forth). It also shows the load averaged across all compute hosts and the total amount of used and installed memory summed over all compute hosts.
The subheading of this table contains a timestamp for when the data was obtained. By default, most monitoring data is automatically refreshed every minute. To display the most up-to-date database information in the tables, click the Monitor Grid menu item again. You can also reload the browser window. If the monitoring is not working properly for any reason, the subheading displays a warning and displays the timestamp for when the data was most recently obtained. This timestamp applies to all monitoring information displayed in GEMM, not just the Summary Status table.
Above this table is the Update button. Clicking this button retrieves the data immediately instead of waiting for the next one-minute interval. A progress bar shows the progress of the update. When the update completes, click the Done button to return to the main Monitor Grid page with the new data and updated timestamp.
If an update of the monitor is already in progress when you click the button, a message indicate this situation. As soon as the update in progress completes, the Update button will again be available to force a new update.
You access the Jobs details page by clicking the Jobs button in the Summary Status table on the main Monitor page.
This page has a table which shows a summary of all current jobs in the system including jobs which are pending, running, suspended, held, or in an error state. Completed jobs are not listed. The top row of three buttons lets you see the list of jobs according to three different views: Overview, Utilization, and Allocation.
The initial view is always the Overview. Clicking any of the other buttons displays the other corresponding views. In all views, the back button on the table leads back to the main page.
Also present in all views at the bottom of the frame is the Filter, which you can use to limit the jobs displayed by providing configured criteria. Finally, the three buttons corresponding to the three different views are always shown at the top of each view, allowing you to move directly among the three views.
The Overview view shows an overall summary of the jobs.
The columns in this table provide the:
Job state, indicated by one or more letters plus a colored circle and icon
User who submitted the job
Project under which the job was submitted
Department of the submitter
Priority of the job
Job time, either the time spent pending, or for running jobs the time spent running
Job task ID; for pending jobs, all task IDs are grouped together
The icon scheme for the job state is:
A Gray Icon means the job is pending
A Green Icon means the job is running
A Yellow icon indicates the job is suspended
A Red Icon indicates that the job is in an error state
The letters shown for the job state are the same letters used by N1 Grid Engine to indicate the job state when you run the qstat command. For more information, see the N1 Grid Engine Administration manual.
Jobs display ten rows at a time. You can see the entire list by using the pagination controls at the bottom of the table. By default, rows are displayed numerically by job ID, but you can use any column whose header is white to change the ordering of the rows. Clicking on a column header sorts the rows according to the values in that column. Clicking again on the column header reverses the sort. The sorting is preserved across pages if you click on a pagination button.
Clicking the Inspect icon next to the ID of each job retrieves details about the job. A progress bar indicates the progress of this process. When the Done button appears, clicking it leads to a page with the details displayed for the chosen job. These details appear in three tables.
The first table shows the job details, including various properties related to the jobs environment, resource requests, submit options, and so forth.
The second table shows the current resource utilization for that job. If this information is not available, for example, because the job started too recently or the job is still pending, then this table is empty. For jobs with multiple tasks, the usage of each task appears on a separate line.
The third table shows the scheduling information for that job.
The information displayed in these three tables corresponds directly to the output from the N1 Grid Engine 6 qstat -j command. For more information on job details, see the N1 Grid Engine 6 Administration manual.
Clicking the Back button of the first table returns you to the Overview page.
You access the Utilization view of the job by clicking the Utilization button on the Jobs page.
Unlike the Overview view, only running and suspended jobs appear. In the Utilization view, the columns are the:
Job state, indicated by a colored circle and icon
Queue instance where the job is running
CPU utilization of the job
Memory utilization of the job
Normalized Ticket priority
Normalized Urgency priority
Normalized POSIX priority
Job task ID; tasks belonging to the same job are never grouped
If the CPU usage or memory usage values are blank, the usage information for that job has not yet been reported. Check back at a later time to see if the usage is then reported.
The description for the Overview page regarding the meaning of the icons for the job state is the same for this view, except that no letters are shown. The pagination of the table and the sorting based upon different columns all apply similarly to the Utilization View.
An Inspect icon for the job Task ID is displayed for all jobs above the final column. Clicking this icon retrieves the current diagnostic information for that job. This diagnostic information corresponds to the data found in the job spool files in the jobs spool directory. A progress bar indicates the progress of this process. When the Done button appears, clicking it leads to a page with the status information displayed for the chosen job as in the following figure.
You can only obtain job diagnostic information if the job is running on a compute host that was deployed by GEMM. If the host on which the job is running was not deployed by GEMM, then clicking the Inspect icon results in an error message; clicking Done leads back to the Utilization view.
The Job Diagnostic details given in these tables include:
Interpreting the Tables
Each table corresponds to a different file from the job spool directory. For more information on the information in the job spool directory, see the N1 Grid Engine 6 Administration manual.
Clicking the back button of the addgrpid table returns you to the Utilization view.
If a job has already completed by the time you click the Inspect button, or if the job completes during the information retrieval process, the information is lost and cannot be displayed. In this case, the progress bar will indicate a failure and clicking on the Done button leads back to the Utilization view.
Clicking the Allocation button switches to the Allocation view of the jobs.
In this view, information is presented for all jobs and the columns provide details for the:
Job state, indicated by a colored circle and icon
Total number of tickets for the job
Number of override tickets
Number of functional tickets
Number of share tree tickets
Total urgency for the job
Resource contribution to the urgency
Deadline contribution to the urgency
Waiting time contribution to the urgency
The description of the icons for the job state on the Overview page apply here also, except that no letters are shown. The pagination of the table, and the sorting based upon different columns all apply similarly to the Allocation View.
For more information on the meaning of each column, see the N1 Grid Engine Administration manual.
In each of the three views Overview, Utilization, and Allocation, the Filter option appears below the job table.
You use the filter to limit the jobs displayed to those matching a specified search condition. The filter lets you choose a column on which to filter, a search type to use, and a value on which to search.
You select the column and search type from a drop-down table, while you type the value into a text entry box. The drop-down table for column changes with each view depending on which columns are being displayed. The type of search can be one of: equals, not equals, less than, less than or equal to, greater than, and greater than or equal to.
You can define up to three filters at one time; the effects of multiple filters are combined together to provide the final result. After you set up the desired filter, click the Filter button to redisplay the current view with the filter applied. Pagination is still active and will maintain the filter across pages . Clicking the Clear button restores the unfiltered view.
The following figure shows you how a sorted jobs utilization page would look.
When you choose the Job State as a search column, the search value is compared against the job status letter code as displayed in the Overview view, even though these letters are not displayed for the Utilization and Allocation view.
You access the Queue Details page by clicking the Queue button in the Summary Status table on the main Monitor page.
Note that this table provides information on all queue instances on the currently selected master host, including instances on hosts that were not added by GEMM framework. The information appears in groups of ten rows at a time, with the ability to page back and forth between the rows.
For each queue instance, there are columns for the Queue instance name, the status, the total number of slots and number of used slots. The status is indicated by a colored circle and icon similar to the Job Alerts previously described. The only additional feature is a green icon to indicate queue instances that have no alert conditions. Clicking the Back icon in the table header returns you to the Monitor Grid main page.
By default, rows display alphabetically by queue instance name but you can use any column whose header is written in white to change the ordering of the rows. Clicking on a column header sorts the rows according to the values in that column; clicking again on the column header reverses the sort. The sorting is preserved across pages if you click a pagination button.
The final column of each row has an Inspect icon. Clicking on this icon displays a table with the full details for that queue instance. The final entry in this table shows the timestamp when the data was obtained. For information on the meaning of the other table entries, consult the N1 Grid Engine 6 Administration manual. Clicking on the 0 icon for this table returns you to the Queue Details page.
You access the Host Details page by clicking the Host button on in the Summary Status table on the main Monitor page.
This page displays a table with the state of all the compute hosts that are members of the grid. The title of the table also indicates which host is currently chosen as the Proxy Host.
Note that this table has information on all compute hosts reporting to the currently-chosen master host, including those that were not added by GEMM framework.
The information appears in groups of ten rows at a time, with the ability to page back and forth between the rows. For each host, there are columns for the Hostname, Architecture, Load per CPU, Memory in use, Total Memory, and Swap Space in use. The status is also indicated by a colored circle and icon similar to the Host Alerts table with an additional green icon to indicate hosts that have no alert conditions. Clicking the Back icon in the table header returns you to the Monitor Grid main page.
By default, rows display alphabetically but you can use any column whose header is white to change the ordering of the rows. Clicking on a column header sorts the rows according to the values in that column; clicking again on the column header reverses the sort. The sorting is preserved across pages if you click a pagination button.
The final column of each row has an Inspect icon. Clicking on this icon displays a table where full details for that host appear. The final entry in this table shows the timestamp when the data was obtained. For information on the meaning of the other table entries, please consult the N1 Grid Engine 6 Administration manual. Clicking the Back icon on this table returns you to the Host Details page.
You access the Grid Engine Daemon Logs page by clicking the Daemons button on in the Summary Status table on the main Monitor page.
The Logs page contains a table which displays the names of all compute hosts that were deployed by GEMM, plus the name of master host if it was deployed by GEMM.
Two additional columns are also shown. The first column, labeled Master, contains an Inspect icon for the master host. The second column, labeled execd, contains an Inspect icon for each compute host. Clicking these icons lets you retrieve the actual log message files.
If the master host was not deployed by GEMM, no host in the table will have the Inspect icon for the Qmaster column. Similarly, if there are compute hosts that were not deployed by GEMM, these hosts will not appear in this table. Clicking the Back icon in the table header returns you to the Monitor Grid main page.
Clicking an inspect icon retrieves and displays the qmaster and execd daemon messages file for the corresponding host. A progress bar indicates the progress of this process. When the Done button appears, clicking it displays the contents of the chosen messages file with each line appearing in its own row in a table. Rows display 25 at a time with the ability to page through them.
The rows display in reverse chronological order, so that the most recent message appears at the top of the list. Clicking on the Back icon for this table returns you to the Grid Engine Daemon Logs page. For more information on daemon messages, see the N1 Grid Engine 6 Administration manual.
The first column of this table shows a colored circle and icon to indicate the severity of that message. A green circle indicates a message of type Info. A yellow circle indicates a message of type Warning or Critical. A red circle indicates a message of type Error. The second column shows the time stamp for the message and the third column shows the actual text of the message.
This table shows a summary of the state of all the cluster queues configured on the grid, indicating the numbers of slots in various states. For information on cluster queues, see the N1GE 6 Administration Guide.
This table shows all hosts where the threshold for either the load or memory has been crossed. There are two types of alerts each indicated by a different colored circle and icon.
A warning alert is indicated by a yellow icon. This alert displays if the load goes above the load warning threshold or the memory goes below the memory warning threshold.
A critical alert is indicated by a red icon. This alert displays if the load goes above the load critical threshold or the memory goes below the memory critical threshold.
The Host Alerts table is empty if no hosts have crossed any threshold. You configure the values for the load and memory warning and critical thresholds on the Settings page.
This table shows queue instances that are not in the usual running state. There are three types of alerts each indicated by a different colored circle and icon.
A red icon indicates the queue instance is in either the Unknown or Error state.
A yellow icon indicates the queue instance is in either an Alarm or Suspended state.
A gray icon indicates the queue instance is in a Disabled state.
The exact state of the queue instance is also given in the Status column. For more information on queue instance states, see the N1 Grid Engine 6 Administration Manual.
This table displays grid jobs which are not in the usual running state. There are two types of alerts each indicated by a different colored circle and icon.
A red icon indicates the job is in an Error state.
A yellow icon indicates the jobs pending time has exceed the pending time threshold.
You configure the values for the pending time threshold on the Settings page. For more information on job states, see the N1 Grid Engine 6 Administration manual.
You can quickly see the status of the Grid by using the SCS Active Monitor feature. Choose Station Settings >Active Monitor. and scroll down the page to the Base Services table shown in the following figure.
When the status of the grid changes due to an event like a queue alert, the button next to the Grid Engine entry changes color in the following way:
Green: N1GE is up and running fine.
Yellow: the SCS cannot contact the proxy host or cannot obtain monitoring information from it but it is still possible that the master is running.
Red: the proxy host indicates that the master is down.
Grey: N1GE is not installed anywhere.
When you click the Settings menu item a table displays with all the configurable settings available in GEMM.
The parameters are grouped in four categories: Monitor Alert settings, N1GE settings, NFS mount settings and Proxy settings.
These settings affect the display of alerts in the GEMM Monitor. All these parameters must be set using decimal numbers. Any other type of input produces a formatting error.
Load Warning -- You use this parameter to specify the load warning threshold. If this threshold is exceeded, a load warning alert appears in the Monitor. The value is in terms of system load, as reported by the OS, divided by the number of CPUs.
Certain microprocessors with special features such as hyperthreading may be registered as having more than one CPU per physical CPU socket, depending upon factors such as the BIOS or PROM configuration.
Load Critical -- You use this parameter to specify the load critical threshold. If this threshold is exceeded, a load critical alert appears in the Monitor. Similar to the Load Warning parameter, you set this parameter in terms of the system load scaled by number of CPUs.
Memory Warning -- You use this parameter to set the memory warning threshold. If the value drops below this threshold, a memory warning alert appears in the Monitor. You set the parameter value in terms of megabytes of free virtual memory.
Memory Critical -- You use this parameter to set the memory critical threshold. If the value drops below this threshold, a memory critical alert appears in the Monitor. You set the value in terms of megabytes of free virtual memory.
Maximum Job Pending Time -- You use this parameter to specify the amount of time that a job spends pending after which a Job Pending alert appears in the Monitor. You set the value in hours.
It is important that you set these five parameters to sensible values, according to the characteristics of your particular grid. Otherwise, an excessive number of alerts will appear on the Monitor main page, cluttering the display.
The N1GE settings affect the way N1GE is installed onto the master, compute and access hosts. The N1GE administrator must determine the various parameter values suited to their local Grid environment.
Factors you should determine include the local namespace for users, TCP services, file directory structure, operating system, and so forth. The values have default options which are suitable for a generic installation. You should be familiar with the N1GE 6 product before changing any of these values. If you wish to change more advanced configuration settings, please see Chapter 3, Using the Setup configuration file.
Once you deploy the master host, you cannot edit these values which remain in effect for all further deployments of compute and access hosts. You can only edit the values again if you uninstall the master host. The following section describes each setting
SGE Root -- This setting is the root directory under which the N1GE files will be installed. Note that the files will be installed on all hosts in this directory.
SGE Cell -- This settings is the N1GE cell name used for the deployment.
Qmaster TCP Port -- This setting is the TCP port to use for the N1GE qmaster daemon.
Execd TCP Port -- This setting is the TCP port to use for the N1GE execd daemon.
Admin Username -- This setting is the username of the N1GE admin user.
Admin UID -- This setting is the UID of the N1GE admin user.
Grid Engine Version -- This parameter indicates the version of N1 Grid Engine that will be deployed on the compute and access hosts.
These settings affect the way the N1GE “common” directory for the chosen cell name is mounted on all access and compute hosts. The settings are described as follows.
NFS Server Name -- The name of the NFS server from which all compute and access hosts will mount the N1GE “common” directory. When you deploy the master host using GEMM, this parameter is set automatically to the master host. Once you deploy the master host you cannot edit this value and it remains in effect for all further deployments of compute and access hosts. You can only edit the setting again if you uninstall the master host.
NFS Mount Point -- The directory which is mounted from the NFS server for the N1GE “common” directory. When deploying the master host using GEMM, this is set automatically to <SGE_Root>/<SGE_Cell>/common, where <SGE_Root> and <SGE_Cell> are the values specified above. Once you deploy the master host you cannot edit this value and it remains in effect for all further deployments of compute and access hosts. You can only edit the setting again if you uninstall the master host.
Linux NFS Mount Options -- This setting is the options used when mounting the “common” directory onto a Linux compute or access host. The value in this field is inserted into the Linux /etc/fstab file on each host as:
<Servername>:<Mountpoint> <Mountpoint> nfs <Mountoptions> 0 0
where <Servername> and <Mountpoint> are the values specified above and <Mountoptions> are the specified Linux NFS mount options.
This parameter cannot contain any spaces
Solaris NFS Mount Options -- This setting specifies the options used when mounting the “common” directory onto a Solaris compute or access host. The value in this field is inserted into the Solaris /etc/vfstab file on each host as:
<Servername>:<Mountpoint> - <Mountpoint> nfs -yes <Mountoptions>
where <Mountpoint> is the values specified above and <Mountoptions> is the specified Solaris NFS mount options.
This parameter cannot contain any spaces.
Currently, there is only one proxy setting, which indicates the host on which monitoring commands are executed. If the master host has been previously deployed using GEMM, then the proxy host is set to this host and cannot be changed until the master is uninstalled. To choose the proxy host, click the Choose Proxy button at the bottom of the page. A table of all the hosts on which the GEMM framework has been installed. Select one host from this table.
The host you chose must be an N1GE admin host; otherwise, install and uninstall of other hosts, as well as monitoring, could fail
To set N1GE version parameter, click the Choose Version button at the bottom of the page. This action presents a table from which you select a version by clicking its Inspect icon. The available versions are those uploaded in the GEMM Version management page. If you deployed the master host previously using GEMM, the version chosen at that time is displayed. Manual changes to this parameter are not allowed until you uninstall the master host.
You can use GEMM for deployment and monitoring even with an N1GE master host not configured by GEMM. Possible scenarios include:
There is an already-existing N1GE installation.
You wish to deploy the master host on a platform not supported by the Sun Control Station framework.
You need to install the master host in a configuration unsupported by GEMM, such as with a shadow host, or with high-availability cluster via Sun Cluster software.
If you have an externally-configured master host, you can still use GEMM to deploy compute and access hosts, as well as for monitoring. However, you need to follow these steps:
Collect the N1GE and NFS settings
Establish a Proxy Host
Deploy the Chosen Proxy Host as a Compute or Access
Once you have configured the master host and ensured that it is up and running properly, take note of all the values for the N1GE settings as well as the NFS settings. These settings are essentially the parameters you would use if you were to install an execution host manually and associate with the master host, including the choice of NFS options for mounting the N1GE common directory. For example, you might mount the common directory from the master host or you may need to mount it from a separate file server system or appliance. Note that the correct choice of the NFS settings for the N1GE common directory is a critical step, since the common directory contains a file which tells the compute and access hosts where to find the master host.
Part of this step is to ensure that the exact same version of N1 Grid Engine which is running on the master host has been uploaded to GEMM using the Version page. N1 Grid Engine will not function properly unless the same version, including update level is used on the master host and all compute and access hosts.
Once you have determined and set the N1GE and NFS settings, it is important not to modify them again. Otherwise, further compute and access host deployments could be corrupted and will not work.
In order for GEMM to deploy additional compute and access hosts and perform monitoring, you must choose a host from the Sun Control Station as an N1GE Admin Host. This host must remain as an N1GE admin host as long as GEMM is in use. You may choose a system which will be a compute host as well or you may choose a system which will only be an access host. This choice is determined by factors such as:
Security concerns about a compute host having admin privileges --- this factor depends upon your established policy for using N1GE.
Concerns about monitoring being impacted by compute tasks --- by default the monitoring command runs once a minute on the chosen host, which probably will not have a large impact unless the host is running a very resource-intensive job.
Permanence - The host you choose must be one which you do not expect to take down ever during the course of running GEMM, otherwise monitoring and deployments will not work
Once you have decided which host to make the admin host, then perform these steps:
Set this host as the proxy host as previously described.
On the master host, add this host to the list of admin hosts. You can add the host using the N1GE GUI or add it from the command line by using the N1GE qconf -ah command.
At this point, click the Install Host menu item, select the chosen proxy, and install it as a compute or access host. You must select only the chosen proxy host and no other host in this step. You must wait to deploy additional hosts until the proxy host has been successfully established.
From the Grid Engine main page you have two uninstall choices. You can uninstall a particular host or hosts or you can uninstall everything.
You can remove one or more compute hosts from the compute grid.
When you uninstall a compute host, the N1GE software is shut down and removed from the selected hosts. The N1GE master host is instructed to remove those compute hosts from the N1GE compute grid.
Before you start the uninstall procedure, ensure that no jobs are running on the compute hosts that you want to uninstall. Any jobs that are currently running on these hosts will be terminated. If the jobs are marked as “re-runnable”, they are automatically resubmitted to the N1GE compute grid for execution on another compute host(s). However, if they are marked as “not re-runnable,” then they are not rescheduled and are not automatically executed elsewhere.
Select Grid Engine > Uninstall Host.
The selector appears, displaying the list of hosts currently in the compute grid; see Uninstalling Hosts.
Click to highlight a host(s). You can also click Select All at the top to choose all hosts in the list.
Click Uninstall Selected Nodes in the bottom right corner.
The Uninstall Task Progress Dialog appears.
You can remove all components of the Grid Engine module from the master host and all compute hosts.
Before you uninstall everything, be aware that:
all jobs (both running and suspended) are killed
all pending jobs are lost
all configurations and all records of previously run jobs are lost.
To uninstall everything: