Tuning the HA for PeopleSoft Process Scheduler Fault Monitor

Language:

This section describes the HA for PeopleSoft process scheduler fault monitor's probing algorithm or functionality, and states the conditions, messages, and recovery actions associated with unsuccessful probing.

For conceptual information about fault monitors, see the Oracle Solaris Cluster 4.3 Concepts Guide.

Resource Properties

The HA for PeopleSoft process scheduler fault monitor uses the resource properties that are specified in the resource type ORCL.PeopleSoft_process_scheduler. Refer to the r_properties(5) man page for a list of general resource properties used. Refer to ORCL.PeopleSoft_process_scheduler Extension Properties for a specific list of resource properties for this resource type.

Probing Algorithm and Functionality

The HA for PeopleSoft process scheduler is controlled by extension properties that control the probing frequency. The default values of these properties determine the preset behavior of the fault monitor and are suitable for most Oracle Solaris Cluster installations. You can modify this preset behavior by modifying the following settings:

The interval between fault monitor probes (Thorough_probe_interval)
The timeout for fault monitor probes (Probe_timeout)
The number of times the fault monitor attempts to restart the resource (Retry_count)

The HA for PeopleSoft process scheduler fault monitor checks the domain status within an infinite loop. During each cycle, the fault monitor checks the domain state and reports either a failure or success.

If the fault monitor is successful, it returns to its infinite loop and continues the next cycle of probing and sleeping.
If the fault monitor reports a failure, a request is made to the cluster to restart the resource. If the fault monitor reports another failure, another request is made to the cluster to restart the resource. This behavior continues whenever the fault monitor reports a failure. If successive restarts exceed the Retry_count within the Thorough_probe_interval, a request is made to fail over the resource group onto a different node.

Operations of the PeopleSoft Process Scheduler Probe

The following explains the operations of the PeopleSoft process scheduler probe:

If the control_process_scheduler script for the resource is still running with the start option, the probe returns 100. This basically implements “wait for online” during start. Otherwise, the probe continues.
If the output from psadmin for the boot option contains the string ERROR:, the probe returns 100 to indicate a failed start. Otherwise, the probe continues.
If the output for the psadmin -p sstatus -d ${Psft_Domain} command contains the string ERROR:, the probe checks for the following specific message:
```
Can not find DBBL on master and backup nodes.
```
- If that string is detected, it assumes the critical BBL service has failed and tries to restart the BBL by sending the bbc command, using tmadmin. The probe returns 50, which puts the service into degraded mode. If on a subsequent probe the same error is detected, the return code is 50 again, which totals 100, resulting in a failed probe.
- If the specific error message is not matched, the probe immediately returns 100.
- If no error message is found, the probe continues.
The probe checks whether at least one of each of the services that are defined as critical is running. The following services are regarded as critical:
- BBL
- PSMONITORSRV
- PSPRCSRV
- PSDSTSRV
If the probe does not detect that all of the critical services are running, the probe returns 100, otherwise it returns 0.
If the PeopleSoft process scheduler guest-domain resource is repeatedly restarted and subsequently exhausts the Retry_count within the Retry_interval, and if Failover_enabled is set to TRUE, a failover to another node is initiated for the resource group.