Sun Cluster Data Service for SAP Guide for Solaris OS

Sun Cluster HA for SAP Fault Probes for Application Server

For the application server, the fault probe executes the following steps.

  1. Retrieves the process ID for the main dispatcher

  2. Loops infinitely (sleeps for Thorough_probe_interval)

  3. Checks the availability of the SAP resources

    1. Abnormal exit – If the Process Monitor Facility (PMF) detects that the SAP process tree has failed, the fault monitor treats this problem as a complete failure. The fault monitor restarts or fails over the SAP resource to another node, based on the resources' failure history.

    2. Availability check of the SAP resources through probe – The probe uses the ps(1) command to check the SAP Message Server and main dispatcher processes. If the SAP main dispatcher process is missing from the system's active processes list, the fault monitor treats the problem as a complete failure.

    3. Database connection status through probe – The probe calls the SAP-supplied utility R3trans to check the status of the database connection. Sun Cluster HA for SAP fault probes verify that SAP can connect to the database. Sun Cluster HA for SAP depends, however, on the highly available database fault probes to determine database availability. If the database connection status check fails, the fault monitor logs the message, Database might be down, to /var/adm/messages and sets the status of the SAP resource to DEGRADED. If the probe checks the status of the database again and the connection is reestablished, the fault monitor logs the message, Database is up, to /var/adm/messages. The fault monitor then sets the status of the SAP resource to OK.

  4. Evaluates the failure history

    Based on the failure history, the fault monitor completes one of the following actions.

    • no action

    • local restart

    • failover

      If the application server resource is a failover resource, the fault monitor fails over the application server.

      If the application server resource is a scalable resource, after the number of local restarts are exhausted, RGM will bring up the application server on a different node if there is another node available in the cluster.