JavaScript is required to for searching.
Skip Navigation Links
Exit Print View
Oracle Solaris Cluster Data Services Developer's Guide     Oracle Solaris Cluster 4.1
search filter icon
search icon

Document Information

Preface

1.  Overview of Resource Management

2.  Developing a Data Service

3.  Resource Management API Reference

4.  Modifying a Resource Type

5.  Sample Data Service

6.  Data Service Development Library

7.  Designing Resource Types

8.  Sample DSDL Resource Type Implementation

X Font Server

X Font Server Configuration File

TCP Port Number

ORCL.xfnts RTR File

Naming Conventions for Functions and Callback Methods

scds_initialize() Function

xfnts_start Method

Validating the Service Before Starting the X Font Server

Starting the Service With svc_start()

Returning From svc_start()

xfnts_stop Method

xfnts_monitor_start Method

xfnts_monitor_stop Method

xfnts_monitor_check Method

ORCL.xfnts Fault Monitor

xfonts_probe Main Loop

svc_probe() Function

Determining the Fault Monitor Action

xfnts_validate Method

xfnts_update Method

9.  Oracle Solaris Cluster Agent Builder

10.  Generic Data Service

11.  DSDL API Functions

12.  Cluster Reconfiguration Notification Protocol

13.  Security for Data Services

A.  Sample Data Service Code Listings

B.  DSDL Sample Resource Type Code Listings

C.  Requirements for Non-Cluster-Aware Applications

D.  Document Type Definitions for the CRNP

E.  CrnpClient.java Application

Index

ORCL.xfnts Fault Monitor

The RGM does not directly call the PROBE method, but rather calls the Monitor_start method to start the monitor after a resource is started on a node. The xfnts_monitor_start method starts the fault monitor under the control of the PMF. The xfnts_monitor_stop method stops the fault monitor.

The ORCL.xfnts fault monitor performs the following operations:

xfonts_probe Main Loop

The xfonts_probe method implements a loop.

Before implementing the loop, xfonts_probe performs the following operations:

The xfnts_probe method implements the following loop:

for (ip = 0; ip < netaddr->num_netaddrs; ip++) {
         /*
          * Grab the hostname and port on which the
          * health has to be monitored.
          */
         hostname = netaddr->netaddrs[ip].hostname;
         port = netaddr->netaddrs[ip].port_proto.port;
         /*
          * HA-XFS supports only one port and
          * hence obtain the port value from the
          * first entry in the array of ports.
          */
         ht1 = gethrtime(); /* Latch probe start time */
         scds_syslog(LOG_INFO, "Probing the service on port: %d.", port);

         probe_result =
         svc_probe(scds_handle, hostname, port, timeout);

         /*
          * Update service probe history,
          * take action if necessary.
          * Latch probe end time.
          */
         ht2 = gethrtime();

         /* Convert to milliseconds */
         dt = (ulong_t)((ht2 - ht1) / 1e6);

         /*
          * Compute failure history and take
          * action if needed
          */
         (void) scds_fm_action(scds_handle,
             probe_result, (long)dt);
      }   /* Each net resource */
   }    /* Keep probing forever */

The svc_probe() function implements the probe logic. The return value from svc_probe() is passed to scds_fm_action(), which determines whether to restart the application, fail over the resource group, or do nothing.

svc_probe() Function

The svc_probe() function makes a simple socket connection to the specified port by calling scds_fm_tcp_connect(). If the connect fails, svc_probe() returns a value of 100, which indicates a complete failure. If the connect succeeds, but the disconnect fails, svc_probe() returns a value of 50, which indicates a partial failure. If the connect and disconnect both succeed, svc_probe() returns a value of 0, which indicates success.

The code for svc_probe() is as follows:

int svc_probe(scds_handle_t scds_handle,
char *hostname, int port, int timeout)
{
   int  rc;
   hrtime_t   t1, t2;
   int    sock;
   char   testcmd[2048];
   int    time_used, time_remaining;
   time_t      connect_timeout;


   /*
    * probe the data service by doing a socket connection to the port
    * specified in the port_list property to the host that is
    * serving the XFS data service. If the XFS service which is configured
    * to listen on the specified port, replies to the connection, then
    * the probe is successful. Else we will wait for a time period set
    * in probe_timeout property before concluding that the probe failed.
    */

   /*
    * Use the SVC_CONNECT_TIMEOUT_PCT percentage of timeout
    * to connect to the port
    */
   connect_timeout = (SVC_CONNECT_TIMEOUT_PCT * timeout)/100;
   t1 = (hrtime_t)(gethrtime()/1E9);

   /*
    * the probe makes a connection to the specified hostname and port.
    * The connection is timed for 95% of the actual probe_timeout.
    */
   rc = scds_fm_tcp_connect(scds_handle, &sock, hostname, port,
       connect_timeout);
   if (rc) {
      scds_syslog(LOG_ERR,
          "Failed to connect to port <%d> of resource <%s>.",
          port, scds_get_resource_name(scds_handle));
      /* this is a complete failure */
      return (SCDS_PROBE_COMPLETE_FAILURE);
   }

   t2 = (hrtime_t)(gethrtime()/1E9);

   /*
    * Compute the actual time it took to connect. This should be less than
    * or equal to connect_timeout, the time allocated to connect.
    * If the connect uses all the time that is allocated for it,
    * then the remaining value from the probe_timeout that is passed to
    * this function will be used as disconnect timeout. Otherwise, the
    * the remaining time from the connect call will also be added to
    * the disconnect timeout.
    *
    */

   time_used = (int)(t2 - t1);

   /*
    * Use the remaining time(timeout - time_took_to_connect) to disconnect
    */

   time_remaining = timeout - (int)time_used;

   /*
    * If all the time is used up, use a small hardcoded timeout
    * to still try to disconnect. This will avoid the fd leak.
    */
   if (time_remaining <= 0) {
      scds_syslog_debug(DBG_LEVEL_LOW,
          "svc_probe used entire timeout of "
          "%d seconds during connect operation and exceeded the "
          "timeout by %d seconds. Attempting disconnect with timeout"
          " %d ",
          connect_timeout,
          abs(time_used),
          SVC_DISCONNECT_TIMEOUT_SECONDS);

      time_remaining = SVC_DISCONNECT_TIMEOUT_SECONDS;
   }

   /*
    * Return partial failure in case of disconnection failure.
    * Reason: The connect call is successful, which means
    * the application is alive. A disconnection failure
    * could happen due to a hung application or heavy load.
    * If it is the later case, don't declare the application
    * as dead by returning complete failure. Instead, declare
    * it as partial failure. If this situation persists, the
    * disconnect call will fail again and the application will be
    * restarted.
    */
   rc = scds_fm_tcp_disconnect(scds_handle, sock, time_remaining);
   if (rc != SCHA_ERR_NOERR) {
      scds_syslog(LOG_ERR,
          "Failed to disconnect to port %d of resource %s.",
          port, scds_get_resource_name(scds_handle));
      /* this is a partial failure */
      return (SCDS_PROBE_COMPLETE_FAILURE/2);
   }

   t2 = (hrtime_t)(gethrtime()/1E9);
   time_used = (int)(t2 - t1);
   time_remaining = timeout - time_used;

   /*
    * If there is no time left, don't do the full test with
    * fsinfo. Return SCDS_PROBE_COMPLETE_FAILURE/2
    * instead. This will make sure that if this timeout
    * persists, server will be restarted.
    */
   if (time_remaining <= 0) {
      scds_syslog(LOG_ERR, "Probe timed out.");
      return (SCDS_PROBE_COMPLETE_FAILURE/2);
   }

   /*
    * The connection and disconnection to port is successful,
    * Run the fsinfo command to perform a full check of
    * server health.
    * Redirect stdout, otherwise the output from fsinfo
    * ends up on the console.
    */
   (void) sprintf(testcmd,
       "/usr/openwin/bin/fsinfo -server %s:%d > /dev/null",
       hostname, port);
   scds_syslog_debug(DBG_LEVEL_HIGH,
       "Checking the server status with %s.", testcmd);
   if (scds_timerun(scds_handle, testcmd, time_remaining,
      SIGKILL, &rc) != SCHA_ERR_NOERR || rc != 0) {

      scds_syslog(LOG_ERR,
         "Failed to check server status with command <%s>",
         testcmd);
      return (SCDS_PROBE_COMPLETE_FAILURE/2);
   }
   return (0);
}

When finished, svc_probe() returns a value that indicates success (0), partial failure (50), or complete failure (100). The xfnts_probe method passes this value to scds_fm_action().

Determining the Fault Monitor Action

The xfnts_probe method calls scds_fm_action() to determine the action to take.

The logic in scds_fm_action() is as follows:

For example, suppose the probe makes a connection to the xfs server, but fails to disconnect. This indicates that the server is running, but could be hung or just under a temporary load. The failure to disconnect sends a partial (50) failure to scds_fm_action(). This value is below the threshold for restarting the data service, but the value is maintained in the failure history.

If during the next probe the server again fails to disconnect, a value of 50 is added to the failure history maintained by scds_fm_action(). The cumulative failure value is now 100, so scds_fm_action() restarts the data service.