Sun Cluster 3.1 Data Services Developer's Guide

Chapter 2 Developing a Data Service

This chapter provides detailed information about developing a data service.

The following information is in this chapter.

Analyzing the Application for Suitability

The first step in creating a data service is to determine that the target application satisfies the requirements for being made highly available or scalable. If the application fails to meet all requirements, you might be able to modify the application source code to make it so.

The list that follows summarizes the requirements for an application to be made highly available or scalable. If you need more detail or if you need to modify the application source code, refer to Appendix B, Sample Data Service Code Listings.

Note –

A scalable service must meet all the following conditions for high availability as well as some additional criteria.

Both network aware (client-server model) and non network aware (client less) applications are potential candidates for being made highly available or scalable in the Sun Cluster environment. However Sun Cluster cannot provide enhanced availability in time-sharing environments in which applications are run on a server that is accessed through telnet or rlogin.
The application must be crash tolerant. That is, it must recover disk data (if necessary) when it is started after an unexpected node death. Furthermore, the recovery time after a crash must be bounded. Crash tolerance is a prerequisite for making an application highly available because the ability to recover the disk and restart the application is a data integrity issue. The data service is not required to be able to recover connections
The application must not depend upon the physical hostname of the node on which it is running. See Host Names for additional information.
The application must operate correctly in environments in which multiple IP addresses are configured up; for example, environments with multihomed hosts, in which the node is on more than one public network, and environments with nodes on which multiple, logical interfaces are configured up on one hardware interface.
To be highly available, the application data must reside in the cluster file systems—see Multihosted Data.

If the application uses a hard-wired path name for the location of the data, you could change that path to a symbolic link that points to a location in the cluster file system, without changing application source code. See Using Symbolic Links for Multihosted Data Placement for additional information.
Application binaries and libraries can reside locally on each node or on the cluster file system. The advantage of residing on the cluster file system is that a single installation is sufficient. The disadvantage is that rolling upgrade becomes an issue because the binaries are in use while the application is running under control of the RGM.
The client should have some capacity to retry a query automatically if the first attempt times out. If the application and protocol already handle the case of a single server crashing and rebooting, then they also will handle the case of the containing resource group being failed over or switched over. See Client Retry for additional information.
The application must not have Unix domain sockets or named pipes in the cluster file system.

Additionally, scalable services must meet the following requirements.

The application must have the ability to run multiple instances, all operating on the same application data in the cluster file system.
The application must provide data consistency for simultaneous access from multiple nodes.
The application must implement sufficient locking with a globally visible mechanism, such as the cluster file system.

For a scalable service, application characteristics also determine the load-balancing policy. For example, the load-balancing policy, LB_WEIGHTED, which allows any instance to respond to client requests, does not work for an application that makes use of an in-memory cache on the server for client connections. In this case, you should specify a load-balancing policy that restricts a given client's traffic to one instance of the application. The load-balancing policies, LB_STICKY and LB_STICKY_WILD, repeatedly send all requests by a client to the same application instance—where they can make use of an in-memory cache. Note that if multiple client requests come in from different clients, the RGM distributes the requests among the instances of the service. See Implementing a Scalable Resource for more information about setting the load balancing policy for scalable data services.

Determining the Interface to Use

The Sun Cluster developer support package (SUNWscdev) provides two sets of interfaces for coding data service methods:

The Resource Management API (RMAPI), a set of low-level routines (in the libscha.so library)
The Data Service Development Library (DSDL), a set of higher level functions (in the libdsdev.so library) that encapsulate the functionality of the RMAPI and provides some additional functionality

Also included in the Sun Cluster developer support package is SunPlex Agent Builder, a tool that automates the creation of a data service.

The recommended approach to developing a data service is:

Decide whether to code in C or Korn shell (ksh). If you decide to use the Korn shell, you cannot use the DSDL, which provides a C interface only.
Run Agent Builder, specify the requested inputs, and generate a data service, which includes source and executable code, an RTR file, and a package.
If the generated data service requires customizing, you can add DSDL code to the generated source files. Agent Builder indicates, with comments, specific places in the source files where you can add your own code.
If the code requires further customizing to support the target application, you can add RMAPI functions to the existing source code.

In practice, you could take numerous approaches to creating a data service. For example, rather than add your own code to specific places in the code generated by Agent Builder, you could replace entirely one of the generated methods or the generated monitor program with a program you write from scratch using DSDL or RMAPI functions. However, regardless of the manner you proceed, in almost every case, starting with Agent Builder makes sense, for the following reasons:

The code generated by Agent Builder, while generic in nature, has been tested in numerous data services.
Agent Builder generates an RTR file, a make file, a package for the resource, and other support files for the data service. Even if you use none of the data service code, using these other files can save you a considerable amount of work.
You can modify the generated code.

Note –

Unlike the RMAPI, which provides a set of C functions and a set of commands for use in scripts, the DSDL provides a C function interface only. Therefore, if you specify ksh output in Agent Builder, the generated source code makes calls to RMAPI because there are no DSDL ksh commands.

Setting Up the Development Environment for Writing a Data Service

Before beginning data service development, you must have installed the Sun Cluster development package (SUNWscdev) to have access to the Sun Cluster header and library files. Although this package is already installed on all cluster nodes, typically, you do development on a separate, non-cluster development machine, not on a cluster node. In this typical case, you must use pkgadd(1M) to install the SUNWscdev package on your development machine.

When compiling and linking your code, you must set particular options to identify the header and library files. When you have finished development (on a non-cluster node) you can transfer the completed data service to a cluster for running and testing.

Note –

Be certain you are using a development version of Solaris 5.8 or higher.

Use the procedures in this section to:

Install the Sun Cluster development package (SUNWscdev) and set the appropriate compiler and linker options
Transfer the data service to a cluster

How to Set Up the Development Environment

This procedure describes how to install the SUNWscdev package and set the compiler and linker options for data service development.

Change directory to the appropriate CD-ROM directory.
cd appropriate_CD-ROM_directory

Install the SUNWscdev package in the current directory.
pkgadd -d . SUNWscdev

In the makefile, specify compiler and linker options to identify the include and library files for your data service code.

Specify the -I option to identify the Sun Cluster header files, the -L option to specify the compile-time library search path on the development system, and the -R option to specify the library search path to the runtime linker on the cluster.
# Makefile for sample data service ... -I /usr/cluster/include -L /usr/cluster/lib -R /usr/cluster/lib ...

How to Transfer a Data Service to a Cluster

When you have completed development of a data service on a development machine, you must transfer it to a cluster for testing. To reduce the chance of error, the best way to accomplish this transfer is to package together the data service code and the RTR file and then install the package on all nodes of the cluster.

Note –

Whether you use pkgadd or some other way to install the data service, you must put the data service on all cluster nodes. Agent Builder automatically packages together the RTR file and data service code.

Setting Resource and Resource Type Properties

Sun Cluster provides a set of resource type properties and resource properties that you use to define the static configuration of a data service. Resource type properties specify the type of the resource, its version, the version of the API, and so on, as well as paths to each of the callback methods. Table A–1 lists all the resource type properties.

Resource properties, such as Failover_mode, Thorough_probe_interval, and method timeouts, also define the static configuration of the resource. Dynamic resource properties such as Resource_state and Status reflect the active state of a managed resource. Table A–2 describes the resource properties.

You declare the resource type and resource properties in the resource type registration (RTR) file, which is an essential component of a data service. The RTR file defines the initial configuration of the data service at the time the cluster administrator registers the data service with Sun Cluster.

It is recommended that you use Agent Builder to generate the RTR file for your data service because Agent Builder declares the set of properties that are both useful and required for any data service. For example certain properties (such as Resource_type) must be declared in the RTR file or registration of the data service fails. Other properties, though not required, will not be available to a system administrator unless you declare them in the RTR file, while some properties are available whether you declare them or not, because the RGM defines them and provides a default value. To avoid this level of complexity, you can simply use Agent Builder to guarantee generation of a proper RTR file. Later on you can edit the RTR file to change specific values if you need to do so.

The rest of this section leads you through a sample RTR file, created by Agent Builder.

Declaring Resource Type Properties

The cluster administrator cannot configure the resource type properties you declare in the RTR file. They become part of the permanent configuration of the resource type.

Note –

One resource type property, Installed_nodes, is configurable by a system administrator. In fact, it is only configurable by a system administrator and you cannot declare it in the RTR file.

The syntax for resource type declarations is:

property_name = value;

Note –

The RGM treats property names as case insensitive. The convention for properties in Sun-supplied RTR files, with the exception of method names, is uppercase for the first letter of the name and lowercase for the rest of the name. Method names—as well as property attributes—contain all uppercase letters.

Following are the resource type declarations in the RTR file for a sample (smpl) data service:

# Sun Cluster Data Services Builder template version 1.0
# Registration information and resources for smpl
#
#NOTE: Keywords are case insensitive, i.e., you can use
#any capitalization style you prefer.
#
Resource_type = "smpl";
Vendor_id = SUNW;
RT_description = "Sample Service on Sun Cluster";

RT_version ="1.0"; 
API_version = 2;
Failover = TRUE;

Init_nodes = RG_PRIMARIES;

RT_basedir=/opt/SUNWsmpl/bin;

Start           =    smpl_svc_start;
Stop            =    smpl_svc_stop;

Validate        =    smpl_validate;
Update          =    smpl_update;

Monitor_start   =    smpl_monitor_start;
Monitor_stop    =    smpl_monitor_stop;
Monitor_check   =    smpl_monitor_check;

Tip –

You must declare the Resource_type property as the first entry in the RTR file. Otherwise, registration of the resource type will fail.

The first set of resource type declarations provide basic information about the resource type, as follows:

Resource_type and Vendor_id

Provide a name for the resource type. You can specify the resource type name with the Resource_type property alone (smpl) or using the Vendor_id as a prefix with a “.” separating it from the resource type (SUNW.smpl), as in the sample. If you use Vendor_id, make it the stock symbol for the company defining the resource type. The resource type name must be unique in the cluster.

Note –

By convention, the resource type name (Resource_typeVendor_id) is used as the package name. Package names are limited to nine characters, so it is a good idea to limit the total number of characters in these two properties to nine or fewer characters, though the RGM does not enforce this limit. Agent Builder, on the other hand, explicitly generates the package name from the resource type name, so it does enforce the nine character limit.

Rt_version

Identifies the version of the sample data service.

API_version

Identifies the version of the API. For example, API_version = 2, indicates that the data service runs under Sun Cluster version 3.0.

Failover = TRUE

Indicates that the data service cannot run in a resource group that can be online on multiple nodes at once, that is, specifies a failover data service. See Implementing a Failover Resource for more information.

Start, Stop, Validate, and so on

Provide the paths to the respective callback method programs called by the RGM. These paths are relative to the directory specified by RT_basedir.

The remaining resource type declarations provide configuration information, as follows:

Init_nodes = RG_PRIMARIES: Specifies that the RGM call the Init, Boot, Fini, and Validate methods only on nodes that can master the data service. The nodes specified by RG_PRIMARIES is a subset of all nodes on which the data service is installed. Set the value to RT_INSTALLED_NODES to specify that the RGM call these methods all nodes on which the data service is installed.
RT_basedir: Points to /opt/SUNWsample/bin as the directory path to complete relative paths, such as callback method paths.
Start, Stop, Validate, and so on: Provide the paths to the respective callback method programs called by the RGM. These paths are relative to the directory specified by RT_basedir.

Declaring Resource Properties

As with resource type properties, you declare resource properties in the RTR file. By convention, resource property declarations follow the resource type declarations in the RTR file. The syntax for resource declarations is a set of attribute value pairs enclosed by curly brackets:

{
    Attribute = Value;
    Attribute = Value;
             .
             .
             .
    Attribute = Value;
}

For resource properties provided by Sun Cluster, so-called system-defined properties, you can change specific attributes in the RTR file. For example, Sun Cluster provides method timeout properties for each of the callback methods, and specifies default values. In the RTR file, you can specify different default values.

You can also define new resource properties in the RTR file, so-called extension properties, using a set of property attributes provided by Sun Cluster. Table A–4 lists the attributes for changing and defining resource properties. Extension property declarations follow the system-defined property declarations in the RTR file.

The first set of system-defined resource properties specifies timeout values for the callback methods:

...

# Resource property declarations appear as a list of bracketed
# entries after the resource-type declarations. The property 
# name declaration must be the first attribute after the open
# curly bracket of a resource property entry.
#
# Set minimum and default for method timeouts.
{
        PROPERTY = Start_timeout;
        MIN=60;
        DEFAULT=300;
}

{
        PROPERTY = Stop_timeout;
        MIN=60;
        DEFAULT=300;
}
{
        PROPERTY = Validate_timeout;
        MIN=60;
        DEFAULT=300;
}
{
        PROPERTY = Update_timeout;
        MIN=60;
        DEFAULT=300;
}
{
        PROPERTY = Monitor_Start_timeout;
        MIN=60;
        DEFAULT=300;
}
{
        PROPERTY = Monitor_Stop_timeout;
        MIN=60;
        DEFAULT=300;
{
        PROPERTY = Monitor_Check_timeout;
        MIN=60;
        DEFAULT=300;
}

The name of the property (PROPERTY = value) must be the first attribute for each resource-property declaration. You can configure resource properties, within limits defined by the property attributes in the RTR file. For example, the default value for each method timeout in the sample is 300 seconds. An administrator can change this value; however, the minimum allowable value, specified by the MIN attribute, is 60 seconds. See Table A–4 for a complete list of resource property attributes.

The next set of resource properties defines properties that have specific uses in the data service.

{
        PROPERTY = Failover_mode;
        DEFAULT=SOFT;
        TUNABLE = ANYTIME;
}
{
        PROPERTY = Thorough_Probe_Interval;
        MIN=1;
        MAX=3600;
        DEFAULT=60;
        TUNABLE = ANYTIME;
}

# The number of retries to be done within a certain period before concluding 
# that the application cannot be successfully started on this node.
{
        PROPERTY = Retry_Count;
        MAX=10;
        DEFAULT=2;
        TUNABLE = ANYTIME; 
}

# Set Retry_Interval as a multiple of 60 since it is converted from seconds
# to minutes, rounding up. For example, a value of 50 (seconds)
# is converted to 1 minute. Use this property to time the number of 
# retries (Retry_Count).
{
        PROPERTY = Retry_Interval;
        MAX=3600;
        DEFAULT=300;
        TUNABLE = ANYTIME;
}

{
        PROPERTY = Network_resources_used;
        TUNABLE = WHEN_DISABLED;
        DEFAULT = "";
}
{
        PROPERTY = Scalable;
        DEFAULT = FALSE;
        TUNABLE = AT_CREATION;
}
{
        PROPERTY = Load_balancing_policy;
        DEFAULT = LB_WEIGHTED;
        TUNABLE = AT_CREATION;
}
{
        PROPERTY = Load_balancing_weights;
        DEFAULT = "";
        TUNABLE = ANYTIME;
}
{
        PROPERTY = Port_list;
        TUNABLE = AT_CREATION;
        DEFAULT = ;
}

These resource-property declarations add the TUNABLE attribute, which limits the occasions on which the system administrator can change their values. AT_CREATION means the administrator can only specify the value when the resource is created and cannot change it later.

For most of these properties you can accept the default values as generated by Agent Builder unless you have a reason to change them. Information about these properties follows (for additional information, see Resource Properties or the r_properties(5) man page):

Failover_mode: Indicates whether the RGM should relocate the resource group or abort the node in the case of a failure of a Start or Stop method.
Thorough_probe_interval, Retry_count, Retry_interval: Used in the fault monitor. The tunability is anytime, so a system administrator can adjust them if the fault monitor is not functioning optimally.
Network_resources_used: A list of logical hostname or shared address resources used by the data service. Agent Builder declares this property so a system administrator can specify a list of resources, if there are any, when configuring the data service.
Scalable: Set to FALSE to indicate this resource does not use the cluster networking (shared address) facility. This setting is consistent with the resource type Failover property set to TRUE to indicate a failover service. See Implementing a Failover Resource and Implementing a Scalable Resource for additional information about how to use this property.
Load_balancing_policy, Load_balancing_weights: Automatically declares these properties, however, they have no use in a failover resource type.
Port_list: Identifies the list of ports on which the server is listening. Agent Builder declares this property so a system administrator can specify a list of ports, when configuring the data service.

Declaring Extension Properties

At the end of the sample RTR file are extension properties, as shown in the following listing

# Extension Properties
#

# The cluster administrator must set the value of this property to point to the 
# directory that contains the configuration files used by the application. 
# For this application, smpl, specify the path of the configuration file on 
# PXFS (typically named.conf).
{
        PROPERTY = Confdir_list;
        EXTENSION;
        STRINGARRAY;
        TUNABLE = AT_CREATION;
        DESCRIPTION = "The Configuration Directory Path(s)";
}

# The following two properties control restart of the fault monitor.
{
        PROPERTY = Monitor_retry_count;
        EXTENSION;
        INT;
        DEFAULT = 4;
        TUNABLE = ANYTIME;
        DESCRIPTION = "Number of PMF restarts allowed for fault monitor.";
}
{
        PROPERTY = Monitor_retry_interval;
        EXTENSION;
        INT;
        DEFAULT = 2;
        TUNABLE = ANYTIME;
        DESCRIPTION = "Time window (minutes) for fault monitor restarts.";
}
# Time out value in seconds for the probe.
{
        PROPERTY = Probe_timeout;
        EXTENSION;
        INT;
        DEFAULT = 120;
        TUNABLE = ANYTIME;
        DESCRIPTION = "Time out value for the probe (seconds)";
}

# Child process monitoring level for PMF (-C option of pmfadm).
# Default of -1 means to not use the -C option of pmfadm.
# A value of 0 or greater indicates the desired level of child-process.
# monitoring.
{
        PROPERTY = Child_mon_level;
        EXTENSION;
        INT;
        DEFAULT = -1;
        TUNABLE = ANYTIME;
        DESCRIPTION = “Child monitoring level for PMF";
}
# User added code -- BEGIN VVVVVVVVVVVV
# User added code -- END   ^^^^^^^^^^^^

Agent Builder creates some extension properties that are useful for most data services, as follows.

Confdir_list: Specifies the path to the application configuration directory, which is useful information for many applications. The system administrator can provide the location of this directory when configuring the data service.
Monitor_retry_count, Monitor_retry_interval, Probe_timeout: Control restarts of the fault monitor itself, not of the server daemon.
Child_mon_level: Sets the level of monitoring to be done by PMF. See pmfadm(1M) for more information.

You can create additional extension properties in the area delimited by the User added code comments.

Implementing Callback Methods

This section provides some information that pertains to implementing the callback methods in general.

Accessing Resource and Resource Group Property Information

Generally, callback methods require access to the properties of the resource. The RMAPI provides both shell commands and C functions that you can use in callback methods to access the system-defined and extension properties of resources. See the scha_resource_get(1HA) and scha_resource_get(3HA) man pages.

The DSDL provides a set of C functions (one for each property) to access system-defined properties, and a function to access extension properties. See the scds_property_functions(3HA) and scds_get_ext_property(3HA) man pages.

You cannot use the property mechanism to store dynamic state information for a data service because no API functions are available for setting resource properties (other than for setting Status and Status_msg). Rather, you should store dynamic state information in global files.

Note –

The cluster administrator can set certain resource properties using the scrgadm command or through an available graphical administrative command or through an available graphical administrative interface. However, do not call scrgadm from any callback method because scrgadm fails during cluster reconfiguration, that is, when the RGM calls the method.

Idempotency for Methods

In general, the RGM does not call a method more than once in succession on the same resource with the same arguments. However, if a Start method fails, the RGM could call a Stop method on a resource even though the resource was never started. Likewise, a resource daemon could die of its own accord and the RGM might still invoke its Stop method on it. The same scenarios apply to the Monitor_start and Monitor_stop methods.

For these reasons, you must build idempotency into your Stop and Monitor_stop methods, which means that repeated calls of Stop or Monitor_stop on the same resource with the same parameters achieve the same results as a single call.

One implication of idempotency is that Stop and Monitor_stop must return 0 (success) even if the resource or monitor is already stopped and no work is to done.

Note –

The Init, Fini, Boot, and Update methods must also be idempotent. A Start method need not be idempotent.

Generic Data Service

A generic data service (DGS) is a mechanism for making simple applications highly available or scalable by plugging them into the Sun Cluster's Resource Group Manager framework. This mechanism does not require the coding of an agent which is the typical approach for making an application highly available or scalable.

The GDS model relies on a precompiled resource type, SUNW.gds, to interact with the RGM framework

See Chapter 10, Generic Data Services for additional information.

Controlling an Application

Callback methods enable the RGM to take control of the underlying resource (application) whenever nodes are in the process of joining or leaving the cluster.

Starting and Stopping a Resource

A resource type implementation requires, at a minimum, a Start method and a Stop method. The RGM calls a resource type's method programs at appropriate times and on the appropriate nodes for bringing resource groups offline and online. For example, after the crash of a cluster node, the RGM moves any resource groups mastered by that node onto a new node. You must implement a Start method to provide the RGM with a way of restarting each resource on the surviving host node.

A Start method must not return until the resource has been started and is available on the local node. Be certain that resource types requiring a long initialization period have sufficiently long timeouts set on their Start methods (set default and minimum values for the Start_timeout property in the resource type registration file).

You must implement a Stop method for situations in which the RGM takes a resource group offline. For example, suppose a resource group is taken offline on Node1 and back online on Node2. While taking the resource group offline, the RGM calls the Stop method on resources in the group to stop all activity on Node1. After the Stop methods for all resources have completed on Node1, the RGM brings the resource group back online on Node2.

A Stop method must not return until the resource has completely stopped all its activity on the local node and has completely shut down. The safest implementation of a Stop method would terminate all processes on the local node related to the resource. Resource types requiring a long time to shut down should have sufficiently long timeouts set on their Stop methods. Set the Stop_timeout property in the resource type registration file.

Failure or timeout of a Stop method causes the resource group to enter an error state that requires operator intervention. To avoid this state, the Stop and Monitor_stop method implementations should attempt to recover from all possible error conditions. Ideally, these methods should exit with 0 (success) error status, having successfully stopped all activity of the resource and its monitor on the local node.

Deciding on the `Start` and `Stop` Methods to Use

This section provides some tips about when to use the Start and Stop methods versus using the Prenet_start and Postnet_stop methods. You must have in-depth knowledge of both the client and the data service's client-server networking protocol to decide which methods are appropriate.

Services that use network address resources might require that start or stop steps be done in a certain order relative to the logical hostname address configuration. The optional callback methods Prenet_start and Postnet_stop allow a resource type implementation to do special start-up and shutdown actions before and after network addresses in the same resource group are configured up or configured down.

The RGM calls methods that plumb (but do not configure up) the network addresses before calling the data service's Prenet_start method. The RGM calls methods that unplumb the network addresses after calling the data service's Postnet_stop methods. The sequence is as follows when the RGM takes a resource group online.

Plumb network addresses.
Call data service's Prenet_start method (if any).
Configure network addresses up.
Call data service's Start method (if any).

The reverse happens when the RGM takes a resource group offline:

Call data service's Stop method (if any).
Configure network addresses down.
Call data service's Postnet_stop method (if any).
Unplumb network addresses.

When deciding whether to use the Start, Stop, Prenet_start, or Postnet_stop methods, first consider the server side. When bringing online a resource group containing both data service application resources and network address resources, the RGM calls methods to configure up the network addresses before it calls the data service resource Start methods. Therefore, if a data service requires network addresses to be configured up at the time it starts, use the Start method to start the data service.

Likewise, when bringing offline a resource group that contains both data service resources and network address resources, the RGM calls methods to configure down the network addresses after it calls the data service resource Stop methods. Therefore, if a data service requires network addresses to be configured up at the time it stops, use the Stop method to stop the data service.

For example, to start or stop a data service, you might have to invoke the data service's administrative utilities or libraries. Sometimes, the data service has administrative utilities or libraries that use a client-server networking interface to perform the administration. That is, an administrative utility makes a call to the server daemon, so the network address might need to be up to use the administrative utility or library. Use the Start and Stop methods in this scenario.

If the data service requires that the network addresses be configured down at the time it starts and stops, use the Prenet_start and Postnet_stop methods to start and stop the data service. Consider whether your client software will respond differently depending on whether the network address or the data service comes online first after a cluster reconfiguration, scha_control giveover, or scswitch switchover. For example, the client implementation might do minimal retries, giving up soon after determining that the data service port is not available.

If the data service does not require the network address to be configured up when it starts, start it before the network interface is configured up. This ensures that the data service is able to respond immediately to client requests as soon as the network address has been configured up, and clients are less likely to stop retrying. In this scenario, use the Prenet_start method rather than the Start method to start the data service.

If you use the Postnet_stop method, the data service resource is still up at the point the network address is configured to be down. Only after the network address is configured down is the Postnet_stop method invoked. As a result, the data service's TCP or UDP service port, or its RPC program number, always appears to be available to clients on the network, except when the network address also is not responding.

The decision to use the Start and Stop methods versus the Prenet_start and Postnet_stop methods, or to use both, must take the requirements and behavior of both the server and client into account.

Initializing and Terminating a Resource

Three optional methods, Init, Fini, and Boot, allow the RGM to execute initialization and termination code on a resource. The RGM invokes the Init method to perform a one-time initialization of the resource when the resource becomes managed—either when the resource group it is in is switched from an unmanaged to a managed state, or when it is created in a resource group that is already managed.

TheRGM invokes the Fini method to clean up after the resource when the resource becomes unmanaged—either when the resource group it is in is switched to an unmanaged state or when it is deleted from a managed resource group. The clean up must be idempotent, that is, if the clean up has already been done, Fini exits 0 (success).

The RGM invokes the Boot method on nodes that have newly joined the cluster, that is, have been booted or rebooted.

The Boot method normally performs the same initialization as Init. This initialization must be idempotent, that is, if the resource has already been initialized on the local node, Boot and Init exit 0 (success).

Monitoring a Resource

Typically, you implement monitors to run periodic fault probes on resources to detect whether the probed resources are functioning correctly. If a fault probe fails, the monitor can attempt to restart locally or request failover of the affected resource group by calling the scha_control RMAPI function or the scds_fm_action DSDL function.

You can also monitor the performance of a resource and tune or report performance. Writing a resource type-specific fault monitor is completely optional. Even if you choose not to write such a fault monitor, the resource type benefits from the basic monitoring of the cluster that Sun Cluster itself does. Sun Cluster detects failures of the host hardware, gross failures of the host's operating system, and failures of a host to be able to communicate on its public networks.

Although the RGM does not call a resource monitor directly, it does provide for automatically starting monitors for resources. When bringing a resource offline, the RGM calls the Monitor_stop method to stop the resource's monitor on the local nodes before stopping the resource itself. When bringing a resource online, the RGM calls the Monitor_start method after the resource itself has been started.

The scha_control RMAPI function and the scds_fm_action DSDL function (which calls scha_control) allow resource monitors to request the failover of a resource group to a different node. As one of its sanity checks, scha_control calls Monitor_check (if defined), to determine if the requested node is reliable enough to master the resource group containing the resource. If Monitor_check reports back that the node is not reliable, or the method times out, the RGM looks for a different node to honor the failover request. If Monitor_check fails on all nodes, the failover is canceled.

The resource monitor can set the Status and Status_msg properties to reflect the monitor's view of the resource state. Use the RMAPI scha_resource_setstatus(1HA), (3HA) command or function, or the DSDL scds_fm_action(3HA) function to set these properties.

Note –

Although Status and Status_msg are of particular use to a resource monitor, any program can set these properties.

See Defining a Fault Monitor for an example of a fault monitor implemented with the RMAPI. See The SUNW.xfnts Fault Monitor for an example of a fault monitor implemented with the DSDL. See the Sun Cluster 3.1 Data Services Installation and Configuration Guide for information on fault monitors built into Sun supplied data services.

Adding Message Logging to a Resource

If you want to record status messages in the same log file as other cluster messages, use the convenience function scha_cluster_getlogfacility to retrieve the facility number being used to log cluster messages.

Use this facility number with the regular Solaris syslog function to write messages to the cluster log. You can also access the cluster log facility information through the generic scha_cluster_get interface.

Providing Process Management

The Resource Management API and the DSDL provide process management facilities to implement resource monitors and resource control callbacks. The RMAPI defines the following facilities (see the man pages for details on each of these commands and programs):

Process Monitor Facility: pmfadm and rpc.pmfd: The Process Monitor Facility (PMF), provides a means of monitoring processes and their descendants, and restarting them if they die. The facility consists of the pmfadm command for starting and controlling monitored processes, and the rpc.pmfd daemon.
halockrun: A program for running a child program while holding a file lock. This command is convenient for use in shell scripts.
hatimerun: A program for running a child program under time-out control. This is a convenience command for use in shell scripts.

The DSDL provides the scds_hatimerun function to implement the hatimerun functionality.

The DSDL provides a set of functions (scds_pmf_*) to implement the PMF functionality. See PMF Functions for an overview of the DSDL PMF functionality and for a list of the individual functions.

Providing Administrative Support for a Resource

Administrative actions on resources include setting and changing resource properties. The API defines the Validate and Update callback methods so you can hook into these administrative actions.

The RGM calls the optional Validate method when a resource is created and when administrative action updates the properties of the resource or its containing group. The RGM passes the property values for the resource and its resource group to the Validate method. The RGM calls Validate on the set of cluster nodes indicated by the Init_nodes property of the resource's type (see Resource Type Properties, or the rt_properties(5) man page, for information about Init_nodes. The RGM calls Validate before the creation or update is applied, and a failure exit code from the method on any node causes the creation or update to fail.

The RGM calls Validate only when resource or group properties are changed through administrative action, not when the RGM sets properties, or when a monitor sets the resource properties Status and Status_msg.

The RGM calls the optional Update method to notify a running resource that properties have been changed. The RGM invokes Update after an administrative action succeeds in setting properties of a resource or its group. The RGM calls this method on nodes where the resource is online. This method can use the API access functions to read property values that might affect an active resource and adjust the running resource accordingly.

Implementing a Failover Resource

A failover resource group contains network addresses such as the built in resource types logical hostname and shared address, and failover resources such as the data service application resources for a failover data service. The network address resources, along with their dependent data service resources move between cluster nodes when data services fail over or are switched over. The RGM provides a number of properties that support implementation of a failover resource.

Set the boolean resource type property, Failover, to TRUE, to restrict the resource from being configured in a resource group that can be online on more than one node at a time. This property defaults to FALSE, so you must declare it as TRUE in the RTR file for a failover resource.

The Scalable resource property determines if the resource uses the cluster shared-address facility. For a failover resource, set Scalable to FALSE because a failover resource does not use shared addresses.

The RG_mode resource group property allows the cluster administrator to identify a resource group as failover or scalable. If RG_mode is FAILOVER, the RGM sets the Maximum_primaries property of the group to 1 and restricts the resource group to being mastered by a single node. The RGM does not allow a resource whose Failover property is TRUE to be created in a resource group whose RG_mode is SCALABLE.

The Implicit_network_dependencies resource group property specifies that the RGM should enforce implicit strong dependencies of non-network-address resources on all network-address resources (logical hostname and shared address) within the group. This means that the non-network address (data service) resources in the group will not have their Start methods called until the network addresses in the group are configured up. The Implicit_network_dependencies property defaults to TRUE.

Implementing a Scalable Resource

A scalable resource can be online on more than one node simultaneously. Scalable resources include data services such as Sun Cluster HA for Sun One Web Server and HA-Apache.

The RGM provides a number of properties that support implementation of a scalable resource.

Set the boolean resource type property, Failover, to FALSE, to allow the resource to be configured in a resource group that can be online on more than one node at a time.

The Scalable resource property determines if the resource uses the cluster shared-address facility. Set this property to TRUE because a scalable service uses a shared-address resource to make the multiple instances of the scalable service appear as a single service to the client.

The RG_mode property enables the cluster administrator to identify a resource group as failover or scalable. If RG_mode is SCALABLE, the RGM allows Maximum_primaries to have a value greater than 1, meaning the group can be mastered by multiple nodes simultaneously. The RGM allows a resource whose Failover property is FALSE to be instantiated in a resource group whose RG_mode is SCALABLE.

The cluster administrator creates a scalable resource group to contain scalable service resources, and a separate failover resource group to contain the shared-address resources upon which the scalable resource depends.

The cluster administrator uses the RG_dependencies resource group property to specify the order in which resource groups are brought online and offline on a node. This ordering is important for a scalable service because the scalable resources and the shared address resources upon which they depend are in different resource groups. A scalable data service requires that its network address (shared address) resources be configured up before it is started. Therefore, the administrator must set the RG_dependencies property (of the resource group containing the scalable service) to include the resource group containing the shared address resources.

When you declare the Scalable property in the RTR file for a resource, the RGM automatically creates the following set of scalable properties for the resource:

Network_resources_used

Identifies the shared address resources used by this resource. This property defaults to the empty string so the cluster administrator must provide the actual list of shared addresses the scalable service uses when creating the resource. The scsetup command and SunPlex Manager provide features to automatically set up the necessary resources and groups for scalable services.

Load_balancing_policy

Specifies the load balancing policy for the resource. You can explicitly set the policy in the RTR file (or allow the default, LB_WEIGHTED). In either case, the cluster administrator can change the value when creating the resource (unless you set tunability for Load_balancing_policy to NONE or FALSE in the RTR file). Legal values are:

LB_WEIGHTED: The load is distributed among various nodes according to the weights set in the Load_balancing_weights property.
LB_STICKY: A given client (identified by the client IP address) of the scalable service, is always sent to the same node of the cluster.
LB_STICKY_WILD: A given client (identified by the client's IP address), that connects to an IP address of a wildcard sticky service, is always sent to the same cluster node regardless of the port number it is coming to.

For a scalable service with Load_balancing_policy LB_STICKY or LB_STICKY_WILD, changing Load_balancing_weights while the service is online can cause existing client affinities to be reset. In that case, a different node might service a subsequent client request even if the client had been previously serviced by another node in the cluster.

Similarly, starting a new instance of the service on a cluster, might reset existing client affinities.

Load_balancing_weights

Specifies the load to be sent to each node. The format is weight@node,weight@node, where weight is an integer reflecting the relative portion of load distributed to the specified node. The fraction of load distributed to a node is the weight for this node divided by the sum of all weights of active instances. For example, 1@1,3@2 specifies that node 1 receives 1/4 of the load and node 2 receives 3/4.

Port_list

Identifies the ports on which the server is listening. This property defaults to the empty string. You can provide a list of ports in the RTR file. Otherwise, the cluster administrator must provide the actual list of ports when creating the resource.

You can create a data service that can be configured by the administrator to be either scalable or failover. To do so, declare both the Failover resource type property and the Scalable resource property as FALSE in the data service's RTR file. Specify the Scalable property to be tunable at creation.

The Failover property value (FALSE) allows the resource to be configured into a scalable resource group. The administrator can enable shared addresses by changing the value of Scalable to TRUE when creating the resource, and thusly create a scalable service.

On the other hand, even though Failover is set to FALSE, the administrator can configure the resource into a failover resource group to implement a failover service. The administrator does not change the value of Scalable, which is FALSE. To support this contingency, you should provide a check in the Validate method on the Scalable property. If Scalable is FALSE, verify that the resource is configured into a failover resource group.

See Sun Cluster 3.1 Concepts for additional information regarding scalable resources.

Validation Checks For Scalable Services

Whenever a resource is created or updated with the scalable property set to TRUE, the RGM validates various resource properties. If the properties are not configured correctly, the RGM rejects the attempted update or creation. The RGM performs the following checks:

The Network_resources_used property must be non-empty and contain the names of existing shared address resources. Every node in the Nodelist of the resource group containing the scalable resource must appear in either the NetIfList property or AuxNodeList property of each of the named shared address resources.
The RG_dependencies property of the resource group that contains the scalable resource must include the resource groups of all shared address resources listed in the scalable resource's Network_resources_used property.
The Port_list property must be non-empty and contain a list of port-protocol pairs such that protocol is either tcp or udp. For example,
Port_list=80/tcp,40/udp

Writing and Testing Data Services

This section provides some information about writing and testing data services.

Using Keep-Alives

On the server side, using TCP keep-alives protects the server from wasting system resources for a down (or network-partitioned) client. If these resources are not cleaned up (in a server that stays up long enough), eventually the wasted resources grow without bound as clients crash and reboot.

If the client-server communication uses a TCP stream, then both the client and the server should enable the TCP keep-alive mechanism. This provision applies even in the non-HA, single-server case.

Other connection-oriented protocols might also have a keep-alive mechanism.

On the client side, using TCP keep-alives enables the client to be notified when a network address resource has failed over or switched over from one physical host to another. That transfer of the network address resource breaks the TCP connection. However, unless the client has enabled the keep-alive, it does not necessarily learn of the connection break if the connection happens to be quiescent at the time.

For example, suppose the client is waiting for a response from the server to a long-running request, and the client's request message has already arrived at the server and has been acknowledged at the TCP layer. In this situation, the client's TCP module has no need to keep retransmitting the request, and the client application is blocked, waiting for a response to the request.

Where possible, in addition to using the TCP keep-alive mechanism, the client application also must perform its own periodic keep-alive at its level, because the TCP keep-alive mechanism is not perfect in all possible boundary cases. Using an application-level keep-alive typically requires that the client-server protocol supports a null operation or at least an efficient read-only operation such as a status operation.

Testing HA Data Services

This section provides suggestions about how to test a data service implementation in the HA environment. The test cases are suggestions and are not exhaustive. You need access to a test-bed Sun Cluster configuration so the testing work does not impact production machines.

Test that your HA data service behaves properly in all cases where a resource group is moved between physical hosts. These cases include system crashes and the use of the scswitch command. Test that client machines continue to get service after these events.

Test the idempotency of the methods. For example, replace each method temporarily with a short shell script that calls the original method two or more times.

Coordinating Dependencies Between Resources

Sometimes one client-server data service makes requests on another client-server data service while fulfilling a request for a client. Informally, a data service A depends on a data service B if, for A to provide its service, B must provide its service. Sun Cluster provides for this requirement by permitting resource dependencies to be configured within a resource group. The dependencies affect the order in which Sun Cluster starts and stops data services. See the scrgadm(1M) man page for details.

If resources of your resource type depend on resources of another type, you need to instruct the user to configure the resources and resource groups appropriately, or provide scripts or tools to correctly configure them. If the dependent resource must run on the same node as the depended-on resource, then both resources must be configured in the same resource group.

Decide whether to use explicit resource dependencies, or to omit them and poll for the availability of the other data service(s) in your HA data service's own code. In the case that the dependent and depended-on resource can run on different nodes, configure them into separate resource groups. In this case, polling is required because it is not possible to configure resource dependencies across groups.

Some data services store no data directly themselves, but instead depend on another back-end data service to store all their data. Such a data service translates all read and update requests into calls on the back-end data service. For example, consider a hypothetical client-server appointment calendar service that keeps all of its data in an SQL database such as Oracle. The appointment calendar service has its own client-server network protocol. For example, it might have defined its protocol using an RPC specification language, such as ONC™ RPC.

In the Sun Cluster environment, you can use HA-ORACLE to make the back-end Oracle database highly available. Then you can write simple methods for starting and stopping the appointment calendar daemon. Your end user registers the appointment calendar resource type with Sun Cluster.

If the appointment calendar application must run on the same node as the Oracle database, then the end user configures the appointment calendar resource in the same resource group as the HA-ORACLE resource, and makes the appointment calendar resource dependent on the HA-ORACLE resource. This dependency is specified using the Resource_dependencies property tag in scrgadm.

If the HA-ORACLE resource is able to run on a different node than the appointment calendar resource, the end user configures them into two separate resource groups. The end user might configure a resource group dependency of the calendar resource group on the Oracle resource group. However resource group dependencies are only effective when both resource groups are being started or stopped on the same node at the same time. Therefore, the calendar data service daemon, after it has been started, might poll waiting for the Oracle database to become available. The calendar resource type's Start method usually would just return success in this case, because if the Start method blocked indefinitely it would put its resource group into a busy state, which would prevent any further state changes (such as edits, failovers, or switchovers) on the group. However, if the calendar resource's Start method timed-out or exited non-zero, it might cause the resource group to ping-pong between two or more nodes while the Oracle database remained unavailable.

Chapter 2 Developing a Data Service

Analyzing the Application for Suitability

Determining the Interface to Use

Setting Up the Development Environment for Writing a Data Service

How to Set Up the Development Environment

How to Transfer a Data Service to a Cluster

Setting Resource and Resource Type Properties

Declaring Resource Type Properties

Declaring Resource Properties

Declaring Extension Properties

Implementing Callback Methods

Accessing Resource and Resource Group Property Information

Idempotency for Methods

Generic Data Service

Controlling an Application

Starting and Stopping a Resource

Deciding on the Start and Stop Methods to Use

Initializing and Terminating a Resource

Monitoring a Resource

Adding Message Logging to a Resource

Providing Process Management

Providing Administrative Support for a Resource

Implementing a Failover Resource

Implementing a Scalable Resource

Validation Checks For Scalable Services

Writing and Testing Data Services

Using Keep-Alives

Testing HA Data Services

Coordinating Dependencies Between Resources

Deciding on the `Start` and `Stop` Methods to Use