Sun Cluster Data Services Developer's Guide for Solaris OS

Chapter 2 Developing a Data Service

This chapter tells you how to make an application highly available or scalable, and provides detailed information about developing a data service.

This chapter covers the following topics:

Analyzing the Application for Suitability

The first step in creating a data service is to determine whether the target application satisfies the requirements for being made highly available or scalable. If the application fails to meet all requirements, you might be able to modify the application source code to make it highly available or scalable.

The list that follows summarizes the requirements for an application to be made highly available or scalable. If you need more detail or if you need to modify the application source code, see Appendix B, Sample Data Service Code Listings.


Note –

A scalable service must meet all the following conditions for high availability as well as some additional criteria, which follow the list.


Additionally, scalable services must meet the following requirements:

For a scalable service, application characteristics also determine the load-balancing policy. For example, the load-balancing policy Lb_weighted, which allows any instance to respond to client requests, does not work for an application that makes use of an in-memory cache on the server for client connections. In this case, specify a load-balancing policy that restricts a given client's traffic to one instance of the application. The load-balancing policies Lb_sticky and Lb_sticky_wild repeatedly send all requests by a client to the same application instance, where they can make use of an in-memory cache. Note that if multiple client requests come in from different clients, the RGM distributes the requests among the instances of the service. See Implementing a Failover Resource for more information about setting the load-balancing policy for scalable data services.

Determining the Interface to Use

The Sun Cluster developer support package (SUNWscdev) provides two sets of interfaces for coding data service methods:

Also included in the Sun Cluster developer support package is Sun Cluster Agent Builder, a tool that automates the creation of a data service.

    Here is the recommended approach to developing a data service:

  1. Decide whether to code in C or the Korn shell. If you decide to use the Korn shell, you cannot use the DSDL, which provides a C interface only.

  2. Run Agent Builder, specify the requested information, and generate a data service, which includes source and executable code, an RTR file, and a package.

  3. If the generated data service requires customizing, you can add DSDL code to the generated source files. Agent Builder indicates, with comments, specific places in the source files where you can add your own code.

  4. If the code requires further customizing to support the target application, you can add RMAPI functions to the existing source code.

In practice, you can take numerous approaches to creating a data service. For example, rather than add your own code to specific places in the code that is generated by Agent Builder, you could entirely replace one of the generated methods or the generated monitor program with a program that you write from scratch using DSDL or RMAPI functions.

However, regardless of how you proceed, in almost every case, starting with Agent Builder makes sense, for the following reasons:


Note –

Unlike the RMAPI, which provides a set of C functions and a set of commands for use in scripts, the DSDL provides a C function interface only. Therefore, if you specify Korn shell (ksh) output in Agent Builder, the generated source code makes calls to RMAPI because there are no DSDL ksh commands.


Setting Up the Development Environment for Writing a Data Service

Before you begin to develop your data service, you must install the Sun Cluster development package (SUNWscdev) to have access to the Sun Cluster header and library files. Although this package is already installed on all cluster nodes, you typically develop your data service on a separate, non-cluster development machine, rather than on a cluster node. In this typical case, you must use the pkgadd command to install the SUNWscdev package on your development machine.


Note –

On the development machine, ensure that you are using the Developer or Entire Distribution software group of the Solaris 9 OS or the Solaris 10 OS.


When compiling and linking your code, you must set particular options to identify the header and library files.


Note –

You cannot mix compatibility-mode compiled C++ code and standard-mode compiled C++ code in the Solaris Operating System and Sun Cluster products.

Consequently, if you intend to create a C++ based data service for use on Sun Cluster, you must compile that data service, as follows:


When you have finished development (on a non-cluster node), you can transfer the completed data service to a cluster for testing.

The procedures in this section describe how to complete the following tasks:

ProcedureHow to Set Up the Development Environment

This procedure describes how to install the SUNWscdev package and set the compiler and linker options for data service development.

  1. Become superuser or assume a role that provides solaris.cluster.modify RBAC authorization.

  2. Change directory to the CD-ROM directory that you want.


    # cd cd-rom-directory
    
  3. Install the SUNWscdev package in the current directory.


    # pkgadd -d . SUNWscdev
    
  4. In the makefile, specify compiler and linker options that identify the include and library files for your data service code.

    Specify the -I option to identify the Sun Cluster header files, the -L option to specify the compile-time library search path on the development system, and the -R option to specify the library search path to the runtime linker in the cluster.

    # Makefile for sample data service
    ...
    
    -I /usr/cluster/include
    
    -L /usr/cluster/lib
    
    -R /usr/cluster/lib
    ...

Transferring a Data Service to a Cluster

When you have completed the data service on a development machine, you must transfer the data service to a cluster for testing. To reduce the chance of error during the transfer, combine the data service code and the RTR file into a package. Then, install the package on the Solaris hosts on which you want to run the service.


Note –

Agent Builder creates this package automatically.


Setting Resource and Resource Type Properties

Sun Cluster provides a set of resource type properties and resource properties that you use to define the static configuration of a data service. Resource type properties specify the type of the resource, its version, the version of the API, as well as the paths to each of the callback methods. Resource Type Properties lists all the resource type properties.

Resource properties, such as Failover_mode, Thorough_probe_interval, and method timeouts, also define the static configuration of the resource. Dynamic resource properties, such as Resource_state and Status, reflect the active state of a managed resource. Resource Properties describes the resource properties.

You declare the resource type and resource properties in the resource type registration (RTR) file, which is an essential component of a data service. The RTR file defines the initial configuration of the data service at the time that the cluster administrator registers the data service with the Sun Cluster software.

Use Agent Builder to generate the RTR file for your data service. Agent Builder declares the set of properties that are both useful and required for any data service. For example, particular properties, such as Resource_type, must be declared in the RTR file. Otherwise, registration of the data service fails. Other properties, although not required, are not available to a cluster administrator unless you declare them in the RTR file. Some properties are available whether you declare them or not because the RGM defines them and provides default values. To avoid this level of complexity, use Agent Builder to guarantee the generation of a correct RTR file. Later, you can edit the RTR file to change specific values if necessary.

The rest of this section shows a sample RTR file, which was created by Agent Builder.

Declaring Resource Type Properties

The cluster administrator cannot configure the resource type properties that you declare in the RTR file. They become part of the permanent configuration of the resource type.


Note –

Only a cluster administrator can configure the resource type property Installed_nodes. You cannot declare Installed_nodes in the RTR file.


The syntax of resource type declarations is as follows:

property-name = value;

Note –

Property names for resource groups, resources, and resource types are not case sensitive. You can use any combination of uppercase and lowercase letters when you specify property names.


These are resource type declarations in the RTR file for a sample (smpl) data service:

# Sun Cluster Data Services Builder template version 1.0
# Registration information and resources for smpl
#
#NOTE: Keywords are case insensitive, i.e., you can use
#any capitalization style you prefer.
#
Resource_type = "smpl";
Vendor_id = SUNW;
RT_description = "Sample Service on Sun Cluster";

RT_version ="1.0"; 
API_version = 2;
Failover = TRUE;

Init_nodes = RG_PRIMARIES;

RT_basedir=/opt/SUNWsmpl/bin;

Start           =    smpl_svc_start;
Stop            =    smpl_svc_stop;

Validate        =    smpl_validate;
Update          =    smpl_update;

Monitor_start   =    smpl_monitor_start;
Monitor_stop    =    smpl_monitor_stop;
Monitor_check   =    smpl_monitor_check;

Tip –

You must declare the Resource_type property as the first entry in the RTR file. Otherwise, registration of the resource type fails.


The first set of resource type declarations provide basic information about the resource type.

Resource_type and Vendor_id

Provide a name for the resource type. You can specify the resource type name with the Resource_type property alone (smpl) or by using the Vendor_id property as a prefix with a period (.) separating it from the resource type (SUNW.smpl), as shown in the sample. If you use Vendor_id, make it the stock market symbol of the company that is defining the resource type. The resource type name must be unique in the cluster.


Note –

By convention, the resource type name (vendoridApplicationname) is used as the package name. Starting with the Solaris 9 OS, the combination of vendor ID and application name can exceed nine characters.

Agent Builder, on the other hand, in all cases explicitly generates the package name from the resource type name, so it enforces the nine-character limit.


RT_description

Briefly describes the resource type.

RT_version

Identifies the version of the sample data service.

API_version

Identifies the version of the API. For example, API_version = 2 indicates that the data service can be installed on any version of Sun Cluster starting with Sun Cluster 3.0. API_version = 7 indicates that the data service can be installed on any version of Sun Cluster starting with Sun Cluster 3.2. However, API_version = 7 also indicates that the data service cannot be installed on any version of Sun Cluster that was released before Sun Cluster 3.2. This property is described in more detail under the entry for API_version in Resource Type Properties.

Failover = TRUE

Indicates that the data service cannot run in a resource group that can be online on multiple nodes at the same time. In other words, this declaration specifies a failover data service. This property is described in more detail under the entry for Failover in Resource Type Properties.

Start, Stop, and Validate

Provide the paths to the respective callback method programs that are called by the RGM. These paths are relative to the directory that is specified by RT_basedir.

The remaining resource type declarations provide configuration information.

Init_nodes = RG_PRIMARIES

Specifies that the RGM call the Init, Boot, Fini, and Validate methods only on nodes that can master the data service. The nodes that are specified by RG_PRIMARIES are a subset of all nodes on which the data service is installed. Set the value to RT_INSTALLED_NODES to specify that the RGM call these methods on all nodes on which the data service is installed.

RT_basedir

Points to /opt/SUNWsample/bin as the directory path to complete relative paths, such as callback method paths.

Start, Stop, and Validate

Provide the paths to the respective callback method programs that are called by the RGM. These paths are relative to the directory that is specified by RT_basedir.

Declaring Resource Type Properties for a Zone Cluster

You (and the cluster administrator) can register a resource type for use in a particular zone cluster by creating an RTR file under the zone root path. To correctly configure this RTR file, ensure that it meets the following conditions:

You can also register a resource type for a zone cluster by placing an RTR file in the /usr/cluster/lib/rgm/rtreg/ directory. The cluster administrator cannot configure the resource type properties that you declare in an RTR file in this directory.

Resource types that are defined in RTR files in the /opt/cluster/lib/rgm/rtreg/ directory are for the exclusive use of the global cluster.

Declaring Resource Properties

As with resource type properties, you declare resource properties in the RTR file. By convention, resource property declarations follow the resource type declarations in the RTR file. The syntax for resource declarations is a set of attribute value pairs enclosed by braces ({}):

{
    attribute = value;
    attribute = value;
             .
             .
             .
    attribute = value;
}

For resource properties that are provided by Sun Cluster, which are called system-defined properties, you can change specific attributes in the RTR file. For example, Sun Cluster provides default values for method timeout properties for each callback method. In the RTR file, you can specify different default values.

If an RGM method callback times out, the method's process tree is killed by a SIGABRT signal (not a SIGTERM signal). As a result, all members of the process group generate a core dump file in the /var/cluster/core directory or in a subdirectory of the /var/cluster/core directory on the node on which the method exceeded its timeout. This core dump file is generated to enable you to determine why your method exceeded its timeout.


Note –

Avoid writing data service methods that create a new process group. If your data service method must create a new process group, write a signal handler for the SIGTERM and SIGABRT signals. Also, ensure that your signal handler forwards the SIGTERM or SIGABRT signal to the child process group or groups before the signal handler terminates the process. Writing a signal handler for these signals increases the likelihood that all processes that are spawned by your method are correctly terminated.


You can also define new resource properties in the RTR file, which are called extension properties, by using a set of property attributes that are provided by Sun Cluster. Resource Property Attributes lists the attributes for changing and defining resource properties. Extension property declarations follow the system-defined property declarations in the RTR file.

The first set of system-defined resource properties specifies timeout values for the callback methods.

...

# Resource property declarations appear as a list of bracketed
# entries after the resource type declarations. The property 
# name declaration must be the first attribute after the open
# curly bracket of a resource property entry.
#
# Set minimum and default for method timeouts.
{
        PROPERTY = Start_timeout;
        MIN=60;
        DEFAULT=300;
}

{
        PROPERTY = Stop_timeout;
        MIN=60;
        DEFAULT=300;
}
{
        PROPERTY = Validate_timeout;
        MIN=60;
        DEFAULT=300;
}
{
        PROPERTY = Update_timeout;
        MIN=60;
        DEFAULT=300;
}
{
        PROPERTY = Monitor_Start_timeout;
        MIN=60;
        DEFAULT=300;
}
{
        PROPERTY = Monitor_Stop_timeout;
        MIN=60;
        DEFAULT=300;
{
        PROPERTY = Monitor_Check_timeout;
        MIN=60;
        DEFAULT=300;
}

The name of the property (PROPERTY = value) must be the first attribute for each resource-property declaration. You can configure resource properties within limits that are defined by the property attributes in the RTR file. For example, the default value for each method timeout in the sample is 300 seconds. The cluster administrator can change this value. However, the minimum allowable value, specified by the MIN attribute, is 60 seconds. Resource Property Attributes contains a list of resource property attributes.

The next set of resource properties defines properties that have specific uses in the data service.

{
        PROPERTY = Failover_mode;
        DEFAULT=SOFT;
        TUNABLE = ANYTIME;
}
{
        PROPERTY = Thorough_Probe_Interval;
        MIN=1;
        MAX=3600;
        DEFAULT=60;
        TUNABLE = ANYTIME;
}

# The number of retries to be done within a certain period before concluding
# that the application cannot be successfully started on this node.
{
        PROPERTY = Retry_count;
        MAX=10;
        DEFAULT=2;
        TUNABLE = ANYTIME; 
}

# Set Retry_interval as a multiple of 60 since it is converted from seconds
# to minutes, rounding up. For example, a value of 50 (seconds)
# is converted to 1 minute. Use this property to time the number of
# retries (Retry_count).
{
        PROPERTY = Retry_interval;
        MAX=3600;
        DEFAULT=300;
        TUNABLE = ANYTIME;
}

{
        PROPERTY = Network_resources_used;
        TUNABLE = WHEN_DISABLED;
        DEFAULT = "";
}
{
        PROPERTY = Scalable;
        DEFAULT = FALSE;
        TUNABLE = AT_CREATION;
}
{
        PROPERTY = Load_balancing_policy;
        DEFAULT = LB_WEIGHTED;
        TUNABLE = AT_CREATION;
}
{
        PROPERTY = Load_balancing_weights;
        DEFAULT = "";
        TUNABLE = ANYTIME;
}
{
        PROPERTY = Port_list;
        TUNABLE = ANYTIME;
        DEFAULT = ;
}

These resource-property declarations include the TUNABLE attribute. This attribute limits the occasions on which the cluster administrator can change the value of the property with which this attribute is associated. For example, the value AT_CREATION means that the cluster administrator can only specify the value when the resource is created and cannot change the value later.

For most of these properties, you can accept the default values as generated by Agent Builder unless you have a reason to change them. Information about these properties follows. For additional information, see Resource Properties or the r_properties(5) man page.

Failover_mode

Indicates whether the RGM should relocate the resource group or abort the node in the case of a failure of a Start or Stop method.

Thorough_probe_interval, Retry_count, and Retry_interval

Used in the fault monitor. Tunable equals ANYTIME, so a cluster administrator can adjust them if the fault monitor is not functioning optimally.

Network_resources_used

A list of logical-hostname or shared-address resources on which this resource has a dependency. This list contains all network-address resources that appear in the properties Resource_dependencies, Resource_dependencies_weak, Resource_dependencies_restart, or Resource_dependencies_offline_restart.

The RGM automatically creates this property if the Scalable property is declared in the RTR file. If the Scalable property is not declared in the RTR file, Network_resources_used is unavailable unless it is explicitly declared in the RTR file.

If you do not assign a value to the Network_resources_used property, its value is updated automatically by the RGM, based on the setting of the resource-dependencies properties. You do not need to set this property directly. Instead, set the Resource_dependencies, Resource_dependencies_offline_restart, Resource_dependencies_restart, or Resource_dependencies_weak property.

To maintain compatibility with earlier releases of Sun Cluster software, you can still set the value of the Network_resources_used property directly. If you set the value of the Network_resources_used property directly, the value of the Network_resources_used property is no longer derived from the settings of the resource-dependencies properties. If you add a resource name to the Network_resources_used property, the resource name is automatically added to the Resource_dependencies property as well. The only way to remove that dependency is to remove it from the Network_resources_used property. If you are not sure whether a network-resource dependency was originally added to the Resource_dependencies property or to the Network_resources_used property, remove the dependency from both properties.

Scalable

Set to FALSE to indicate that this resource does not use the cluster networking (shared address) facility. If you set this property to FALSE, the resource type property Failover must be set to TRUE to indicate a failover service. See Transferring a Data Service to a Cluster and Implementing Callback Methods for additional information about how to use this property.

Load_balancing_policy and Load_balancing_weights

Automatically declares these properties. However, these properties have no use in a failover resource type.

Port_list

Identifies the list of ports on which the application is listening. Agent Builder declares this property so that a cluster administrator can specify a list of ports when the cluster administrator configures the data service.

Declaring Extension Properties

Extension properties appear at the end of the sample RTR file.

# Extension Properties
#

# The cluster administrator must set the value of this property to point to the 
# directory that contains the configuration files used by the application.
# For this application, smpl, specify the path of the configuration file on
# PXFS (typically named.conf).
{
        PROPERTY = Confdir_list;
        EXTENSION;
        STRINGARRAY;
        TUNABLE = AT_CREATION;
        DESCRIPTION = "The Configuration Directory Path(s)";
}

# The following two properties control restart of the fault monitor.
{
        PROPERTY = Monitor_retry_count;
        EXTENSION;
        INT;
        DEFAULT = 4;
        TUNABLE = ANYTIME;
        DESCRIPTION = "Number of PMF restarts allowed for fault monitor.";
}
{
        PROPERTY = Monitor_retry_interval;
        EXTENSION;
        INT;
        DEFAULT = 2;
        TUNABLE = ANYTIME;
        DESCRIPTION = "Time window (minutes) for fault monitor restarts.";
}
# Time out value in seconds for the probe.
{
        PROPERTY = Probe_timeout;
        EXTENSION;
        INT;
        DEFAULT = 120;
        TUNABLE = ANYTIME;
        DESCRIPTION = "Time out value for the probe (seconds)";
}

# Child process monitoring level for PMF (-C option of pmfadm).
# Default of -1 means to not use the -C option of pmfadm.
# A value of 0 or greater indicates the desired level of child-process.
# monitoring.
{
        PROPERTY = Child_mon_level;
        EXTENSION;
        INT;
        DEFAULT = -1;
        TUNABLE = ANYTIME;
        DESCRIPTION = “Child monitoring level for PMF";
}
# User added code -- BEGIN VVVVVVVVVVVV
# User added code -- END   ^^^^^^^^^^^^

Agent Builder creates the following extension properties, which are useful for most data services.

Confdir_list

Specifies the path to the application configuration directory, which is useful information for many applications. The cluster administrator can provide the location of this directory when the cluster administrator configures the data service.

Monitor_retry_count, Monitor_retry_interval, and Probe_timeout

Controls the restarts of the fault monitor itself, not the server daemon.

Child_mon_level

Sets the level of monitoring to be carried out by the PMF. See the pmfadm(1M) man page for more information.

You can create additional extension properties in the area that is delimited by the User added code comments.

Implementing Callback Methods

This section provides general information that pertains to implementing the callback methods.

Accessing Resource and Resource Group Property Information

Generally, callback methods require access to the properties of the resource. The RMAPI provides both shell commands and C functions that you can use in callback methods to access the system-defined and extension properties of resources. See the scha_resource_get(1HA) and scha_resource_get(3HA) man pages.

The DSDL provides a set of C functions (one function for each property) to access system-defined properties, and a function to access extension properties. See the scds_property_functions(3HA) and scds_get_ext_property(3HA) man pages.

You cannot use the property mechanism to store dynamic state information for a data service because no API functions are available for setting resource properties other than Status and Status_msg. Rather, you should store dynamic state information in global files.


Note –

The cluster administrator can set particular resource properties by using the clresource command or through a graphical administrative command or interface. However, do not call clresource from any callback method because clresource fails during cluster reconfiguration, that is, when the RGM calls the method.


Idempotence of Methods

In general, the RGM does not call a method more than once in succession on the same resource with the same arguments. However, if a Start method fails, the RGM can call a Stop method on a resource even though the resource was never started. Likewise, a resource daemon could die of its own accord and the RGM might still run its Stop method on it. The same scenarios apply to the Monitor_start and Monitor_stop methods.

For these reasons, you must build idempotence into your Stop and Monitor_stop methods. In other words, repeated calls to Stop or Monitor_stop on the same resource with the same arguments must achieve the same results as a single call.

One implication of idempotence is that Stop and Monitor_stop must return 0 (success) even if the resource or monitor is already stopped and no work is to be done.


Note –

The Init, Fini, Boot, and Update methods must also be idempotent. A Start method need not be idempotent.


How Methods Are Invoked in Zones

If declared in the RTR file, the Global_zone resource type property indicates whether the methods of a resource type execute in the global zone. If the Global_zone property equals TRUE, methods execute in the global zone even if the resource group that contains the resource is configured to run in a non-global zone.

If the resource for which Global_zone equals TRUE is configured in a non-global zone, methods that are invoked in the global zone are invoked with the -Z zonename option. The zonename operand specifies the Solaris zone name of the non-global zone in which the resource is actually configured. The value of this operand is passed to the method.

If the resource is configured in the global zone, the -Z zonename option is not invoked and the non-global zone name is not passed to the method.

The Global_zone resource type property is described in more detail in Resource Type Properties and in the rt_properties(5) man page.

Generic Data Service

A generic data service (GDS) is a mechanism for making simple applications highly available or scalable by plugging them into the Sun Cluster Resource Group Manager (RGM) framework. This mechanism does not require the coding of a data service, which is the typical approach for making an application highly available or scalable.

The GDS model relies on a precompiled resource type, SUNW.gds, to interact with the RGM framework. See Chapter 10, Generic Data Services for additional information.

Controlling an Application

Callback methods enable the RGM to take control of the underlying resource (that is, the application). For example, callback methods enable the RGM to take control of the underlying resource when a node joins or leaves the cluster.

Starting and Stopping a Resource

A resource type implementation requires, at a minimum, a Start method and a Stop method.

Using Start and Stop Methods

The RGM calls a resource type's method programs at correct times and on the correct nodes for bringing resource groups offline and online. For example, after the crash of a cluster node, the RGM moves any resource groups that are mastered by that node onto a new node. In this case, you must implement a Start method to provide the RGM with, among other things, a way of restarting each resource on the surviving host node.

A Start method must not return until the resource has been started and is available on the local node. Be certain that resource types that require a long initialization period have sufficiently long timeouts set on their Start methods. To ensure sufficient timeouts, set the default and minimum values for the Start_timeout property in the RTR file.

You must implement a Stop method for situations in which the RGM takes a resource group offline. For example, suppose a resource group is taken offline in ZoneA on Host1 and brought back online in ZoneB on Host2. While taking the resource group offline, the RGM calls the Stop method on resources in the resource group to stop all activity in ZoneA on Host1. After the Stop methods for all resources have completed in ZoneA on Host1, the RGM brings the resource group back online in ZoneB on Host2.

A Stop method must not return until the resource has completely stopped all its activity on the local node and has completely shut down. The safest implementation of a Stop method terminates all processes on the local node that are related to the resource. Resource types that require a long time to shut down need sufficiently long timeouts set on their Stop methods. Set the Stop_timeout property in the RTR file.

If an RGM method callback times out, the method's process tree is killed by a SIGABRT signal (not a SIGTERM signal). As a result, all members of the process group generate a core dump file in the /var/cluster/core directory or in a subdirectory of the /var/cluster/core directory on the node on which the method exceeded its timeout. This core dump file is generated to enable you to determine why your method exceeded its timeout.


Note –

Avoid writing data service methods that create a new process group. If your data service method must create a new process group, write a signal handler for the SIGTERM and SIGABRT signals. Also, ensure that your signal handler forwards the SIGTERM or SIGABRT signal to the child process group or groups before the signal handler terminates the process. Writing a signal handler for these signals increases the likelihood that all processes that are spawned by your method are correctly terminated.


Failure or timeout of a Stop method causes the resource group to enter an error state that requires the cluster administrator to intervene. To avoid this state, the Stop and Monitor_stop method implementations must attempt to recover from all possible error conditions. Ideally, these methods must exit with 0 (success) error status, having successfully stopped all activity of the resource and its monitor on the local node.

Deciding Which Start and Stop Methods to Use

This section provides some tips about when to use the Start and Stop methods as opposed to using the Prenet_start and Postnet_stop methods. You must have in-depth knowledge of both the client and the data service's client-server networking protocol to decide the correct methods to use.

Services that use network address resources might require that start or stop steps be done in a particular order. This order must be relative to the logical host name address configuration. The optional callback methods Prenet_start and Postnet_stop enable a resource type implementation to perform special startup and shutdown operations before and after network addresses in the same resource group are configured to go up or configured to go down.

The RGM calls methods that plumb the network addresses (but do not configure network addresses to go up) before calling the data service's Prenet_start method. The RGM calls methods that unplumb the network addresses after calling the data service's Postnet_stop methods.

    The sequence is as follows when the RGM takes a resource group online:

  1. Plumb network addresses.

  2. Call the data service's Prenet_start method (if any).

  3. Configure network addresses to go up.

  4. Call the data service's Start method (if any).

    The reverse happens when the RGM takes a resource group offline:

  1. Call the data service's Stop method (if any).

  2. Configure network addresses to go down.

  3. Call the data service's Postnet_stop method (if any).

  4. Unplumb network addresses.

When deciding whether to use the Start, Stop, Prenet_start, or Postnet_stop methods, first consider the server side. When bringing online a resource group that contains both data service application resources and network address resources, the RGM calls methods to configure the network addresses to go up before it calls the data service resource Start methods. Therefore, if a data service requires network addresses to be configured to go up at the time it starts, use the Start method to start the data service.

Likewise, when bringing offline a resource group that contains both data service resources and network address resources, the RGM calls methods to configure the network addresses to go down after it calls the data service resource Stop methods. Therefore, if a data service requires network addresses to be configured to go up at the time it stops, use the Stop method to stop the data service.

For example, to start or stop a data service, you might have to run the data service's administrative utilities or libraries. Sometimes, the data service has administrative utilities or libraries that use a client-server networking interface to perform the administration. That is, an administrative utility makes a call to the server daemon, so the network address might need to be up to use the administrative utility or library. Use the Start and Stop methods in this scenario.

If the data service requires that the network addresses be configured to go down at the time it starts and stops, use the Prenet_start and Postnet_stop methods to start and stop the data service. Consider whether your client software is to respond differently, depending on whether the network address or the data service comes online first after a cluster reconfiguration (either scha_control() with the SCHA_GIVEOVER argument or a switchover with the clnode evacuate command). For example, the client implementation might perform the fewest retries, giving up soon after determining that the data service port is not available.

If the data service does not require the network address to be configured to go up when it starts, start the data service before the network interface is configured to go up. Starting the data service in this way ensures that the data service is able to respond immediately to client requests as soon as the network address has been configured to go up. As a result, clients are less likely to stop retrying. In this scenario, use the Prenet_start method rather than the Start method to start the data service.

If you use the Postnet_stop method, the data service resource is still up at the point the network address is configured to be down. Only after the network address is configured to go down is the Postnet_stop method run. As a result, the data service's TCP or UDP service port, or its RPC program number, always appears to be available to clients on the network, except when the network address is also not responding.


Note –

If you install an RPC service in the cluster, the service must not use the following program numbers: 100141, 100142, and 100248. These numbers are reserved for the Sun Cluster daemons rgmd_receptionist, fed, and pmfd, respectively. If the RPC service that you install uses one of these program numbers, change the program number of that RPC service.


The decision to use the Start and Stop methods as opposed to the Prenet_start and Postnet_stop methods, or to use both, must take into account the requirements and behavior of both the server and client.

Using the Optional Init, Fini, and Boot Methods

Three optional methods, Init, Fini, and Boot, enable the RGM to execute initialization and termination code on a resource.

Using the Init Method

The RGM executes the Init method to perform a one-time initialization of the resource when the resource becomes managed as a result of one of the following conditions:

Using the Fini Method

The RGM executes the Fini method to clean up after a resource when that resource is no longer managed by the RGM. The Fini method usually undoes any initializations that were performed by the Init method.

The RGM executes Fini on the node where the resource becomes unmanaged when the following situations arise:

A “node list” is either the resource group's Nodelist or the resource type's Installed_nodes list. Whether “node list” refers to the resource group's Nodelist or the resource type's Installed_nodes list depends on the setting of the resource type's Init_nodes property. You can set the Init_nodes property to RG_PRIMARIES or RT_INSTALLED_NODE. For most resource types, Init_nodes is set to RG_PRIMARIES, the default. In this case, both the Init and Fini methods are executed on the nodes that are specified in the resource group's Nodelist.

The type of initialization that the Init method performs defines the type of cleanup that the Fini method that you implement needs to perform, as follows:

Guidelines for Implementing a Fini Method

The Fini method that you implement needs to determine whether to perform only cleanup of node-specific configuration or cleanup of both node-specific and cluster-wide configuration.

When a resource becomes unmanaged on only a particular node, the Fini method can clean up local, node-specific configuration. However, the Fini method must not clean up global, cluster-wide configuration, because the resource remains managed on other nodes. If the resource becomes unmanaged cluster-wide, the Fini method can perform cleanup of both node-specific and global configuration. Your Fini method code can distinguish these two cases by determining whether the resource group's node list contains the local node on which your Fini method is executing.

If the local node appears in the resource group's node list, the resource is being deleted or is moving to an unmanaged state. The resource is no longer active on any node. In this case, your Fini method needs to clean up any node-specific configuration on the local node as well as cluster-wide configuration.

If the local node does not appear in the resource group's node list, your Fini method can clean up node-specific configuration on the local node. However, your Fini method must not clean up cluster-wide configuration. In this case, the resource remains active on other nodes.

You must also code the Fini method so that it is idempotent. In other words, even if the Fini method has cleaned up a resource during a previous execution, subsequent calls to the Fini method exit successfully.

Using the Boot Method

The RGM executes the Boot method on nodes that join the cluster, that is, that have just been booted or rebooted.

The Boot method normally performs the same initialization as Init. You must code the Boot method so that it is idempotent. In other words, even if the Boot method has initialized the resource during a previous execution, subsequent calls to the Boot method exit successfully.

Monitoring a Resource

Typically, you implement monitors to run periodic fault probes on resources to detect whether the probed resources are working correctly. If a fault probe fails, the monitor can attempt to restart locally or request failover of the affected resource group. The monitor requests the failover by calling the scha_control() or scha_control_zone() RMAPI function or the scds_fm_action() DSDL function.

You can also monitor the performance of a resource and tune or report performance. Writing a resource type-specific fault monitor is optional. Even if you choose not to write such a fault monitor, the resource type benefits from the basic monitoring of the cluster that Sun Cluster itself does. Sun Cluster detects failures of the host hardware, gross failures of the host's operating system, and failures of a host to be able to communicate on its public networks.

Although the RGM does not call a resource monitor directly, the RGM does provide for automatically starting monitors for resources. When bringing a resource offline, the RGM calls the Monitor_stop method to stop the resource's monitor on the local nodes before stopping the resource itself. When bringing a resource online, the RGM calls the Monitor_start method after the resource itself has been started.

The scha_control() or scha_control_zone() RMAPI function and the scds_fm_action() DSDL function (which calls scha_control()) enable resource monitors to request the failover of a resource group to a different node. As one of its sanity checks, scha_control() and scha_control_zone() call Monitor_check (if defined) to determine whether the requested node is reliable enough to master the resource group that contains the resource. If Monitor_check reports back that the node is not reliable, or the method times out, the RGM looks for a different node to honor the failover request. If Monitor_check fails on all nodes, the failover is canceled.

The resource monitor can set the Status and Status_msg properties to reflect the monitor's view of the resource state. Use the scha_resource_setstatus() or scha_resource_setstatus_zone() RMAPI function, the scha_resource_setstatus command, or the scds_fm_action() DSDL function to set these properties.


Note –

Although the Status and Status_msg properties are of particular use to a resource monitor, any program can set these properties.


See Defining a Fault Monitor for an example of a fault monitor that is implemented with the RMAPI. See SUNW.xfnts Fault Monitor for an example of a fault monitor that is implemented with the DSDL. See the Sun Cluster Data Services Planning and Administration Guide for Solaris OS for information about fault monitors that are built into data services that are supplied by Sun.

Implementing Monitors and Methods That Execute Exclusively in the Global Zone

Most resource types execute their methods in whatever node appears in the resource group's node list. A few resource types must execute all of their methods in the global zone, even when the resource group is configured in a non-global zone, that is, either a zone-cluster node or a global-cluster non-voting node. This is necessary for resource types that manage system resources, such as network addresses or disks, which can only be managed from the global zone. These resource types are identified by setting the Global_zone property to TRUE in the resource type registration (RTR) file.


Caution – Caution –

Do not register a resource type for which the Global_zone property is set to TRUE unless the resource type comes from a known and trusted source. Resource types for which this property is set to TRUE circumvent zone isolation and present a risk.


A resource type that declares Global_zone=TRUE might also declare the Global_zone_override resource property. In this case, the value of the Global_zone_override property supersedes the value of the Global_zone property for that resource. The Global_zone_override property is described in more detail in Resource Properties and the r_properties(5) man page.

If the Global_zone resource type property is not set to TRUE, monitors and methods execute in whatever nodes are listed in the resource group's node list.

The scha_control() and scha_resource_setstatus() functions and the scha_control and scha_resource_setstatus commands operate implicitly on the node from which the function or command is run. If the Global_zone resource type property equals TRUE, these functions and commands need to be invoked differently when the resource is configured in a non-global zone.

When the resource is configured in a non-global zone, the value of the zonename operand is passed to the resource type method by the -Z option. If your method or monitor invokes one of these functions or commands without the correct handling, it incorrectly operates on the global zone. Your method or monitor should operate on the non-global zone in which the resource that is included in the resource group's node list is configured.

To ensure that your method or monitor code handles these conditions correctly, check that it does the following:

If a resource for which the Global_zone resource type property equals TRUE invokes scha_cluster_get() with the ZONE_LOCAL query optag value, it returns the name of the global zone. In this case, the calling code must concatenate the string :zonename to the local node name to obtain the zone in which the resource is actually configured. The zonename is the same zone name that is passed down to the method in the -Z zonename command-line option. If there is no -Z option in the command line, the resource group is configured in the global zone and you do not need to concatenate a zone name to the node name.

Similarly, if the calling code queries, for example, the state of a resource in the non-global zone, it must invoke scha_resource_get() with the RESOURCE_STATE_NODE optag value rather than the RESOURCE_STATE optag value. In this case, the RESOURCE_STATE optag value queries in the global zone rather than in the non-global zone in which the resource is actually configured.

The DSDL functions inherently handle the -Z zonename option. Therefore, the scds_initialize() function obtains the relevant resource and resource group properties for the non-global zone in which a resource is actually configured. Other DSDL queries operate implicitly on that node.

You can use the DSDL function scds_get_zone_name() to query the name of the zone that is passed to the method in the -Z zonename command-line option. If no -Z zonename is passed, the scds_get_zone_name() function returns NULL.

Multiple Boot methods might run simultaneously in the global zone if both of the following conditions occur:

Adding Message Logging to a Resource

If you want to record status messages in the same log file as other cluster messages, use the convenience function scha_cluster_getlogfacility() to retrieve the facility number that is being used to log cluster messages.

Use this facility number with the regular Solaris syslog() function to write messages to the cluster log. You can also access the cluster log facility information through the generic scha_cluster_get() interface.

Providing Process Management

The RMAPI and the DSDL provide process management facilities to implement resource monitors and resource control callbacks. The RMAPI defines the following facilities:

Process Monitor Facility (PMF): pmfadm and rpc.pmfd

Provides a means of monitoring processes and their descendants, and restarting processes if they die. The facility consists of the pmfadm command for starting and controlling monitored processes, and the rpc.pmfd daemon.

The DSDL provides a set of functions (preceded by the name scds_pmf_) to implement the PMF functionality. See PMF Functions for an overview of the DSDL PMF functionality and for a list of the individual functions.

The pmfadm(1M) and rpc.pmfd(1M) man pages describe this command and daemon in more detail.

halockrun

A program for running a child program while holding a file lock. This command is convenient to use in shell scripts.

The halockrun(1M) man page describes this command in more detail.

hatimerun

A program for running a child program under timeout control. This command is convenient to use in shell scripts.

The DSDL provides the scds_hatimerun() function to implement the features of the hatimerun command.

The hatimerun(1M) man page describes this command in more detail.

Providing Administrative Support for a Resource

Actions that cluster administrators perform on resources include setting and changing resource properties. The API defines the Validate and Update callback methods so that you can create code that hooks into these administrative actions.

The RGM calls the optional Validate method when a resource is created. The RGM also calls the Validate method when a cluster administrator updates the properties of the resource or its containing group. The RGM passes the property values for the resource and its resource group to the Validate method. The RGM calls Validate on the set of cluster nodes that is indicated by the Init_nodes property of the resource's type. See Resource Type Properties or the rt_properties(5) man page for information about Init_nodes. The RGM calls Validate before the creation or the update is applied. A failure exit code from the method on any node causes the creation or the update to fail.

The RGM calls Validate only when the cluster administrator changes resource or resource group properties, not when the RGM sets properties, or when a monitor sets the Status and Status_msg resource properties.

The RGM calls the optional Update method to notify a running resource that properties have been changed. The RGM runs Update after the cluster administrator succeeds in setting properties of a resource or its group. The RGM calls this method on nodes where the resource is online. This method can use the API access functions to read property values that might affect an active resource and adjust the running resource accordingly.

Implementing a Failover Resource

A failover resource group contains network addresses, such as the built-in resource types LogicalHostname and SharedAddress, and failover resources, such as the data service application resources for a failover data service. The network address resources, along with their dependent data service resources, move between cluster nodes when data services fail over or are switched over. The RGM provides a number of properties that support implementation of a failover resource.

In a global cluster, a failover resource group can fail over to a node on another Solaris host or on the same Solaris host. A failover resource group cannot fail over in this way in a zone cluster. However, if the host fails, the failing over of this resource group to a node on the same host does not provide high availability. Nonetheless, you might find this failing over of a resource group to a node on the same host useful in testing or prototyping.

Set the Boolean Failover resource type property to TRUE to restrict the resource from being configured in a resource group that can be online on more than one node at a time. The default for this property is FALSE, so you must declare it as TRUE in the RTR file for a failover resource.

The Scalable resource property determines if the resource uses the cluster shared address facility. For a failover resource, set Scalable to FALSE because a failover resource does not use shared addresses.

The RG_mode resource group property enables the cluster administrator to identify a resource group as failover or scalable. If RG_mode is FAILOVER, the RGM sets the Maximum_primaries property of the group to 1. The RGM also restricts the resource group to being mastered by a single node. The RGM does not allow a resource whose Failover property is TRUE to be created in a resource group whose RG_mode is SCALABLE.

The Implicit_network_dependencies resource group property specifies that the RGM should enforce implicit strong dependencies of nonnetwork address resources on all network address resources (LogicalHostname and SharedAddress) within the group. As a result, the Start methods of the nonnetwork address (data service) resources in the group are not called until the network addresses in the group are configured to go up. The Implicit_network_dependencies property defaults to TRUE.

Implementing a Scalable Resource

A scalable resource can be online on more than one node simultaneously. You can configure a scalable resource (which uses network load-balancing) to run on a global-cluster non-voting node as well. However, you can run such a scalable resource in only one node per Solaris host. Scalable resources include data services such as Sun Cluster HA for Sun Java System Web Server (formerly Sun Cluster HA for Sun ONE Web Server) and Sun Cluster HA for Apache.

The RGM provides a number of properties that support the implementation of a scalable resource.

Set the Boolean Failover resource type property to FALSE, to allow the resource to be configured in a resource group that can be online on more than one node at a time.

The Scalable resource property determines if the resource uses the cluster shared address facility. Set this property to TRUE because a scalable service uses a shared address resource to make the multiple instances of the scalable service appear as a single service to the client.

The RG_mode property enables the cluster administrator to identify a resource group as failover or scalable. If RG_mode is SCALABLE, the RGM allows Maximum_primaries to be assigned a value greater than 1. The resource group can be mastered by multiple nodes simultaneously. The RGM allows a resource whose Failover property is FALSE to be instantiated in a resource group whose RG_mode is SCALABLE.

The cluster administrator creates a scalable resource group to contain scalable service resources and a separate failover resource group to contain the shared address resources upon which the scalable resource depends.

The cluster administrator uses the RG_dependencies resource group property to specify the order in which resource groups are brought online and offline on a node. This ordering is important for a scalable service because the scalable resources and the shared address resources upon which they depend are located in different resource groups. A scalable data service requires that its network address (shared address) resources be configured to go up before the scalable data service is started. Therefore, the cluster administrator must set the RG_dependencies property (of the resource group that contains the scalable service) to include the resource group that contains the shared address resources.

When you declare the Scalable property in the RTR file for a resource, the RGM automatically creates the following set of scalable properties for the resource.

Network_resources_used

Identifies the shared-address resources on which this resource has a dependency. This list contains all network-address resources that appear in the properties Resource_dependencies, Resource_dependencies_weak, Resource_dependencies_restart, or Resource_dependencies_offline_restart.

The RGM automatically creates this property if the Scalable property is declared in the RTR file. If the Scalable property is not declared in the RTR file, Network_resources_used is unavailable unless it is explicitly declared in the RTR file.

If you do not assign a value to the Network_resources_used property, its value is updated automatically by the RGM, based on the setting of the resource-dependencies properties. You do not need to set this property directly. Instead, set the Resource_dependencies, Resource_dependencies_offline_restart, Resource_dependencies_restart, or Resource_dependencies_weak property.

Load_balancing_policy

Specifies the load-balancing policy for the resource. You can explicitly set the policy in the RTR file (or allow the default LB_WEIGHTED). In either case, the cluster administrator can change the value when he or she creates the resource (unless you set Tunable for Load_balancing_policy to NONE or FALSE in the RTR file). These are the legal values that you can use:

LB_WEIGHTED

The load is distributed among various nodes according to the weights that are set in the Load_balancing_weights property.

LB_STICKY

A given client (identified by the client IP address) of the scalable service is always sent to the same node of the cluster.

LB_STICKY_WILD

A given client (identified by the client's IP address) that connects to an IP address of a wildcard sticky service is always sent to the same cluster node regardless of the port number to which it is coming.

For a scalable service with a Load_balancing_policy of LB_STICKY or LB_STICKY_WILD, changing Load_balancing_weights while the service is online can cause existing client affinities to be reset. In this case, a different node might service a subsequent client request, even if the client had been previously serviced by another node in the cluster.

Similarly, starting a new instance of the service on a cluster might reset existing client affinities.

Load_balancing_weights

Specifies the load to be sent to each node. The format is weight@node,weight@node. weight is an integer that reflects the relative portion of load that is distributed to the specified node. The fraction of load that is distributed to a node is the weight for this node divided by the sum of all weights of active instances. For example, 1@1,3@2 specifies that node 1 receives ¼ of the load and node 2 receives ¾ of the load.

Port_list

Identifies the ports on which the application is listening. This property defaults to the empty string. You can provide a list of ports in the RTR file. Otherwise, the cluster administrator must provide the actual list of ports when creating the resource.

You can create a data service that the cluster administrator can configure to be either scalable or failover. To do so, declare both the Failover resource type property and the Scalable resource property as FALSE in the data service's RTR file. Specify the Scalable property to be tunable at creation.

The Failover property value FALSE allows the resource to be configured in a scalable resource group. The cluster administrator can enable shared addresses by changing the value of Scalable to TRUE when he or she creates the resource, to create a scalable service.

On the other hand, even though Failover is set to FALSE, the cluster administrator can configure the resource in a failover resource group to implement a failover service. The cluster administrator does not change the value of Scalable, which is FALSE. To support this scenario, you should provide a check in the Validate method on the Scalable property. If Scalable is FALSE, verify that the resource is configured into a failover resource group.

The Sun Cluster Concepts Guide for Solaris OS contains additional information about scalable resources.

Validation Checks for Scalable Services

Whenever you create or update a resource with the scalable property set to TRUE, the RGM validates various resource properties. If you do not configure the properties correctly, the RGM rejects the attempted update or creation.

The RGM performs the following checks:

Writing and Testing Data Services

This section describes how to write and test a data service. Topics that are covered include using TCP keep-alives to protect the server, testing highly available data services, and coordinating dependencies between resources.

Using TCP Keep-Alives to Protect the Server

On the server side, using TCP keep-alives protects the server from wasting system resources for a down (or network-partitioned) client. If these resources are not cleaned up in a server that stays up long enough, the wasted resources eventually grow without bound as clients crash and reboot.

If the client-server communication uses a TCP stream, both the client and the server should enable the TCP keep-alive mechanism. This provision applies even in the non-HA, single-server case.

Other connection-oriented protocols might also have a keep-alive mechanism.

On the client side, using TCP keep-alives enables the client to be notified when a network address resource has failed over or switched over from one physical host to another physical host. That transfer of the network address resource breaks the TCP connection. However, unless the client has enabled the keep-alive, it does not necessarily learn of the connection break if the connection happens to be quiescent at the time.

For example, suppose the client is waiting for a response from the server to a long-running request, and the client's request message has already arrived at the server and has been acknowledged at the TCP layer. In this situation, the client's TCP module has no need to keep retransmitting the request. Also, the client application is blocked, waiting for a response to the request.

Where possible, in addition to using the TCP keep-alive mechanism, the client application also must perform its own periodic keep-alive at its level. The TCP keep-alive mechanism is not perfect in all possible boundary cases. Using an application-level keep-alive typically requires that the client-server protocol support a null operation or at least an efficient read-only operation, such as a status operation.

Testing HA Data Services

This section provides suggestions about how to test a data service implementation in a highly-available environment. The test cases are suggestions and are not exhaustive. You need access to a test-bed Sun Cluster configuration so that the testing work does not affect production machines.

Test your HA data service on global-cluster non-voting nodes on a single Solaris host rather than on all Solaris hosts in the cluster. Once you determine that your data service works as expected in the global-cluster non-voting nodes, you can then test it on the entire cluster. Even if it's ill-behaved, a HA data service that runs in a global-cluster non-voting node on a host probably will not perturb the operation of data services that are running in other nodes or on other hosts.

Test that your HA data service behaves correctly in all cases where a resource group is moved between physical hosts. These cases include system crashes and the use of the clnode command. Test that client machines continue to get service after these events.

Test the idempotence of the methods. For example, replace each method temporarily with a short shell script that calls the original method two or more times.

Coordinating Dependencies Between Resources

Sometimes one client-server data service makes requests on another client-server data service while fulfilling a request for a client. For example, data service A depends on data service B if, for A to provide its service, B must provide its service. Sun Cluster provides for this requirement by permitting resource dependencies to be configured within a resource group. The dependencies affect the order in which Sun Cluster starts and stops data services. See the r_properties(5) man page.

If resources of your resource type depend on resources of another type, you need to instruct the cluster administrator to configure the resources and resource groups correctly. As an alternative, provide scripts or tools to correctly configure them.

Decide whether to use explicit resource dependencies, or omit them and poll for the availability of other data services in your HA data service's code. If the dependent and depended-on resource can run on different nodes, configure them in separate resource groups. In this case, polling is required because configuring resource dependencies across groups is not possible.

Some data services store no data directly themselves. Instead, they depend on another back-end data service to store all their data. Such a data service translates all read and update requests into calls on the back-end data service. For example, consider a hypothetical client-server appointment calendar service that keeps all of its data in an SQL database, such as Oracle. The appointment calendar service uses its own client-server network protocol. For example, it might have defined its protocol using an RPC specification language, such as ONC RPC.

In the Sun Cluster environment, you can use HA-ORACLE to make the back-end Oracle database highly available. Then, you can write simple methods for starting and stopping the appointment calendar daemon. The cluster administrator registers the appointment calendar resource type with Sun Cluster.

If the HA-ORACLE resource is to run on a different node than the appointment calendar resource, the cluster administrator configures them into two separate resource groups. The cluster administrator consequently makes the appointment calendar resource dependent on the HA-ORACLE resource.

The cluster administrator makes the resources dependent by doing either of the following:

The calendar data service daemon, after it has been started, might poll while waiting for the Oracle database to become available. The calendar resource type's Start method usually returns success in this case. If the Start method blocks indefinitely, however, this method moves its resource group into a busy state. This busy state prevents any further state changes, such as edits, failovers, or switchovers on the resource group. If the calendar resource's Start method times out or exits with a nonzero status, its timing out or nonzero exit status might cause the resource group to ping-pong between two or more nodes while the Oracle database remains unavailable.