Sun Cluster 2.2 API Developer's Guide

Chapter 1 Data Services API

This chapter introduces the Sun Cluster Data Services API and the concepts needed to make your data service applications highly available. There is also a section on the differences between the API implementation on Solstice HA 1.3 and Sun Cluster 2.x

Overview

The Sun Cluster Data Service API uses command-line utilities and a C-callable library. For convenience, all C-callable functionality is also available using the command-line utility programs. This enables you to code in a scripting language such as the Bourne shell (sh(1)), if you choose.

The API is defined by its man pages:

hareg(1M)--Controls registration and activation of Sun Cluster data services
haget(1M)--Queries current state of Sun Cluster configuration
hads(3HA)--Library routines for Sun Cluster data services
ha_open(3HA), ha_close(3HA)--Sun Cluster environment open and close
ha_get_calls(3HA)--Gets Sun Cluster environment
hatimerun(1M)--Provides a convenient facility for timing out the execution of another child program. It is useful when programming in scripting languages such as the Bourne shell.
halockrun(1M)--Provides a convenient means to claim a file lock on a file and run a program while holding that lock. It is useful when programming in scripting languages such as the Bourne shell.
hactl(1M)--Provides various control operations on Sun Cluster. The control operations include requesting the movement of a logical host from one physical host to another (possibly forcibly), requesting the movement of all logical hosts that a physical host currently masters to other physical host(s), and requesting a cluster reconfiguration.
pmfadm(1M)--Provides the administrative, command line interface, to the process monitor facility.
rpc.pmfd(1M)--RPC-based server for serving the process monitor facility that is used by the Sun Cluster 2.2 System.

The command line utilities and the C-callable library are documented in the man pages.

Interaction Between Data Services and the Sun Cluster Software

When a data service first registers with Sun Cluster, it registers a set of call-back programs, or methods. Sun Cluster makes call-backs to the data service's methods when certain key events occur in the cluster. The remainder of this section describes the three basic methods required to make any data service run in the Sun Cluster environment. The methods are start, stop, and abort.

After the failure of a host, Sun Cluster itself moves the logical host (both its diskset and its logical network IP addresses) to one of the surviving hosts. At this point, the data service's software must be restarted on the surviving host. Sun Cluster itself cannot restart a data service. Instead, it makes a call to the data service telling it to restart itself. This call is to the data service's start or start_net method.

The Sun Cluster haswitch(1M) command smoothly shuts down a logical host on one physical server in preparation for moving the logical host to another physical server. For Sun Cluster to coordinate this shut-down work with layered data services, each data service also registers a stop method. Sun Cluster calls the data service's stop method during scadmin switch or haswitch(1M) operations, and whenever Sun Cluster is stopped using scadmin stopnode. This stop method performs a smooth, safe shutdown of the data service. This occurs without waiting for clients on the network to completely finish their work, because waiting for a client could introduce an unbounded delay.

Sun Cluster continuously monitors the health of the physical servers in the cluster. In some cases, Sun Cluster will decide that a physical server is failing, but is still able to execute some "last wishes" cleanup code before Sun Cluster halts and reboots the server. In this case, each data service is given an opportunity to execute last wishes cleanup code before Sun Cluster halts the server. Sun Cluster does this by calling the abort_net method of each data service. A data service that does not need or want the last wishes cleanup opportunity can choose not to register an abort method.

Logical Host Configuration Issues

A data service is made highly available by exploiting the Sun Cluster concept of a logical host. The data service's data is placed on a logical host's diskset. A diskset is dual-ported, making the data accessible by a surviving server in the event that one server fails. For network access by clients on the network, the data service advertises the logical host name as the server name that clients should use. A logical network IP address failover causes network clients of the data service to move with the logical host.

Data Service Use of Single or Multiple Logical Hosts

In Sun Cluster, there can be any number of logical hosts, so your data service implementation should not depend on a certain quantity. You must decide whether your data service will keep its data in just one or in multiple logical hosts.

Generally, it is easier to design and implement a data service that uses just one logical host. In that case, all of the data service's data is placed only on that logical host's diskset. The data service needs just one set of daemons. A physical host runs the daemons for that data service only if the physical host currently masters the single logical host that the data service uses. When the physical host takes over mastery of the logical host, the data service's start method can start up the daemons. When the physical host is giving up mastery of the logical host, the data service's stop method can stop the daemons. In some cases, killing the daemons by sending a kill signal will suffice.

If you use multiple logical hosts, you must be able to split the data service's data into disjoint sets. The sets must be split so that no operation the data service needs to perform requires data from more than one set.

Consider Sun's HA-NFS product, which has multiple file systems with different data residing in each file system. For HA-NFS, each logical host has its own set of NFS^(TM) file systems. Each physical host NFS shares the file systems that belong to the logical hosts that it masters. The sets of NFS file systems belonging to the two logical hosts are disjoint.

Using multiple logical hosts enables some rudimentary load balancing: when both physical hosts are up, each physical host can master one of the logical hosts and handles the data service's traffic for that logical host. Thus, both physical hosts are doing useful work in addition to acting as standbys for each other.

For some data services, splitting the data into disjoint collections such that no data service operation requires more than one collection is not feasible. The in.named example described in Chapter 2 "Sample Data Service", is such a data service. It has only one set of interdependent data files, and it would be difficult to split them into disjoint sets.

Note -

Configure the data service to use just one of the logical hosts, unless the data is easily split into disjoint collections and there is significant benefit to the rudimentary load balancing enabled by use of multiple logical hosts.

Required File System for Each Logical Host

Each Sun Cluster logical host has at least one diskset containing one or more file systems or raw partitions. Sun Cluster requires that each logical host has one file system that is special, in that it must exist and must have a particular name (that is, it must be mounted on a particular directory name in the name space hierarchy). When Sun Cluster is first installed and configured, the scconf(1M) program assists the administrator in creating the required file system, thus following the required convention. Sun Cluster uses the term administrative file system to refer to this special required file system.

Required Administrative File System Conventions

If your data service uses the administrative file system, it must adhere to the conventions described in this section.

Per Data-Service Subdirectory

Each data service should place its administrative data in its own subdirectory of the administrative file system. For example, if the data service uses Solaris packages, then the subdirectory should have a name of the form /administrative_file_system/PkgName, where PkgName is the name of your data service package.

If the package mechanism is not used, then the data service should use the same name that it supplied as its data service name when it registered with Sun Cluster using hareg(1M). The hareg(1M) utility detects and prohibits naming conflicts. If your implementation uses logical host "hahost1," and calls hareg(1M) with the name "hainnamed," you create the administrative subdirectory /hahost1/hainnamed.

Small Amount of Data

The administrative file system is relatively small. Each data service should limit the amount of administrative data it keeps in the administrative file system to a few kilobytes. If a larger amount of administrative data is required, use the administrative file system to point at another directory in one of the logical host's file systems. The data service's user data should not be stored in the administrative file system, because for most data services, that data would be too large.

Data Service Requirements

The following sections present the requirements that a data service must meet to participate in the Sun Cluster Data Service API.

Client-Server Environment

Sun Cluster is designed for client-server networking environments. Sun Cluster cannot operate in time-sharing environments in which applications are run on a server that is accessed through telnet or rlogin. Such models typically have no inherent ability to recover from a server crash.

Crash Tolerance

The data service must be crash-tolerant. This means that the data service's daemon processes must be relatively stateless, in that they write all updates to disk synchronously.

When a physical host that masters a logical host crashes and a new physical host takes over, Sun Cluster calls the start method of each data service. The start method triggers any crash recovery of the on-disk data. For example, if the data service uses logging techniques, the start method should cause the data service to carry out crash recovery using the log.

Multihosted Data

The logical host's disksets are multihosted so that when a physical host crashes, one of the surviving hosts can access the disk. For a data service to be highly available, its data must be highly available, and thus its data must reside on the logical host's diskset.

A data service might have command-line switches or configuration files pointing to the location of the data files. If the data service uses hard-wired path names, it might be possible to change the path name to a symbolic link that points to a file in the logical host's diskset, without changing the data service code. See Appendix A, Using Symbolic Links for Multihosted Data Placement, for a more detailed discussion about using symbolic links.

In the worst case, the data service's code must be modified to provide some mechanism for pointing to the actual data location. You can do this by implementing additional command-line switches.

Sun Cluster supports the use of UFS, VxFS, and raw partitions on the logical host's diskset. When the system administrator installs and configures Sun Cluster, he or she must specify which disk resources to use for UFS or VxFS file systems and which for raw partitions. Typically, raw partitions are used only by database servers and multimedia servers.

Host Names

You must determine whether the data service ever needs to know the host name of the server on which it is running. If so, the data service might need to be modified to use the host name of the logical host, rather than that of the physical host. Recall that the Sun Cluster concept of "logical host" involves having a physical host "impersonate" a logical host's host name and IP address.

Occasionally, in the client-server protocol for a data service, the server returns its own host name to the client as part of the contents of a message to the client. For such protocols, the client could be depending on this returned host name as the host name to use when contacting the server. For the returned host name to be usable after a takeover or switchover, the host name should be that of the logical host, not the physical host. In this case, you must modify the data service code to return the logical host name to the client.

Multihomed Hosts

The term multihomed host describes a host that is on more than one public network. Such a host has multiple host names and IP addresses; it has one host name/IP address pair for each network. Sun Cluster is designed to permit a host to appear on any number of networks, including just one (the non-multihomed case). Just as the physical host name has multiple host name/IP address pairs, each logical host has multiple host name/IP address pairs, one for each public network. By convention, one of the host names in the set of pairs is the same name as that of the logical host itself. When Sun Cluster moves a logical host from one physical host to another, the complete set of host name/IP address pairs for that logical host is moved.

For each Sun Cluster logical host, the set of host name/IP address pairs is part of the Sun Cluster configuration data and is specified by the system administrator when Sun Cluster is first installed and configured. The Sun Cluster Data Service API contains facilities for querying the set of pairs, specifically, the names_on_subnets field described in the hads(3HA) and haget(1M) man pages.

Most off-the-shelf data service daemons that have been written for Solaris already handle multihomed hosts properly. Many data services do all their network communication by binding to the Solaris wildcard address INADDR_ANY. This automatically causes them to handle all the IP addresses for all the network interfaces. INADDR_ANY effectively binds to all IP addresses currently configured on the machine. A data service daemon that uses INADDR_ANY generally does not have to be changed to handle the Sun Cluster logical host's IP addresses.

Binding to `INADDR_ANY` Versus Binding to Specific IP Addresses

Even in the non-multihomed case, the Sun Cluster logical host concept allows the machine to have more than one IP address. It has one for its own physical host and one additional IP address for each logical host it currently masters. When a machine becomes the master of a logical host, it dynamically acquires an additional IP address. When it gives up mastery of a logical host, it dynamically relinquishes an IP address.

Some data services cannot work properly using only INADDR_ANY. These data services must dynamically change the set of IP addresses to which they are bound as a logical host is mastered or unmastered. The starting and stopping methods provide the hooks for Sun Cluster to inform the data service that a logical host has appeared or disappeared. One strategy for such a data service to accomplish the rebinding is for its stop and start methods to kill and restart the data service's daemons.

During cluster reconfiguration, there is a relationship between the order in which data service methods are called and the time when the logical host's network addresses are configured by Sun Cluster. See the hareg(1M) man page for details about this relationship.

By the time the data service's stop method returns, the data service should have stopped using the logical host's IP addresses. Similarly, by the time the start_net method returns, the data service should have started to use the logical host's IP addresses. If the data service uses INADDR_ANY rather than binding to individual IP addresses, then there is no problem. If the data service's stop and start methods accomplish their work by killing and restarting the data service's daemons, then the data service stops and starts using the network addresses at the appropriate times.

Client Retry

To a network client, a takeover or switchover appears to be a crash of the logical host followed by a fast reboot. Ideally, the client application and the client-server protocol are structured to do some amount of retrying. If the application and protocol already handle the case of a single server crashing and rebooting, then they also will handle the case of the logical host being taken over or switched over. Some applications might elect to retry endlessly. More sophisticated applications notify the user that a long retry is in progress and allow the user to choose whether or not to continue.

Registering a Data Service

A data service is registered with Sun Cluster using the hareg(1M) program. Registration is persistent in that it survives across takeovers, switchovers, and reboots. Registration with Sun Cluster is usually done as the last step of installing and configuring a data service. Registration is a one-time event. A data service also can be unregistered with hareg(1M). See the hareg(1M) man page for details.

In addition to the distinction between registered versus unregistered, Sun Cluster has the concept of a data service being either "on" or "off." The purpose of the "on" or "off" state is to provide the system administrator with a mechanism for temporarily shutting down a data service without taking the more drastic step of unregistering it. For example, a system administrator can turn a data service "off" to do stand-alone backups. While the data service is "off," it is not providing service to clients. When a data service is "off," the parameters that Sun Cluster passes to the data service's methods indicate that the data service should not be servicing data from any of the logical hosts.

When a data service is first registered with Sun Cluster, its initial state is "off." The hareg(1M) program is used to transition a data service between the "off" and "on" states. The work of moving a data service between states is accomplished through a reconfiguration as described in the hareg(1M) man page.

Before unregistering a data service, the system administrator first must transition the data service into the "off" state by calling hareg(1M).

Differences Between the Solstice HA 1.3 and Sun Cluster 2.x API

In Solstice HA, the hareg(1M) man page defined an explicit reconfiguration sequence, for example, that stop methods are called before start methods are called, and that when a stop method is called a start method is also eventually called.

However, the Sun Cluster 2.x implementation deviates from the Solstice HA model. Most notably, you should not rely on the overall reconfiguration sequence too much. In Sun Cluster 2.x, it is possible for the following to occur:

When moving a logical host off of a physical host, stop_net and stop methods will be called, however, the start and start_net methods will not necessarily be called. This deviates from hareg(1M) man page.
When moving a logical host onto a physical host, start and start_net methods are called. However, the stop_net and stop methods are not necessarily called. Again, this deviates from the hareg(1M) man page.
Solstice HA 1.3 has the behavior that there is a single call to each method for all the logical hosts that are moving during a particular reconfiguration. For example, let us say that a physical host is mastering two logical hosts and that physical host crashes. Both of the logical hosts will need to move to the surviving physical host (consider just a two node cluster for now). In Solstice HA 1.3, the methods of each registered data service are called just once, and are passed the complete list of all the logical hosts that this physical host now masters as arguments to the method call. In Sun Cluster 2.x, the implementation of reconfiguration tends to be that logical hosts are moved one at a time. Thus, each data service will have its methods called multiple times, once for each logical host that is moving.

Working With the Differences Between Solstice HA 1.3 and Sun Cluster 2.x

This section describes some ways in which you can adjust your applications to deal with the differences in the API.

The API definition, and both of its implementations, ultimately require that a method callback be "idempotent," that is, that it can be called multiple times and that has the same effect as a single call. Pragmatically, a called back method needs to be prepared to deal with the scenario that it has no real work to do, that the work was already accomplished during some previous call to the method. Concretely, this means that the method needs to contain logic that figures out whether there is any work to do for the logical host(s) that are moving. If not, the method should just return. An example of this in shown in Chapter 2 "Sample Data Service".

These differences in the API implementations should have minimal impact given that a data service's called-back methods must deal with the basic idempotence issue anyway.