C H A P T E R  8

Debugging Applications in the Foundation Services

For information about how to report and check errors caused by applications and how to debug applications, see the following sections:

For debugging purposes, configure remote IP access to all nodes in the cluster. For more information, see “Cluster Addressing and Networking” in Netra High Availability Suite 3.0 1/08 Foundation Services Overview.

You can use standard Solaris Operating System commands in the Foundation Services environment. For debugging applications that interact with the Foundation Services nodes use the debugging software provided with the Sun Studio software.


Reporting Application Errors

Configure applications to report errors and their causes. This information can be used during troubleshooting to reduce the risk of the re-occurrence of similar errors. To facilitate recovery from an error, you can provide the following information:

The standard return values for CMM API errors are summarized in TABLE 8-1.


Reading Error Information for Debugging

In the Foundation Services, standard error and alert messages are sent to system log files. In error scenarios, you can refer to the system log files to determine the history of a process. Critical errors are written on the console in addition to being logged in the system log files.

While it is true that errors can cause notifications to be sent, notifications are events and are not errors in themselves. For information on notifications, see EXAMPLE 6-1.

The NMA enables you to receive information on notifications. Statistics are available to diagnose the cause of errors received. See the Netra High Availability Suite 3.0 1/08 Foundation Services NMA Programming Guide.



Note - NMA is not available for use on the Linux platform, and is only supported for use with the Solaris OS.



For information about using and configuring system log files, see the Netra High Availability Suite 3.0 1/08 Foundation Services Cluster Administration Guide.


Stopping the Daemon Monitor for Debugging

You cannot debug critical services, such as the CMM or Reliable NFS, on a running cluster. Debugging would interrupt the regular messages that these services send between nodes. Debugging tools, such as the truss command, cannot be used on daemons while they are being monitored by the Daemon Monitor.

Before debugging a Foundation Services daemon or a monitored Solaris daemon, stop the Daemon Monitor from monitoring the daemon that you want to debug. When you have finished debugging, restart the Daemon Monitor.

For information about how to stop and restart the Daemon Monitor, see the Netra High Availability Suite 3.0 1/08 Foundation Services Cluster Administration Guide. For a list of monitored daemons, see the nhpmd(1M), or for the Linux OS, the nhpmd(8), man page.


Broken Pipe Error Messages

If one of the applications you are running on your cluster terminates suddenly, CMM notification pipes that this application opened are kept on the nhcmmd side. You can be left with a broken pipe from the CMM to the dead application. If the CMM later sends a notification to this dead application, the CMM realizes that the application is dead and closes the broken pipe. Alternatively, the CMM frequently checks to see if a client application is dead and if necessary, closes associated pipes.

If many of your applications die suddenly, without notifying the CMM, the following can happen:

If one of your applications has died suddenly, you receive a system log message such as this:


#  Dec 23 09:56:07 machine_name CMM[839]: S-CMM 
notif to /var/run/CMM_884_00000000 fails: Broken pipe

The CMM detects the problem and closes the notification pipe. For further information on accessing system log files, see “Accessing and Maintaining System Log Messages” in the Netra High Availability Suite 3.0 1/08 Foundation Services Cluster Administration Guide and the syslog.conf(4) man page.


Return Values of the CMM API

The CMM API provides extensive return values for errors and successful function calls. They are listed in TABLE 8-1.


TABLE 8-1   Common Return Values of the CMM API  
Return Value Result Possible Responses
CMM_OK The function call succeeded. None required.
CMM_EAGAIN Returned information is based on a cluster view that has not been updated by the master node for more than 10 seconds. Retry the function call.
CMM_EBADF An identifier or descriptor that corresponds to a file descriptor is invalid. The connection to the CMM is no longer valid. Perhaps the CMM is dead. Verify that data in your program is not corrupted. Call the cmm_cmc_register and the cmm_notify_getfd functions to fetch a new connection.
CMM_EBUSY
  • For all functions: The CMM API server is temporarily out of resources to respond to the requested operation.

  • For cmm_cmc_unregister: An attempt to unregister a callback, that is, a call to the cmm_cmc_unregister function, failed because the caller's callback function is active.

See the cmm_cmc_unregister(3CMM) man page.
Wait, then retry the function call. You can decide the length of wait, based on the application's characteristics.
CMM_ECANCELED A switchover operation was cancelled. For example, when trying to demote the master, no vice-master can take over the master role. Continue.
CMM_ECONN The local CMM API process is unreachable. Check that the process is currently running. Perhaps it is not running yet. Retry the function call.
CMM_EEXIST Only one function can be registered at a time. An attempt to call the cmm_cmc_register function when a callback is already registered returns this message. The calling process has already registered a callback. Verify that the existing function is required for the purpose of your program.
CMM_EINVAL A function parameter has an invalid value. Ensure that the type of each parameter matches the type in the function prototype. For example the nodeid is not a master-eligible node.

Cast variables to the expected type if necessary and verify that the area of memory that stores the parameter is valid.

CMM_ENOCLUSTER One of the following has occurred:
  • The local node is not configured in an active cluster. This occurs, for example, when the cluster election is in progress.

  • The local node has been removed from the cluster node table on the master node. For more information, see the cluster_nodes_table(4) man page.

  • There is more than one master node.

  • The master node has been disqualified and no vice-master node has taken over the master role.

  • A failover has been triggered by the disqualification of the master node. During the failover, there is a brief time when there is no master node. The CMM_ENOCLUSTER error was returned during this time.

Any combination of the following:
  • Add an entry for the node to the cluster node table.

  • Requalify the node.

  • Assign only one master.

CMM_ENOENT An attempted operation on an item failed because the item does not exist. For example, when calling the cmm_cmc_unregister function, no callback has been registered. Not critical. Any combination of the following:
  • Verify that the area of memory that stores the item is valid.

  • If you want to delete the item, continue.

CMM_ENOMSG An attempt to dispatch an event failed because there are no events to be dispatched. Continue.
CMM_ENOTSUP The operation could not be correctly executed. This error can be the result of a system problem such as a file that cannot be created or a problem with Remote Procedure Call (RPC) services. Examine the system log files.
CMM_EPERM The call tried to execute on a node other than the master node, but it can execute only on the master node. For more information, see the cmm_mastership_release(3CMM), cmm_member_setqualif(3CMM), and cmm_member_seizequalif(3CMM) man pages. Execute the function only on the master node.
CMM_ERANGE The number of cells in the table is smaller than the number of nodes in the cluster. Returned by the cmm_member_getall function. See the cmm_member_getall(3CMM) man page. Add an entry in the table for each potential peer node.
CMM_ESRCH
  • Using the cmm_member_getinfo function to obtain information about a node that is either not in the local cluster node table, or is in the local cluster node table but currently has the CMM_OUT_OF_CLUSTER role.

  • Using the cmm_potential_getinfo function to obtain information about a node that is not in the local cluster node table.

  • Using the cmm_vicemaster_getinfo function while the cluster has no vice-master.

Any combination of the following:
  • Examine why the master-eligible node is down or isolated.

  • Add an entry for this node to the cluster node table. See the cluster_nodes_table(4) man page.

  • Change the node's role to master or vice-master.

CMM_ETIMEDOUT No response even when an operation is retried, until the delay has expired. The function call was timed out. Any combination of the following:
  • Retry the function call.

  • Reduce the load on the system.