A P P E N D I X  A

Watchdog Timer Application Mode

This appendix gives information on the watchdog timer application mode on the server. It provides the following sections to help you understand how to configure and use the watchdog timer:



Note - Once the application watchdog timer is in use, you must reboot the Oracle Solaris OS in order to return to the default (nonprogrammable) watchdog timer and default LED behavior (no Alarm3).



Watchdog Timer Application Mode

The watchdog mechanism detects a system hang, or an application hang or crash, should they occur. The watchdog is a timer that is continually reset by a user application as long as the operating system and user application are running.

When the application is rearming the application watchdog, an expiration can be caused by:

When the system watchdog is running, a system hang, or more specifically, the hang of the clock interrupt handler, causes an expiration.

The system watchdog mode is the default. If the application watchdog is not initialized, then the system watchdog mode is used.

The application mode enables you to:

The setupsc command, an existing command for the ALOM CMT compatability CLI (in ILOM), can be used to configure the recovery for the system watchdog only:

sc> setupsc

The recovery configuration for the application watchdog is set using input/output control codes (IOCTLs) that are issued to the ntwdt driver.


Watchdog Timer Limitations

The limitations of the watchdog timer mode include:



Note - The system controller displays a console message as a reminder that the watchdog, from the system controller’s perspective, is disabled.


However, when you re-enter the Oracle Solaris OS, the watchdog timer is still enabled from the Oracle Solaris OS perspective. To have both the system controller and the Oracle Solaris OS view the same watchdog state, you must use the watchdog application to either enable or disable the watchdog.

You can execute the following command to locate the system boards that contain kernel (permanent) memory:

# cfgadm -lav | grep -i permanent

However, domain booting is not monitored for:

In the case of a recovery of a hung or failed domain, a boot failure is not detected and no recovery attempts are made.


Using the ntwdt Driver

To enable and control the watchdog’s application mode, you must program the watchdog system using the LOMIOCDOGxxx IOCTLs, described in Understanding the User API.

If the ntwdt driver, as opposed to the system controller, initiates a reset of the Oracle Solaris OS on application watchdog expiration, the value of the following property in the ntwdt driver’s configuration file (ntwdt.conf) is used:

ntwdt-boottimeout="600";

In case of a panic, or an expiration of the application watchdog, the ntwdt driver reprograms the watchdog time-out to the value specified in the property.

Assign a value representing a duration that is longer than the time it takes to reboot and perform a crash dump. If the specified value is not large enough, the system controller resets the host if reset is enabled. Note that this reset by the system controller occurs only once.


Understanding the User API

The ntwdt driver provides an application programming interface by using IOCTLs. You must open the /dev/ntwdt device node before issuing the watchdog IOCTLs.



Note - Only a single instance of open() is allowed on /dev/ntwdt. More than one instance of open() will generate the following error message: EAGAIN - The driver is busy, try again.


You can use the following IOCTLs with the watchdog timer:


Using the Watchdog Timer

Setting the Timeout Period

The LOMIOCDOGTIME IOCTL sets the timeout period of the watchdog. This IOCTL programs the watchdog hardware with the time specified in this IOCTL. You must set the timeout period (LOMIOCDOGTIME) before attempting to enable the watchdog timer (LOMIOCDOGCTL).

The argument is a pointer to an unsigned integer. This integer holds the new timeout period for the watchdog in multiples of 1 second. You can specify any timeout period in the range of 1 second to 180 minutes.

If the watchdog function is enabled, the time-out period is immediately reset so that the new value can take effect. An error (EINVAL) is displayed if the timeout period is less than 1 second or longer than 180 minutes.



Note - The LOMIOCDOGTIME is not intended for general-purpose use. Setting the watchdog time-out to too low a value might cause the system to receive a hardware reset if the watchdog and reset functions are enabled. If the timeout is set too low, the user application must be run with a higher priority (for example, as a real-time thread) and must be rearmed more often to avoid an unintentional expiration.


Enabling or Disabling the Watchdog

The LOMIOCDOGCTL IOCTL enables or disables the watchdog, and it enables or disables the reset capability. See Finding and Defining Data Structures for the correct values for the watchdog timer.

The argument is a pointer to the lom_dogctl_t structure. This structure is described in greater detail in Finding and Defining Data Structures.

Use the reset_enable member to enable or disable the system reset function. Use the dog_enable member to enable or disable the watchdog function. An error (EINVAL) is displayed if the watchdog is disabled but reset is enabled.



Note - If LOMIOCDOGTIME has not been issued to set up the timeout period prior to this IOCTL, the watchdog is not enabled in the hardware.


Rearming the Watchdog

The LOMIOCDOGPAT IOCTL rearms, or pats, the watchdog so that the watchdog starts ticking from the beginning; that is, to the value specified by LOMIOCDOGTIME. This IOCTL requires no arguments. If the watchdog is enabled, this IOCTL must be used at regular intervals that are less than the watchdog timeout, or the watchdog expires.

Obtaining the State of the Watchdog Timer

The LOMIOCDOGSTATE IOCTL gets the state of the watchdog and reset functions, and retrieves the current time-out period for the watchdog. If LOMIOCDOGSTATE was never issued to set up the timeout period prior to this IOCTL, the watchdog is not enabled in the hardware.

The argument is a pointer to the lom_dogstate_t structure, which is described in greater detail in Finding and Defining Data Structures. The structure members are used to hold the current states of the watchdog reset circuitry and current watchdog timeout period. This timeout period is not the time remaining before the watchdog is triggered.

The LOMIOCDOGSTATE IOCTL requires only that open() be successfully called. This IOCTL can be run any number of times after open() is called, and it does not require any other DOG IOCTLs to have been executed.

Finding and Defining Data Structures

All data structures and IOCTLs are defined in lom_io.h, which is available in the SUNWlomh package.

The data structures for the watchdog timer are shown here:

Example Watchdog Program

Following is a sample program for the watchdog timer.


EXAMPLE A-3 Example Watchdog Program

#include  <sys/types.h>
#include  <fcntl.h>
#include  <unistd.h>
#include  <sys/stat.h>
#include  <lom_io.h>
 
int main() {
	uint_t timeout = 30; /* 30 seconds */
	lom_dogctl_t dogctl;
	int fd;
 
	dogctl.reset_enable = 1;
	dogctl.dog_enable = 1;
 
	fd = open("/dev/ntwdt", O_EXCL);
 
	/* Set timeout */
	ioctl(fd, LOMIOCDOGTIME, (void *)&timeout);
 
	/* Enable watchdog */
	ioctl(fd, LOMIOCDOGCTL, (void *)&dogctl);
 
	/* Keep patting */
	while (1) {
			ioctl(fd, LOMIOCDOGPAT, NULL);
			sleep (5);
	}
	return (0);
}


Watchdog Timer Error Messages

TABLE A-1 describes watchdog timer error messages that might be displayed and what they mean.


TABLE A-1 Watchdog Timer Error Messages

Error Message

Meaning

EAGAIN

An attempt was made to open more than one instance of open() on /dev/ntwdt.

EFAULT

A bad user-space address was specified.

EINVAL

A nonexistent control command was requested or invalid parameters were supplied.

EINTR

A thread awaiting a component state change was interrupted.

ENXIO

The driver is not installed in the system.