A P P E N D I X  B

Understanding the ALOM Watchdog Timer

This appendix gives information on the ALOM watchdog timer.

ALOM features a watchdog mechanism to detect and respond to a system hang, should one ever occur. The ALOM watchdog is a timer that is continually reset by a user application as long as the operating system and user application are running. In the event of a system hang, the user application is no longer able to reset the timer. The timer will then expire and will perform an action set by the user, eliminating the need for operator intervention.

In order to fully understand the ALOM watchdog timer, it's useful to understand certain terms associated with the feature's components and how all of the components interact.

1. If the ALOM watchdog timer is enabled, it will automatically begin monitoring the host Netra server and will detect when the host or application encounters a hang condition or stops running. The default timeout period is 60 seconds; in other words, if the ALOM watchdog timer does not hear from the host system within that 60-second window, it will automatically perform the action set through the sys_autorestart variable (see sys_autorestart). You can change the timeout period through the sys_wdttimeout variable (see sys_wdttimeout).

2. If you set XIR as the function that ALOM would perform once the watchdog timer timeout period is reached, then ALOM will attempt to XIR the host system. If the XIR does not complete within the specified number of seconds (set through the sys_xirtimeout variable), then ALOM forces the server to perform a hard reset instead (see sys_xirtimeout).

3. The ALOM watchdog should be enabled by the user application after the host system is booted up. ALOM starts a timer to detect host boot failures as soon as the host is powered on or reset. The host is considered fully booted once the ALOM watchdog timer is started; if the host fails to boot within a certain amount of time, it will take an action specified by you. The amount of time the ALOM watchdog will wait for the host to boot is set through the sys_boottimeout variable (see sys_boottimeout), and the action it will take if it doesn't boot in that time is set through the sys_bootrestart variable (see sys_bootrestart).

4. You can set the maximum number of attempted reboots using the sys_maxbootfail variable to keep the system from going through an endless cycle of reboots (see sys_maxbootfail). If the system goes through the number of reboots set through the sys_maxbootfail variable, then ALOM will perform an action set by you; that action is set through the sys_bootfailrecovery variable (see sys_bootfailrecovery).

Note that the boot timer will will be disabled for the host reset or reboot after the action set through the sys_bootfailrecovery variable is taken; it will not be enabled again until after the user application restarts the watchdog timer.


Driver Properties

The following property must be present in the /platform/sun4u/kernel/drv/rmclomv.conf file for the ALOM watchdog to function:

rmclomv-watchdog-mode="app";

This property tells the watchdog subsystem to disable the kernel level heartbeat mechanism. Comment out or remove this line to enable the kernel level watchdog.

The ntwdt driver will have an associated driver configuration file (ntwdt.conf) that will specify the following parameters:

ntwdt-autorestart

This property indicates the action to be taken if the watchdog timer expires. Following are the acceptable values for this property:

Note that if you enter any value other than those listed above, the software will automatically default to the xir value instead.

ntwdt-boottimeout

When the host system begins to boot up the Solaris operating system, the ntwdt-boottimeout value specifies the amount of time, in seconds, that the watchdog system must be programmed. Note that if the application watchdog is enabled, the user program must program the watchdog system using the LOMIOCDOGTIME or LOMIOCDOGCTL ioctls; otherwise, the kernel will do it automatically. If the watchdog is not programmed, then ALOM will take the recovery action.

ntwdt-bootrestart

This property specifies the action to be taken when the boot timer expires. Following are the acceptable values for this property:

Note that if you enter any value other than those listed above, the software will automatically default to the xir value instead.



Note - If you set the ntwdt-bootrestart property to xir, you must also set the OpenBoot PROM NVRAM variable auto-boot-on-error? to true and the error-reset-recovery variable to boot. In addition, for this option to work reliably, the system must reboot followed by an xir, which may not happen in all cases (for example, if the system fails to find the boot disk and drops down to the ok prompt). Because of these restrictions, you may want to set the ntwdt-bootrestart property to reset for a more consistent behavior.



ntwdt-xirtimeout

This property specifies how long ALOM will wait, in seconds, to issue a system reset if the ntwdt-autorestart property is set to xir and the watchdog timer expires, but the system did not reset successfully. Acceptable values for this property range from 900 (15 minutes) to 10800 (180 minutes). Any value entered that is outside of this range will be ignored.

ntwdt-maxbootfail

This property allows you to set a limit to the number of times that the recovery action applied through the ntwdt-bootfailrecovery property is allowed to be taken, keeping the system from peforming the recovery action continuously. The maximum value for this property is 6. Any value entered that is above 6 will be ignored.

ntwdt-bootfailrecovery

This property tells ALOM what recovery action to take if the Netra system fails to boot after the value set in the ntwdt-maxbootfail property is met. Following are the acceptable values for this property:

Note that if you enter any value other than those listed above, the software will automatically default to the powercycle value instead.


Understanding the User APIs

The ntwdt driver provides several application programming interfaces (APIs) to application programs. You must open the /dev/ntwdt device node before issuing the watchdog ioctls. Note that only a single instance of open() is allowed on /dev/ntwdt; more than one instance of open() will generate the following error message:

EAGAIN 
The driver is busy, try again.

The following APIs are used with the ALOM watchdog timer:


Setting the Timeout Period

The timeout period for the ALOM watchdog is set using the LOMIOCDOGTIME API.

LOMIOCDOGTIME

This API sets the timeout period of the watchdog. This ioctl will program the watchdog hardware with the time specified in this ioctl.

The argument is a pointer to an unsigned integer. This integer holds the new timeout period for the watchdog in multiples of 1 second.

The watchdog framework will only allow timeouts in excess of 1 second. You can specify any timeout period in the range of 1 second to 180 minutes.

If the watchdog function is enabled, the timeout period is immediately reset so that the new value can take effect. An error (EINVAL) is displayed if the timeout period is less than 1 second or longer than 180 minutes.



Note - Setting the timeout period to a value of 0 means that the watchdog timer is uninitialized, so once you arm the watchdog timer, you cannot set the timeout period back to 0. Any attempt to set the timeout period to 0 will be unsuccessful. If you want to disable the watchdog timer, do not attempt to set the timeout period to 0; use the LOMIOCDOGCTL API instead (see LOMIOCDOGCTL for more information).





Note - This ioctl is not intended for general purpose use. Setting the watchdog timeout to too low a value may cause the system to receive a hardware reset if the watchdog and reset functions are enabled. If the timeout is set too low, the user application must be run with a higher priority (for example, as a real time thread) and must be patted more often to avoid an unintentional expiration.



To change the base unit back to seconds, either remove the line above from the ntwdt.conf file or change the value on that line from 1 to 10:

ntwdt-time-unit=10;


Enabling or Disabling the ALOM Watchdog

The enabling or disabling of the ALOM watchdog is done through the LOMIOCDOGCTL API.

LOMIOCDOGCTL

This API enables or disables the watchdog reset function. The ALOM watchdog is programmed with appropriate values.

The argument is a pointer to the lom_dogctl_t structure (described in greater detail in Data Structures). The reset_enable member is used to enable or disable the system reset function. The dog_enable member is used to enable or disable the watchdog function. An error (EINVAL) is displayed if the watchdog is disabled but reset is enabled.


Patting the ALOM Watchdog

The patting of the ALOM watchdog is done through the LOMIOCDOGPAT API.

LOMIOCDOGPAT

This API resets (pats) the watchdog so that the watchdog starts ticking from the beginning. This ioctl requires no arguments. If the watchdog is enabled, this ioctl must be used at regular intervals that are less than the watchdog timeout.


Getting the State of the Watchdog Timer

The state of the ALOM watchdog is shown using the LOMIOCDOGSTATE API.

LOMIOCDOGSTATE

This API gets the state of the watchdog and reset functions and retrieves the current timeout period for the watchdog. If LOMIOCDOGSTATE was never issued to set up the timeout period prior to this ioctl, the watchdog is not enabled in the hardware.

The argument is a pointer to the lom_dogstate_t structure (described in greater detail in Data Structures). The structure members are used to hold the current states of the watchdog reset circuitry and current watchdog timeout period. Note that this is not the time remaining before the watchdog is triggered.


Data Structures

All data structures and ioctls are defined in lom_io.h.

Watchdog/Reset State Data Structure

Following is the watchdog/reset state data structure.

CODE EXAMPLE B-1 Watchdog/Reset State Data Structure
typedef struct { 
	int reset_enable; /* reset enabled iff non-zero */ 
	int dog_enable; /* watchdog enabled iff non-zero */ 
	uint_t dog_timeout; /* Current watchdog timeout */ 
} lom_dogstate_t; 

Watchdog/Reset Control Data Structure

Following is the watchdog/reset control data structure.

CODE EXAMPLE B-2 Watchdog/Reset Control Data Structure
typedef struct { 
int reset_enable; /* reset enabled iff non-zero */ 
int dog_enable; /* watchdog enabled iff non-zero */ 
} lom_dogctl_t; 


Error Messages

Following are the error messages that may be displayed and what they mean.

EAGAIN

This error message will be displayed if you attempt to open more than one instance of open () on /dev/ntwdt.

EFAULT

This error message will be displayed if a bad user-space address was specified.

EINVAL

This error message will be displayed if a non-existant control command was requested or invalid parametes were supplied.

EINTR

This error message will be displayed if a thread awaiting a component state change was interrupted.

ENXIO

This error message will be displayed if the driver is not installed in the system.


Sample ALOM Watchdog Program

Following is a sample program for the ALOM watchdog program.

CODE EXAMPLE B-3 Example Program for ALOM Watchdog Program
#include "lom_io.h" 
main() { 
uint_t timeout = 30; /* 30 seconds */ 
lom_dogctl_t dogctl; 
int fd = open("/dev/ntwdt", O_RDWR); 
dogctl.reset_enable = 1; 
dogctl.dog_enable = 1; 
/* Set timeout */ 
ioctl(fd, LOMIOCDOGTIME, (void *)&timeout); 
/* Enable watchdog */ 
ioctl(fd, LOMIOCDOGCTL, (void *)&dogctl); 
 
/* Keep patting */ 
While (1) { 
ioctl(fd, LOMIOCDOGPAT, NULL); 
sleep (5); 
} 
}