A P P E N D I X  B

Understanding the ALOM Watchdog Timer

This appendix gives information on the ALOM watchdog timer.



Note - The ALOM watchdog feature is not supported on all servers. For more information about whether your host system is supported, refer to the Release Notes for your version of the ALOM software.



ALOM features a watchdog mechanism to detect and respond to a system hang, should one ever occur. The ALOM watchdog is a timer that is continually reset by a user application as long as the operating system and user application are running. In the event of a system hang, the user application is no longer able to reset the timer. The timer will then expire and will perform an action set by the user, eliminating the need for operator intervention.

To fully understand the ALOM watchdog timer, you must understand certain terms associated with the feature's components and how all of the components interact.

1. If the ALOM watchdog timer is enabled, it will automatically begin monitoring the host server and will detect when the host or application encounters a hang condition or stops running. The default time-out period is 60 seconds; in other words, if the ALOM watchdog timer does not hear from the host system within that 60-second window, it will automatically perform the action that you specify in the sys_autorestart variable (see sys_autorestart). You can change the time-out period through the sys_wdttimeout variable (see sys_wdttimeout).

2. If you set XIR as the function that ALOM would perform once the watchdog timer time-out period is reached, then ALOM will attempt to XIR the host system. If the XIR does not complete within the specified number of seconds (set through the sys_xirtimeout variable), then ALOM forces the server to perform a hard reset instead (see sys_xirtimeout).

3. The ALOM watchdog should be enabled by the user application after the host system is booted. ALOM starts a timer to detect host boot failures as soon as the host is powered on or reset. The host is considered fully booted once the ALOM watchdog timer is started. If the host fails to boot within a certain amount of time, it will take an action that you have specified. You use the sys_boottimeout variable to specify the amount of time that the ALOM watchdog will wait for the host to boot (see sys_boottimeout). You specify the action it will take if it doesn't boot in that time through the sys_bootrestart variable (see sys_bootrestart).

4. You can set the maximum number of attempted reboots using the sys_maxbootfail variable to keep the system from going through an endless cycle of reboots (see sys_maxbootfail). If the system goes through the number of reboots set through the sys_maxbootfail variable, then ALOM will perform an action that you set through the sys_bootfailrecovery variable (see sys_bootfailrecovery).

Note that the boot timer will be disabled for the host reset or reboot after the action set through the sys_bootfailrecovery variable is taken; it will not be enabled again until after the user application restarts the watchdog timer.


Driver Properties

The following property must be present in the /platform/sun4u/kernel/drv/rmclomv.conf file for the ALOM watchdog to function:


rmclomv-watchdog-mode="app";

This property tells the watchdog subsystem to disable the kernel level heartbeat mechanism. Comment out or remove this line to enable the kernel level watchdog.

The ntwdt driver will have an associated driver configuration file (ntwdt.conf) that will specify the following parameters:

ntwdt-autorestart

This property indicates the action to be taken if the watchdog timer expires. Following are the acceptable values for this property:

Note that if you enter any value other than those listed above, the software will automatically default to the xir value.

ntwdt-boottimeout

When the host system begins to boot the Solaris operating system, the ntwdt-boottimeout value specifies the amount of time, in seconds, that the watchdog system must be programmed. Note that if the application watchdog is enabled, the user program must program the watchdog system using the LOMIOCDOGTIME or LOMIOCDOGCTL input/output control devices (ioctls); otherwise, the kernel does it automatically. If the watchdog is not programmed, then ALOM takes the recovery action.

ntwdt-bootrestart

This property specifies the action to be taken when the boot timer expires. Following are the acceptable values for this property:

Note that if you enter any value other than those listed above, the software will automatically default to the xir value.



Note - If you set the ntwdt-bootrestart property to xir, you must also set the OpenBoot PROM NVRAM variable auto-boot-on-error? to true and the error-reset-recovery variable to boot. In addition, for this option to work reliably, the system must reboot followed by an xir, which might not happen in all cases; for example, if the system fails to find the boot disk and drops down to the ok prompt. Because of these restrictions, you might want to set the ntwdt-bootrestart property to reset for a more consistent behavior.



ntwdt-xirtimeout

This property specifies how long ALOM will wait, in seconds, to issue a system reset if the ntwdt-autorestart property is set to xir and the watchdog timer expires, but the system did not reset successfully. Acceptable values for this property range are from 900 (15 minutes) to 10800 (180 minutes). Any value entered that is outside of this range will be ignored.

ntwdt-maxbootfail

This property allows you to set a limit to the number of times that the recovery action applied through the ntwdt-bootfailrecovery property is allowed to be taken, keeping the system from performing the recovery action continuously. The maximum value for this property is 6. Any value entered that is above 6 will be ignored.

ntwdt-bootfailrecovery

This property tells ALOM what recovery action to take if the host system fails to boot after the value set in the ntwdt-maxbootfail property is met. Following are the acceptable values for this property:

Note that if you enter any value other than those listed above, the software will automatically default to the powercycle value.


Understanding the User APIs

The ntwdt driver provides several application programming interfaces (APIs) to application programs. You must open the /dev/ntwdt device node before issuing the watchdog ioctls. Note that only a single instance of open() is allowed on /dev/ntwdt; more than one instance of open() will generate the following error message:


EAGAIN 
The driver is busy, try again.

The following APIs are used with the ALOM watchdog timer:


Setting the Time-out Period

The time-out period for the ALOM watchdog is set using the LOMIOCDOGTIME API.

LOMIOCDOGTIME

This API sets the time-out period of the watchdog. This ioctl programs the watchdog hardware with the time specified in this ioctl.

The argument is a pointer to an unsigned integer. This integer holds the new time-out period for the watchdog in multiples of 1 second.

The watchdog framework will only allow time-outs in excess of 1 second. You can specify any time-out period in the range of 1 second to 180 minutes.

If the watchdog function is enabled, the time-out period is immediately reset so that the new value can take effect. An error (EINVAL) is displayed if the time-out period is less than 1 second or longer than 180 minutes.



Note - Setting the time-out period to a value of 0 means that the watchdog timer is uninitialized, so once you arm the watchdog timer, you cannot set the time-out period back to 0. Any attempt to set the time-out period to 0 will be unsuccessful. If you want to disable the watchdog timer, do not attempt to set the time-out period to 0; use the LOMIOCDOGCTL API instead (see LOMIOCDOGCTL for more information).





Note - This ioctl is not intended for general purpose use. Setting the watchdog time-out to too low a value may cause the system to receive a hardware reset if the watchdog and reset functions are enabled. If the time-out is set too low, the user application must be run with a higher priority (for example, as a real time thread) and must be patted more often to avoid an unintentional expiration.



To change the base unit back to seconds, either remove the line above from the ntwdt.conf file or change the value on that line from 1 to 10:


ntwdt-time-unit=10;


Enabling or Disabling the ALOM Watchdog

The enabling or disabling of the ALOM watchdog is done through the LOMIOCDOGCTL API.

LOMIOCDOGCTL

This API enables or disables the watchdog reset function. The ALOM watchdog is programmed with appropriate values.

The argument is a pointer to the lom_dogctl_t structure (described in greater detail in Data Structures). The reset_enable member is used to enable or disable the system reset function. The dog_enable member is used to enable or disable the watchdog function. An error (EINVAL) is displayed if the watchdog is disabled but reset is enabled.


Patting the ALOM Watchdog

The patting of the ALOM watchdog is done through the LOMIOCDOGPAT API.

LOMIOCDOGPAT

This API resets (pats) the watchdog so that the watchdog starts ticking from the beginning. This input/output control device (ioctl) requires no arguments. If the watchdog is enabled, this ioctl must be used at regular intervals that are less than the watchdog time-out.


Getting the State of the Watchdog Timer

The state of the ALOM watchdog is shown using the LOMIOCDOGSTATE API.

LOMIOCDOGSTATE

This API gets the state of the watchdog and reset functions and retrieves the current time-out period for the watchdog. If LOMIOCDOGSTATE was never issued to set up the time-out period prior to this ioctl, the watchdog is not enabled in the hardware.

The argument is a pointer to the lom_dogstate_t structure (described in greater detail in Data Structures). The structure members are used to hold the current states of the watchdog reset circuitry and current watchdog time-out period. Note that this is not the time remaining before the watchdog is triggered.


Data Structures

All data structures and ioctls are defined in the lom_io.h file.

Watchdog/Reset State Data Structure

Following is the watchdog/reset state data structure.


CODE EXAMPLE B-1 Watchdog/Reset State Data Structure
typedef struct { 
	int reset_enable; /* reset enabled iff non-zero */ 
	int dog_enable; /* watchdog enabled iff non-zero */ 
	uint_t dog_timeout; /* Current watchdog timeout */ 
} lom_dogstate_t; 

Watchdog/Reset Control Data Structure

Following is the watchdog/reset control data structure.


CODE EXAMPLE B-2 Watchdog/Reset Control Data Structure
typedef struct { 
int reset_enable; /* reset enabled iff non-zero */ 
int dog_enable; /* watchdog enabled iff non-zero */ 
} lom_dogctl_t; 


Error Messages

TABLE B-1 lists the error messages that might be displayed and what they mean.


TABLE B-1 Error Messages for the Watchdog Timer

Error Message

Description

EAGIN

Appears if you attempt to open more than one instance of open () on /dev/ntwdt.

EFAULT

Appears if an invalid user-space address is specified.

EINVAL

Appears if a non-existent control command is requested or invalid parameters are supplied.

EINTR

Appears if a thread awaiting a component state change is interrupted.

ENXIO

Appears if the driver is not installed in the system.



Sample ALOM Watchdog Program

Following is a sample program for the ALOM watchdog program.


CODE EXAMPLE B-3 Example Program for ALOM Watchdog Program
#include "lom_io.h" 
main() { 
uint_t timeout = 30; /* 30 seconds */ 
lom_dogctl_t dogctl; 
int fd = open("/dev/ntwdt", O_RDWR); 
dogctl.reset_enable = 1; 
dogctl.dog_enable = 1; 
/* Set timeout */ 
ioctl(fd, LOMIOCDOGTIME, (void *)&timeout); 
/* Enable watchdog */ 
ioctl(fd, LOMIOCDOGCTL, (void *)&dogctl); 
 
/* Keep patting */ 
While (1) { 
ioctl(fd, LOMIOCDOGPAT, NULL); 
sleep (5); 
} 
}