A P P E N D I X A |
This appendix gives information on the watchdog timer application mode on the server. It provides the following sections to help you understand how to configure and use the watchdog timer:
The watchdog mechanism detects a system hang, or an application hang or crash, should they occur. The watchdog is a timer that is continually reset by a user application as long as the operating system and user application are running.
When the application is rearming the application watchdog, an expiration can be caused by:
When the system watchdog is running, a system hang, or more specifically, the hang of the clock interrupt handler, causes an expiration.
The system watchdog mode is the default. If the application watchdog is not initialized, then the system watchdog mode is used.
The application mode enables you to:
Configure the watchdog timer – Your applications running on the host can configure and use the watchdog timer, enabling you to detect fatal problems from applications and to recover automatically.
Program Alarm3 – This capability enables you to generate this alarm in case of critical problems in your applications.
The setupsc command, an existing command for the ALOM CMT compatability CLI (in ILOM), can be used to configure the recovery for the system watchdog only:
sc> setupsc
The recovery configuration for the application watchdog is set using input/output control codes (IOCTLs) that are issued to the ntwdt driver.
The limitations of the watchdog timer mode include:
In the case of the watchdog timer expiration detected by the system controller, the recovery is attempted only once. There are no further attempts of recovery if the first attempt fails to recover the domain.
If the application watchdog is enabled and you break into the OpenBoot PROM by issuing the break command from the system controller’s sc> prompt, the system controller automatically disables the watchdog timer.
Note - The system controller displays a console message as a reminder that the watchdog, from the system controller’s perspective, is disabled. |
However, when you re-enter the Solaris OS, the watchdog timer is still enabled from the Solaris OS perspective. To have both the system controller and the Solaris OS view the same watchdog state, you must use the watchdog application to either enable or disable the watchdog.
If you perform a dynamic reconfiguration (DR) operation in which a system board containing kernel (permanent) memory is deleted, then you must disable the watchdog timer’s application mode before the DR operation and enable it after the DR operation. This is required because Solaris software quiesces all system IO and disables all interrupts during a memory-delete of permanent memory. As a result, system controller firmware and Solaris software can not communicate during the DR operation. Note that this limitation affects neither the dynamic addition of memory nor the deletion of a board not containing permanent memory. In those cases, the watchdog timer’s application mode can run concurrently with the DR implementation.
You can execute the following command to locate the system boards that contain kernel (permanent) memory:
# cfgadm -lav | grep -i permanent
If the Solaris Operating System hangs under the following conditions, the system controller firmware cannot detect the Solaris software hang:
The watchdog timer provides partial boot monitoring. You can use the application watchdog to monitor a domain reboot.
However, domain booting is not monitored for:
In the case of a recovery of a hung or failed domain, a boot failure is not detected and no recovery attempts are made.
The watchdog timer’s application mode provides no monitoring for application startup. In application mode, if the application fails to start up, the failure is not detected and no recovery is provided.
To enable and control the watchdog’s application mode, you must program the watchdog system using the LOMIOCDOGxxx IOCTLs, described in Understanding the User API.
If the ntwdt driver, as opposed to the system controller, initiates a reset of the Solaris OS on application watchdog expiration, the value of the following property in the ntwdt driver’s configuration file (ntwdt.conf) is used:
ntwdt-boottimeout="600";
In case of a panic, or an expiration of the application watchdog, the ntwdt driver reprograms the watchdog time-out to the value specified in the property.
Assign a value representing a duration that is longer than the time it takes to reboot and perform a crash dump. If the specified value is not large enough, the system controller resets the host if reset is enabled. Note that this reset by the system controller occurs only once.
The ntwdt driver provides an application programming interface by using IOCTLs. You must open the /dev/ntwdt device node before issuing the watchdog IOCTLs.
Note - Only a single instance of open() is allowed on /dev/ntwdt. More than one instance of open() will generate the following error message: EAGAIN – The driver is busy, try again. |
The LOMIOCDOGTIME IOCTL sets the timeout period of the watchdog. This IOCTL programs the watchdog hardware with the time specified in this IOCTL. You must set the timeout period (LOMIOCDOGTIME) before attempting to enable the watchdog timer (LOMIOCDOGCTL).
The argument is a pointer to an unsigned integer. This integer holds the new timeout period for the watchdog in multiples of 1 second. You can specify any timeout period in the range of 1 second to 180 minutes.
If the watchdog function is enabled, the time-out period is immediately reset so that the new value can take effect. An error (EINVAL) is displayed if the timeout period is less than 1 second or longer than 180 minutes.
The LOMIOCDOGCTL IOCTL enables or disables the watchdog, and it enables or disables the reset capability. See Finding and Defining Data Structures for the correct values for the watchdog timer.
The argument is a pointer to the lom_dogctl_t structure. This structure is described in greater detail in Finding and Defining Data Structures .
Use the reset_enable member to enable or disable the system reset function. Use the dog_enable member to enable or disable the watchdog function. An error (EINVAL) is displayed if the watchdog is disabled but reset is enabled.
Note - If LOMIOCDOGTIME has not been issued to set up the timeout period prior to this IOCTL, the watchdog is not enabled in the hardware. |
The LOMIOCDOGPAT IOCTL rearms, or pats, the watchdog so that the watchdog starts ticking from the beginning; that is, to the value specified by LOMIOCDOGTIME. This IOCTL requires no arguments. If the watchdog is enabled, this IOCTL must be used at regular intervals that are less than the watchdog timeout, or the watchdog expires.
The LOMIOCDOGSTATE IOCTL gets the state of the watchdog and reset functions, and retrieves the current time-out period for the watchdog. If LOMIOCDOGSTATE was never issued to set up the timeout period prior to this IOCTL, the watchdog is not enabled in the hardware.
The argument is a pointer to the lom_dogstate_t structure, which is described in greater detail in Finding and Defining Data Structures . The structure members are used to hold the current states of the watchdog reset circuitry and current watchdog timeout period. This timeout period is not the time remaining before the watchdog is triggered.
The LOMIOCDOGSTATE IOCTL requires only that open() be successfully called. This IOCTL can be run any number of times after open() is called, and it does not require any other DOG IOCTLs to have been executed.
All data structures and IOCTLs are defined in lom_io.h, which is available in the SUNWlomh package.
The data structures for the watchdog timer are shown here:
The watchdog and reset state data structure is as follows:
typedef struct { int reset_enable; /* reset enabled if non-zero */ int dog_enable; /* watchdog enabled if non-zero */ uint_t dog_timeout; /* Current watchdog timeout */ } lom_dogstate_t; |
The watchdog and reset control data structure is as follows:
typedef struct { int reset_enable; /* reset enabled if non-zero */ int dog_enable; /* watchdog enabled if non-zero */ } lom_dogctl_t; |
Following is a sample program for the watchdog timer.
TABLE A-1 describes watchdog timer error messages that might be displayed and what they mean.
Copyright © 2010, Oracle and/or its affiliates. All rights reserved.