A P P E N D I X  A

Understanding the Watchdog Timer Application Mode

This appendix gives information on the watchdog timer application mode on the Netra 1280 or Netra 1290 server.

The application mode allows you to:

This appendix provides the following sections to help you understand how to configure and use the watchdog timer and program Alarm3:



Note - Once the application watchdog timer is in use, it is necessary to reboot the Solaris operating system in order to return to the default (non-programmable) watchdog timer and default LED behavior (no Alarm3).




Understanding the Watchdog Timer Application Mode

The watchdog mechanism detects a system hang, or an application hang or crash, should they occur. The watchdog is a timer that is continually reset by a user application as long as the operating system and user application are running.

When the application is rearming the application watchdog, an expiration can be caused by:

When the system watchdog is running, a system hang, or more specifically, the hang of the clock interrupt handler causes an expiration.

The system watchdog mode is the default. If the application watchdog is not initialized, then the system watchdog mode is used.

The setupsc command, an existing command on the SC Lights Out Management can be used to configure the recovery for the system watchdog only:

lom> setupsc

The system controller configuration should be as follows:


SC POST diag Level [off]:
Host Watchdog [enabled]:
Log Reset Data [true]:
Verbose Reset Data [true]:
Rocker Switch [enabled]:
Secure Mode [off]:
 
PROC RTUs installed: 0
PROC Headroom quantity (0 to disable, 4 MAX) [0]:

When the Host Watchdog is enabled, and Log Reset Data is set to true, the system controller sends data to the console about the current state of each CPU before resetting the system. This allows system state data to be preserved if console data is being logged. The output format is the same as the format used by the showresetstate command when dumping the CPU state data for a hung system manually (that is, if Host Watchdog has been disabled).

Setting the Verbose Reset Data to true controls the amount of information that the system controller sends to the console. When enabled, this option produces the same result as using the showresetstate -v command.

The recovery configuration for the application watchdog is set using Input/Output Control codes (IOCTLs) that are issued to the ntwdt driver.


Using the ntwdt Driver

To use the new application watchdog feature, you must install the ntwdt driver. To enable and control the watchdog's application mode, you must program the watchdog system using the LOMIOCDOGxxx IOCTLs, described in the section "Understanding the User API".

If the ntwdt driver, as opposed to the system controller, initiates a reset of the Solaris OS on application watchdog expiration, the value of the following property in the ntwdt driver's configuration file (ntwdt.conf) is used:

ntwdt-boottimeout="600";

In case of a panic, or an expiration of the application watchdog, the ntwdt driver reprograms the watchdog time-out to the value specified in the property.

Assign a value representing a duration that is longer than the time it takes to reboot and perform a crash dump. If the specified value is not large enough, the SC resets the host if reset is enabled. Note that this reset by the SC occurs only once.


Understanding the User API

The ntwdt driver provides an application programming interface by using IOCTLs. You must open the /dev/ntwdt device node before issuing the watchdog ioctls.



Note - Only a single instance of open() is allowed on /dev/ntwdt; more than one instance of open() will generate the following error message: EAGAIN - The driver is busy, try again.



You can use the following IOCTLs with the watchdog timer:


Setting the Time-out Period

The LOMIOCDOGTIME IOCTL sets the timeout period of the watchdog. This IOCTL programs the watchdog hardware with the time specified in this IOCTL. You must set the time-out period (LOMIOCDOGTIME) before attempting to enable the watchdog timer (LOMIOCDOGCTL).

The argument is a pointer to an unsigned integer. This integer holds the new timeout period for the watchdog in multiples of 1 second. You can specify any timeout period in the range of 1 second to 180 minutes.

If the watchdog function is enabled, the time-out period is immediately reset so that the new value can take effect. An error (EINVAL) is displayed if the timeout period is less than 1 second or longer than 180 minutes.



Note - The LOMIOCDOGTIME is not intended for general purpose use. Setting the watchdog time-out to too low a value may cause the system to receive a hardware reset if the watchdog and reset functions are enabled. If the time-out is set too low, the user application must be run with a higher priority (for example, as a real time thread) and must be rearmed more often to avoid an unintentional expiration.




Enabling or Disabling the Watchdog

The LOMIOCDOGCTL IOCTL enables or disables the watchdog, and it enables or disables the reset capability. (See Finding and Defining Data Structures for the correct values for the watchdog timer.)

The argument is a pointer to the lom_dogctl_t structure (described in greater detail in Finding and Defining Data Structures).

Use the reset_enable member to enable or disable the system reset function. Use the dog_enable member to enable or disable the watchdog function. An error (EINVAL) is displayed if the watchdog is disabled but reset is enabled.



Note - If LOMIOCDOGTIME has not been issued to set up the time-out period prior to this IOCTL, the watchdog is NOT enabled in the hardware.




Rearming, or Patting, the Watchdog

The LOMIOCDOGPAT IOCTL rearms, or pats, the watchdog so that the watchdog starts ticking from the beginning; that is, to the value specified by LOMIOCDOGTIME. This IOCTL requires no arguments. If the watchdog is enabled, this IOCTL must be used at regular intervals that are less than the watchdog timeout, or the watchdog expires.


Getting the State of the Watchdog Timer

The LOMIOCDOGSTATE IOCTL gets the state of the watchdog and reset functions and retrieves the current time-out period for the watchdog. If LOMIOCDOGSTATE was never issued to set up the time-out period prior to this IOCTL, the watchdog is not enabled in the hardware.

The argument is a pointer to the lom_dogstate_t structure (described in greater detail in Finding and Defining Data Structures). The structure members are used to hold the current states of the watchdog reset circuitry and current watchdog time-out period. Note that this is not the time remaining before the watchdog is triggered.

The LOMIOCDOGSTATE IOCTL requires only that open() be successfully called. This IOCTL can be run any number of times after open() is called, and it does not require any other DOG IOCTLs to have been executed.


Finding and Defining Data Structures

All data structures and ioctls are defined in lom_io.h, which is available in the SUNWlomu package.

The data structures for the watchdog timer are shown here:

1. The watchdog/reset state data structure is as follows:


CODE EXAMPLE A-1 Watchdog/Reset State Data Structure
typedef struct { 
        int reset_enable; /* reset enabled if non-zero */ 
        int dog_enable; /* watchdog enabled if non-zero */ 
        uint_t dog_timeout; /* Current watchdog timeout */ 
} lom_dogstate_t; 

2. The watchdog/reset control data structure is as follows:


CODE EXAMPLE A-2 Watchdog/Reset Control Data Structure
typedef struct { 
        int reset_enable; /* reset enabled if non-zero */ 
        int dog_enable; /* watchdog enabled if non-zero */ 
} lom_dogctl_t; 


Using the Sample Watchdog Program

Following is a sample program for the watchdog timer.


CODE EXAMPLE A-3 Example Watchdog Program
#include "sys/types.h" 
#include "lom_io.h" 
#include "fnctl.h" 
#include "unistd.h" 
#include "sys/stat.h" 
 
int 
main() 
{
      uint_t timeout = 30; /* 30 seconds */ 
      lom_dogctl_t dogctl; 
      int fd; 
 
      dogctl.reset_enable = 1; 
      dogctl.dog_enable = 1; 
 
      fd = open("/dev/ntwdt", O_EXCL);
 
      /* Set timeout */ 
      ioctl(fd, LOMIOCDOGTIME, (void *)&timeout); 
 
      /* Enable watchdog */ 
      ioctl(fd, LOMIOCDOGCTL, (void *)&dogctl); 
 
      /* Keep patting */ 
      While (1) { 
            ioctl(fd, LOMIOCDOGPAT, NULL); 
            sleep (5); 
      } 
 
      return (0);
}


Programming Alarm 3

Alarm 3 is available to Solaris Operating System users irrespective of the watchdog mode. Alarm 3 or system alarm ON and OFF have been redefined (see the table below.)

Set the value of Alarm 3 using the LOMIOCALCTL IOCTL. You can program Alarm 3 like you set and clear Alarm 1 and Alarm 2.

The following table presents the behavior of Alarm 3:


TABLE A-1 Alarm 3 Behavior

Alarm 3

Relay

System LED (Green)

Poweroff

ON

COM -> NC

OFF

Poweron/LOM up

ON

COM -> NC

OFF

Solaris running

OFF

COM -> NO

ON

Solaris not running

ON

COM -> NC

OFF

Host WDT expires

ON

COM -> NC

OFF

User sets to ON

ON

COM -> NC

OFF

User sets to OFF

OFF

COM -> NO

ON


To summarize the data in the table:

Alarm3 ON = Relay(COM->NC), System LED OFF
Alarm3 OFF = Relay(COM->NO), System LED ON

When programmed, you can check Alarm3 or the system alarm with the showalarm command and the argument system.

For example:


sc> showalarm system
system alarm is on

The data structure used with the LOMIOCALCTL and LOMIOCALSTATE IOCTLs is as follows:


CODE EXAMPLE A-4 LOMIOCALCTL and LOMIOCALSTATE IOCTL Data Structure
#include <lom_io.h>
 
#define ALARM_NUM_1 1 
#define ALARM_NUM_2 2 
#define ALARM_NUM_3 3
 
#define ALARM_OFF 0 
#define ALARM_ON 1
 
typedef struct {
      int alarm_no; 
      int alarm_state;
} lom_aldata_t; 


Understanding Error Messages

Following are the error messages that might be displayed and what they mean.

EAGAIN

This error message is displayed if you attempt to open more than one instance of open() on /dev/ntwdt.

EFAULT

This error message is displayed if a bad user-space address is specified.

EINVAL

This error message is displayed if a non-existant control command was requested or invalid parameters were supplied.

EINTR

This error message is displayed if a thread awaiting a component state change is interrupted.

ENXIO

This error message is displayed if the driver is not installed in the system.


Knowing Unsupported Features and Limitations

1. In the case of the watchdog timer expiration detected by the SC, the recovery is attempted only once; there are no further attempts of recovery if the first attempt fails to recover the domain.

2. If the application watchdog is enabled and you break into the OpenBoottrademark PROM (OBP) by issuing the break command from the system controller's lom prompt, the SC automatically disables the watchdog timer.



Note - The SC displays a console message as a reminder that the watchdog, from the SC's perspective, is disabled.



However, when you reenter the Solaris OS, the watchdog timer is still ENABLED from the Solaris Operating System's perspective. To have both the SC and the Solaris OS view the same watchdog state, you must use the watchdog application to either enable or disable the watchdog.

3. If you perform a dynamic reconfiguration (DR) operation in which a system board containing kernel (permanent) memory is deleted, then you must disable the watchdog timer's application mode before the DR operation and enable it after the DR operation. This is required because Solaris software quiesces all system IO and disables all interrupts during a memory-delete of permanent memory. As a result, system controller firmware and Solaris software can not communicate during the DR operation. Note that this limitation affects neither the dynamic addition of memory nor the deletion of a board not containing permanent memory. In those cases, the watchdog timer's application mode can run concurrently with the DR implementation.

You can execute the following command to locate the system boards that contain kernel (permanent) memory:

sh> cfgadm -lav | grep -i permanent

4. If the Solaris Operating System hangs under the following conditions, the system controller firmware cannot detect the Solaris software hang:

5. The watchdog timer provides partial boot monitoring. You can use the application watchdog to monitor a domain reboot.

However, domain booting is not monitored for:

In the latter cases, a boot failure is not detected and no recovery attempts are made.

6. The watchdog timer's application mode provides no monitoring for application startup. In application mode, if the application fails to start up, the failure is not detected and no recovery is provided.