ChorusOS 5.0 Application Developer's Guide

Chapter 14 Analyzing System Failure : Black Boxes, Watchdog Timers, and Logging

This chapter demonstrates the use of the ChorusOS application programming interfaces that help when diagnosing and recovering from application or system failure. The APIs covered here include:

Software black box

Provides a robust repository for realtime collection and storage of a time ordered vector of historical event information to enable postmortem data analysis for system and application failures.
Application-level two-stage watchdog timer

Provides a two-stage watchdog timer. If the timer is not periodically reset, it will generate an interrupt to initiate a controlled processor restart and then, if the system cannot initiate a controlled restart, will trigger a hardware reset without software intervention (to ensure that the target restarts).
System logging

Provides a configurable system log service to enable the system and applications to log errors, warnings, or other messages to local or remote files.

After implementing the examples provided in this chapter, you should understand how these APIs can be used in an application.

Software Black Box

The ChorusOS black box enables applications to read the event buffer and log events. The black box feature can be defined as a set of microkernel ring buffers. Multiple black boxes can be configured, each identified with a unique integer identifier. The number and sizes of black boxes on your ChorusOS are tunable. A specified black box can be frozen and events directed to another black box.

Events are tagged with specific information (event producer identifier, event producer name, filter data, and timestamp). Some filtering is performed before events are inserted into a black box. Events can be filtered in or out based on the event producer name, and on the severity of the event. Descriptive event tags can be added to a black box along with each event to enable finer grained post-process filtering.

Black Box APIs

All black box APIs are thread-safe. Calling any one of these APIs will produce valid results, even if the call is made by multiple threads. The bb_event and bb_freeze events are async-signal-safe. This means that they can safely be called from a signal handler, and cannot be interrupted by a signal. No black box APIs are cancel-safe or interruptible.

Using the Black Box

At node startup, a specific number of black boxes are allocated. The node will select one of these black boxes to be active. This is performed sufficiently early in the boot process so that all user applications are able to log events to it as soon as they start. Some black box parameters are configurable at node startup time, that is before applications start running.

Before an application component manager launches an application, it prepares to catch the application in case it fails unexpectedly. If the application does fail, the component manager logs information about the failure to the black box, and triggers a freeze of the current black box.

Any application can log events in a black box at any point by calling bb_event. If multiple entities on the node are generating black box events, a specific thread may have fewer events than expected recorded in the black box.

At some point, an event may occur which requires freezing the black box. This is accomplished by calling bb_freeze. When the current black box is frozen, a new black box is chosen (if one is available), and new events are routed into it.

The criteria by which event filtering is performed can be modified dynamically. See "Filtering APIs" for details on the capabilities of the filtering APIs.

A system process can locate a frozen black box, open it, and read its contents to diagnose a failure. After the failure has been diagnosed, the process calls bb_release to allow the system to reselect that black box.

The black box API can also be used to store microkernel-level information. In this case the black box logs exceptions, traps, panics, and failed system calls. To use this feature, the kern.blackbox.kernelLogging tunable must be set to 1.

When the HOT_RESTART feature is selected,persistent memory can be used to store the black box data. To use persistant memory in this way, the BSP must be compiled with the -DBB_PERSISTENT flag.

Note -

When the BSP is compiled with the -DBB_PERSISTENT flag, the total size allocated for the black boxes is fixed and equally distributed for the specified number of black boxes.

Example 14-1 demonstrates the basic use of the black box API in an application.

Example 14-1 Basic Use of the Software Black Box

#include <sys/blackbox.h>

#define SEV_LOW 4 
#define SEV_HIGH 24 
#define SEV_EMERG 31

int main(int argc, char **argv)
{ 
   int fd; 
   
   bb_event("main", SEV_LOW, "Starting with %d args", argc); 
   
   if (argc > 1) { 
   	   bb_event("main", SEV_LOW, "argv[1] = %s", argv[1]);
   } else { 
   	   bb_event("main", SEV_EMERG, "no arguments   exiting");
   	   exit(1);
   } 
   
   if ((fd = open(argv[1], O_RDWR)) ==  1) {
   	   bb_event("main", SEV_HIGH,
   	    "failed to open \"%s\": error %d", argv[1], errno);
   } 
   ... 
}

Note -

Depending on the node's filtering configuration, some of the events in this example may not be added to the black box.

Filtering APIs

The filtering APIs include the following interfaces:

bb_getfilters
bb_getprodids
bb_getseverity
bb_setfilters
bb_setprodids
bb_setseverity
BB_SEV_CLEAR
BB_SEV_CLEAR_LIMIT
BB_SEV_SET
BB_SEV_SET_LIMIT
BB_SEV_TEST

These interfaces enable three levels of pre-process event filtering; filter list, filtered severity bitmap, and global severity bitmap.

You can specify an event producer to use the filter list and filtered severity bitmap (also known as the fine grained filters) by identifying the producer in a set passed to bb_setprodids.

The filter list contains a set of pairs of tag and severity. These entities are described in the definition of bb_getseverity. An event is entered into the black box if:

The call to bb_event has a tag that matches a tag in the filter list.
The severity of the call is enabled in the tag's severity bitmap in the filter list.
The caller of bb_event has been enabled to use the filter list.

The filter list can be updated by reading the current list with bb_getfilters, adding to it, subtracting from it, or modifying it, and then passing the new list to bb_setfilters.

The filtered severity bitmap is a node-wide severity bitmap. An event will be entered into the black box if:

The call to bb_event has a severity that is enabled in the filtered severity bitmap.
The caller of bb_event has been enabled to use the filtered severity bitmap.

The filtered severity bitmap can be updated by including a tag, severity filter with a tag of BB_ANY_TAG in a call to bb_setfilters.

The global severity bitmap is also a node-wide severity bitmap. If a call to bb_event does not find a match in the filter list or the filtered severity bitmap, or if the caller is not using these filters, the bb_event call will fall back to using the global severity bitmap. An event will be entered into the black box if the call to bb_event has a severity that is enabled in the global severity bitmap.

The global severity bitmap can be modified by calling bb_getseverity, modifying the bitmap, and passing the new bitmap to bb_setseverity.

The three filters are used in order, from the most to the least specific:

filter list
filtered severity bitmap
global severity bitmap

If an event matches a filter in the filter list, it will added to the black box, otherwise it is tried against the filtered severity bitmap. If a match is found in the filtered severity bitmap, it will be added to the black box. If no match is found, it will be tried against the global severity bitmap. A consequence of this approach is that there is no strict filter out support. If the first filter tried does not match an event, another filter may find a match, even if the intent was for the more specific filter to filter it out. In such a case, however, the filters as specified are inconsistent, and a semantic should be chosen.

Using the Filtering APIs

The following example illustrates the use of tags in libraries and some filtering interfaces. It also shows what a black box dump should look like.

Example 14-2 Using the Filtering APIs

#include <sys/blackbox.h>

#define SEV_TRACE 8 
#define SEV_INIT  15
#define SEV_HIGH  28
#define SEV_OOM   30

int
foo_init(foo_t **fp, hat_t *hp)
{
  hat_t *nhp;
  bar_t *bp;
  uint_t total_reqs, i;
  *fp = NULL;

  if (hp == NULL) {
    bb_event("libA:foo_init", SEV_TRACE, "hat pointer is NULL, skipping");
    return (0);

  }
	
  for (nhp = hp, total_reqs = 0; nhp != NULL; nhp = nhp->hat_next)
      total_reqs += hp->hat_reqs;
  bb_event("libA:foo_init", SEV_TRACE, "total of %d requests", total_reqs);
		
  if (total_reqs == 0)
      return (0);

  *fp = (foo_t *)malloc(sizeof (foo_t));
  if (*fp == NULL) {
      bb_event("libA:foo_init", SEV_OOM, "out of memory for foo_t");
      return (-1);
  }

  bp = (bar_t *)malloc(total_reqs * sizeof (bar_t));
  if (bp == NULL) {
      bb_event("libA:foo_init", SEV_OOM,
               "no memory for %d bars", total_reqs);
      free(*fp);
      return (-1);
  }

  (*fp)->foo_array = bp;
  (*fp)->foo_hatptr = hp;
  (*fp)->foo_reqcnt = total_reqs;

  for (nhp = hp; nhp != NULL;
      nhp = nhp->hat_next, bp += nhp->hat_reqs) {
      nhp->hat_array = bp;

  for (i = 0; i < nhp->hat_reqs; i++)
      bp->bar_id = i;
  }
  bb_event("libA:foo_init", SEV_INIT, "foo at %p created", *fp);

  return (0);
}

Within the program itself:

int process_hatlist(hat_t *hp)
{
  foo_t *fp;

  bb_event("process_hatlist", SEV_TRACE, "received hat list %p", hp);
		
  if (foo_init(&fp, hp) != 0) {
      bb_event("process_hatlist", SEV_HIGH,
               "cant make foo for hatlist %p", hp);
      return ( 1);
  }

  if (fp != NULL) {
      foo_getreqs(fp);
      foo_enqueue(fp);
  }

  return (0);
}

Given the program from the previous example, the following examples illustrate filters and the potential black box traces that they may generate. (Note that this is simply an example trace format and may bear no resemblance to any output an administrative CLI program might display.)

Example 14-3 Filter and Trace Showing Successful Execution

Filter:

bb_filter_t filts[] = {
		{ 0, "libA:foo_init" },
};
bb_prodid_t prods[] = {
		{ BB_ALL_PROD, BB_ALL_PIDS }
};
bb_severity_t sev;
BB_SEV_SET_LIMIT(SEV_TRACE, &filts[0].bbf_severity);
BB_SEV_SET_LIMIT(SEV_HIGH, &sev);
bb_setfilters(filts, 1);
bb_setprodids(prods, 1);
bb_setseverity(sev);

Trace:

2000 03 15 18:56:02.928980 hatrouter[2583/1]: libA:foo_init(8):
		total of 93 requests
2000-03-15 18:56:02.929301 hatrouter[2583/1]: libA:foo_init(15):
		foo at 0x78a89380 created
2000-03-15 18:56:03.017327 foodb[2448/1]: libA:foo_init(8):
		total of 93 requests
2000-03-15 18:56:03.017936 foodb[2448/1]: libA:foo_init(15):
		foo at 0x300004bc8d8 created

In the previous example, the filter and trace indicate that for both producers, foo_init was successful. No errors ocurred and the chain of 93 hat_ts were successfully passed from the hatrouter to the foodb. The number after the tag libA:foo_init indicates the severity level of the event.

Example 14-4 Filter and Trace Showing Unsuccessful Execution

Filter:

bb_filter_t filts[] = {
		{ 0, BB_ALL_TAGS },
};
bb_prodid_t prods[] = {
		{ "hatrouter", BB_ALL_PIDS }
};
bb_severity_t sev;
BB_SEV_SET_LIMIT(SEV_TRACE, &filts[0].bbf_severity);
BB_SEV_SET_LIMIT(SEV_HIGH, &sev);
bb_setfilters(filts, 1);
bb_setprodids(prods, 1);
bb_setseverity(sev);

Trace:

2000-03-15 19:33:28.139821 hatrouter[2583/1]: process_hatlist(8):
		received hat list 0x7f78a190
2000-03-15 19:33:28.148295 hatrouter[2583/1]: libA:foo_init(30):
		out of memory for 28173 bar's
2000-03-15 19:33:28.148376 hatrouter[2583/1]: process_hatlist(28):
		can't make foo for hat list 0x7f78a190

In the previous example, the filter and trace suggest that the hatrouter received an extremely long chain of hat_ts and could not allocate enough bar_ts to back them up.

Watchdog Timer

The watchdog timer API provides a set of system calls that enable a ChorusOS actor to manage a two-stage watchdog timer. The two-stage watchdog timer can be used by ChorusOS applications or middleware to monitor their sanity.

A user or supervisor actor takes control of the timer by invoking the watchdog timer API. When under the control of an application, the two-stage watchdog timer must be reloaded periodically. If the first-stage timer expires, an interrupt is triggered, enabling the collection of diagnostic information by means of the system dump feature. The system is then restarted.

If the system freezes before the completion of stage one, it will be reset when the second-stage timer expires.

The two-stage watchdog timer uses two watchdog devices. The first device operates in interrupt mode (stage one), while the second operates in reset mode (stage two). The first-stage interrupt handler is system-wide and is therefore not exposed to the user, as shown in Figure 14-1.

Figure 14-1 Two-Stage Watchdog Timer

When the watchdog timer is enabled, the second stage reset-mode watchdog device is armed and is running all the time, even when the two-stage watchdog timer is unallocated and not controlled by an application.

In this situation, the running watchdog device is reloaded silently by the watchdog API. It is never subsequently disarmed. This ensures that the system is guarded continually against system lockups.

Note -

If a process is unscheduled for a long time, for example, as the result of a debugging session, for example, a system dump will occur (if implemented). Following the system dump, the system will be reset on timeout of the first and second-stage watchdog devices.

The watchdog timer API includes the following system calls. Note that in this table, the watchdog timer refers to the watchdog timer named by handle.

Table 14-1 Watchdog Timer API System Calls


System Call	Purpose
`wdt_alloc()`	Returns a valid handle that identifies the allocated watchdog timer
`wdt_realloc()`	Reallocates the watchdog timer, that was returned by the last call to `wdt_alloc()`. This enables a new actor created in a higly available (HA) environment to take over watchdog management for an actor that has died. If the watchdog timer is armed, it will also be reloaded by this call.
`wdt_free()`	Disarms and releases the watchdog timer
`wdt_get_maxinterval()`	Returns the maximum timeout interval that can be set for the timer
`wdt_set_interval()`	Sets the timeout interval for the watchdog timer. If both components of interval are zero, the timer is disarmed.
`wdt_get_interval()`	Returns the current timeout interval set for the watchdog timer
`wdt_arm()`	Starts a new timeout interval, the duration of which will be set by `wdt_set_interval()`
`wdt_disarm()`	Disarms the watchdog timer
`wdt_is_armed()`	Returns the state of the watchdog timer. A positive value is returned if the timer is armed.
`wdt_pat()`	Reloads the watchdog timer. The timer begins a new timeout interval, the duration of which is set by `wdt_set_interval()`.
`wdt_startup_commit()`	Indicates whether the watchdog timer startup sequence has completed successfully. Successful completion denotes that the system no longer needs to be rebooted if the reset mode watchdog device armed by the boot framework expires. Because the reset mode watchdog device must not be disarmed, the system will continue to reload the reset mode device silently until it is shut down, or until the watchdog timer is explicitly allocated and armed by the HA framework (or by another application).
`wdt_shutdown()`	Indicates whether the system is being shut down. During shutdown, the reset mode watchdog device must not be reloaded for more than the configured timeout interval. This will ensure that the system is reset even if the shutdown sequence does not complete within the expected time period. This call will fail if it is invoked while the watchdog device is armed.

Using a Watchdog Timer

The watchdog API is typically used to monitor the progress of an application. In the following example, a thread is dedicated to the watchdog.

Example 14-5 Using the Watchdog Timer

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <errno.h>
#include <assert.h>

#include <util/chKnTimeVal.h>
#include <sched/chSched.h>
#include <exec/chExec.h>
#include <sync/chSem.h>
#include <cx/wdt.h>


#define WDT_RUNS          10   /* number of times the test is run */
#define WDT_PAT_TIMES      3   /* number of times the timer 
                                * is patted at each interval
				*/
#define WDT_PAT_INC       10   /* timeout interval increment */
#define WDT_PAT_MIN        5   /* initial, minimum timeout interval */
#define WDT_STACKSIZE   1000   /* worker thread stack size */
#define WDT_THREAD_PRIO    3   /* worker thread (very high) priority */
#define WDT_EXTRA_SLEEP   10   /* extra delay to check that the timer 
				* is truly disarmed
				*/

wdt_handle_t wdt_handle = 0;   /* watchdog timer handle */


/*
 * sleep for 'secs' seconds, displaying a symbol at each elapsed second
 */
void
wdt_sleep(int secs)
{
    KnTimeVal tv = { 1, 0 };
    int       i;

    putchar('[');

    for (i = 0; i < secs; i++) {
        threadDelay(&tv);
        if (i && (i % 60 == 59))
            putchar('#');
        else
        if (i && (i % 10 == 9))
            putchar('|');
        else
            putchar('.');

        fflush(stdout);
    }
    printf("]\n");
}


/*
 * check the return value of a system call
 */
void
wdt_check(char* fn, int res)
{
    printf("%s: ", fn);

    if (res < 0)
        printf("ERROR: res = %d, errno = %d (%s)\n", res,
	        errno, strerror(errno));
    else
        printf("SUCCESS: res = %d\n", res);
}


/*
 * arm, pat, then disarm the watchdog timer
 */
int
wdt_arm_pat_disarm(int min, int max, int inc, int times, int armed)
{
    timespec_t tv;
    int res, intv, wait, pats;

    intv = min;
    while (intv < max) {

        tv.tv_sec  = intv;
        tv.tv_nsec = 0;

        /* timer already armed? */
        if (!armed) {
            printf("\nsetting interval to %d s...\n", intv);
            res = wdt_set_interval(wdt_handle, &tv);
            wdt_check("wdt_set_interval()", res);
            if (res < 0)
                return res;

            printf("arming...\n");
            res = wdt_arm(wdt_handle);
            wdt_check("wdt_arm()", res);
            if (res < 0) 
                return res;
        }
        
        printf("patting...\n");
        for (pats = 0; pats < times; pats++) {

            /* pat 1 sec before the timer expires */
            wait = intv - 1;
            
            printf("sleeping for %d s\n", wait);
            wdt_sleep(wait);

            res = wdt_pat(wdt_handle);
            wdt_check("wdt_pat()", res);
            if (res < 0)
                return res;
        }

        printf("disarming...\n");
        res = wdt_disarm(wdt_handle);
        wdt_check("wdt_disarm()", res);
        if (res < 0)
            return res;
        armed = 0;

        /* wait 'WDT_EXTRA_SLEEP' secs more than last configured interval */
        wait = intv + WDT_EXTRA_SLEEP;
        printf("checking that the watchdog timer is stopped...\n");
        printf("sleeping for %d s...\n", wait);
        wdt_sleep(wait);

        intv += inc;
    }

    return 0;
}


static KnSem wdt_sem;


/*
 * worker thread
 */
void
wdt_thread()
{    
    timespec_t max;
    int armed, i, res, min;
    
    for (i = 0; i < WDT_RUNS; i++) {
        
      /* Allocate watchdog timer. If timer is already allocated (busy),
         a previous instance of the actor was killed. In this case, 
	 reallocate timer and, if it is armed, continue the test
         without disarming it */
        res = wdt_alloc(&wdt_handle);
        wdt_check("wdt_alloc()", res);

        if (res < 0 && errno == EBUSY) {

            printf("watchdog timer busy -> reallocating it\n");
            
            /* Reallocate timer, using the magic handle (as we have
               no knowledge of the actual handle previously allocated).
               this call also pats the timer if it is armed */
            wdt_handle = WDT_MAGIC_HANDLE;
            res = wdt_realloc(&wdt_handle);
            wdt_check("wdt_realloc()", res);
        }

        printf("--> watchdog handle = %d\n", wdt_handle);
        
        if (res < 0)
            break;

        /* the timer may be already armed when we reallocate it */
        res = armed = wdt_is_armed(wdt_handle);
        wdt_check("wdt_is_armed()", res);
        if (res < 0)
            break;
        printf("--> the watchdog timer is currently%s armed\n",
	       armed ? "" : " NOT");

        /* ok, get current timeout interval of the armed timer */
        if (armed) {
            timespec_t intv;
          
            res = wdt_get_interval(wdt_handle, &intv);
            wdt_check("wdt_get_interval()", res);
            if (res < 0)
                break;
            printf("--> current interval = %d.%09d s\n",
	           intv.tv_sec, intv.tv_nsec);

            /* continue test at point where preceding actor was killed */
            min = intv.tv_sec;
        }
        else {
            min = WDT_PAT_MIN;
        }

        /* then get the maximum allowed timeout interval */
        res = wdt_get_maxinterval(wdt_handle, &max);
        wdt_check("wdt_get_maxinterval()", res);
        printf( "--> max interval = %d.%09d s\n", max.tv_sec, max.tv_nsec);
        if (res < 0)
            break;
        
        res = wdt_arm_pat_disarm(min, max.tv_sec - WDT_EXTRA_SLEEP, 
                                 WDT_PAT_INC, WDT_PAT_TIMES, armed);
        if (res < 0)
            break;
        
        res = wdt_free(wdt_handle);
        wdt_check("wdt_free()", res);
        if (res < 0)
            break;
    }
    
    /* test completed */
    semV(&wdt_sem); 

    threadDelete(K_MYACTOR, K_MYSELF);
}


static long wdt_stack[WDT_STACKSIZE];  /* watchdog thread stack */


int
main()
{
    KnThreadLid          childLid;
    KnThreadDefaultSched schedParam;
    KnDefaultStartInfo_f startInfo;
    KnActorPrivilege     actorPriv;
    int                  res;

    semInit(&wdt_sem, 0);

    actorPrivilege(K_MYACTOR, &actorPriv, NULL);
    
    startInfo.dsPrivilege        = (actorPriv == K_SUPACTOR) ? 
                                   K_SUPTHREAD : K_USERTHREAD;
    startInfo.dsType             = K_DEFAULT_START_INFO;
    startInfo.dsSystemStackSize  = K_DEFAULT_STACK_SIZE;
    startInfo.dsUserStackPointer = &wdt_stack[WDT_STACKSIZE];
    startInfo.dsEntry            = (KnPc) wdt_thread;
    
    /* heavily increase the priority of the worker thread to ensure that
       it will not be later delayed by other applicative threads (if any) */
    schedParam.tdClass           = K_SCHED_DEFAULT;
    schedParam.tdPriority        = WDT_THREAD_PRIO;

    /* create the worker thread */
    res = threadCreate(K_MYACTOR, &childLid, K_ACTIVE,
                       &schedParam, &startInfo);
    assert(res == 0);

    /* then wait for the completion of the test */
    semP(&wdt_sem, K_NOTIMEOUT);

    return 0;
}

System Logging

The ChorusOS operating system provides a general message logging feature (syslog) which is used by the microkernel and applications. This logging feature provides support for logging console activity on a target system and provides a way to store information kept on persistent storage (disk).

The Syslog feature consists of a logging server (syslogd) and a device into which messages are written. Distribution of messages is controlled by the configuration file syslog.conf. The syslog APIs enable applications to send messages to the syslog.

System logging enables the microkernel and other applications to produce error, warning, and other messages. The messages are saved in a file in append-only mode and can be viewed at a later time. Log file data are never overwritten. The logging service modifies the file by appending new data.

To use the system logging feature, an application writes log messages tagged with a specific header (including the facility and the severity). These messages are then sent to the logging service. The local configuration of the logging service defines how the messages are processed. The criteria used to select how a message is processed are based on the values in the Facility and/or Severity fields.

A message can be:

discarded (filtered out)
appended to a local file
forwarded to another node where it will be stored in a file

Using System Logging

The following example shows a basic LOG_ALERT message.

In this example:

The syslog() call logs a message at priority LOG_ALERT.
The FTP daemon ftpd makes a call to openlog() to announce that all messages it logs should have the identifying string, ftpd. It also announces that messages should be treated by syslogd() in the same way that other messages from system daemons are treated, and should include the process ID of the process logging the message.
The FTP daemon then makes a call to setlogmask() to indicate that only messages with priorities from LOG_EMERG through LOG_ERR should be logged. Messages with any other priority will not be logged.
The FTP daemon then calls syslog() to log a message at priority LOG_INFO.

Example 14-6 Using System Logging

syslog(LOG_ALERT, "who: internal error 23");
openlog("ftpd", LOG_PID, LOG_DAEMON);
setlogmask(LOG_UPTO(LOG_ERR));
syslog(LOG_INFO, "Connection from host %d", CallingHost);

A utility written locally would use the following call to syslog() to log a message at priority LOG_INFO. The message would be treated by syslogd() in the same way that other messages to the facility LOG_LOCAL2 are treated:

syslog(LOG_INFO|LOG_LOCAL2, "error: %m");