ChorusOS 5.0 Board Support Package Developer's Guide

Chapter 13 Hardened Driver Requirements

This chapter details the main rules that a hardened driver should satisfy:

"No Panic" describes how a hardened driver must not panic on error.
"Containment" describes how a hardened driver ensures that the effects of the fault are contained.
"Logging" describes how a hardened driver logs driver errors and device failures.
"Notification" describes how a hardened driver notifies its clients of device failure.
"Bus Exceptions" describes how a hardened bus driver handles hardware bus exceptions.
"Corrupt Data Detection" describes how a hardened driver detects corrupt data read from the device.
"Stuck Interrupts" describes how a hardened driver handles persistently asserted interrupts.
"Periodic Health Checks" describes how a hardened driver carries out periodic health checks.

Many of the above requirements are illustrated using code taken from two real hardened device drivers:

The dec21x4x ethernet (PCI) device driver (see the dec21x4x(9DRV) man page).
The Raven PCI host bridge bus driver (see the raven(9DRV) man page).

No Panic

A hardened driver does not panic on error. All detected errors and failures are considered as fault conditions and the system is notified using Driver Framework mechanisms (see "Notification" ).

Containment

Once a failure is detected, a hardened driver will ensure that the effects of the fault are contained. The driver will maintain internally an instance-wide state of the serviced device (working/failed). This flag should be checked at critical points within the driver code to prevent the propagation of erroneous data or events outside the driver, or from multiplying errors by accepting new requests on a failed device.

Code Example 13-1, from the dec21x4x ethernet driver, shows how the 'failed' field of the driver instance data is used to ensure that effects from faults are contained. This field is zeroed in the drv_init() routine. It is set in fail() (see code Example 13-2) and on device removal and is checked on entry to each ethernet DDI down call routine.

Example 13-1 dec21x4x fault containment

typedef struct Dec21Data {
    DevNode            node;
    char*              path;
    int                pathSize;
    DevRegId           devRegId;    /* Device registry id            */
    DevRegEntry        entry;       /* device registry entry         */
    PciBusEvent        evtState;    /* bus events being processed    */

[...]

    Bool               failed;      /* instance wide failed flag     */

[...]

} Dec21Data;

#define IS_DEV_FAILED(d) (d->failed)

    static KnError
open (void* devId, void* clientCookie, EtherCallBack* clientOps)
{
    Dec21Data* dec21 = (Dec21Data*)devId;

    if (IS_DEV_FAILED(dec21)) {
        return K_EFAIL;
    }

[...]

    return K_OK;
}

[...]

    /*
     * PCI bus event handlers.
     * The event handler is invoked by the parent bus driver when a bus
     * event occurs in the bus.
     *
     * The DEC21 driver always supports the PCI_SYS_SHUTDOWN and
     * PCI_DEV_SHUTDOWN events. The PCI_DEV_REMOVAL support is optional and
     * is provided only when DEC21_DEV_REMOVAL is defined.
     */
    static KnError
eventHandler (void* cookie, BusEvent event, void* arg)
{
    Dec21Data* dec21 = (Dec21Data*)cookie;
    KnError    res   = K_OK;

    switch (event) {

[...]

    case PCI_DEV_REMOVAL:
            /*
             * The device removal is processed from either the
             * normal mode or shut down mode. In other words,
             * this event is ignored if the driver already operates in the
             * device removal mode.
             *
             * Here, we flag that the device is entered into removal
             * mode (dev->evtState). We ask the device registry to 
             * notify clients about the device removal event.
             * The real shut down procedure will be done by the relHandler()
             * handler. This handler is called by device registry when the
             * reference to the driver instance goes away (i.e. when
             * svDeviceRelease() is called by client).
             */
        if (dec21->evtState != PCI_DEV_REMOVAL) {
            dec21->evtState = PCI_DEV_REMOVAL;
            dec21->failed   = TRUE;
            DKI_MSG(("%s: entered into removal mode\n", dec21->path));
            svDeviceEvent(dec21->devRegId, DEV_EVENT_REMOVAL, NULL);
        }
        break;

[...]
}

Logging

Driver errors and device failures will be logged using the Driver Framework logging mechanism. The DKI_ERR(), DKI_WARN() and DKI_MSG() macros should be used for that purpose.

Code Example 13-2, from the dec21x4x ethernet driver, illustrates how detected errors are turned into event notifications, and how they are logged. In the example, the Dec21Data structure contains driver instance specific data. In this structure, the failed field is the instance-wide flag indicating a device failure. The evtState field is used to record the current state of the driver instance, regarding propagated events.

Example 13-2 dec21x4x error handling

    /*
     * Enter failed mode (called on device failure).
     * On device failure, the REMOVAL event is propagated to clients, 
     * as the device should not be accessed any more.
     * However, internal state (dec21->evtState) is set to 
     * PCI_DEV_SHUTDOWN, as we may need to further access the device 
     * (to reset it for example). The dec21->failed flag is also set 
     * for next client down calls to return a K_EFAIL result.
     */
    static KnError
fail (Dec21Data* dec21)
{
    dec21->failed = TRUE;
    if (!dec21->evtState) {
        dec21->evtState = PCI_DEV_SHUTDOWN;
        DKI_MSG(("%s: entered into failed mode\n", dec21->path));    
        svDeviceEvent(dec21->devRegId, DEV_EVENT_REMOVAL, NULL);
    }
    return K_OK;
}

    /*
     * The I/O error handler is called by the parent bus driver
     * if a bus error occurs when accessing the device registers.
     * This is considered a device failure.
     */
    static void
ioErrHandler (void* cookie, PciBusError* error)
{
    DKI_ERR(("%s: error -- I/O (%d) at 0x%08x\n",
             ((Dec21Data*)cookie)->path, error->code, error->offset));
    fail(cookie);
}

    /*
     * DMA error handler.
     * This is considered a device failure.
     */
    static void
dmaErrHandler (void* cookie, PciBusError* error)
{
    DKI_ERR(("%s: error -- DMA (%d) at 0x%08x\n",
             ((Dec21Data*)cookie)->path, error->code, error->offset));
    fail(cookie);
}

Refer to Part III for details about these macros.

Notification

As soon as a device failure is detected, a hardened driver will notify its clients that the device has failed. It will turn the arbitrary and uncontrolled error into a notification event. A nexus driver will notify its child drivers through the bus event handler invocation. A leaf device driver will then notify its clients through the device event mechanism provided by the device registry (svDeviceEvent()).

Note -

If a nexus driver also exports an extra (orthogonal) interface through the device registry, both mechanisms will be used.

The dec21x4x event handler, shown in code Example 13-3, illustrates how a hardened device driver should notify its clients. This handler is called from the bus driver. The fail() routine is called either from the event handler, or directly from the driver on failure detection.

Example 13-3 dec21x4x event handler

    /*
     * PCI bus event handlers.
     * The event handler is invoked by the parent bus driver when a bus
     * event occurs in the bus.
     *
     * The DEC21 driver always supports the PCI_SYS_SHUTDOWN and
     * PCI_DEV_SHUTDOWN events. The PCI_DEV_REMOVAL support is optional and
     * is provided only when DEC21_DEV_REMOVAL is defined.
     */
    static KnError
eventHandler (void* cookie, BusEvent event, void* arg)
{
    Dec21Data* dec21 = (Dec21Data*)cookie;
    KnError    res   = K_OK;

    switch (event) {

[...]

    case PCI_SYS_ERROR: {
        uint32_f csr5;
            /*
             * PCI_SYS_ERROR is considered a fatal bus error.
             * We first check if this event was caused by our device on the
             * PCI bus.
             * If positive, all bus accesses from the device are disabled,
             * thus, we put the device into failed mode.
             */
        if (arg) {
            csr5 = *((uint32_f*)arg);
        } else {
            csr5 = dec21->pciIoOps->load_32(dec21->pciIoId, CSR5);
        }
        if (csr5 & CSR5_FBE) {
            DKI_ERR(("%s: error -- Fatal bus error (csr5=0x%08x)\n",
                     dec21->path, csr5));
            fail(dec21);
        }
        break;
    }

    case PCI_INTR_DEFECTIVE: {
            /*
             * Our interrupt line is defective (stuck interrupt).
             * We put the device into failed mode.
             */
        DKI_ERR(("%s: error -- interrupt line is defective\n", 
                dec21->path));
        res = fail(dec21);
        break;
    }

    default:
        /*
         * Palette events are ignored
         */
        res = K_ENOTIMP;
    }

    return res;
}

Refer to Part III for details about bus and system event notification.

Bus Exceptions

A hardened bus driver (usually the host bridge driver) will handle hardware bus exceptions to identify the bus address at fault. By invoking the associated driver's error handler, the hardened bus driver will propagate the appropriate error message upstream.

A hardened driver will not panic when its error handlers are invoked. This is considered as a device fault condition by a leaf driver, and is propagated to child drivers by a nexus driver.

In the ChorusOS Driver Framework a bus exception is reported by the nexus driver to a child driver through the error handler invocation. The error handler is specific to a mapped region: a driver specifies an error handler when calling its parent nexus driver to map a memory, I/O or DMA region. When an exception occurs, the bus driver analyzes the faulty bus address, detects the fault region and invokes the associated error handler.

Code Example 13-4 shows the Raven bus error interrupt handler. This handler manages errors at bus level and propagates warning of those errors to appropriate faulty device driver error (or event) handlers.

Example 13-4 Raven bus error handler

        /*
         * PowerPC "machine check" interrupt handler
         * It is called from DKI after context have been saved in stack.
         * This handler uses the RAVEN internal register to analyze and
         * dispatch bus errors to device error handlers.
         */
    static CpuIntrStatus
errHandler (RavenData* raven)
{
            PciBusError   error;
            PciMap*       pciMap;
            uint32_f      merad;
            uint16_f      merat;
   volatile uint8_f       merstReg;
            uint8_f       merst;
            uint8_f       overflow;
            KnIntrCtx*    intrCtx;
            CpuIntrStatus status = CPU_INTR_UNCLAIMED;
       /*
        * Read Raven MPC_MERST register ... and clear error bits that will
        * be handled
        */
   merst = merstReg = READ_REG_8(raven->regs.vaddr, MPC_MERST);
   WRITE_REG_8(raven->regs.vaddr, MPC_MERST, merstReg);

   overflow = merst & MPC_MERST_OVF;
   svIntrCtxGet(&intrCtx);
       /*
        * DATA PARITY ERROR and PCI SYSTEM ERROR are propagated as a
        * PCI_SYS_ERROR event, because the RAVEN does not latch any address
        * in this cases.
        */
   if (merst & (MPC_MERST_PERR | MPC_MERST_SERR)) {
       PciDev* pciDev;

       DKI_ERR(("%s: error -- (MC) pc=0x%08x lr=0x%08x sp=0x%08x\n",
                raven->path, intrCtx->pc, intrCtx->lr, intrCtx->r1));
       if (merst & MPC_MERST_PERR) {
           DKI_ERR(("%s: error -- PARITY ERROR detected\n", raven->path));
       }
       if (merst & MPC_MERST_SERR) {
           DKI_ERR(("%s: error -- SYSTEM ERROR detected\n", raven->path));
       }
       pciDev = raven->dev;
       while (pciDev) {
           if (pciDev->evtHandler) {
               pciDev->evtHandler(pciDev->cookie, PCI_SYS_ERROR, NULL);
           }
           pciDev = pciDev->next;
       }
       return CPU_INTR_CLAIMED;
   }

   if (overflow) {
       DKI_WARN(("%s: warning -- error overflow (0x%02x)\n",
                 raven->path, merst));
   }
       /*
        * Read latched fault address and cycle attributes
        */
   merad = READ_REG_32(raven->regs.vaddr, MPC_MERAD);
   merat = READ_REG_16(raven->regs.vaddr, MPC_MERAT);
       /*
        * PowerPC bus error: MERAT/MERAD contains PowerPC cycles attributes
        */
   if (merst & MPC_MERST_MATO) {
       error.code = PCI_ERR_TARGET_ABORT;
           /*
            * Search an existing map to which the latched address belong
            * and call the associated error handler.
            * DMA maps are in memMap list.
            */
       pciMap = raven->memMap;
       while (pciMap) {
           if ((pciMap->memChunk.paddr <= merad) && 
               (merad <= pciMap->memChunk.paddr + pciMap->memChunk.psize)) {
               error.offset = merad - pciMap->memChunk.paddr;
               pciMap->errHandler(pciMap->errCookie, &error);
               status = CPU_INTR_CLAIMED;
               break;
           }
           pciMap = pciMap->next;
       }
           /*
            * Save fault address in dar to raise a kernel exception
            */
       intrCtx->dar = merad;

       if (status == CPU_INTR_UNCLAIMED) {
         DKI_ERR(("%s: error -- (MC) pc=0x%08x lr=0x%08x sp=0x%08x\n",
                  raven->path, intrCtx->pc, intrCtx->lr, intrCtx->r1));
         DKI_ERR(("%s: error -- PowerPC timed-out 0x%08x (merat=0x%04x)\n",
                  raven->path, merad, merat));
         DKI_ERR(("%s: error -- from %s %s TT=0x%02x TSIZ=0x%01x\n",
                raven->path,
                (MPC_MERAT_MID(merat) == MPC_MID_RAVEN) ? "raven" : "cpu",
                (merat & MPC_MERAT_TBST) ? "burst" : "",
                merat & MPC_MERAT_TT,
                MPC_MERAT_TSIZ(merat)));
       }
       return status;
   }
       /*
        * PCI bus errors: MERAT/MERAD contains PCI cycle attributs
        */
   if (merst & MPC_MERST_RTA) {
       error.code = PCI_ERR_TARGET_ABORT;
   }else if (merst & MPC_MERST_SMA) {
       error.code = PCI_ERR_MASTER_ABORT;
   }
   switch (merat & MPC_MERAT_COMM) {
   case MPC_MERAT_IACK:
       DKI_ERR(("%s: error -- PCI IACK cycle\n", raven->path));
       break;
   case MPC_MERAT_CFG_READ:
   case MPC_MERAT_CFG_WRITE:
           /*
            * An error occured while accessing PCI configuration space
            */
       if ((merad == (raven->confAddr & ~0x3)) ||
           (merad == CONFIG_ADDR_TO_ADDR(raven->confAddr))) {
               /*
                * The latched error address matches the one currently 
                * accessed through a conf_load_xx operation.
                * Reset the accessed (confAddr) address to indicate
                * the operation failed.
                */
           raven->confAddr = 0;
           status = CPU_INTR_CLAIMED;
       } else {
               /*
                * The error is not due to our conf_xxx() operations !
                */
           DKI_ERR(("%s: error -- PCI Configuration cycle\n", raven->path));
       }
       break;
   case MPC_MERAT_IO_READ:
   case MPC_MERAT_IO_WRITE:
           /*
            * An error occured while accessing PCI I/O space
            * Search an existing map to which the latched address belong
            * and call the associated error handler
            */
       pciMap = raven->ioMap;
       while (pciMap) {
           if ((pciMap->first <= merad) && (merad <= pciMap->last)) {
               error.offset = merad - pciMap->first;
               pciMap->errHandler(pciMap->errCookie, &error);
               status = CPU_INTR_CLAIMED;
               break;
           }
           pciMap = pciMap->next;
       }
       break;
   case MPC_MERAT_MEM_READ:
   case MPC_MERAT_MEM_WRITE:
   case MPC_MERAT_MEM_READ_MULTI:
   case MPC_MERAT_MEM_READ_LINE:
   case MPC_MERAT_MEM_WRITE_INVAL:
           /*
            * An error occured while accessing PCI Memory space
            * Search an existing map to which the latched address belong
            * and call the associated error handler
            */
       pciMap = raven->memMap;
       while (pciMap) {
           if ((pciMap->first <= merad) && (merad <= pciMap->last)) {
               error.offset = merad - pciMap->first;
               pciMap->errHandler(pciMap->errCookie, &error);
               status = CPU_INTR_CLAIMED;
               break;
           }
           pciMap = pciMap->next;
       }
   }

   if (status == CPU_INTR_UNCLAIMED) {
     DKI_ERR(("%s: error -- (MC) pc=0x%08x lr=0x%08x sp=0x%08x\n",
               raven->path, intrCtx->pc, intrCtx->lr, intrCtx->r1));
    DKI_ERR(("%s: error -- (%d) PCI at 0x%08x merat=0x%04x merst=0x%02x\n",
               raven->path, error.code, merad, merat, merst));
     DKI_ERR(("%s: error -- from %s %s BYTE_EN=0x%02x\n",
             raven->path,
             (MPC_MERAT_MID(merat) == MPC_MID_RAVEN) ? "raven" : "cpu",
             (merat & MPC_MERAT_TBST) ? "write-posted" : "",
             merat & MPC_MERAT_BYTE_EN));
   }

   return status;
}

Refer to the ChorusOS man pages section 9DDI: Device Driver Interfaces for details about bus error handling interfaces.

Corrupt Data Detection

A hardened driver assumes that any data which it reads from the device may be corrupt. The data should be sanity checked, before use, if undesirable consequences are anticipated from its use or propagation.

In the dec21x4x ethernet driver the DMA buffers are not checked against corruption because this is already done by client's protocol stack (TCP/IP). However, code Example 13-5 illustrates how you could avoid an infinate loop when reading a register, by adding a break condition to the loop.

Example 13-5 loop on register value

/*
  * Reset the PHY device.
  */
  static void
phy_reset (Dec21Data* dec21)
{
  unsigned int count = 10000;

  mii_write_reg(dec21, MII_CTRL_REG, MII_CTRL_RESET);
  do {
      msecBusyWait(1);
  } while ((mii_read_reg(dec21, MII_CTRL_REG) & MII_CTRL_RESET) && count--);
  msecBusyWait(1);
}

Device Management and Control Data

Hardened drivers must act with extreme caution when using pointers, array indexes or memory offsets which are read or calculated from data retrieved from the device. These values should not be used until they are checked to ensure that they are within an expected range and have legal alignment. These types of pointer mechanisms can become misleading or malignant if the device has developed a fault.

A hardened driver will never loop simply upon a register value. An infinite loop may occur if a device breaks and returns stuck data. The hardened driver must have a method to break this type of loop.

Driver state information should be maintained in main memory, not on an I/O card.

Received Data

Device errors can result in corrupt data being placed in receive buffers. This corruption is indistinguishable from corruption occurring beyond the domain of the device, for example within a network. Typically, existing software will already be in place to handle such corruption through, for example, integrity checks at the transport layer of a protocol stack or within the application using the device.

If the received data is not going to be subjected to an integrity check at a higher layer, as in the case of a disk driver, it can be integrity-checked within the driver itself. However, such low level integrity checking can cause the greatest degradation to system performance. By not performing such checks at device level the results are, at worst, application failure or file corruption; it is not likely to cause a total system crash.

DMA

A defective device may be able to falsely initiate a DMA transfer over the bus. This type of data transfer may corrupt the system memory.

Some host bus bridges provide an IOMMU which allows you to map a DMA region (within the bus address space) to the system memory. On such hardware, the bus driver is able to protect the system memory (which is not used for DMA buffers) from corruption caused by a falsely initiated DMA transfer. The bus driver should not use a static one-to-one mapping (from the system memory to the bus space) to handle DMA transfers. Instead, it should manage IOMMU mappings dynamically. The dma_alloc() method maps a memory region to the bus space, enabling DMA transfers. The dma_free() method invalidates any mapping, disabling DMA to the memory region.

Note -

A defective device may still corrupt a DMA buffer managed by another device driver.

Stuck Interrupts

A persistently asserted interrupt will severely affect system performance, almost certainly stalling a single processor board. An interrupt handler needs to be able to identify whether it has been called as a result of a hoax interrupt.

A hardened driver's interrupt handler will return a BUS_INTR_UNCLAIMED result unless it detects that the device legitimately asserted the interrupt. Conceptually, an interrupt is legitimate if the device actually requires the driver to do some useful work.

A hardened bus driver is able to detect whether an interrupt line is defective. It disables the defective interrupt line (through the bus controller) and notifies any attached child drivers by calling their event handler, specifying a BUS_INTR_DEFECTIVE event, and passing the child driver the interrupt identifier as an argument.

To detect a defective interrupt line, a bus driver should maintain a count of unclaimed interrupts for each interrupt line. The bus driver may count unclaimed interrupts occurring between two claimed interrupts, resetting the total when an interrupt is claimed. Alternatively, it may count the unclaimed interrupts occurring during a given, configurable period of time, resetting the counter on a time-out invocation. In both cases, if the counter reaches a predetermined, configurable watermark, the bus driver should consider the interrupt line defective. Note that, in such a model, all devices sharing the same interrupt line will fail if stuck interrupts are detected on that line.

Code Example 13-6 illustrates how stuck interrupts may be detected by both the bus and device driver interrupt handlers. The Raven handler counts consecutive unclaimed interrupts, and raises a PCI_INTR_DEFECTIVE event when this count reaches a configured value. This handler also forbids enabling defective interrupt lines.

Example 13-6 Raven interrupt handler

#define IS_INTR_DEFECTIVE(raven, l)  (raven->unclaimed[(l)] == (uint32_f)-1)
#define SET_INTR_DEFECTIVE(raven, l) (raven->unclaimed[(l)]  = (uint32_f)-1)

    static void
unmask (PciIntrId intrId)
{
    RavenData* raven = ((PciIntr*)intrId)->devId->pciId;

        /*
         * Check if interrupt line is defective
         */
    if (IS_INTR_DEFECTIVE(raven, ((PciIntr*)intrId)->intrLine)) {
        return;
    }
        /*
         * Mask all PCI interrupts while working on MPIC registers
         */
    raven->intrOps->mask(raven->intrId);
    OPIC_INTR_UNMASK(raven->mpicIoOps,
                     raven->mpicIoId,
                     ((PciIntr*)intrId)->intrLine);
    raven->intrOps->unmask(raven->intrId);
}

    /*
     * Declare an interrupt line as defective
     */
    static void
intrDefective(RavenData* raven, uint32_f line)
{
   PciIntr* intr;
   PciDev*  dev;
        /*
         * Mask defective interrupt line at interrupt controller level
         */
    raven->intrOps->mask(raven->intrId);
    OPIC_INTR_MASK(raven->mpicIoOps, raven->mpicIoId, line);
    SET_INTR_DEFECTIVE(raven, line);
    raven->intrOps->unmask(raven->intrId);
        /*
         * Raise an event to all devices attached to this interrupt.
         * Interrupt identifier is passed as a specific argument.
         */
   for (intr = raven->intr[line] ; intr ; intr = intr->next) {
       dev = intr->devId;
       if (dev->evtHandler) {
          dev->evtHandler(dev->cookie, PCI_INTR_DEFECTIVE, (PciIntrId)intr);
       }
   }
}
        /*
         * PowerPC external interrupts handler.
         * It is called from DKI after context have been saved in stack.
         * This handler manages the RAVEN internal MPIC which is OpenPIC
         * compliant.
         */
    static CpuIntrStatus
intrHandler (RavenData* raven)
{
   uint32_f      vector;
   PciIntrStatus intrStatus = PCI_INTR_UNCLAIMED;
   uint32_f      cpu        = mfspr_PIR ();  /* processor id register */
   PciIoOps*     mpicIoOps  = raven->mpicIoOps;
   PciIoId       mpicIoId   = raven->mpicIoId;
   CpuIntrOps*   intrOps    = raven->intrOps;
   CpuIntrId     intrId     = raven->intrId;
   PciIntr*      pciIntr;
   int           claimed    = 0;
       /*
        * Get vector to identify the interrupt source
        */
   vector = OPIC_INTR_ACKNOWLEDGE(mpicIoOps, mpicIoId, cpu);

       /*
        * Ignore spurious interrupt requests
        */
   if (vector == MPIC_SPURIOUS_INTR_VECTOR) {
      raven->spurious++;
      return CPU_INTR_CLAIMED;
   }

       /*
        * Enable external interrupts on CPU
        */
   intrOps->unmask(intrId);
       /*
        * Call device handlers attached to this interrupt vector
        */
   for (pciIntr = raven->intr[vector] ;
        pciIntr ;
        pciIntr = pciIntr->next) {
       intrStatus = pciIntr->intrHandler(pciIntr->intrCookie);
       if (intrStatus != PCI_INTR_UNCLAIMED) {
           claimed++;
       }
   }
       /*
        * Disable external interrupts on CPU
        */
   intrOps->mask(intrId);

   if (intrStatus == PCI_INTR_ACKNOWLEDGED) {
           /*
            * Interrupt handler has already done:
            * - enable()
            * - ....
            * - disable()
            * So we just:
            * - reset task priority to re-enable lower priority interrupts
            * - unmask current interrupt (masked by disable()).
            */
       OPIC_CURRENT_TASK_SET_PRIORITY(mpicIoOps, mpicIoId, cpu,
                                      OPIC_PRIORITY_MIN);
       OPIC_INTR_UNMASK(mpicIoOps, mpicIoId,  vector);
   } else {
           /*
            * Interrupt was just serviced by the handler.
            * Send a non-specific EOI command to open PIC
            */
       OPIC_INTR_EOI(mpicIoOps, mpicIoId, cpu);
   }

   if (claimed == 0) {
           /*
            * Increment unclaimed counter and check against max.
            */
       if (++(raven->unclaimed[vector]) > raven->maxUnclaimed) {
           intrDefective(raven, vector);
       }
   } else {
       raven->unclaimed[vector] = 0; /* Reset unclaimed counter */
   }

   return CPU_INTR_CLAIMED;
}

The dec21x4x interrupt handler, shown in code Example 13-7, checks for unexpected interrupts by masking them from the interrupt status register that is read. If an unexpected interrupt is received, it is considered unclaimed.

Example 13-7 `dec21x4x` interrupt handler

    /* 
     * The interrupt handler 
     */
    static PciIntrStatus
intrHandler (void* cookie)
{
    Dec21Data* dec21 = (Dec21Data*)cookie;
    uint32_f   csr5;
        /*
         * Get current status and acknowledge all interrupt sources ASAP.
         */
    csr5 = dec21->pciIoOps->load_32(dec21->pciIoId, CSR5);
    dec21->pciIoOps->store_32(dec21->pciIoId, CSR5, csr5);

#ifdef DEBUG_DEC21
    sysLog("%s: intrHandler csr5=0x%08x\n", dec21->path, csr5);
#endif
        /*
         * Check if an unmasked interrupt is pending
         */
    csr5 &= dec21->csr7;
    if (csr5 == 0) {
        return PCI_INTR_UNCLAIMED;
    }

        /*
         * Process Rx interrupt
         */
    if (csr5 & CSR5_RI) {
        CSR7_INTR_MASK(dec21, CSR7_RIE);
        dec21->clientOps->receiptNotify(dec21->clientCookie);
    }
        /*
         * Process Tx interrupt
         */
    if (csr5 & CSR5_TI) {
        CSR7_INTR_MASK(dec21, CSR7_TIE);        
        dec21->clientOps->transmitNotify(dec21->clientCookie);
    }
        /*
         * Process errors, if Abnormal error summary bit is set.
         */
    if (csr5 & CSR5_AIS) {
        intrErr(dec21, csr5);
    }

    return PCI_INTR_CLAIMED;
}

Refer to the ChorusOS man pages section 9DDI: Device Driver Interfaces for details about bus interrupt handling interfaces

Periodic Health Checks

A latent fault will not show itself until some other action occurs. For example, a hardware failure occurring in a PCI card that is a cold standby could remain undetected until a fault occurs on the master PCI card. Only at that point will it be discovered that the system now contains defective PCI cards. It is essential to identify a failed secondary device so that it can be repaired or replaced before any failure in the primary device occurs. As a general rule, latent faults that are allowed to remain undetected will eventually cause system failure.

A hardened driver must perform periodic health checks on all the devices that it manages. Although this does not directly protect the system from the device, it does allow timely detection of failure during quiet periods. A device may be quiet because it has failed.

Periodic health checks can:

Run a quick access check on the board.
Check a register or memory location on the device whose value the driver expects to have deterministically altered since the last poll.
Time-stamp outgoing requests in order to detect any over-age requests which have not completed.
Initiate an action on the device which should be completed before the next scheduled check.

Note -

These kind of health checks are intended to be triggered and controlled through the Management DDI. A driver should not start periodic health checks itself, but rather rely on a driver component manager client to trigger the checks at a rate appropriate to the device, and service it provides. Please refer to the mngt(1CC) man page for details about Management DDI.

Chapter 13 Hardened Driver Requirements

No Panic

Containment

Example 13-1 dec21x4x fault containment

Logging

Example 13-2 dec21x4x error handling

Notification

Example 13-3 dec21x4x event handler

Bus Exceptions

Example 13-4 Raven bus error handler

Corrupt Data Detection

Example 13-5 loop on register value

Device Management and Control Data

Received Data

DMA

Stuck Interrupts

Example 13-6 Raven interrupt handler

Example 13-7 dec21x4x interrupt handler

Periodic Health Checks

Example 13-7 `dec21x4x` interrupt handler