Writing Device Drivers

Example: kadb on a Deadlocked Thread

The next problem is that the system does not panic, but the mkfs(1M) command hangs and cannot be aborted. Though a core dump can be forced--by sending a break and then using sync from the OBP or using `g 0' from SunMon--in this case kadb(1M) will be used. After logging in remotely and using ps (which indicated that only the mkfs(1M) process was hung, not the entire system) the system is shut down and booted using kadb(1M) .

ok boot kadb -d
Boot device: /sbus/esp@0,800000/sd@3,0   File and args: kadb -d
kadb:kernel/unix
Size: 673348+182896+46008 bytes
/platform/SUNW,Sun_4_75/kernel/unix ...
kadb[0]::cSunOS Release 5.7 Version Generic [UNIX(R) System V Release 4.0]
Copyright (c) 1983-1998, Sun Microsystems, Inc.
...

After the rest of the kernel has loaded, moddebug is patched to see if loading is the problem. Because it got as far as rd_write() before, loading is probably not the problem, but it will be checked anyway.

# ~stopped at 0xfbd01028: ta 0x7d
kadb[0]: moddebug/Xmoddebug:
moddebug: 0
kadb[0]: moddebug/W 0x80000000moddebug: 0x0 = 0x80000000
kadb[0]: :c

modload(1M) is used to load the driver, to separate module loading from the real access:

# modload /home/driver/drv/ramdisk

It loads without errors, so loading is not the problem. The condition is recreated with mkfs(1M).

# mkfs -F ufs -o nsect=8,ntrack=8,free=5 /devices/pseudo/ramdisk@0:c,raw 1024ramdisk0: misusing 524288 bytes of memory

mkfs(1M) hangs. At this point, kadb(1M) is entered and the stack examined:

~stopped        at      Syslimit+0x1028:                ta      0x7d
kadb[0]: $c
Syslimit() + 1028
debug_enter(0x0,0x740000,0xc4,0x0,0x0,0x40000e7) + c0
zsa_xsint(0xf5dd9000,0x80,0xf5dd906c,0x44,0xff0113,0x0) + 244
zs_high_intr(0xf5dd9000) + 204
_level1(0x1) + 13bc
idle(0xf02a0fac,0x0,0xf02a8078,0xf02a8078,0xf004684c,0x4400ae1) + 50

In the previous example, the presence of idle on the current thread stack indicates that this thread is not the cause of the deadlock. To determine the deadlocked thread, the entire thread list is checked:

kadb[0]: $<threadlist

                ============== thread_id        f0244020
p0:
p0:             process args=   sched
t0:
t0:             lwp             proc            wchan
                f02a3e78        f02a8078        0
t0+0x34:        sp              pc
                f0243af0        sched+0x4e0
?(?) + 0
main(0x0,0xf02b7e20,0xf02a800c,0x0,0x6e,0x0)
afsrbuf(?) + 1b8
exitto(0xf0040000,0xf02a0f58,0x3c,0xf02d3c40,0xfbd7d668,0x0)

                ============== thread_id        fbe01ea0
p0:
p0:             process args=   sched
0xfbe01ea0:     lwp             proc            wchan
                0               f02a8078        0
0xfbe01ed4:     sp              pc
                fbe1fe68        debug_enter+0xb0
?(?) + 0
abort_sequence_enter(0x0,0x740000,0xc4,0x0,0x0,0x40000e7)
zsa_xsint(0xf5dd9000,0x80,0xf5dd906c,0x44,0xff0113,0x0) + 244
zs_high_intr(0xf5dd9000) + 204
_level1(0xf02d40ec) + 13bc
idle(0xf02a0fac,0x0,0xf02a8078,0xf02a8078,0xf004684c,0x4400ae1) + 50

...
                ============== thread_id        f6350a80
0xf6285518:     process args=   mkfs -o nsect=8,ntrack=8,free=5 /devices/pseudo/
                ramdisk@0:c,raw 1024
0xf6350a80:     lwp             proc            wchan
                f634baa0        f6285518        f5ef6fd0
0xf6350ab4:     sp              pc
                fbedc9e0        biowait+0xe4
?(?) + 0
biowait(0xf5ef6f68,0xf5ef6f68,0xf604ee00,0xf591b56c,0xf6306430,0xf591b550)
physio(0xf5ef6f68,0xf02d596c,0x200,0x100,0xf591b550,0xfbedcb50) + 364
write(0x200) + 250

Of all the threads, only one has a stack trace that references the ramdisk driver. It happens to be the last one. It seems that the process running mkfs(1M) is blocked in biowait(9F). After a call to physio(9F), biowait(9F) takes a buf(9S) structure as a parameter. The next step is to examine the buf(9S) structure:

kadb[0]:  f5ef6f68$<buf
0xf5ef6f68:     flags
                4129
0xf5ef6f6c:     forw            back            av_forw         av_back
                0               0               0               0
0xf5ef6f80:     bcount          bufsize         error           edev
                512             0               0               1180000
0xf5ef6f84:     addr            blkno           resid           proc
                f5f66250        3ff             0               f6285518
0xf5ef6fac:     iodone          vp              pages
                0               0               0

The resid field is 0, which indicates that the transfer is complete. physio(9F) is still blocked, however. The reference for physio(9F) in the Solaris 2.7 Reference Manual AnswerBook points out that biodone(9F) should be called to unblock biowait(9F). This is the problem; rd_strategy() did not call biodone(9F). Adding a call to biodone(9F) before returning fixes this problem.