Writing Device Drivers

Example: kadb on a Deadlocked Thread

This example shows how kadb can be used to debug a driver bug. This example was taken from the development of the ramdisk sample driver. This driver exports physical memory as a virtual disk. In this case, the dd(1M) command hangs while trying to copy some data onto the device and cannot be aborted. Though a crash dump could be forced, for illustrative purposes, kadb(1M) will be used. After logging into the system remotely, ps was used to determine that the system was still running; and only the dd(1M) command is hung.

At this point, the system is rebooted with kadb, which can now be entered by typing STOP-A on the system console. After the rest of the kernel has loaded, moddebug is patched to see if loading is the problem:

stopped at:
edd000d8:       ta      %icc,%g0 + 125
kadb[0]: moddebug/X
moddebug:
moddebug:       0
kadb[0]: moddebug/W 0x80000000
moddebug:       0x0             =       0x80000000
kadb[0]: :c

modload(1M) is used to load the driver, to separate module loading from the real access:

# modload /home/driver/drv/ramdisk

It loads without errors, so loading is not the problem. The condition is recreated with dd(1M):

# dd if=/dev/zero of=/devices/pseudo/ramdisk@0:c,raw

dd(1M) hangs. At this point, kadb(1M) is entered and the stack examined:

stopped at:
edd000d8:       ta      %icc,%g0 + 125
kadb[0]: $c
intr_vector() + 7dcfc0d8
debug_enter(0,0,10431e50,10,1,b0) + 78
zsa_xsint(80,7044a06c,44,7044a000,ff0113,0) + 278
zs_high_intr(7044a000,1,1,1042f78c,10424680,100949d0) + 20c
sbus_intr_wrapper(704dfad4,0,702bd048,7029cec0,630,10260250) + 30
current_thread(4001fe60,1041a550,10424698,10424698,10150f08,0) + 180
idle(1040b6c0,0,0,1041a550,704d6a98,0) + 54
thread_start(0,0,0,0,0,0) + 4

The presence of idle on the current thread stack indicates that this thread is not the cause of the deadlock. To determine the deadlocked thread, the entire thread list is checked:

kadb[0]: $<threadlist
...
                ============== thread_id        70cef120
70c8b1c0:
                process args    dd if=/dev/zero of=/devices/pseudo/ramdisk@0:c,raw

70cef1c8:       lwp             procp           wchan
                70fa9080        70c8aec0        70691fc8
70cef144:
                pc              sp
                sema_p+0x290    40313a78
?(70691fc8,10424680,1,1042b99c,10460f8c,70691fc8)
biowait(70691f60,1041a6c4,70691f60,70c385d0,40313bcc,705c73a0) + 8c
default_physio(1042e8fc,200,129,100,70eb5b54,705c73a0) + 3bc
write(2002,70aac1d0,70f9f9ac,200,4,200) + 23c
...

Of all the threads, only one has a stack trace which references the ramdisk driver. It seems that the process running dd(1M) is blocked in biowait(9F). biowait(9F)'s first parameter is a buf(9S) structure. The next step is to examine this structure:

kadb[0]:  70691f60$70691f60$
70691f60:       flags           forw            back
                204129          0               0
70691f6c:       av_forw         av_back         bcount
                0               0               512
70691fa0:       bufsize         error           edev
                0               0               1180000
70691f7c:       un.b_addr       _b_blkno        resid
                710e8000        0                0
70691f94:       proc            iodone          vp
                70c8aec0        0               0
70691f98:       pages
                0

The resid field is 0, which indicates that the transfer is complete. physio(9F) is still blocked, however. The reference for physio(9F) in the Solaris 9 Reference Manual Collection points out that biodone(9F) should be called to unblock biowait(9F). This is the problem; rd_strategy() did not call biodone(9F). Adding a call to biodone(9F) before returning fixes this problem.