The next problem is that the system does not panic, but the mkfs(1M) command hangs and cannot be aborted. Though a core dump can be forced--by sending a break and then using sync from the OBP or using `g 0' from SunMon--in this case kadb(1M) will be used. After logging in remotely and using ps (which indicated that only the mkfs(1M) process was hung, not the entire system) the system is shut down and booted using kadb(1M) .
ok boot kadb -d Boot device: /sbus/esp@0,800000/sd@3,0 File and args: kadb -d kadb:kernel/unix Size: 673348+182896+46008 bytes /platform/SUNW,Sun_4_75/kernel/unix ... kadb[0]::cSunOS Release 5.7 Version Generic [UNIX(R) System V Release 4.0] Copyright (c) 1983-1998, Sun Microsystems, Inc. ...
After the rest of the kernel has loaded, moddebug is patched to see if loading is the problem. Because it got as far as rd_write() before, loading is probably not the problem, but it will be checked anyway.
# ~stopped at 0xfbd01028: ta 0x7d kadb[0]: moddebug/Xmoddebug: moddebug: 0 kadb[0]: moddebug/W 0x80000000moddebug: 0x0 = 0x80000000 kadb[0]: :c
modload(1M) is used to load the driver, to separate module loading from the real access:
# modload /home/driver/drv/ramdisk
It loads without errors, so loading is not the problem. The condition is recreated with mkfs(1M).
# mkfs -F ufs -o nsect=8,ntrack=8,free=5 /devices/pseudo/ramdisk@0:c,raw 1024ramdisk0: misusing 524288 bytes of memory
mkfs(1M) hangs. At this point, kadb(1M) is entered and the stack examined:
~stopped at Syslimit+0x1028: ta 0x7d kadb[0]: $c Syslimit() + 1028 debug_enter(0x0,0x740000,0xc4,0x0,0x0,0x40000e7) + c0 zsa_xsint(0xf5dd9000,0x80,0xf5dd906c,0x44,0xff0113,0x0) + 244 zs_high_intr(0xf5dd9000) + 204 _level1(0x1) + 13bc idle(0xf02a0fac,0x0,0xf02a8078,0xf02a8078,0xf004684c,0x4400ae1) + 50
In the previous example, the presence of idle on the current thread stack indicates that this thread is not the cause of the deadlock. To determine the deadlocked thread, the entire thread list is checked:
kadb[0]: $<threadlist ============== thread_id f0244020 p0: p0: process args= sched t0: t0: lwp proc wchan f02a3e78 f02a8078 0 t0+0x34: sp pc f0243af0 sched+0x4e0 ?(?) + 0 main(0x0,0xf02b7e20,0xf02a800c,0x0,0x6e,0x0) afsrbuf(?) + 1b8 exitto(0xf0040000,0xf02a0f58,0x3c,0xf02d3c40,0xfbd7d668,0x0) ============== thread_id fbe01ea0 p0: p0: process args= sched 0xfbe01ea0: lwp proc wchan 0 f02a8078 0 0xfbe01ed4: sp pc fbe1fe68 debug_enter+0xb0 ?(?) + 0 abort_sequence_enter(0x0,0x740000,0xc4,0x0,0x0,0x40000e7) zsa_xsint(0xf5dd9000,0x80,0xf5dd906c,0x44,0xff0113,0x0) + 244 zs_high_intr(0xf5dd9000) + 204 _level1(0xf02d40ec) + 13bc idle(0xf02a0fac,0x0,0xf02a8078,0xf02a8078,0xf004684c,0x4400ae1) + 50 ... ============== thread_id f6350a80 0xf6285518: process args= mkfs -o nsect=8,ntrack=8,free=5 /devices/pseudo/ ramdisk@0:c,raw 1024 0xf6350a80: lwp proc wchan f634baa0 f6285518 f5ef6fd0 0xf6350ab4: sp pc fbedc9e0 biowait+0xe4 ?(?) + 0 biowait(0xf5ef6f68,0xf5ef6f68,0xf604ee00,0xf591b56c,0xf6306430,0xf591b550) physio(0xf5ef6f68,0xf02d596c,0x200,0x100,0xf591b550,0xfbedcb50) + 364 write(0x200) + 250
Of all the threads, only one has a stack trace that references the ramdisk driver. It happens to be the last one. It seems that the process running mkfs(1M) is blocked in biowait(9F). After a call to physio(9F), biowait(9F) takes a buf(9S) structure as a parameter. The next step is to examine the buf(9S) structure:
kadb[0]: f5ef6f68$<buf 0xf5ef6f68: flags 4129 0xf5ef6f6c: forw back av_forw av_back 0 0 0 0 0xf5ef6f80: bcount bufsize error edev 512 0 0 1180000 0xf5ef6f84: addr blkno resid proc f5f66250 3ff 0 f6285518 0xf5ef6fac: iodone vp pages 0 0 0
The resid field is 0, which indicates that the transfer is complete. physio(9F) is still blocked, however. The reference for physio(9F) in the Solaris 2.7 Reference Manual AnswerBook points out that biodone(9F) should be called to unblock biowait(9F). This is the problem; rd_strategy() did not call biodone(9F). Adding a call to biodone(9F) before returning fixes this problem.