To avoid data loss and other problems, you should take special care when testing a new device driver. This section discusses various testing strategies. For example, setting up a separate system that you control through a serial connection is the safest way to test a new driver. You can load test modules with various kernel variable settings to test performance under different kernel conditions. Should your system crash, you should be prepared to restore backup data, analyze any crash dumps, and rebuild the device directory.
If your system is in a hard hang, then you cannot break into the debugger. If you enable the deadman feature, the system panics instead of hanging indefinitely. You can then use the kmdb(1) kernel debugger to analyze your problem.
The deadman feature checks every second whether the system clock is updating. If the system clock is not updating, then you are in an indefinite hang. If the system clock has not been updated for 50 seconds, the deadman feature induces a panic and puts you in the debugger.
Take the following steps to enable the deadman feature:
Make sure you are capturing crash images with dumpadm(1M).
Set the snooping variable in the /etc/system file. See the system(4) man page for information on the /etc/system file.
Reboot the system so that the /etc/system file is read again and the snooping setting takes effect.
Note that any zones on your system inherit the deadman setting as well.
If your system hangs while the deadman feature is enabled, you should see output similar to the following example on your console:
panic[cpu1]/thread=30018dd6cc0: deadman: timed out after 9 seconds of clock inactivity panic: entering debugger (continue to save dump)
Inside the debugger, use the ::cpuinfo command to investigate why the clock interrupt was not able to fire and advance the system time.
Using a serial connection is a good way to test drivers. Use the tip(1) command to make a serial connection between a host system and a test system. With this approach, the tip window on the host console is used as the console of the test machine. See the tip(1) man page for additional information.
A tip window has the following advantages:
Interactions with the test system and kernel debuggers can be monitored. For example, the window can keep a log of the session for use if the driver crashes the test system.
The test machine can be accessed remotely by logging into a tip host machine and using tip(1) to connect to the test machine.
Note - Although using a tip connection and a second machine are not required to debug an Oracle Solaris device driver, this technique is still recommended.
This connection must be made with a null modem cable.
The terminal entry must match the serial port that is used. The operating system comes with the correct entry for serial port B, but a terminal entry must be added for serial port A:
Note - The baud rate must be set to 9600.
% tip debug connected
The shell window is now a tip window with a connection to the console of the test machine.
Caution - Do not use STOP-A for SPARC machines or F1-A for x86 architecture machines on the host machine to stop the test machine. This action actually stops the host machine. To send a break to the test machine, type ~# in the tip window. Commands such as ~# are recognized only if these characters on first on the line. If the command has no effect, press either the Return key or Control-U.
A quick way to set up the test machine on the SPARC platform is to unplug the keyboard before turning on the machine. The machine then automatically uses serial port A as the console.
Another way to set up the test machine is to use boot PROM commands to make serial port A the console. On the test machine, at the boot PROM ok prompt, direct console I/O to the serial line. To make the test machine always come up with serial port A as the console, set the environment variables: input-device and output-device.
Example 23-1 Setting input-device and output-device With Boot PROM Commands
ok setenv input-device ttya ok setenv output-device ttya
The eeprom command can also be used to make serial port A the console. As superuser, execute the following commands to make the input-device and output-device parameters point to serial port A. The following example demonstrates the eeprom command.
Example 23-2 Setting input-device and output-device With the eeprom Command
# eeprom input-device=ttya # eeprom output-device=ttya
The eeprom commands cause the console to be redirected to serial port A at each subsequent system boot.
On x86 platforms, use the eeprom command to make serial port A the console. This procedure is the same as the SPARC platform procedure. See Setting Up a Target System on the SPARC Platform. The eeprom command causes the console to switch to serial port A (COM1) during reboot.
Note - x86 machines do not transfer console control to the tip connection until an early stage in the boot process unless the BIOS supports console redirection to a serial port. In SPARC machines, the tip connection maintains console control throughout the boot process.
The system(4) file in the /etc directory enables you to set the value of kernel variables at boot time. With kernel variables, you can toggle different behaviors in a driver and take advantage of debugging features that are provided by the kernel. The kernel variables moddebug and kmem_flags, which can be very useful in debugging, are discussed later in this section. See also Enable the Deadman Feature to Avoid a Hard Hang.
Changes to kernel variables after boot are unreliable, because /etc/system is read only once when the kernel boots. After this file is modified, the system must be rebooted for the changes to take effect. If a change in the file causes the system not to work, boot with the ask (-a) option. Then specify /dev/null as the system file.
Note - Kernel variables cannot be relied on to be present in subsequent releases.
The set command changes the value of module or kernel variables. To set module variables, specify the module name and the variable:
For example, to set the variable test_debug in a driver that is named myTest, use set as follows:
% set myTest:test_debug=1
To set a variable that is exported by the kernel itself, omit the module name.
You can also use a bitwise OR operation to set a value, for example:
% set moddebug | 0x80000000
The commands modload(1M), modunload(1M), and modinfo(1M) can be used to add test modules, which is a useful technique for debugging and stress-testing drivers. These commands are generally not needed in normal operation, because the kernel automatically loads needed modules and unloads unused modules. The moddebug kernel variable works with these commands to provide information and set controls.
Use modload(1M) to force a module into memory. The modload command verifies that the driver has no unresolved references when that driver is loaded. Loading a driver does not necessarily mean that the driver can attach. When a driver loads successfully, the driver's _info(9E) entry point is called. The attach() entry point is not necessarily called.
Use modinfo(1M) to confirm that the driver is loaded.
Example 23-3 Using modinfo to Confirm a Loaded Driver
$ modinfo Id Loadaddr Size Info Rev Module Name 6 101b6000 732 - 1 obpsym (OBP symbol callbacks) 7 101b65bd 1acd0 226 1 rpcmod (RPC syscall) 7 101b65bd 1acd0 226 1 rpcmod (32-bit RPC syscall) 7 101b65bd 1acd0 1 1 rpcmod (rpc interface str mod) 8 101ce8dd 74600 0 1 ip (IP STREAMS module) 8 101ce8dd 74600 3 1 ip (IP STREAMS device) ... $ modinfo | grep mydriver 169 781a8d78 13fb 0 1 mydriver (Test Driver 1.5)
The number in the info field is the major number that has been chosen for the driver. The modunload(1M) command can be used to unload a module if the module ID is provided. The module ID is found in the left column of modinfo output.
Sometimes a driver does not unload as expected after a modunload is issued, because the driver is determined to be busy. This situation occurs when the driver fails detach(9E), either because the driver really is busy, or because the detach entry point is implemented incorrectly.
To remove all of the currently unused modules from memory, run modunload(1M) with a module ID of 0:
# modunload -i 0
Prints messages to the console when loading or unloading modules.
Gives more detailed error messages.
Prints more detail when loading or unloading, such as including the address and size.
No auto-unloading drivers. The system does not attempt to unload the device driver when the system resources become low.
No auto-unloading streams. The system does not attempt to unload the STREAMS module when the system resources become low.
No auto-unloading of kernel modules of any type.
If running with kmdb, moddebug causes a breakpoint to be executed and a return to kmdb immediately before each module's _init() routine is called. This setting also generates additional debug messages when the module's _info() and _fini() routines are executed.
The kmem_flags kernel variable enables debugging features in the kernel's memory allocator. Set kmem_flags to 0xf to enable the allocator's debugging features. These features include runtime checks to find the following code conditions:
Writing to a buffer after the buffer is freed
Using memory before the memory is initialized
Writing past the end of a buffer
The Oracle Solaris Modular Debugger Guide describes how to use the kernel memory allocator to analyze such problems.
Note - Testing and developing with kmem_flags set to 0xf can help detect latent memory corruption bugs. Because setting kmem_flags to 0xf changes the internal behavior of the kernel memory allocator, you should thoroughly test without kmem_flags as well.
A number of driver-related system files are difficult, if not impossible, to reconstruct. Files such as /etc/name_to_major, /etc/driver_aliases, /etc/driver_classes, and /etc/minor_perm can be corrupted if the driver crashes the system during installation. See the add_drv(1M) man page.
To be safe, use the beadm(1M) command to make a backup copy of the root file system after the test machine is in the proper configuration. If you plan to modify the /etc/system file, make a backup copy of the file before making modifications.
See the Chapter 4, Administering Boot Environments, in Creating and Administering Oracle Solaris 11 Boot Environments and Booting From a ZFS Boot Environment (Task Map) in Oracle Solaris Administration: Common Tasks for detailed information.
If the system is attached to a network, the test machine can be added as a client of a server. If a problem occurs, the system can be booted from the network. The local disks can then be mounted, and any fixes can be made. Alternatively, the system can be booted directly from the Oracle Solaris system CD-ROM.
Another way to recover from disaster is to have another bootable root file system. Use format(1M) to make a partition that is the exact size of the original. Then use dd(1M) to copy the bootable root file system. After making a copy, run fsck(1M) on the new file system to ensure its integrity.
Subsequently, if the system cannot boot from the original root partition, boot the backup partition. Use dd(1M) to copy the backup partition onto the original partition. You might have a situation where the system cannot boot even though the root file system is undamaged. For example, the damage might be limited to the boot block or the boot program. In such a case, you can boot from the backup partition with the ask (-a) option. You can then specify the original file system as the root file system.
When a system panics, the system writes an image of kernel memory to the dump device. The dump device is by default the most suitable swap device. The dump is a system crash dump, similar to core dumps generated by applications. On rebooting after a panic, savecore(1M) checks the dump device for a crash dump. If a dump is found, savecore makes a copy of the kernel's symbol table, which is called unix.n. The savecore utility then dumps a core file that is called vmcore.n in the core image directory. By default, the core image directory is /var/crash/machine_name. If /var/crash has insufficient space for a core dump, the system displays the needed space but does not actually save the dump. The mdb(1) debugger can then be used on the core dump and the saved kernel.
In the Oracle Solaris operating system, crash dump is enabled by default. The dumpadm(1M) command is used to configure system crash dumps. Use the dumpadm command to verify that crash dumps are enabled and to determine the location of core files that have been saved.
Note - You can prevent the savecore utility from filling the file system. Add a file that is named minfree to the directory in which the dumps are to be saved. In this file, specify the number of kilobytes to remain free after savecore has run. If insufficient space is available, the core file is not saved.
Damage to the /devices and /dev directories can occur if the driver crashes during attach(9E). If either directory is damaged, you can rebuild the directory by booting the system and running fsck(1M) to repair the damaged root file system. The root file system can then be mounted. Recreate the /devices and /dev directories by running devfsadm(1M) and specifying the /devices directory on the mounted disk.
The following example shows how to repair a damaged root file system on a SPARC system. In this example, the damaged disk is /dev/dsk/c0t3d0s0, and an alternate boot disk is /dev/dsk/c0t1d0s0.
Example 23-4 Recovering a Damaged Device Directory
ok boot disk1 ... Rebooting with command: boot kernel.test/sparcv9/unix Boot device: /sbus@1f,0/espdma@e,8400000/esp@e,8800000/sd@31,0:a File and \ args: kernel.test/sparcv9/unix ... # fsck /dev/dsk/c0t3d0s0** /dev/dsk/c0t3d0s0 ** Last Mounted on / ** Phase 1 - Check Blocks and Sizes ** Phase 2 - Check Pathnames ** Phase 3 - Check Connectivity ** Phase 4 - Check Reference Counts ** Phase 5 - Check Cyl groups 1478 files, 9922 used, 29261 free (141 frags, 3640 blocks, 0.4% fragmentation) # mount /dev/dsk/c0t3d0s0 /mnt # devfsadm -r /mnt
Note - A fix to the /devices and /dev directories can allow the system to boot while other parts of the system are still corrupted. Such repairs are only a temporary fix to save information, such as system crash dumps, before reinstalling the system.