To avoid unnecessary locks, note the following:
Use the multithreading semantics of the entry points to your advantage.
If an element of a device's state structure is read-mostly--for example, initialized in attach(9E), and destroyed in detach(9E), but only read in other entry points--there is no need to acquire a mutex to read that element of the structure. Indiscriminately adding calls to mutex_enter(9F) and mutex_exit(9F) around every access to such a variable can lead to unnecessary locking overhead.
Make all entry points re-entrant and reduce the amount of shared data by changing static variables to automatic, or by adding them to your state structure.
Kernel-thread stacks are small (currently 8 Kbytes), so do not allocate large automatic variables, and avoid deep recursion.