While a traditional UNIX® process contains a single thread of control, multithreading separates a process into many execution threads, each of which runs independently. Multithreading your code has a number of benefits, but it can also introduce bugs that might be difficult to find. This article suggests ways of avoiding such bugs in your code as well as strategies for finding these bugs using the dbx command-line debugger.
Multithreading your code can help in the following areas.
Any program in which many activities are not dependent upon each other can be redesigned so that each independent activity is defined as a thread. For example, the user of a multithreaded GUI does not have to wait for one activity to complete before starting another.
Typically, applications that express concurrency requirements with threads need not take into account the number of available processors. The performance of the application improves transparently with additional processors because the operating system takes care of scheduling threads for the number of processors that are available. When multicore processors and multithreaded processors are available, a multithreaded application's performance scales appropriately because the cores and threads are viewed by the OS as processors.
Numerical algorithms and applications with a high degree of parallelism, such as matrix multiplications, can run much faster when implemented with threads on a multiprocessor.
Many programs are more efficiently structured as multiple independent or semi-independent units of execution instead of as a single, monolithic thread. For example, a non-threaded program that performs many different tasks might need to devote much of its code just to coordinating the tasks. When the tasks are programmed as threads, the code can be simplified. Multithreaded programs. especially programs that provide service to multiple concurrent users, can be more adaptive to variations in user demands than single-threaded programs.
Programs that use two or more processes that access common data through shared memory are applying more than one thread of control.
However, each process has a full address space and operating environment state. The cost of creating and maintaining this large amount of state information makes each process much more expensive than a thread in both time and space.
In addition, the inherent separation between processes can require a major effort by the programmer. This effort includes handling communication between the threads in different processes, or synchronizing their actions When the threads are in the same process, communication and synchronization becomes much easier.
The following list point out some of the more frequent oversights that can cause bugs in multithreaded programs:
A pointer passed to the caller's stack as an argument to a new thread.
The shared changeable state of global memory accessed without the protection of a synchronization mechanism leading to a data race. A data race occurs then two or moe threads in a single process access the same memory location concurrently, and at least one of the threads tries to write to the location. When the threads do not use exclusive locks to control their accesses to that memory, the order of accesses is non-deterministic, and the computation may give different results from run to run depending on that order. Some data races may be benign (for example, when the memory access is used for a busy-wait), but many data races are bugs in the program. The Thread Analyzer tools is useful for detecting data races.
Deadlocks caused by two threads trying to acquire rights to the same pair of global resources in alternate order. One thread controls the first resource and the other controls the second resource. Neither can proceed until the other gives up. The Thread Analyzer tool is useful for detecting deadlocks.
Trying to reacquire a lock already held (recursive deadlock).
Creating a hidden gap in synchronization protection. This gap in protection occurs when a protected code segment contains a function that frees and then reacquires the synchronization mechanism before it returns to the caller. The result is misleading. To the caller, it appears that the global data has been protected when the data actually has not been protected.
When mixing UNIX signals with threads, using the sigwait(2) model for handling asynchronous signals.
Calling setjmp()and longjmp(), and then long-jumping away without releasing the mutex locks.
Failing to re-evaluate the conditions after returning from a call to *_cond_wait() or *_cond_timedwait().
Forgetting that default threads are created PTHREAD_CREATE_JOINABLE and must be reclaimed with pthread_join(3C). Note that pthread_exit(3C) does not free up its storage space.
Making deeply nested, recursive calls and using large automatic arrays can cause problems because multithreaded programs have a more limited stack size than single-threaded programs.
Specifying an inadequate stack size, or using nondefault stacks.
Multithreaded programs, especially those containing bugs, often behave differently in two successive runs, given identical inputs, because of differences in the thread scheduling order.
In general, multithreading bugs are statistical instead of deterministic. Tracing is usually a more effective method of finding order of execution problems than is breakpoint-based debugging.
When it detects a multithreaded program, dbx tries to load libthread_db.so, a special system library for thread debugging located in /usr/lib. dbx is synchronous; when any thread or lightweight process (LWP) stops, all other threads and LWPs sympathetically stop. (An LWP is a thread in the SolarisTM kernel that executes kernel code and system calls.) This behavior is sometimes referred to as the “stop the world” model.
You can set breakpoints in multithreaded code using the stop command, trace command, or when command. The basic syntax of these commands is:
stop event-specification [modifier]
trace event-spcification [modifier]
when event-specification [ modifier ] { command; ... }
Two thread-specific events were added in Sun Studio 11 dbx:
The thr_create [thread_id] event occurs when a thread, or a thread with the specified thread_id, has been created. For example, in the following stop command, the thread ID t@1 refers to the creating thread, while the thread ID t@5 refers to the created thread.
(dbx) stop thr_create t@5 -thread t@1
The thr_exit event occurs when a thread has been exited. To capture the exit of a specific thread, use the -thread option of the stop command as follows:
(dbx) stop thr_exit -thread t@5
You can get an idea of how often your application creates and destroys threads by using the thr_create event and thr_exit event as in the following example:
(dbx) trace thr_create (dbx) trace thr_exit (dbx) run trace: thread created t@2 on l@2 trace: thread created t@3 on l@3 trace: thread created t@4 on l@4 trace: thr_exit t@4 trace: thr_exit t@3 trace: thr_exit t@2
The application created three threads. Note how the threads exited in reverse order from their creation, which might indicate that had the application had more threads, the threads would accumulate and consume resources.
To get more interesting information, you could try the following in a different session:
(dbx) when thr_create { echo "XXX thread $newthread created by $thread"; } XXX thread t@2 created by t@1 XXX thread t@3 created by t@1 XXX thread t@4 created by t@1
The output shows that all three threads were created by thread t@1, which is a common multithreading pattern.
Suppose you want to debug thread t@3 from its outset. You could stop the application at the point that thread t@3 is created as follows:
(dbx) stop thr_create t@3 (dbx) run t@1 (l@1) stopped in tdb_event_create at 0xff38409c 0xff38409c: tdb_event_create : retl Current function is main 216 stat = (int) thr_create(NULL, 0, consumer, q, tflags, &tid_cons2); (dbx)
If your application occasionally spawns a new thread from thread t@5 instead of thread t@1, you could capture that event as follows:
(dbx) stop thr_create -thread t@5
See the Sun Studio 12: Debugging a Program With dbx manual for a complete list of event specifications. Bear in mind that the event you specify may occur in more than one thread, so your program may hit the breakpoint many times. You can specify a thread_id or lwp_id as a modifier to the stop command and the trace command. The action associated with the event is then executed only if the thread or LWP that caused the event matches the thread_id or lwp_id. However, the specific thread of LWP you have in mind might be assigned a different thread_id or lwp_id from one execution of the program to the next.
dbx supports two basic single-step commands: next and step, plus two variants of step, called step up and step to. Both the next command and the step command let the program execute one source line before stopping again. The basic difference between the next and step commands is in how they handle function calls. If the line executed contains a function call:
The next command allows the call to be executed and stops at the following line (“steps over” the call)
The step command stops at the first line in a called function (“steps into” the call).
The syntax of the next command is:
next [ n ] [ -sig signal ] [ thread_id ] [lwp_id ]
The syntax of the step command is:
step [ n ] [ up ] [ -sig signal ] [ thread_id ] [lwp_id ] [ to function ]
To step one line in the current thread or LWP, type:
next
or
step
To step multiple (n) lines in the current thread or LWP, type:
next n
or
step n
To step one line in a different thread, type:
step thread_id
For example:
(dbx) step t@2
With multithreaded programs when a function call is stepped into or stepped over, all LWPs are implicitly resumed for the duration of that function call in order to avoid deadlock.
You can specify a specific thread_id or lwp_id to the next command or the step command, thus changing the current thread or LWP. However, if you do so, this deadlock avoidance measure is defeated. So to avoid deadlocks, it is safer to change the current thread or LWP using the thread command or lwp command, and then use the next command or step command to step in the new current thread or LWP.
Whenever you give a command that steps a single thread or LWP, you need to be aware of potential deadlocks. If the thread that continues executing needs to acquire a lock that is held by a thread that has not resumed execution, your program deadlocks. If such a deadlock occurs, you can break it only by typing ctrl-C and then resuming all threads.
To step up and out of the current function in the current thread or LWP, type:
step up
or
step up lwp_id
To step into a specified function at the current source line, type:
step to function_name
To step into the last function called as determined by the assembly code for the current source line, type:
step to
To deliver a signal while stepping, you can add -sig signal to any of the above next and step commands.
To resume execution of your multithreaded program after hitting a breakpoint or after single-stepping through your code, use the cont command. For multithreaded programs, the syntax is:
cont [ at line ] [ thread_id | lwp_id ] [ -sig signal ]
To continue execution of all threads, type:
cont
To continue execution of a specific thread or LWP, type:
cont thread_id
or
cont lwp_id
For example:
(dbx) cont l@3
To continue execution at a specific source line, type:
cont at line_number thread_id
or
cont at line_number lwp_id
To continue execution with a specific signal, you can add -sig signal in any of the above cont commands. Whenever you give a command that resumes a single thread or LWP, you need to be aware of potential deadlocks. If the thread that continues executing needs to acquire a lock that is held by a thread that has not resumed execution, your program deadlocks. If such a deadlock occurs, you can break it only by typing ctrl-C and then resuming all threads.
To view the threads list, use the threads command. The syntax is:
threads [ -all ] [ -mode [ all|filter ] [ auto|manual ] ]
The threads command displays the thread information shown in the following example:
(dbx) threads t@1 a l@1 ?() running in main() t@2 ?() asleep on 0xef751450 in_swtch() t@3 b l@2 ?() running in sigwait() t@4 consumer() asleep on 0x22bb0 in _lwp_sema_wait() *>t@5 b l@4 consumer() breakpoint in Queue_dequeue() t@6 b l@5 producer() running in _thread_start() (dbx>
For native code, each line of information displayed by the threads command is composed of the following:
The * (asterisk) indicates that an event requiring user attention has occurred in this thread. Usually this is a breakpoint.
An 'o' instead of an asterisk indicates that a dbx internal event has occurred.
The > (arrow) denotes the current thread.
t@number, the thread id, refers to a particular thread. The number is the thread_t value passed back by thr_create.
b l@number or a l@number means the thread is bound to or active on the designated LWP, meaning the thread is actually runnable by the operating system.
The “Start function” of the thread as passed to thr_create. A ?()means that the start function is not known.
The thread state (See the table below for descriptions of the thread states.)
The function that the thread is currently executing.
State |
Description |
---|---|
suspended |
The thread has been explicitly suspended. |
runnable |
The thread is runnable and is waiting for an LWP as a computational resource. |
zombie |
When a detached thread exits (thr_exit()()), it is in a zombie state until it has rejoined through the use of thr_join().() THR_DETACHED is a flag specified at thread creation time (thr_create()()). A non-detached thread that exits is in a zombie state until it has been reaped. |
asleep on syncobj |
The thread is blocked on the given synchronization object. Depending on what level of support libthread.so and libthread_db.so provide, syncobj might be as simple as a hexadecimal address or something with more information content. |
active |
The thread is active on an LWP, but dbx cannot access the LWP. |
unknown |
dbx cannot determine the state. |
lwpstate |
A bound or active thread state has the state of the LWP associated with it. |
running |
The LWP was running but was stopped in synchrony with some other LWP. |
syscall num |
The LWP stopped on an entry into the given system call number. |
syscall return num |
The LWP stopped on an exit from the given system call number. |
job control |
The LWP stopped due to job control. |
LWP suspended |
The LWP is blocked in the kernel. |
single stepped |
The LWP has just completed a single step. |
breakpoint |
The LWP has just hit a breakpoint. |
fault num |
The LWP has incurred the given fault number. |
signal name |
The LWP has incurred the given signal. |
process sync |
The process to which this LWP belongs has just started executing. |
LWP death |
The LWP is in the process of exiting. |
To print the list of all known threads, type:
threads
The output of this command might be:
*> t@1 a l@1 ?() signal SIGINT in _XFlushInt() t@2 b l@2 ?() running in _signotifywait() t@3 b l@3 ?() running in _lwp_sema_wait() t@4 ?() sleep on (unknown) in _reap_wait()
To print threads normally not printed (zombies), type:
threads -all
The output of this command might be:
*> t@1 a l@1 ?() signal SIGINT in _XFlushInt() t@2 b l@2 ?() running in _signotifywait() t@3 b l@3 ?() running in _lwp_sema_wait() t@4 ?() sleep on (unknown) in _reap_wait() t@5 myThread() zombie in in t@5 myThread() zombie in in
By default, the threads command runs in filter mode, meaning that hidden threads and zombie threads are not printed. To print all threads including hidden threads and zombies, type:
threads -mode all
The thread command lists or changes the current thread. The syntax is:
thread [ -blocks ] [ -blockedby ] [ -info ] [ -hide ] [ -unhide ] [ -suspend ] [ -resume ] thread_id
To change the current thread, type:
thread thread_id
To print everything known about the current thread or given thread, type:
thread -info [thread_id]
For example, this command might produce the following output:
thread -info t@4 Thread t@4 (0xfe60bd70) at priority 127 state: asleep on (unknown) base function: 0x0: 0x00000000() stack: 0xfe60bd70[1047920] flags: DETACHED|DAEMON masked signals: HUP INT QUIT ILL TRAP ABRT EMT FPE BUS SEGV SYS PIPE ALRM TERM USR1 USR2 CLD PWR WINCH URG POLL TSTP CONT TTIN TTOU VTALRM PROF XCPU XFSZ WAITING FREEZE THAW LOST RTMIN RTMIN+1 RTMIN+2 RTMIN+3 RTMIN+4 RTMIN+5 RTMIN+6 RTMIN+7 Currently inactive in _reap_wait
To print all locks held by the current thread or given thread blocking other threads, type:
thread -blocks [thread_id]
To show which synchronization object the current thread or given thread is blocked by, if any, type:
thread -blockedby [thread_id]
To suspend the current thread or given thread, type:
thread -suspend [thread_id]
To resume (unsuspend) the current thread or given thread, type:
thread -resume [thread_id]
To hide the current thread or given thread so that it will not be displayed by the threads command, type:
thread -hide [thread_id]
To unhide the current thread or given thread so that it will be displayed by the threads command, type:
thread -unhide [thread_id]
Normally, you need not be aware of LWPs, but there are times when thread level queries cannot be completed. In these cases, you can use the lwp command and lwps command to show information about LWPs.
The syntax of the lwp command is:
lwp [ -info ] lwp_id
To list the current LWP, type:
lwp
For example:
lwp l@3
To change the current LWP, type:
lwp lwp_id
To display the name, home, and masked signals of the current LWP, type:
lwp -info
For example, this command might produce the following output:
lwp -info l@2 l@2 running in _signotifywait() masked signals are:
To list all LWPs in the current process, use the lwps command:
lwps
The lwps command displays the LWP information shown in the following example:
(dbx) lwps l@1 running in main() l@2 running in sigwait() l@3 running in _lwp_sema_wait() *>l@4 breakpoint in Queue_dequeue() l@5 running in _thread_start() (dbx)
Each line of the LWP list contains the following:
The * (asterisk) indicates that an event requiring user attention has occurred in this LWP.
The arrow denotes the current LWP.
l@number refers to a particular LWP.
The next item represents the LWP state.
in function_name() identifies the function that the LWP is currently executing.
Runtime checking in dbx supports multithreaded applications. Along with each access checking error report, RTC prints the ID of the thread on which the error occurred. The leak report generated by RTC includes the leaks from all the threads in the program.
If you are accustomed to using dynamic function calls from dbx when debugging single-threaded programs, take care in using the same technique when debugging multithreaded code.
dbx allows you to use function calls in expressions. For example, the following command forces the target program to call foo()():
print foo()
Forcing a function call can be useful because it lets you use the program code to examine the state of the program.
You can use the when command to stop execution at particular locations in the program, print data, and then resume execution:
when in bar {print var;}
If you combine these two examples, as in the following command, you are stopping the program at various locations, forcing it to call a function, and then continuing execution:
when in bar {print foo();}
If you give dbx such a command, you must be sure that calling foo()() at the times you stop execution in bar() does not interfere with your program's intended execution.
If your program is multithreaded, it is more difficult to predict when it is safe to force a call to foo()(). One thread might be stopped in bar(), which you know is safe, but other threads might be in the process of modifying data that foo()() relies on.
For further details, see:
Multithreaded Programming Guide
Sun Studio 12: Debugging a Program With dbx