Debugging a Multithreaded Program

Language:

The following discussion describes characteristics that can cause bugs in multithreaded programs. Utilities that you can use to help debug your program are also described.

Common Oversights in Multithreaded Programs

The following list points out some of the more frequent oversights that can cause bugs in multithreaded programs.

A pointer passed to the caller's stack as an argument to a new thread.
The shared changeable state of global memory accessed without the protection of a synchronization mechanism leading to a data race. A data race occurs when two or more threads in a single process access the same memory location concurrently, and at least one of the threads tries to write to the location. When the threads do not use exclusive locks to control their accesses to that memory, the order of accesses is non-deterministic, and the computation may give different results from run to run depending on that order. Some data races may be benign (for example, when the memory access is used for a busy-wait), but many data races are bugs in the program. The Thread Analyzer tool is useful for detecting data races. See Detecting Data Races and Deadlocks Using Thread Analyzer.
Deadlocks caused by two threads trying to acquire rights to the same pair of global resources in alternate order. One thread controls the first resource and the other controls the second resource. Neither thread can proceed until the other gives up. The Thread Analyzer tool is also useful for detecting deadlocks. See Detecting Data Races and Deadlocks Using Thread Analyzer.
Trying to reacquire a lock already held (recursive deadlock).
Creating a hidden gap in synchronization protection. This gap in protection occurs when a protected code segment contains a function that frees and reacquires the synchronization mechanism before returning to the caller. The result is misleading. To the caller, the appearance is that the global data has been protected when the data actually has not been protected.
When mixing UNIX signals with threads, and not using the sigwait(2) model for handling asynchronous signals.
Calling setjmp(3C) and longjmp(3C), and then long-jumping away without releasing the mutex locks.
Failing to re-evaluate the conditions after returning from a call to *_cond_wait() or *_cond_timedwait().
Forgetting that default threads are created PTHREAD_CREATE_JOINABLE and must be reclaimed with pthread_join(3C). Note that pthread_exit(3C) does not free up its storage space.
Making deeply nested, recursive calls and using large automatic arrays can cause problems because multithreaded programs have a more limited stack size than single-threaded programs.
Specifying an inadequate stack size, or using nondefault stacks.

Multithreaded programs, especially those containing bugs, often behave differently in two successive runs, even with identical inputs. This behavior is caused by differences in the order that threads are scheduled.

In general, multithreading bugs are statistical instead of deterministic. Tracing is usually a more effective method of finding the order of execution problems than is breakpoint-based debugging.

Built-in Error Checking

The standard C library has built-in error checking code. To activate the error checking code, set any one of the following environment variables.

export _THREAD_ERROR_DETECTION=1
export _THREAD_ERROR_DETECTION=2

If any one of the environment variable is set, libc detects and reports the lock usage errors on the standard error output. The following table lists the type of errors detected by libc for each of the functions.

Table 19 Lock Usage Errors Detected by libc

Function Name	Type of Error Detected by `libc`
`mutex_lock()`	Calling thread already owns the lock
`mutex_unlock()`	Calling thread does not own the lock
`cond_wait()`	Calling thread does not own the lock Recursive mutex in `cond_wait()` condvar process-shared, mutex process-private condvar process-private, mutex process-shared
`rwlock_rdlock()`	Calling thread owns the writer lock
`rwlock_wrlock()`	Calling thread owns the readers lock Calling thread owns the writer lock
`rwlock_unlock()`	Writer lock held, but not by the calling thread Readers lock held, but not by the calling thread Lock not owned

An error message similar to the following is displayed on the standard error output:

*** _THREAD_ERROR_DETECTION: lock usage error detected ***
mutex_unlock(0x100763f50): calling thread does not own the lock
calling thread is 0x7d5a40bf2a40 thread-id 1
the lock is unowned

Note -

If _THREAD_ERROR_DETECTION is set to 1, the program continues execution.
If _THREAD_ERROR_DETECTION is set to 2, the program is aborted with a core dump for later inspection.

Tracing and Debugging with DTrace

DTrace is a comprehensive dynamic tracing facility that is built into the Oracle Solaris OS. The DTrace facility can be used to examine the behavior of your multithreaded program. DTrace inserts probes into running programs to collect data at points in the execution path that you specify. The collected data can be examined to determine problem areas. See the Oracle Solaris 11.3 DTrace (Dynamic Tracing) Guide for more information about using DTrace.

Profiling with Performance Analyzer

The Performance Analyzer tool, included in the Oracle Developer Studio6 software, can be used for extensive profiling of multithreaded and single threaded programs. The tool enables you to see in detail what a thread is doing at any given point. See Oracle Developer Studio 12.6: Performance Analyzer (https://docs.oracle.com/cd/E77782_01/html/E77798/index.html) for more information.

Detecting Data Races and Deadlocks Using Thread Analyzer

The Oracle Developer Studio6 software includes a tool called the Thread Analyzer. This tool enables you to analyze the execution of a multithreaded program. It can detect multithreaded programming errors such as data races or deadlocks in code that is written using the Pthread API, the Oracle Solaris thread API, OpenMP directives, Oracle parallel directives, Cray parallel directives, or a mix of these technologies.

See Oracle Developer Studio 12.6: Thread Analyzer User's Guide for more information.

Using `dbx`

The dbx utility is a debugger included in the Oracle Developer Studio6 developer tools, available from https://www.oracle.com/technetwork/server-storage/developerstudio/downloads/index.html. With the Oracle Developer Studio6 dbx command-line debugger, you can debug and execute source programs that are written in C, C++, and Fortran. You can use dbx by starting it in a terminal window and interactively debugging your program with dbx commands. If you prefer a graphical interface, you can use the same dbx functionality in the Debugging windows of the Oracle Developer Studio6 IDE (Integrated Development Environment). For a description of how to start dbx, see the dbx man page. See Oracle Developer Studio 12.6: Debugging a Program With dbx for an overview of dbx. The Debugging features in the Oracle Developer Studio6 IDE are described in the IDE online help.

The Oracle Developer Studio 12.6: Debugging a Program With dbx guide contains detailed information about debugging multithreaded programs. The dbx debugger provides commands to manipulate event handlers for thread events, which are described in the Event Management appendix of Oracle Developer Studio 12.6: Debugging a Program With dbx.

All the dbx options that are listed in Figure 20, Table 20, dbx Options for MT Programs can support multithreaded applications.

Table 20 dbx Options for MT Programs

Option	Action
`cont at line [-sig signo id]`	Continues execution at `line` with signal `signo`. The `id`, if present, specifies which thread or LWP to continue. The default value is `all`.
`lwp [lwpid]`	Displays current LWP. Switches to given LWP [`lwpid`].
`lwps`	Lists all LWPs in the current process.
`next ... tid`	Steps the given thread. When a function call is skipped, all LWPs are implicitly resumed for the duration of that function call. Nonactive threads cannot be stepped.
`next ... lwpid`	Steps the given LWP. Does not implicitly resume all LWPs when skipping a function. The LWP on which the given thread is active. Does not implicitly resume all LWP when skipping a function.
`step... tid`	Steps the given thread. When a function call is skipped, all LWPs are implicitly resumed for the duration of that function call. Nonactive threads cannot be stepped.
`step... lwpid`	Steps the given LWP. Does not implicitly resume all LWPs when skipping a function.
`stepi... lwpid`	Steps machine instructions (stepping into calls) in the given LWP.
`stepi... tid`	Steps machine instructions in the LWP on which the given thread is active.
`thread [ tid ]`	Displays current thread, or switches to thread `tid`. In all the following variations, omitting the l `tid` implies the current thread.
`thread -info [ tid ]`	Prints everything known about the given thread.
`thread -blocks [ tid ]`	Prints all locks held by the given thread blocking other threads.
`thread -suspend [ tid ]`	Puts the given thread into suspended state, which prevents it from running. A suspended thread displays with an "S" in the `threads` listing.
`thread -resume [ tid ]`	Unsuspends the given thread so it resumes running.
`thread -hide [ tid ]`	Hides the given or current thread. The thread does not appear in the generic `threads` listing.
`thread -unhide [ tid ]`	Unhides the given or current thread.
`thread -unhide all`	Unhides all threads.
`threads`	Prints the list of all known threads.
`threads -all`	Prints threads that are not usually printed (zombies).
`threads -mode all\|filter`	Controls whether `threads` prints all threads or filters threads by default. When filtering is on, threads that have been hidden by the `thread -hide` command are not listed.
`threads -mode auto\|manual`	Enables automatic updating of the thread listing.
`threads -mode`	Echoes the current modes. Any of the previous forms can be followed by a thread or LWP ID to get the traceback for the specified entity.

Using `truss`

See the truss(1) man page for information on tracing system calls, signals and user-level function calls.

Using `mdb`

For information about mdb, see the Oracle Solaris Modular Debugger Guide.

The following mdb commands can be used to access the LWPs of a multithreaded program.

$l: Prints the LWP ID of the representative thread if the target is a user process.
$L: Prints the LWP IDs of each LWP in the target if the target is a user process.
pid::attach: Attaches to process # pid.
::release: Releases the previously attached process or core file. The process can subsequently be continued by prun or it can be resumed by applying MDB or another debugger.

These commands to set conditional breakpoints are often useful.

[ addr ] ::bp [+/-dDestT] [–c cmd] [–n count] sym ...: Set a breakpoint at the specified locations.
addr ::delete [ id | all]: Delete the event specifiers with the given ID number.

Multithreaded Programming Guide