Multithreaded Programming Guide

Working with Multiprocessors

Multithreading lets you take advantage of multiprocessors, primarily through parallelism and scalability. Programmers should be aware of the differences between the memory models of a multiprocessor and a uniprocessor.

Memory consistency is always from the viewpoint of the processor interrogating memory. For uniprocessors, memory is obviously consistent because there is only one processor viewing memory.

To improve multiprocessor performance, memory consistency is relaxed.You cannot always assume that changes made to memory by one processor are immediately reflected in the other processors' views of that memory.

You can avoid this complexity by using synchronization variables when you use shared or global variables.

Barrier synchronization is sometimes an efficient way to control parallelism on multiprocessors. An example of barriers can be found in Appendix B, Solaris Threads Example: barrier.c."

Another multiprocessor issue is efficient synchronization when threads must wait until all have reached a common point in their execution.

Note -

The issues discussed here are not important when the threads synchronization primitives are always used to access shared memory locations.

The Underlying Architecture

When threads synchronize access to shared storage locations using the threads synchronization routines, the effect of running a program on a shared-memory multiprocessor is identical to the effect of running the program on a uniprocessor.

However, in many situations a programmer might be tempted to take advantage of the multiprocessor and use "tricks" to avoid the synchronization routines. As Example 10-5and Example 10-6 show, such tricks can be dangerous.

Understanding the memory models supported by common multiprocessor architectures helps to understand the dangers.

The major multiprocessor components are:

The processors themselves

Store buffers, which connect the processors to their caches

Caches, which hold the contents of recently accessed or modified storage locations

memory, which is the primary storage (and is shared by all processors).

In the simple traditional model, the multiprocessor behaves as if the processors are connected directly to memory: when one processor stores into a location and another immediately loads from the same location, the second processor loads what was stored by the first.

Caches can be used to speed the average memory access, and the desired semantics can be achieved when the caches are kept consistent with one another.

A problem with this simple approach is that the processor must often be delayed to make certain that the desired semantics are achieved. Many modern multiprocessors use various techniques to prevent such delays, which, unfortunately, change the semantics of the memory model.

Two of these techniques and their effects are explained in the next two examples.

"Shared-Memory" Multiprocessors

Consider the purported solution to the producer/consumer problem shown in Example 10-5.

Although this program works on current SPARC-based multiprocessors, it assumes that all multiprocessors have strongly ordered memory. This program is therefore not portable.

Example 10-5 The Producer/Consumer Problem--Shared Memory Multiprocessors

                    char buffer[BSIZE];
                    unsigned int in = 0;
                    unsigned int out = 0;

void                             char
producer(char item) {              consumer(void) {
                                    char item;
    do                               
        ;/* nothing */                do  
    while                                 ;/* nothing */
        (in - out == BSIZE);           while
                                         (in - out == 0);
    buffer[in%BSIZE] = item;            item = buffer[out%BSIZE];
    in++;                             out++;
}                                }

When this program has exactly one producer and exactly one consumer and is run on a shared-memory multiprocessor, it appears to be correct. The difference between in and out is the number of items in the buffer.

The producer waits (by repeatedly computing this difference) until there is room for a new item, and the consumer waits until there is an item in the buffer.

For memory that is strongly ordered (for instance, a modification to memory on one processor is immediately available to the other processors), this solution is correct (it is correct even taking into account that in and out will eventually overflow, as long as BSIZE is less than the largest integer that can be represented in a word).

Shared-memory multiprocessors do not necessarily have strongly ordered memory. A change to memory by one processor is not necessarily available immediately to the other processors. When two changes to different memory locations are made by one processor, the other processors do not necessarily see the changes in the order in which they were made because changes to memory don't happen immediately.

First the changes are stored in store buffers that are not visible to the cache.

The processor looks at these store buffers to ensure that a program has a consistent view, but because store buffers are not visible to other processors, a write by one processor doesn't become visible until it is written to cache.

The synchronization primitives (see Chapter 4, Programming with Synchronization Objects") use special instructions that flush the store buffers to cache. So, using locks around your shared data ensures memory consistency.

When memory ordering is very relaxed, Example 10-5has a problem because the consumer might see that in has been incremented by the producer before it sees the change to the corresponding buffer slot.

This is called weak ordering because stores made by one processor can appear to happen out of order by another processor (memory, however, is always consistent from the same processor). To fix this, the code should use mutexes to flush the cache.

The trend is toward relaxing memory order. Because of this, programmers are becoming increasingly careful to use locks around all global or shared data.

As demonstrated by Example 10-5and Example 10-6, locking is essential.

Peterson's Algorithm

The code in Example 10-6is an implementation of Peterson's Algorithm, which handles mutual exclusion between two threads. This code tries to guarantee that there is never more than one thread in the critical section and that, when a thread calls mut_excl(), it enters the critical section sometime "soon."

An assumption here is that a thread exits fairly quickly after entering the critical section.

Example 10-6 Mutual Exclusion for Two Threads?

void mut_excl(int me /* 0 or 1 */) {
    static int loser;
    static int interested[2] = {0, 0};
    int other; /* local variable */
   
    other = 1 - me;
    interested[me] = 1;
    loser = me;
    while (loser == me && interested[other])
        ;

    /* critical section */
    interested[me] = 0;
}

This algorithm works some of the time when it is assumed that the multiprocessor has strongly ordered memory.

Some multiprocessors, including some SPARC-based multiprocessors, have store buffers. When a thread issues a store instruction, the data is put into a store buffer. The buffer contents are eventually sent to the cache, but not necessarily right away. (Note that the caches on each of the processors maintain a consistent view of memory, but modified data does not reach the cache right away.)

When multiple memory locations are stored into, the changes reach the cache (and memory) in the correct order, but possibly after a delay. SPARC-based multiprocessors with this property are said to have total store order (TSO).

When one processor stores into location A and then loads from location B, and another processor stores into location B and loads from location A, the expectation is that either the first processor fetches the newly modified value in location B or the second processor fetches the newly modified value in location A, or both, but that the case in which both processors load the old values simply cannot happen.

However, with the delays caused by load and store buffers, the "impossible case" can happen.

What could happen with Peterson's algorithm is that two threads running on separate processors each stores into its own slot of the interested array and then loads from the other slot. They both see the old values (0), assume that the other party is not present, and both enter the critical section. (Note that this is the sort of problem that might not show up when you test a program, but only much later.)

This problem is avoided when you use the threads synchronization primitives, whose implementations issue special instructions to force the writing of the store buffers to the cache.

Parallelizing a Loop on a Shared-Memory Parallel Computer

In many applications, and especially numerical applications, while part of the algorithm can be parallelized, other parts are inherently sequential, as shown in the following:

Thread₁	Thread₂ through Thread_n
while(many_iterations) { sequential_computation --- Barrier --- parallel_computation }	while(many_iterations) { --- Barrier --- parallel_computation }

For example, you might produce a set of matrices with a strictly linear computation, then perform operations on the matrices using a parallel algorithm, then use the results of these operations to produce another set of matrices, then operate on them in parallel, and so on.

The nature of the parallel algorithms for such a computation is that little synchronization is required during the computation, but synchronization of all the threads employed is required to ensure that the sequential computation is finished before the parallel computation begins.

The barrier forces all the threads that are doing the parallel computation to wait until all threads involved have reached the barrier. When they've reached the barrier, they are released and begin computing together.