The Underlying Architecture (Multithreaded Programming Guide)

Multithreaded Programming Guide

The Underlying Architecture

When threads synchronize access to shared storage locations using the threads synchronization routines, the effect of running a program on a shared-memory multiprocessor is identical to the effect of running the program on a uniprocessor.

However, in many situations a programmer might be tempted to take advantage of the multiprocessor and use "tricks" to avoid the synchronization routines. As Example 9-5 and Example 9-6 show, such tricks can be dangerous.

Understanding the memory models supported by common multiprocessor architectures helps to understand the dangers.

The major multiprocessor components are:

The processors themselves

Store buffers, which connect the processors to their caches

Caches, which hold the contents of recently accessed or modified storage locations

memory, which is the primary storage (and is shared by all processors).

In the simple traditional model, the multiprocessor behaves as if the processors are connected directly to memory: when one processor stores into a location and another immediately loads from the same location, the second processor loads what was stored by the first.

Caches can be used to speed the average memory access, and the desired semantics can be achieved when the caches are kept consistent with one another.

A problem with this simple approach is that the processor must often be delayed to make certain that the desired semantics are achieved. Many modern multiprocessors use various techniques to prevent such delays, which, unfortunately, change the semantics of the memory model.

Two of these techniques and their effects are explained in the next two examples.

"Shared-Memory" Multiprocessors

Consider the purported solution to the producer/consumer problem shown in Example 9-5.

Although this program works on current SPARC-based multiprocessors, it assumes that all multiprocessors have strongly ordered memory. This program is therefore not portable.

Example 9-5 The Producer/Consumer Problem--Shared Memory Multiprocessors

                    char buffer[BSIZE];
                    unsigned int in = 0;
                    unsigned int out = 0;

void                             char
producer(char item) {              consumer(void) {
                                    char item;
    do                               
        ;/* nothing */                do  
    while                                 ;/* nothing */
        (in - out == BSIZE);           while
                                         (in - out == 0);
    buffer[in%BSIZE] = item;            item = buffer[out%BSIZE];
    in++;                             out++;
}                                }

When this program has exactly one producer and exactly one consumer and is run on a shared-memory multiprocessor, it appears to be correct. The difference between in and out is the number of items in the buffer.

The producer waits (by repeatedly computing this difference) until there is room for a new item, and the consumer waits until there is an item in the buffer.

For memory that is strongly ordered (for instance, a modification to memory on one processor is immediately available to the other processors), this solution is correct (it is correct even taking into account that in and out will eventually overflow, as long as BSIZE is less than the largest integer that can be represented in a word).

Shared-memory multiprocessors do not necessarily have strongly ordered memory. A change to memory by one processor is not necessarily available immediately to the other processors. When two changes to different memory locations are made by one processor, the other processors do not necessarily detect the changes in the order in which they were made because changes to memory do not happen immediately.

First the changes are stored in store buffers that are not visible to the cache.

The processor checks these store buffers to ensure that a program has a consistent view, but because store buffers are not visible to other processors, a write by one processor does not become visible until it is written to cache.

The synchronization primitives (see Chapter 4, Programming With Synchronization Objects) use special instructions that flush the store buffers to cache. So, using locks around your shared data ensures memory consistency.

When memory ordering is very relaxed, Example 9-5 has a problem because the consumer might see that in has been incremented by the producer before it sees the change to the corresponding buffer slot.

This is called weak ordering because stores made by one processor can appear to happen out of order by another processor (memory, however, is always consistent from the same processor). To fix this, the code should use mutexes to flush the cache.

The trend is toward relaxing memory order. Because of this, programmers are becoming increasingly careful to use locks around all global or shared data.

As demonstrated by Example 9-5 and Example 9-6, locking is essential.

Peterson's Algorithm

The code in Example 9-6 is an implementation of Peterson's Algorithm, which handles mutual exclusion between two threads. This code tries to guarantee that there is never more than one thread in the critical section and that, when a thread calls mut_excl(), it enters the critical section sometime "soon."

An assumption here is that a thread exits fairly quickly after entering the critical section.

Example 9-6 Mutual Exclusion for Two Threads?

void mut_excl(int me /* 0 or 1 */) {
    static int loser;
    static int interested[2] = {0, 0};
    int other; /* local variable */
   
    other = 1 - me;
    interested[me] = 1;
    loser = me;
    while (loser == me && interested[other])
        ;

    /* critical section */
    interested[me] = 0;
}

This algorithm works some of the time when it is assumed that the multiprocessor has strongly ordered memory.

Some multiprocessors, including some SPARC-based multiprocessors, have store buffers. When a thread issues a store instruction, the data is put into a store buffer. The buffer contents are eventually sent to the cache, but not necessarily right away. (Note that the caches on each of the processors maintain a consistent view of memory, but modified data does not reach the cache right away.)

When multiple memory locations are stored into, the changes reach the cache (and memory) in the correct order, but possibly after a delay. SPARC-based multiprocessors with this property are said to have total store order (TSO).

When one processor stores into location A and then loads from location B, and another processor stores into location B and loads from location A, the expectation is that either the first processor fetches the newly modified value in location B or the second processor fetches the newly modified value in location A, or both. However, the case in which both processors load the old values simply cannot happen.

Moreover, with the delays caused by load and store buffers, the "impossible case" can happen.

What could happen with Peterson's algorithm is that two threads running on separate processors each stores into its own slot of the particular array and then loads from the other slot. They both read the old values (0), assume that the other party is not present, and both enter the critical section. (Note that this is the sort of problem that might not occur when you test a program, but only much later.)

This problem is avoided when you use the threads synchronization primitives, whose implementations issue special instructions to force the writing of the store buffers to the cache.

Parallelizing a Loop on a Shared-Memory Parallel Computer

In many applications, and especially numerical applications, while part of the algorithm can be parallelized, other parts are inherently sequential, as shown in the following:

Thread₁	Thread₂ through Thread_n
while(many_iterations) { sequential_computation --- Barrier --- parallel_computation }	while(many_iterations) { --- Barrier --- parallel_computation }

For example, you might produce a set of matrixes with a strictly linear computation, then perform operations on the matrixes using a parallel algorithm, then use the results of these operations to produce another set of matrixes, then operate on them in parallel, and so on.

The nature of the parallel algorithms for such a computation is that little synchronization is required during the computation, but synchronization of all the threads employed is required to ensure that the sequential computation is finished before the parallel computation begins.

The barrier forces all the threads that are doing the parallel computation to wait until all threads involved have reached the barrier. When they've reached the barrier, they are released and begin computing together.