This article explains common parallel and multithreading concepts, and differentiates between the hardware and software aspects of parallel processing. It briefly explains the hardware architectures that make parallel processing possible. The article describes several popular parallel programming models. It also makes connections between parallel processing concepts and related Sun hardware and software offerings.

Parallel Processing and Programming Terms

The terms parallel computing, parallel processing, and parallel programming are sometimes used in ambiguous ways, or are not clearly defined and differentiated. Parallel computing is a term that encompasses all the technologies used in running multiple tasks simultaneously on multiple processors. Parallel processing, or parallelism, is accomplished by dividing one single runtime task into multiple, independent, smaller tasks. The tasks can execute simultaneously when more than one processor is available. If only one processor is available, the tasks execute sequentially. On a modern high-speed single processor, the tasks might appear to run at the same time, but in reality they cannot be executed simultaneously on a single processor.

Parallel programming, or multithreaded programming, is the software methodology used to implement parallel processing. The program must include instructions to inform the runtime system which parts of the application can be executed simultaneously. The program is then said to be parallelized. Parallel programming is a performance optimization technique that attempts to reduce the “wall clock” runtime of an application by enabling the program to handle many activities simultaneously.

Parallel programming can be implemented using several different software interfaces, or parallel programming models.

This article explains the common parallel and multithreading concepts, and the differences between the hardware and software aspects of parallel processing. It briefly describes the hardware architectures that make parallel processing possible, and presents several popular parallel programming models. Pointers to other locations where you can read more about specific topics are included.

Parallel Processing

Parallel processing is a general term for the process of dividing tasks into multiple subtasks that can execute at the same time. These subtasks are known as threads, which are runtime entities that are able to independently execute a stream of instructions. Parallel processing can occur at the hardware level and at the software level. Distinguishing between these types of parallel processing is important. At the software level, an application might be rewritten to take advantage of parallelism in the code. With the right hardware support, such as a multiprocessing system, the threads can then execute simultaneously at runtime. If not enough processors or cores are available for all the threads to run simultaneously, certain tasks might still execute one after the other. The common way to describe such non-parallel execution is to say these tasks execute sequentially or serially.

Parallelism in the Hardware

Execution of a parallel application is dependent on hardware design. However, even when the system is capable of parallel execution, the software must still divide, schedule, and manage the tasks.

Multiprocessors – More than one processor can be active simultaneously. The processors use shared memory to communicate and share data. The allocation of tasks between the processors is handled by the operating system, so the system is able to execute multiple jobs simultaneously. The simultaneous execution improves the overall throughput, and for a given workload, reduces the turnaround time for the applications when compared to a system with a single processor. In certain cases, this reduction might not be sufficient because executing the single application still takes too long. At that point, parallel programming might be considered as a way to address this problem. The application developer needs to select a suitable parallel programming model such as POSIX Threads or OpenMP to implement the parallelism. Most Sun hardware is available in multiprocessor configurations, with a few entry level servers and workstations having one processor. Sun offers a comparison of Sun server families and a comparison of Sun workstations that detail the number of processors in Sun machines.
Multicore processors – More than one core, or processing unit, in a single chip can be active simultaneously. Multicore processing is sometimes called chip-level multiprocessing (CMP) because multiple processors are on a single chip. The cores use shared memory, a shared system bus, and, in some cases, shared caches, to communicate and share data with each other. The cores generally have their own processing units and registers. The architecture of each core varies with different processor implementations. The operating system views each core as a processor, and handles the allocation of tasks between the cores. A multicore processor is like a multiprocessor system implemented on a single chip. Although differences exist, especially with respect to the sharing of resources, from an application point of view multiple processors and multicore processors are effectively the same. Therefore, with single-threaded applications running on a multicore processor, the throughput of a multijob workload is increased by executing more than one application simultaneously. For example, on a dual-core processor, two programs can run at the same time. For a parallel application, the independent tasks can be scheduled onto the various cores of the processor. In both cases, however, the performance might not be as good as on a true multiprocessor design. The performance largely depends on the multicore implementation, and how many shared resources are needed by the applications that are running simultaneously. Sun servers are available with single-core or dual-core AMD Opteron processors, dual-core or quad-core Intel Xeon processors, and with single-core UltraSPARC III and UltraSPARC IIIi processors or dual-core UltraSPARC IV and UltraSPARC IV+ processors.
The Sun Studio Tools for Parallel Programming in Multicore Environments page provides links to many sources of information about programming for multicore processors.
Multithreaded processors – These processors contain a number of multithreaded cores, which switch between a number of active threads. Some processor cores implement vertical multithreading (VMT), which enables the core to execute multiple threads in an interleaved fashion. If one software thread stalls waiting for a resource (data to come from memory, input/output, and so on), another thread immediately takes over execution. When the second thread stalls, the first thread or another waiting thread takes over. VMT enables processor cycles to be used more efficiently. Early VMT designs suffered from too much resource sharing, and in many cases, overall performance could be improved by disabling resource sharing. Current processors that use vertical threads include the dual-core SPARC64 VI processor, which has two vertical threads per core, and was developed by Fujitsu. The Sun SPARC Enterprise M-series servers use the SPARC64 VI processor.
A further refinement of hardware multithreading technology is called simultaneous multithreading (SMT). A truly multithreaded processor, specifically designed for SMT, does not have a resource sharing problem. Sun has introduced the term chip multithreading (CMT) for a processor design with multiple cores in which each core is multithreaded. The UltraSPARC T1 is the first processor implementing this design. The UltraSPARC-T1 processor is deployed in the T1000 and T2000 server models. The UltraSPARC T2 is the latest generation to extend these concepts further. The UltraSPARC T2 processor is deployed in the T5120 and T5220 server models.
Sun uses the term throughput computing for its strategy of processor design that greatly increases throughput, or the amount of work that can be done by a computer in a given period of time. The processors use chip multithreading, fully exploited through optimizations in the Solaris OS, to achieve these performance gains. This term is adapted from networking terminology, in which throughput is defined as the rate at which a computer or network sends or receives data. For more information about throughput computing, see the Throughput Computing White Paper.
The term CoolThreads refers to the CMT-based processor line from Sun, the first of which is the UltraSPARC T1. The CoolThreads name reflects the processor's chip multithreading architecture, and its low power usage, which causes less heat to be dissipated, resulting in a cooler chip. For more information about CoolThreads and the UltraSPARC T2 and UltraSPARC T1, see the Sun Servers with CoolThreads Technology overview.
The Solaris 10 OS is optimized for running on CoolThreads processors. Other operating systems are being ported to the UltraSPARC architecture through the OpenSPARC community.
For more information about chip-multithreading, see the following Sun publications:
Cluster computing – A cluster is a group of computers, generally called nodes, working together as a single system. Often the nodes are the same type of computer, running the same operating system, and belonging to the same administrative domain. Special cluster software running on the nodes and a high-speed network connecting the nodes enable rapid communication between them. Clusters can be configured to provide high availability (HA), for situations where the hardware and software must always be up and running. Hardware and software failures in a node do not cause the cluster to fail because built-in redundancy in HA configurations enables other nodes to pick up the tasks of a failed node while the cluster continues to run. Examples of environments requiring high availability are online reservation or ordering systems.
A cluster of systems can also be used as a large parallel computer, useful for high performance computing (HPC). Clusters configured for HPC might be used to run parallelized scientific applications, for example. Usually the HPC and HA uses of a cluster are not combined. When used for HPC environments, all of a cluster's available resources are used for the tasks at hand. If a failure occurs, the hardware or software is fixed and restarted.
To fully realize the multiprocessing benefits of running on a cluster, applications should be parallelized using one of the software parallel programming models.
Grid computing – This term refers to a heterogeneous mix of networked computers working together, similar to a cluster but potentially working across administrative domains or organizations. The nodes on a grid can range from a small group of systems located in the same room to a large set of networked computers installed around the world. Even a cluster can be a node in a grid. Each node in the grid runs special software that enables it to make optimal use of the available resources like CPU cycles and storage that are contributed by the nodes on the grid. Often, the grid software can be configured so that any possible spare CPU cycle is used to run applications. This technique enables optimal use of the system. Originally, grids were used to run scientific applications. More recently, grid use has extended to other environments, including environments where clusters have traditionally been used. As a result, the difference between a cluster and a grid is not always very clear. The system software is often the main differentiator.

Parallelism by Software Programming Models

The Solaris OS kernel and most Solaris services have been multithreaded and optimized for many years in order to take advantage of multiprocessor architectures. Sun continues to invest in parallelizing and optimizing Solaris software to fully support emerging parallel architectures. For a single application to benefit from a multiprocessor architecture including clusters and grids, the program should be parallelized using one of the parallel programming models. In all cases, the application's use of parallelism must improve performance enough to surpass the processing overhead that comes with the programming model. The creation and management of threads are examples of processing overhead.

The programming model used in any application depends on the underlying hardware architecture of the system on which the application is expected to run. Specifically, the developer must distinguish between a shared memory system and a distributed memory system. In a shared memory architecture, the application can transparently access any memory location. A multicore processor is an example of a shared memory system. In a distributed memory environment, the application can only transparently access the memory of the node it is running on. Access to the memory on another node has to be explicitly arranged within the application. Clusters and grids are examples of distributed memory systems.

For more information about parallel computing software models, see the technical article Developing Applications for Parallel Computing by Liang Chen.

Shared Memory Programming Models

Shared memory, or multithreaded, programming is sometimes also called threaded programming. In this context, threads are lightweight processes, which are processes that exist within a single operating system process. Threads share the same memory address space and state information of the process that contains them. The containing process is sometimes also called the parent process. The shared memory model is supported on computers that have multiple processors, where each core or processor has access to the same shared memory. Such a system has a single address space. Communication and data exchange between the threads takes place through shared memory.

Parallel programming can be implemented for shared memory systems using any of the following models.

Automatic parallelization – When the program is compiled, the compiler tries to identify the parallelism in the application. The focus is on loops, either a single loop or a set of nested loops, as this area is typically where most of the execution time is spent. Through a dependence analysis, the compiler determines whether parallelizing a loop is safe. If it is safe, the compiler generates the right parallel infrastructure for parallel execution at runtime. The developer merely has to use the appropriate option on the compiler to activate this feature. With the Sun Studio compilers, this option is the -xautopar option. The -xloopinfo option, which displays parallelization messages, is also highly recommended.
POSIX threads and Solaris threads – The Solaris OS supports two shared-memory threading models. The standard POSIX threads API, usually abbreviated as Pthreads, is available for applications written in C. The older Solaris threads API, which predates the Pthreads standard, is also supported. The POSIX threads API is the standard supported on many UNIX-based operating systems. Use of this standard increases portability. Both libraries are included in the standard C library libc in the Solaris OS. See the pthreads(5) man page for a comparison of both APIs.
For condensed information about Pthreads programming, see the POSIX Threads Programming tutorial at www.llnl.gov. For a more comprehensive understanding of programming with POSIX threads you might read the books Programming with POSIX Threads by David R. Butenhof and Programming with Threads by Steve Klieman, Devang Shah, and Bart Smaalders.
OpenMP – This API specification is for implementing parallel programming on a shared memory system. OpenMP offers a higher level model than POSIX threads and also provides additional functionality. In many cases, an OpenMP implementation is built on top of a native threading model like POSIX threads. OpenMP consists of a set of compiler directives, runtime functions, and environment variables. Fortran, C and C++ are supported.
The compiler directive plays a key role in OpenMP. By inserting directives in the source, the developer specifies what parts of the program can be executed in parallel. The compiler transforms these specified parts of the program into the appropriate infrastructure, such as a function call to an underlying multitasking library. OpenMP has four main advantages over other programming models:
- Portability – Although OpenMP is not an official standard, a program using OpenMP is portable to another OpenMP compiler or environment.
- Ease of use – The developer does not have to create and manage threads at the level of POSIX threads, for example. Thread management is handled by the compiler and underlying multitasking library.
- The application can be parallelized step by step – The developer specifies the sections that can be executed in parallel, and can thus incrementally parallelize the application as necessary.
- The sequential version of the program is preserved – If the program is not compiled with the compiler option for OpenMP, the directives in the code are ignored. This behavior effectively disables parallel execution for that source and the program runs sequentially again.
The OpenMP specification is available at http://www.openmp.org/. Many articles about OpenMP are available on the Parallel Programming page of the Sun Developer Network site, including :
The Sun Studio documentation set includes the Sun Studio 12: OpenMP API User’s Guide, which describes issues specific to the Sun Studio implementation of the OpenMP API.

Distributed Memory Programming Models

Developers can implement the parallelism in an application by using a very low-level communication interface, such as sockets, between networked computers. However, using such a method is the equivalent of using assembly language programming for applications: very powerful, but also very minimal. As a result, an application parallelized using such an API might be hard to maintain and expand.

The Message Passing Interface (MPI) model is commonly used to parallelize applications for a cluster of computers, or a grid. Like OpenMP, this interface is an additional software layer on top of basic OS functionality. MPI is built on top of a software networking interface, such as sockets, with a protocol such as TCP/IP. MPI provides a rich set of communication routines, and is widely available.

An MPI program is a sequential C, C++, or Fortran program that runs on a subset of processors, or all processors or cores in the cluster. The programmer implements the distribution of the tasks and communication between the tasks, and decides how the work is allocated to the various threads. To this end, the program needs to be augmented with calls to MPI library functions, for example, to send and receive information from other threads.

MPI is a very explicit programming model. Although some convenience functionality is provided, such as a global broadcast operation, the developer has to specifically design the parallel application for this programming model. Many low-level details also need to be handled explicitly.

The advantage to MPI is that an application can run on any type of cluster that has the software to support the MPI programming model. Although originally MPI programs mainly ran on clusters of single processor workstations or PCs, running an MPI application on one or more shared memory computers is now common. An optimized MPI implementation can then also take advantage of the faster communication over shared memory for those threads executing in the same system.

The following resources provide more information about MPI:

The MPI specification is available from Argonne National Laboratory at http://www-unix.mcs.anl.gov/mpi/
Open MPI is an open–source effort by a consortium of research, academic, and industry partners to build an MPI library that combines technologies and resources from several MPI projects. Open MPI is the basis for the Sun HPC ClusterTools 7 software. You can download this software for free from the Sun HPC ClusterTools 7 page.
For a detailed overview of MPI, see the Message Passing Interface (MPI) Tutorial at www.llnl.gov.
Additional online tutorial material about MPI is available at http://www-unix.mcs.anl.gov/mpi/tutorial/.

Hybrid Programming Models

With the emergence of multicore systems, an increasing number of clusters and grids are parallel systems with two layers. Within a single node, fast communication through shared memory can be exploited, and a networking protocol can be used to communicate across the nodes. Programs can take advantage of both shared memory and distributed memory.

The MPI model can be used to run parallel applications on clusters of multicore systems. MPI applications run across the nodes as well as within each node, so both parallelization layers, shared and distributed, could be used through MPI. In certain situations, however, adding the finer-grained parallelization offered by a shared memory programming model such as Pthreads or OpenMP is more efficient. Typically, parallel execution over the nodes is achieved through MPI. Within one node, Pthreads or OpenMP is used. When two programming models are used in one application, the application is said to be parallelized with a hybrid or mixed-mode programming model.

Another hybrid programming model that is sometimes used is to combine Pthreads and OpenMP. This type of application only runs in one shared-memory system. Each Pthread process is further parallelized using OpenMP, taking advantage of the additional parallelism offered by this type of process.

Sun Parallel Application Development Software

Sun offers software products to support the technologies discussed in this article.

For Shared Memory Systems

Sun software for shared memory systems includes:

Threads – POSIX threads and Solaris threads libraries are both included in the Solaris libc library.

OpenMP – An implementation of OpenMP for C, C++ and Fortran is included in the Sun Studio software, which is free to download. The -xopenmp compile and link-time option instructs the Sun Studio compiler to recognize OpenMP directives and runtime functions in a program. The OpenMP runtime support library, libmtsk, provides support for thread management, synchronization, and scheduling of work. The library is implemented on top of the POSIX threads library.

For Distributed Memory Systems

An implementation of MPI is included in Sun HPC ClusterTools. This product also includes driver compile scripts and tools to query and manage the jobs at runtime. Note that multiple versions of Sun HPC ClusterTools are available. The ClusterTools 5 and ClusterTools 6 software includes the Sun implementation of MPI, called Sun MPI. The ClusterTools 7 software includes the newer open-source implementation of MPI, called Open MPI. The Sun HPC ClusterTools 7.1 Software Migration Guide describes the differences between Sun MPI and Open MPI to help in upgrading applications that use Sun MPI functions to run with Open MPI. For complete ClusterTools information, see Sun HPC ClusterTools 7.1 Documentation.

Hardware and Software for HPC Cluster and Grid

Sun products for implementing and managing clusters include:

Solaris Cluster software.
N1 System Manager for managing distributed systems.
Sun Customer Ready HPC Cluster, a complete clustering hardware and software solution, and the Sun Customer Ready HPC Scalable Storage Cluster for high-performance data storage.

Sun grid computing products include:

N1 Grid Engine software, which enables your organization to create its own grid.
A publicly accessible utility computing grid, called the Sun Grid Compute Utility, at www.network.com. The Sun Grid Compute Utility Developer Resources page on the Sun Developer Network provides more information.
Sun Customer Ready HPC Cluster, which provides a cluster solution, but can also serve well as the cornerstone of your grid implementation.

See Sun High Performance Computing for more information about HPC at Sun, including the Sun Constellation System, an HPC integrated environment including high performance system and storage hardware, software, and developer tools.