As its name implies, the distinguishing characteristic of a parallel file system is the parallel layout of its files. Unlike serial file systems, such as UFS, which conduct file I/O in single, serial streams, PFS may distribute its files across two or more disks, each of which may be attached to a different PFS I/O server. This allows file I/O to be divided into multiple, parallel streams, yielding significant performance gains in every file read or write operation.
Standard Solaris file system commands can be used to access and manipulate PFS files. However, the high-performance I/O capabilities of PFS can be fully exploited only through calls to MPI I/O library routines.
PFS file systems are defined in the hpc.conf file. There, each file system is given a name and the list of hostnames of the PFS I/O servers across which it will be distributed.
A PFS I/O server is simply a Sun HPC node that has disk storage systems attached, has been defined as a PFS I/O server in the hpc.conf file, and is running a PFS I/O daemon. A PFS I/O server and the disk storage device(s) attached to it are jointly referred to as a PFS storage system.
Figure 4-1 illustrates a sample Sun HPC cluster with eight nodes:
Four nodes function as compute servers only - CS0, 1, 2, and 4.
Three nodes functions as PFS I/O servers only - IOS0, IOS 1, and IOS2.
One node operates as both compute server and PFS I/O server - CS3-IOS3.
All four PFS I/O servers have disk storage subsystems attached. PFS I/O servers IOS0 and IOS3 each have a single disk storage unit, while IOS1 and IOS2 are each connected to two disk storage units.
The PFS configuration example in Figure 4-1 shows two PFS file systems, pfs-demo0 and pfs-demo1.
Each PFS file system is distributed across three PFS storage systems. This means an individual file in either file system will be divided into three blocks, which will be written to and read from its storage subsystems in three parallel data streams.
Note that two PFS storage systems, IOS1 and IOS2, contain at least two disk partitions, allowing them to be used by both pfs-demo0 and pfs-demo1.
The dashed lines labeled pfs-demo0 I/O indicate the data flow between compute processes 0, 1, and 2 and the PFS file system pfs-demo0. Likewise, the set of solid lines labeled pfs-demo1 I/O represent I/O for the PFS file system pfs-demo1.
This method of laying out PFS files introduces some file system configuration issues not encountered with UFS and other serial file systems. These issues are discussed in the balance of this section.
Although PFS files are distributed differently from UFS files, the same Solaris utilities can be used to manage them.
If you plan to configure only a subset of the nodes on a cluster as PFS I/O servers, you will have the option of either colocating applications and I/O daemons on the same PFS I/O servers or segregating them onto separate nodes. If, however, you configure all the nodes in a cluster as PFS I/O servers, you will of necessity colocate applications and PFS I/O daemons.
Guidelines for making this choice are provided below.
Each of the following conditions favors colocating applications with PFS I/O daemons.
Large nodes (many CPUs per node).
Fast disk-storage devices (storage arrays, for example) on each node.
Lower-performance cluster interconnect, such as 10- or 100-BaseT Ethernet.
Small number of applications competing for node resources.
When these conditions exist in combination, the network is more likely to be a performance-limiting resource than the relatively more powerful nodes. Therefore, it becomes advantageous to locate applications on the PFS I/O servers to decrease the amount of data that must be sent across the network.
You should avoid running applications on I/O server nodes when some or all of the following conditions exist.
Smaller nodes.
Slow disk storage devices (single disks, for example) on each node.
Relatively high-performance cluster interconnect, such as SCI or ATM.
Large number of applications competing for node resources.
In this case, the competition for memory, bus bandwidth, and CPU cycles may offset any performance advantages local storage would provide.
By itself, the size of a cluster (number of nodes) does not favor either colocating or not colocating applications and PFS I/O daemons. Larger clusters do, however, attenuate the benefits of colocating. This is because the amount by which colocating reduces network traffic can be expressed as
Tc = Ts - Ts/N
where Tc is the level of network traffic using colocating, Ts is the level of network traffic without colocating, and N is the number of nodes in the cluster. In other words, colocating reduces network traffic by 1/number-of-nodes. The more nodes there are in the cluster, the smaller the effect of colocating.