A cluster file system is a proxy between the kernel on one node and the underlying file system and volume manager running on a node that has a physical connection to the disk(s).
Cluster file systems are dependent on global devices (disks, tapes, CD-ROMs) with physical connections to one or more nodes. The global devices can be accessed from any node in the cluster through the same file name (for example, /dev/global/) whether or not that node has a physical connection to the storage device. You can use a global device the same as a regular device, that is, you can create a file system on it using newfs and/or mkfs.
You can mount a file system on a global device globally with mount -g or locally with mount.Programs can access a file in a cluster file system from any node in the cluster through the same file name (for example, /global/foo).A cluster file system is mounted on all cluster members. You cannot mount a cluster file system on a subset of cluster members.
In Sun Cluster, all multihost disks are configured as disk device groups, which can be Solstice DiskSuite disksets, VxVM disk groups, or individual disks not under control of a software-based volume manager. Also, local disks are configured as disk device groups: a path leads to each local disk from each node. This setup does not mean the data on a disk is necessarily available from all nodes. The data only becomes available to all nodes if the file systems on the disks are mounted globally as a cluster file system.
A local file system that is made into a cluster file system only has a single connection to the disk storage. If the node with the physical connection to the disk storage fails, the other nodes no longer have access to the cluster file system. You can have local file systems on a single node that are not accessible directly from other nodes.
HA data services are set up so that the data for the service is stored on disk device groups in cluster file systems. This setup has several advantages. First, the data is highly available; that is, because the disks are multihosted, if the path from the node that currently is the primary fails, access is switched to another node that has direct access to the same disks. Second, because the data is on a cluster file system, it can be viewed from any cluster node directly--you do not have to log onto the node that currently masters the disk device group to view the data.
The cluster file system is based on the proxy file system (PXFS), which has the following features:
PXFS makes file access locations transparent. A process can open a file located anywhere in the system and processes on all nodes can use the same path name to locate a file.
PXFS uses coherency protocols to preserve the UNIX file access semantics even if the file is accessed concurrently from multiple nodes.
PXFS provides extensive caching and provides zero-copy bulk I/O movement to move large data objects efficiently.
PXFS provides continuous access to data, even when failures occur. Applications do not detect failures as long as a path to disks is still operational. This guarantee is maintained for raw disk access and all file system operations.
PXFS is independent of underlying file system and volume management software. PXFS makes any supported on-disk file system global.
PXFS is built on top of the existing Solaris file system at the vnode interface. This interface enables PXFS to be implemented without extensive kernel modifications.
PXFS is not a distinct file system type. That is, clients see the underlying file system (for example, UFS).
The cluster file system is independent of the underlying file system and volume manager. Currently, you can build cluster file systems on UFS using either Solstice DiskSuite or VERITAS Volume Manager.
As with normal file systems, you can mount cluster file systems in two ways:
Manually--Use the mount command and the -g option to mount the cluster file system from the command line, for example:
# mount -g /dev/global/dsk/d0s0 /global/oracle/data |
Automatically--Create an entry in the /etc/vfstab file with a global mount option to mount the cluster file system at boot. You then create a mount point under the /global directory on all nodes. The directory /global is a recommended location, not a requirement. Here's a sample line for a cluster file system from an /etc/vfstab file:
/dev/md/oracle/dsk/d1 /dev/md/oracle/rdsk/d1 /global/oracle/data ufs 2 yes global,logging |
While Sun Cluster does not impose a naming policy for cluster file systems, you can ease administration by creating a mount point for all cluster file systems under the same directory, such as /global/disk-device-group. See Sun Cluster 3.0 Installation Guide and Sun Cluster 3.0 System Administration Guide for more information.
The syncdir mount option can be used for cluster file systems. However, there is a significant performance improvement if you do not specify syncdir. If you specify syncdir, the writes are guaranteed to be POSIX compliant. If you do not, you will have the same behavior that is seen with UFS file systems. For example, under some cases, without syncdir, you would not discover an out of space condition until you close a file. With syncdir (and POSIX behavior), the out of space condition would have been discovered during the write operation. The cases in which you could have problems if you do not specify syncdir are rare, so we recommend that you do not specify it and receive the performance benefit.
See "File Systems FAQ" for frequently asked questions about global devices and cluster file systems.