NFS Server Performance and Tuning Guide for Sun Hardware

Disk Drives

Disk drive usage is frequently the tightest constraint in an NFS server. Even a sufficiently large memory configuration may not improve performance if the cache cannot be filled quickly enough from the file systems.

Determining if Disks Are the Bottleneck

  1. Use iostat to determine disk usage.

    Look at the number of read and write operations per second (see "Checking the NFS Server"" in Chapter 3, Analyzing NFS Performance).

Because there is little dependence in the stream of NFS requests, the disk activity generated contains large numbers of random access disk operations. The maximum number of random I/O operations per second ranges from 40-90 per disk.

Driving a single disk at more than 60 percent of its random I/O capacity creates a disk bottleneck.

Limiting Disk Bottlenecks

Disk bandwidth on an NFS server has the greatest effect on NFS client performance. Providing sufficient bandwidth and memory for file system caching is crucial to providing the best possible file server performance. Note that read/write latency is also important. For example, each NFSop may involve one or more disk accesses. Disk service times add to the NFSop latency, so slow disks mean a slow NFS server.

Follow these guidelines to ease disk bottlenecks:

If one disk is heavily loaded and others are operating at the low end of their capacity, shuffle directories or frequently accessed files to less busy disks.

Adding disks provides additional disk capacity and disk I/O bandwidth.

See the following section, "Replicating File Systems"."

Caches for inodes (file information nodes), file system metadata such as cylinder group information, and name-to-inode translations must be sufficiently large, or additional disk traffic is created on cache misses. For example, if an NFS client opens a file, that operation generates several name-to-inode translations on the NFS server.

If an operation misses the Directory Name Lookup Cache (DNLC), the server must search the disk-based directory entries to locate the appropriate entry name. What would nominally be a memory-based operation degrades into several disk operations. Also, cached pages will not be associated with the file.

Replicating File Systems

Commonly used file systems, such as the following, are frequently the most heavily used file systems on an NFS server:

The best way to improve performance for these file systems is to replicate them. One NFS server is limited by disk bandwidth when handling requests for only one file system. Replicating the data increases the size of the aggregate "pipe" from NFS clients to the data. However, replication is not a viable strategy for improving performance with writable data, such as a file system of home directories. Use replication with read-only data.

To replicate file systems, do the following:

Specify a server name in the /etc/vfstab file to create a permanent binding from NFS client to the server. Alternatively, listing all server names in an automounter map entry allows completely dynamic binding, but may also lead to a client imbalance on some NFS servers. Enforcing "workgroup" partitions in which groups of clients have their own replicated NFS server strikes a middle ground between the extremes and often provides the most predictable performance.

The frequency of change of the read-only data determines the schedule and the method for distributing the new data. File systems that undergo a complete change in contents, for example, a flat file with historical data that is updated monthly, can be best handled by copying data from the distribution media on each machine, or using a combination of ufsdump and restore. File systems with few changes can be handled using management tools such as rdist.

Adding the Cache File System

The cache file system is client-centered. You use the cache file system on the client to reduce server load. With the cache file system, files are obtained from the server, block by block. The files are sent to the memory of the client and manipulated directly. Data is written back to the disk of the server.

Adding the cache file system to client mounts provides a local replica for each client. The /etc/vfstab entry for the cache file system looks like this:


# device    device    mount    FS    fsck    mount    mount
# to mount  to fsck   point    type  pass    at boot  options
server:/usr/dist      cache    /usr/dist     cachefs  3  yes
ro,backfstype=nfs,cachedir=/cache

Use the cache file system in situations with file systems that are read mainly, such as application file systems. Also, you should use the cache file system for sharing data across slow networks. Unlike a replicated server, the cache file system can be used with writable file systems, but performance will degrade as the percent of writes climb. If the percent of writes is too high, the cache file system may decrease NFS performance.

You should also consider using the cache file system if your networks are high speed networks interconnected by routers.

If the NFS server is frequently updated, do not use the cache file system because doing so would result in more traffic than operating over NFS.

  1. To monitor the effectiveness of the cached file systems use the cachefsstat command (available with the Solaris 2.5 and later operating environment).

    The syntax of the cachefsstat command is as follows:


    system# /usr/bin/cachefsstat [-z] path

    where:

    -z initializes statistics. You should execute cachefs -z (superuser only) before executing cachfsstat again to gather statistics on the cache performance. The statistics printed reflect those just before the statistics are reinitialized.

    path is the path the cache file system is mounted on. If you do not specify a path, all mounted cache file systems are used.

Without the -z option, you can execute this command as a regular UNIX user. The statistical information supplied by the cachefsstat command includes cache hits and misses, consistency checking, and modification operation:

Table 4-1 Statistical Information Supplied by the cachefsstat Command

Output 

Description 

cache hit rate

Percentage of cache hits over the total number of attempts (followed by the actual numbers of hits and misses) 

consistency checks

Number of consistency checks performed. It is followed by the number that passed and the number that failed. 

modifies

Number of modify operations, including writes and creates. 

An example of the cachefsstat command is:


system% /usr/bin/cachefsstat /home/sam
cache hit rate: 73% (1234 hits, 450 misses)
consistency checks:  700 (650 pass, 50 fail)
modifies: 321

In the previous example, the cache hit rate for the file system should be higher than thirty percent. If the cache hit rate is lower than thirty percent, this means that the access pattern on the file system is widely randomized or that the cache is too small.

The output for a consistency check means that the cache file system checks with the server to see if data is still valid. A high failure rate (15 to 20 percent) means that the data of interest is rapidly changing. The cache may be updated more quickly than what is appropriate for a cached file system. When you use the output from consistency checks with the number of modifies, you can learn if this client or other clients are making the changes.

The output for modifies is the number of times the client has written changes to the file system. This output is another method to understand why the hit rate may be low. A high rate of modify operations likely goes along with a high number of consistency checks and a lower hit rate.

Also available, beginning with the Solaris 2.5 software environment, are the commands cachefswssize, which determine the working set size for the cache file system and cachefsstat, which displays where the cache file system statistics are being logged. Use these commands to determine if the cache file system is appropriate and valuable for your installation.

Configuration Rules for Disk Drives

Follow these general guidelines for configuring disk drives. In addition to the following general guidelines, more specific guidelines for configuring disk drives in data-intensive environments and attribute-intensive environments follows:

Keep these rules in mind when configuring disk drives in data-intensive environments:

When configuring disk drives in attribute-intensive environments:

Using Solstice DiskSuite or Online: DiskSuite to Spread Disk Access Load

A common problem in NFS servers is poor load balancing across disk drives and disk controllers.

To balance loads, do the following:

The disk mirroring feature of Solstice DiskSuite or Online: DiskSuite improves disk access time and reduces disk usage by providing access to two or three copies of the same data. This is particularly true in environments dominated by read operations. Write operations are normally slower on a mirrored disk since two or three writes must be accomplished for each logical operation requested.

Attaining even disk usage usually requires some iterations of monitoring and data reorganization. In addition, usage patterns change over time. A data layout that works when installed may perform poorly a year later. For more information on checking disk drive usage, see "Checking the NFS Server" in Chapter 3, Analyzing NFS Performance.

Using Log-Based File Systems With Solstice DiskSuite or Online: DiskSuite 3.0

The Solaris 2.4 through Solaris 7 software environments and the Online: Disk Suite 3.0 or Solstice DiskSuite software support a log-based extension to the standard UNIX file system, which works like a disk-based Prestoserve NFS accelerator.

In addition to the main file system disk, a small (typically 10 Mbytes) section of disk is used as a sequential log for writes. This speeds up the same kind of operations as a Prestoserve NFS accelerator with two advantages:


Note -

You cannot use the Prestoserve NFS accelerator and the log on the same file system.


Using the Optimum Zones of the Disk

When you analyze your disk data layout, consider zone bit recording.

All of Sun's current disks (except the 207 Mbyte disk) use this type of encoding which uses the peculiar geometric properties of a spinning disk to pack more data into the parts of the platter closest to its edge. This results in the lower disk addresses (corresponding to the outside cylinders) usually outperforming the inside addresses by 50 percent.

  1. Put the data in the lowest-numbered cylinders.

    The zone bit recording data layout makes those cylinders the fastest ones.

This margin is most often realized in serial transfer performance, but also affects random access I/O. Data on the outside cylinders (zero) not only moves past the read/write heads more quickly, but the cylinders are also larger. Data will be spread over fewer large cylinders, resulting in fewer and shorter seeks.