NFS Server Performance and Tuning Guide for Sun Hardware

Disk Drives

Disk drive usage is frequently the tightest constraint in an NFS server. Even a sufficiently large memory configuration may not improve performance if the cache cannot be filled quickly enough from the file systems.

Determining if Disks Are the Bottleneck

Use iostat to determine disk usage.

Look at the number of read and write operations per second (see "Checking the NFS Server"" in Chapter 3, Analyzing NFS Performance).

Because there is little dependence in the stream of NFS requests, the disk activity generated contains large numbers of random access disk operations. The maximum number of random I/O operations per second ranges from 40-90 per disk.

Driving a single disk at more than 60 percent of its random I/O capacity creates a disk bottleneck.

Limiting Disk Bottlenecks

Disk bandwidth on an NFS server has the greatest effect on NFS client performance. Providing sufficient bandwidth and memory for file system caching is crucial to providing the best possible file server performance. Note that read/write latency is also important. For example, each NFSop may involve one or more disk accesses. Disk service times add to the NFSop latency, so slow disks mean a slow NFS server.

Follow these guidelines to ease disk bottlenecks:

Balance the I/O load across all disks on the system.

If one disk is heavily loaded and others are operating at the low end of their capacity, shuffle directories or frequently accessed files to less busy disks.

Partition the file system(s) on the heavily used disk and spread the file system(s) over several disks.

Adding disks provides additional disk capacity and disk I/O bandwidth.

Replicate the file system to provide more network-to-disk bandwidth for the clients if the file system used is read-only by the NFS clients, and contains data that doesn't change constantly.

See the following section, "Replicating File Systems"."

Size the operating system caches correctly, so that frequently needed file system data may be found in memory.

Caches for inodes (file information nodes), file system metadata such as cylinder group information, and name-to-inode translations must be sufficiently large, or additional disk traffic is created on cache misses. For example, if an NFS client opens a file, that operation generates several name-to-inode translations on the NFS server.

If an operation misses the Directory Name Lookup Cache (DNLC), the server must search the disk-based directory entries to locate the appropriate entry name. What would nominally be a memory-based operation degrades into several disk operations. Also, cached pages will not be associated with the file.

Replicating File Systems

Commonly used file systems, such as the following, are frequently the most heavily used file systems on an NFS server:

/usr directory for diskless clients
Local tools and libraries
Third-party packages
Read-only source code archives

The best way to improve performance for these file systems is to replicate them. One NFS server is limited by disk bandwidth when handling requests for only one file system. Replicating the data increases the size of the aggregate "pipe" from NFS clients to the data. However, replication is not a viable strategy for improving performance with writable data, such as a file system of home directories. Use replication with read-only data.

To replicate file systems, do the following:

Identify the file or file systems to be replicated.

If several individual files are candidates, consider merging them in a single file system. The potential decrease in performance that arises from combining heavily used files on one disk is more than offset by performance gains through replication.

Use nfswatch, to identify the most commonly used files and file systems in a group of NFS servers. Table A-1 in Appendix A, Using NFS Performance-Monitoring and Benchmarking Toolslists performance monitoring tools, including nfswatch, and explains how to obtain nfswatch.

Determine how clients will choose a replica.

Specify a server name in the /etc/vfstab file to create a permanent binding from NFS client to the server. Alternatively, listing all server names in an automounter map entry allows completely dynamic binding, but may also lead to a client imbalance on some NFS servers. Enforcing "workgroup" partitions in which groups of clients have their own replicated NFS server strikes a middle ground between the extremes and often provides the most predictable performance.

Choose an update schedule and method for distributing the new data.

The frequency of change of the read-only data determines the schedule and the method for distributing the new data. File systems that undergo a complete change in contents, for example, a flat file with historical data that is updated monthly, can be best handled by copying data from the distribution media on each machine, or using a combination of ufsdump and restore. File systems with few changes can be handled using management tools such as rdist.

Evaluate what penalties, if any, are involved if users access old data on a replica that is not current. One possible way of doing this is with the Solaris 2.x JumpStartTM facilities in combination with cron.

Adding the Cache File System

The cache file system is client-centered. You use the cache file system on the client to reduce server load. With the cache file system, files are obtained from the server, block by block. The files are sent to the memory of the client and manipulated directly. Data is written back to the disk of the server.

Adding the cache file system to client mounts provides a local replica for each client. The /etc/vfstab entry for the cache file system looks like this:

# device    device    mount    FS    fsck    mount    mount
# to mount  to fsck   point    type  pass    at boot  options
server:/usr/dist      cache    /usr/dist     cachefs  3  yes
ro,backfstype=nfs,cachedir=/cache

Use the cache file system in situations with file systems that are read mainly, such as application file systems. Also, you should use the cache file system for sharing data across slow networks. Unlike a replicated server, the cache file system can be used with writable file systems, but performance will degrade as the percent of writes climb. If the percent of writes is too high, the cache file system may decrease NFS performance.

You should also consider using the cache file system if your networks are high speed networks interconnected by routers.

If the NFS server is frequently updated, do not use the cache file system because doing so would result in more traffic than operating over NFS.

To monitor the effectiveness of the cached file systems use the cachefsstat command (available with the Solaris 2.5 and later operating environment).

The syntax of the cachefsstat command is as follows:
```
system# /usr/bin/cachefsstat [-z] path
```
where:

-z initializes statistics. You should execute cachefs -z (superuser only) before executing cachfsstat again to gather statistics on the cache performance. The statistics printed reflect those just before the statistics are reinitialized.

path is the path the cache file system is mounted on. If you do not specify a path, all mounted cache file systems are used.

Without the -z option, you can execute this command as a regular UNIX user. The statistical information supplied by the cachefsstat command includes cache hits and misses, consistency checking, and modification operation:

Table 4-1 Statistical Information Supplied by the cachefsstat Command


Output	Description
`cache hit rate`	Percentage of cache hits over the total number of attempts (followed by the actual numbers of hits and misses)
`consistency checks`	Number of consistency checks performed. It is followed by the number that passed and the number that failed.
`modifies`	Number of modify operations, including writes and creates.

An example of the cachefsstat command is:

system% /usr/bin/cachefsstat /home/sam
cache hit rate: 73% (1234 hits, 450 misses)
consistency checks:  700 (650 pass, 50 fail)
modifies: 321

In the previous example, the cache hit rate for the file system should be higher than thirty percent. If the cache hit rate is lower than thirty percent, this means that the access pattern on the file system is widely randomized or that the cache is too small.

The output for a consistency check means that the cache file system checks with the server to see if data is still valid. A high failure rate (15 to 20 percent) means that the data of interest is rapidly changing. The cache may be updated more quickly than what is appropriate for a cached file system. When you use the output from consistency checks with the number of modifies, you can learn if this client or other clients are making the changes.

The output for modifies is the number of times the client has written changes to the file system. This output is another method to understand why the hit rate may be low. A high rate of modify operations likely goes along with a high number of consistency checks and a lower hit rate.

Also available, beginning with the Solaris 2.5 software environment, are the commands cachefswssize, which determine the working set size for the cache file system and cachefsstat, which displays where the cache file system statistics are being logged. Use these commands to determine if the cache file system is appropriate and valuable for your installation.

Configuration Rules for Disk Drives

Follow these general guidelines for configuring disk drives. In addition to the following general guidelines, more specific guidelines for configuring disk drives in data-intensive environments and attribute-intensive environments follows:

Configure additional drives on each host adapter without degrading performance (as long as the number of active drives does not exceed SCSI standard guidelines).

Use Online: DiskSuite or Solstice DiskSuite to spread disk access load across many disks. See "Using Solstice DiskSuite or Online: DiskSuite to Spread Disk Access Load"" later in this chapter.

Use the fastest zones of the disk when possible. See "Using the Optimum Zones of the Disk"" later in this chapter.

Keep these rules in mind when configuring disk drives in data-intensive environments:

Configure for a sequential environment.

Use disks with the fastest transfer speeds (preferably in stripes).

Configure one RAID device (logical volume or metadisk) for every three active version 3 clients or one device for every four to five version 2 clients.

Configure one drive for every client on Ethernet or Token Ring.

When configuring disk drives in attribute-intensive environments:

Configure with a larger number of smaller disks, which are connected to a moderate number of SCSI host adapters (such as a disk array).

Configure four to five (or up to eight or nine) fully active disks per fast SCSI host adapter Using smaller disk drives is much better than operating with one large disk drive.

Configure at least one disk drive for every two fully active clients (on any type of network.)

Configure no more than eight to ten fully active disk drives for each fast/wide SCSI host adapter.

Using Solstice DiskSuite or Online: DiskSuite to Spread Disk Access Load

A common problem in NFS servers is poor load balancing across disk drives and disk controllers.

To balance loads, do the following:

Balance loads by physical usage instead of logical usage. Use Solstice DiskSuite or Online: DiskSuite to spread disk access across disk drives transparently by using its striping and mirroring functions.

The disk mirroring feature of Solstice DiskSuite or Online: DiskSuite improves disk access time and reduces disk usage by providing access to two or three copies of the same data. This is particularly true in environments dominated by read operations. Write operations are normally slower on a mirrored disk since two or three writes must be accomplished for each logical operation requested.

Balance loads using disk concatenation when disks are relatively full. This procedure accomplishes a minimum amount of load balancing

If your environment is data-intensive, stripe the disk with a small interlace to improve disk throughput and distribute the service load. Disk striping improves read and write performance for serial applications. Use 64 Kbytes per number of disks in the stripe as a starting point for interlace size.

If your environment is attribute-intensive, where random access dominates disk usage, stripe the disk with the default interlace (one disk cylinder).

Use the iostat and sar commands to report disk drive usage.

Attaining even disk usage usually requires some iterations of monitoring and data reorganization. In addition, usage patterns change over time. A data layout that works when installed may perform poorly a year later. For more information on checking disk drive usage, see "Checking the NFS Server" in Chapter 3, Analyzing NFS Performance.

Using Log-Based File Systems With Solstice DiskSuite or Online: DiskSuite 3.0

The Solaris 2.4 through Solaris 7 software environments and the Online: Disk Suite 3.0 or Solstice DiskSuite software support a log-based extension to the standard UNIX file system, which works like a disk-based Prestoserve NFS accelerator.

In addition to the main file system disk, a small (typically 10 Mbytes) section of disk is used as a sequential log for writes. This speeds up the same kind of operations as a Prestoserve NFS accelerator with two advantages:

In dual-machine high-available configurations, the Prestoserve NFS accelerator cannot be used. The log can be shared so that it can be used.

After an operating system crash, the fsck of the log-based file system involves a sequential read of the log only. The sequential read of the log is almost instantaneous, even on very large file systems.

Note -

You cannot use the Prestoserve NFS accelerator and the log on the same file system.

Using the Optimum Zones of the Disk

When you analyze your disk data layout, consider zone bit recording.

All of Sun's current disks (except the 207 Mbyte disk) use this type of encoding which uses the peculiar geometric properties of a spinning disk to pack more data into the parts of the platter closest to its edge. This results in the lower disk addresses (corresponding to the outside cylinders) usually outperforming the inside addresses by 50 percent.

Put the data in the lowest-numbered cylinders.

The zone bit recording data layout makes those cylinders the fastest ones.

This margin is most often realized in serial transfer performance, but also affects random access I/O. Data on the outside cylinders (zero) not only moves past the read/write heads more quickly, but the cylinders are also larger. Data will be spread over fewer large cylinders, resulting in fewer and shorter seeks.