NFS Server Performance and Tuning Guide for Sun Hardware

Tuning Parameters

This section describes how to set the number of NFS threads. It also covers tuning the main NFS performance-related parameters in the /etc/system file. Tune these /etc/system parameters carefully, considering the physical memory size of the server and kernel architecture type.

Note -

Arbitrary tuning creates major instability problems, including an inability to boot.

Setting the Number of NFS Threads in `/etc/init.d/nfs.server`

For improved performance, NFS server configurations should set the number of NFS threads. Each thread is capable of processing one NFS request. A larger pool of threads enables the server to handle more NFS requests in parallel. The default setting of 16 in Solaris 2.4 through Solaris 7 software environments results in slower NFS response times. Scale the setting with the number of processors and networks and increase the number of NFS server threads by editing the invocation of nfsd in /etc/init.d/nfs.server:

/usr/lib/nfs/nfsd -a 64

The previous code box specifies that the maximum allocation of demand-based NFS threads is 64.

There are three ways to size the number of NFS threads. Each method results in about the same number of threads if you followed the configuration guidelines in this manual. Extra NFS threads do not cause a problem.

To set the number of NFS threads, take the maximum of the following three suggestions:

Use 2 NFS threads for each active client process.

A client workstation usually only has one active process. However, a time-shared system that is an NFS client may have many active processes.

Use 16 to 32 NFS threads for each CPU.

Use roughly 16 for a SPARCclassic or a SPARCstation 5 system. Use 32 NFS threads for a system with a 60 MHz SuperSPARC processor.

Use 16 NFS threads for each 10 Mbits of network capacity.

For example, if you have one SunFDDITM interface, set the number of threads to 160. With two SunFDDI interfaces, set the thread count to 320, and so on.

Identifying Buffer Sizes and Tuning Variables

The number of fixed-size tables in the kernel has been reduced in each release of the Solaris software environment. Most are now dynamically sized or are linked to the maxusers calculation. Extra tuning to increase the DNLC and inode caches is required for the Solaris 2.4 through Solaris 7 software environments. For Solaris version 2.4 you must tune the pager. Tuning the pager is not necessary for Solaris 2.5, 2.5.1, 2.6, or 7 operating environments.

Using `/etc/system` to Modify Kernel Variables

The /etc/system file is read by the operating system kernel at start-up. It configures the search path for loadable operating system kernel modules and enables kernel variables to be set. For more information, see the man page for system(4).

Caution -

Use the set commands in /etc/system carefully because the commands in /etc/system cause automatic patches of the kernel.

If your machine does not boot and you suspect a problem with /etc/system, use theboot -aoption. With this option, the system prompts (with defaults) for its boot parameters. One of these is the /etc/system configuration file. Either use the name of a backup copy of the original /etc/system file or /dev/null. Fix the file and immediately reboot the system to make sure it is operating correctly.

Adjusting Cache Size: `maxusers`

The maxusers parameter determines the size of various kernel tables such as the process table. The maxusersparameter is set in the /etc/system file. For example:

set maxusers = 
200

In the Solaris 2.4 through Solaris 7 software environments, maxusers is dynamically sized based upon the amount of RAM configured in the system.The sizing method used formaxusers is:

maxusers = Mbytes of RAM configured in the system

The number of Mbytes of RAM configured into the system is actually based upon physmem which does not include the 2 Mbytes or so that the kernel uses at boot time. The minimum limit is 8 and the maximum automatic limit is 1024, which corresponds to systems with 1 Gbyte or more of RAM. It can still be set manually in /etc/system but the manual setting is checked and limited to a maximum of 2048. This is a safe level on all kernel architectures, but uses a large amount of operating system kernel memory.

Parameters Derived From `maxusers`

Table 4-4 describes the default settings for the performance-related inode cache and name cache operating system kernel parameters.

Table 4-5 Default Settings for Inode and Name Cache Parameters


Kernel Resource	Variable	Default Setting
Inode cache	`ufs_ninode`	17 * `maxusers` + 90
Name cache	`ncsize`	17 * `maxusers` + 90

Adjusting the Buffer Cache `(bufhwm)`

The bufhwm variable, set in the /etc/system file, controls the maximum amount of memory allocated to the buffer cache and is specified in Kbytes. The default value of bufhwm is 0, which allows up to 2 percent of system memory to be used. This can be increased up to 20 percent and may need to be increased to 10 percent for a dedicated NFS file server with a relatively small memory system. On a larger system, the bufhwm variable may need to be limited to prevent the system from running out of the operating system kernel virtual address space.

The buffer cache is used to cache inode, indirect block, and cylinder group related disk I/O only. The following is an example of a buffer cache ( bufhwm) setting in the /etc/system file that can handle up to 10 Mbytes of cache. This is the highest value to which you should set bufhwm.

set bufhwm=10240

You can monitor the buffer cache using sar -b (see the following code example), which reports a read (%rcache) and a write hit rate (%wcache) for the buffer cache.

# sar -b 5 10
SunOS hostname 5.2 Generic sun4c    08/06/93
23:43:39 bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/s
Average        0      25     100       3      22      88       0       0

If a significant number of reads and writes per second occur (greater than 50) and if the read hit rate (%rcache) falls below 90 percent, or if the write hit rate (%wcache) falls below 65 percent, increase the buffer cache size, bufhwm.

In the previous sar -b 5 10 command output, the read hit rate (%rcache) and the write hit rate (%wcache) did not fall below 90 percent or 65 percent respectively.

Following are descriptions of the arguments to the sar command.

Table 4-6 Descriptions of the Arguments to the sar Command


`b`	Checks buffer activity
`5`	Time, every 5 seconds (must be at least 5 seconds)
`10`	Number of times the command gathers statistics

Your system will prevent you from increasing the buffer cache to an unacceptably high level. Signs of increasing buffer cache size include:

Hung server

Device drivers that suffer from a shortage of operating system kernel virtual memory

Directory Name Lookup Cache (DNLC)

Size the directory name lookup cache (DNLC) to a default value using maxusers. A large cache size (ncsize) significantly increases the efficiency of NFS servers with multiple clients.

To show the DNLC hit rate (cache hits), type vmstat -s.

% vmstat -s
... lines omitted
79062 total name lookups (cache hits 94%)
16 toolong

Directory names less than 30 characters long are cached and names that are too long to be cached are also reported. A cache miss means that a disk I/O may be needed to read the directory when traversing the path name components to get to a file. A hit rate of less than 90 percent requires attention.

Cache hit rates can significantly affect NFS performance. getattr, setattr, and lookup usually represent greater than 50 percent of all NFS calls. If the requested information isn't in cache, the request will generate a disk operation that results in a performance penalty as significant as that of a read or write request. The only limit to the size of the DNLC cache is available kernel memory.

If the hit rate (cache hits) is less than 90 percent and a problem does not exist with the number of longnames, tune the ncsize variable (see "To Reset ncsize,") which follows. The variable ncsize refers to the size of the DNLC in terms of the number of name and vnode translations that can be cached. Each DNLC entry uses about 50 bytes of extra kernel memory.

To Reset `ncsize`

Set ncsize in the /etc/system file to values higher than the default (based on maxusers.)

As an initial guideline, since dedicated NFS servers do not need a lot of RAM, maxusers will be low and the DNLC will be small; double its size.
```
set ncsize=5000
```
The default value of ncsize is:

ncsize (name cache) = 17 * maxusers + 90

For NFS server benchmarks, set it as high as 16000.
For maxusers = 2048, set it at 34906.

Reboot the system.

See "Increasing the Inode Cache"" which follows.

Increasing the Inode Cache

A memory-resident inode is used whenever an operation is performed on an entity in the file system. The inode read from disk is cached in case it is needed again. ufs_ninode is the size that the UNIX file system attempts to keep the list of idle inodes. You can have ufs_ninod set to 1 but have 10,000 idle inodes. As active inodes become idle, if the number of idle inodes goes over ufs_ninode, then memory is reclaimed by tossing out idle inodes.

Every entry in the DNLC cache points to an entry in the inode cache, so both caches should be sized together. The inode cache should be at least as big as the DNLC cache. For best performance, it should be the same size in the Solaris 2.4 through Solaris 7 operating environments.

Since it is just a limit, ufs_ninode you can tweak with adb on a running system with immediate effect. The only upper limit is the amount of kernel memory used by the inodes. The tested upper limit corresponds to maxusers = 2048, which is the same as ncsize at 34906.

To report the size of the kernel memory allocation use sar -k.

In the Solaris 2.4 operating environment, each inode uses 300 bytes of kernel memory from the lg_mem pool.

In the Solaris 2.5.1, 2.6, and 7 operating environments, each inode uses 320 bytes of kernel memory from the lg_mem pool. ufs_ninode is automatically adjusted to be at least ncsize. Tune ncsize to get the hit rate up and let the system pick the default ufs_ninodes.

With the Solaris 2.5.1. 2.6, and 7 software environments,ufs_ninode is automatically adjusted to be at least ncsize. Tune ncsize to get the hit rate up and let the system pick the default ufs_ninodes.

Increasing the Inode Cache in the Solaris 2.4 or the 2.5 Operating Environments

If the inode cache hit rate is below 90 percent, or if the DNLC requires tuning for local disk file I/O workloads:

Increase the size of the inode cache.

Change the variable ufs_ninode in your /etc/system file to the same size as the DNLC (ncsize). For example, for the Solaris version 2.4, type:
```
set ufs_ninode=5000
```
The default value of the inode cache is the same as that for ncsize:

ufs_ninode (default value) = 17 * maxusers + 90.

Caution -
Do not set ufs_ninode less than ncsize. The ufs_ninode parameter limits the number of inactive inodes, rather than the total number of active and inactive inodes.

Reboot the system.

Increasing Read Throughput

If you are using NFS over a high speed network such as FDDI, SunFastEthernet, or SunATMTM, you will have better read throughput by increasing the number of read-aheads on the NFS client.

Increasing read-aheads is not recommended under these conditions:

The client is very short of RAM.
The network is very busy.
File accesses are randomly distributed.

When free memory is low, read-ahead will not be performed.

The read-ahead is set to 1 block, by default (8 Kbytes with version 2 and to 32 Kbytes with version 3). For example, a read-ahead set to 2 blocks uses an additional 16 Kbytes from a file while you are reading the first 8 Kbytes from the file. Thus, the read-ahead stays one step ahead of you and uses information in 8 Kbyte increments to stay ahead of the information you need.

Increasing the read-ahead count can improve read throughput up to a point. The optimal read-ahead setting will depend on your configuration and application. Increasing the read-ahead value beyond that setting may actually reduce throughput. In most cases, the optimal read-ahead setting is less than eight read-aheads (8 blocks)..

Note -

In the following procedure you can tune the nfs_nra and the nfs3_nra values independently. If a client is running Solaris the 2.5, 2.5.1, 2.6, or 7 operating environment, the client may need to tune nfs_nra (NFS version 2). This happens if the client is talking to a server that does not support version 3.

To Increase the Number of Read-Aheads With Version 2

Add the following line to /etc/system on the NFS client.
```
set nfs:nfs_nra=4
```

Reboot the system to implement the read-ahead value.

To Increase the Number of Read-Aheads With Version 3

Add the following line to /etc/system on the NFS client:
- With versions of the Solaris software environment before the Solaris 2.6 software environment
```
set nfs:nfs3_nra=6
```
  :
- With the Solaris 2.6 operating environment, type:
```
set nfs:nfs3_nra=2
```
- With the Solaris 7 operating environment type:
```
set nfs:nfs3_nra=4
```
Note -
Raising the read-ahead count too high can make read throughput worse. You may consider running benchmarks with different values of nfs3_nra or nfs_nra to see what works best in your environment.

Reboot the system to implement the read-ahead value.