This chapter provides configuration recommendations to maximize NFS performance. For troubleshooting tips see Chapter 5, Troubleshooting.
This chapter discusses tuning recommendations for these environments:
Attribute-intensive environments, in which primarily small files (one to two hundred bytes) are accessed. Software development is an example of an attribute-intensive environment.
Data-intensive environments, in which primarily large files are accessed. A large file can be defined as a file that takes one or more seconds to transfer (roughly 1 Mbyte). CAD or CAE are examples of data-intensive environments.
Check these items when tuning the system:
Networks
Disk drives
Central processor units
Memory
Swap space
Number of NFS threads in /etc/init.d/nfs.server
/etc/system to modify kernel variables
Once you have profiled the performance capabilities of your server, begin tuning the system. Tuning an NFS server requires a basic understanding of how networks, disk drives, CPUs, and memory affect performance. To tune the system, determine which parameters need adjusting to improve balance.
Collect statistics. See Chapter 3, Analyzing NFS Performance.
Identify a constraint or overutilized resource and reconfigure around it.
Refer to this chapter and Chapter 3, Analyzing NFS Performance for tuning recommendations.
Measure the performance gain over a long evaluation period.
All NFS processing takes place inside the operating system kernel at a higher priority than user-level tasks.
Do not combine databases or time-shared loads on an NFS server because when the NFS load is high any additional tasks performed by an NFS server will run slowly.
Non-interactive workloads such as mail delivery and printing, excluding the SPARCprinter (not supported in the Solaris 2.6 and later releases of the Solaris operating environment) or other Sun printers based on the NeWSprintTM software are good candidates for using the server for dual purpose (such as NFS and other tasks). If you have spare CPU power and a light NFS load, then interactive work will run normally.
Providing sufficient network bandwidth and availability is the most important configuration for NFS servers. This means that you should configure the appropriate number and type of networks and interfaces.
Follow these tips when setting up and configuring the network.
Make sure that network traffic is well balanced across all client networks and that networks are not overloaded.
If one client network is excessively loaded, watch the NFS traffic on that segment.
Identify the hosts that are making the largest demands on the servers.
Partition the work load or move clients from one segment to another.
Simply adding disks to a system does not improve its NFS performance unless the system is truly disk I/O-bound. The network itself is likely to be the constraint as the file server increases in size, requiring the addition of more network interfaces to keep the system in balance.
Instead of attempting to move more data blocks over a single network, consider characterizing the amount of data consumed by a typical client and balance the NFS reads and writes over multiple networks.
Data-intensive applications demand relatively few networks. However, the networks must be of high-bandwidth.
If your configuration has either of the following characteristics, then your applications require high-speed networking:
Your clients require aggregate data rates of more than 1 Mbyte per second.
More than one client must be able to simultaneously consume 1 Mbyte per second of network bandwidth.
Configure FDDI, SunATM(TM), or another high-speed network.
If fiber cabling can't be used for logistical reasons, consider FDDI, CDDI, or SunFastEthernet(TM) over twisted-pair implementations. SunATM uses the same size fiber cabling as FDDI. For more information on FDDI, see the FDDI/S3.0 User's Guide.
Configure one FDDI ring for each five to seven concurrent fully NFS-active clients.
Few data-intensive applications make continuous NFS demands. In typical data-intensive EDA and earth-resources applications, this results in 25-40 clients per ring.
A typical use consists of loading a big block of data that is manipulated then written back to the server. Because the data is written back, these environments can have very high write percentages.
If your installation has Ethernet cabling, configure one Ethernet for every two active clients.
This almost always requires a SPARCserver 1000, SPARCserver 1000E, SPARCcenter 2000, SPARCcenter 2000E system, or an Ultra Enterprise 3000, 4000, 5000, or 6000 system since useful communities require many networks. Configure a maximum of four to six clients per network.
In contrast, most attribute-intensive applications are easily handled with less expensive networks. However, attribute-intensive applications require many networks. Use lower-speed networking media, such as Ethernet or Token Ring.
To configure networking when the primary application of the server is attribute-intensive:
Configure on Ethernet or Token Ring.
Configure one Ethernet network for eight to ten fully active clients.
More than 20 to 25 clients per Ethernet results in severe degradation when many clients are active. As a check, an Ethernet can sustain about 250-300 NFS ops/second on the SPECnfs_097 (LADDIS) benchmark, albeit at high collision rates. It is unwise to exceed 200 NFS ops/second on a sustained basis.
Configure one Token Ring network for each ten to fifteen active clients.
If necessary, 50 to 80 total clients per network are feasible on Token Ring networks, due to their superior degradation characteristics under heavy load (compared to Ethernet).
Mixing network types is not unreasonable. For example, both FDDI and Token Ring are appropriate for a server that supports both a document imaging application (data-intensive) and a group of PCs running a financial analysis application (most likely attribute-intensive).
The platform you choose is often dictated by the type and number of networks, as they may require many network interface cards.
Disk drive usage is frequently the tightest constraint in an NFS server. Even a sufficiently large memory configuration may not improve performance if the cache cannot be filled quickly enough from the file systems.
Use iostat to determine disk usage.
Look at the number of read and write operations per second (see "Checking the NFS Server"" in Chapter 3, Analyzing NFS Performance).
Because there is little dependence in the stream of NFS requests, the disk activity generated contains large numbers of random access disk operations. The maximum number of random I/O operations per second ranges from 40-90 per disk.
Driving a single disk at more than 60 percent of its random I/O capacity creates a disk bottleneck.
Disk bandwidth on an NFS server has the greatest effect on NFS client performance. Providing sufficient bandwidth and memory for file system caching is crucial to providing the best possible file server performance. Note that read/write latency is also important. For example, each NFSop may involve one or more disk accesses. Disk service times add to the NFSop latency, so slow disks mean a slow NFS server.
Follow these guidelines to ease disk bottlenecks:
Balance the I/O load across all disks on the system.
If one disk is heavily loaded and others are operating at the low end of their capacity, shuffle directories or frequently accessed files to less busy disks.
Partition the file system(s) on the heavily used disk and spread the file system(s) over several disks.
Adding disks provides additional disk capacity and disk I/O bandwidth.
Replicate the file system to provide more network-to-disk bandwidth for the clients if the file system used is read-only by the NFS clients, and contains data that doesn't change constantly.
See the following section, "Replicating File Systems"."
Size the operating system caches correctly, so that frequently needed file system data may be found in memory.
Caches for inodes (file information nodes), file system metadata such as cylinder group information, and name-to-inode translations must be sufficiently large, or additional disk traffic is created on cache misses. For example, if an NFS client opens a file, that operation generates several name-to-inode translations on the NFS server.
If an operation misses the Directory Name Lookup Cache (DNLC), the server must search the disk-based directory entries to locate the appropriate entry name. What would nominally be a memory-based operation degrades into several disk operations. Also, cached pages will not be associated with the file.
Commonly used file systems, such as the following, are frequently the most heavily used file systems on an NFS server:
/usr directory for diskless clients
Local tools and libraries
Third-party packages
Read-only source code archives
The best way to improve performance for these file systems is to replicate them. One NFS server is limited by disk bandwidth when handling requests for only one file system. Replicating the data increases the size of the aggregate "pipe" from NFS clients to the data. However, replication is not a viable strategy for improving performance with writable data, such as a file system of home directories. Use replication with read-only data.
To replicate file systems, do the following:
Identify the file or file systems to be replicated.
If several individual files are candidates, consider merging them in a single file system. The potential decrease in performance that arises from combining heavily used files on one disk is more than offset by performance gains through replication.
Use nfswatch, to identify the most commonly used files and file systems in a group of NFS servers. Table A-1 in Appendix A, Using NFS Performance-Monitoring and Benchmarking Toolslists performance monitoring tools, including nfswatch, and explains how to obtain nfswatch.
Determine how clients will choose a replica.
Specify a server name in the /etc/vfstab file to create a permanent binding from NFS client to the server. Alternatively, listing all server names in an automounter map entry allows completely dynamic binding, but may also lead to a client imbalance on some NFS servers. Enforcing "workgroup" partitions in which groups of clients have their own replicated NFS server strikes a middle ground between the extremes and often provides the most predictable performance.
Choose an update schedule and method for distributing the new data.
The frequency of change of the read-only data determines the schedule and the method for distributing the new data. File systems that undergo a complete change in contents, for example, a flat file with historical data that is updated monthly, can be best handled by copying data from the distribution media on each machine, or using a combination of ufsdump and restore. File systems with few changes can be handled using management tools such as rdist.
Evaluate what penalties, if any, are involved if users access old data on a replica that is not current. One possible way of doing this is with the Solaris 2.x JumpStartTM facilities in combination with cron.
The cache file system is client-centered. You use the cache file system on the client to reduce server load. With the cache file system, files are obtained from the server, block by block. The files are sent to the memory of the client and manipulated directly. Data is written back to the disk of the server.
Adding the cache file system to client mounts provides a local replica for each client. The /etc/vfstab entry for the cache file system looks like this:
# device device mount FS fsck mount mount # to mount to fsck point type pass at boot options server:/usr/dist cache /usr/dist cachefs 3 yes ro,backfstype=nfs,cachedir=/cache
Use the cache file system in situations with file systems that are read mainly, such as application file systems. Also, you should use the cache file system for sharing data across slow networks. Unlike a replicated server, the cache file system can be used with writable file systems, but performance will degrade as the percent of writes climb. If the percent of writes is too high, the cache file system may decrease NFS performance.
You should also consider using the cache file system if your networks are high speed networks interconnected by routers.
If the NFS server is frequently updated, do not use the cache file system because doing so would result in more traffic than operating over NFS.
To monitor the effectiveness of the cached file systems use the cachefsstat command (available with the Solaris 2.5 and later operating environment).
The syntax of the cachefsstat command is as follows:
system# /usr/bin/cachefsstat [-z] path
where:
-z initializes statistics. You should execute cachefs -z (superuser only) before executing cachfsstat again to gather statistics on the cache performance. The statistics printed reflect those just before the statistics are reinitialized.
path is the path the cache file system is mounted on. If you do not specify a path, all mounted cache file systems are used.
Without the -z option, you can execute this command as a regular UNIX user. The statistical information supplied by the cachefsstat command includes cache hits and misses, consistency checking, and modification operation:
Table 4-1 Statistical Information Supplied by the cachefsstat Command
Output |
Description |
---|---|
cache hit rate |
Percentage of cache hits over the total number of attempts (followed by the actual numbers of hits and misses) |
consistency checks |
Number of consistency checks performed. It is followed by the number that passed and the number that failed. |
modifies |
Number of modify operations, including writes and creates. |
An example of the cachefsstat command is:
system% /usr/bin/cachefsstat /home/sam cache hit rate: 73% (1234 hits, 450 misses) consistency checks: 700 (650 pass, 50 fail) modifies: 321
In the previous example, the cache hit rate for the file system should be higher than thirty percent. If the cache hit rate is lower than thirty percent, this means that the access pattern on the file system is widely randomized or that the cache is too small.
The output for a consistency check means that the cache file system checks with the server to see if data is still valid. A high failure rate (15 to 20 percent) means that the data of interest is rapidly changing. The cache may be updated more quickly than what is appropriate for a cached file system. When you use the output from consistency checks with the number of modifies, you can learn if this client or other clients are making the changes.
The output for modifies is the number of times the client has written changes to the file system. This output is another method to understand why the hit rate may be low. A high rate of modify operations likely goes along with a high number of consistency checks and a lower hit rate.
Also available, beginning with the Solaris 2.5 software environment, are the commands cachefswssize, which determine the working set size for the cache file system and cachefsstat, which displays where the cache file system statistics are being logged. Use these commands to determine if the cache file system is appropriate and valuable for your installation.
Follow these general guidelines for configuring disk drives. In addition to the following general guidelines, more specific guidelines for configuring disk drives in data-intensive environments and attribute-intensive environments follows:
Configure additional drives on each host adapter without degrading performance (as long as the number of active drives does not exceed SCSI standard guidelines).
Use Online: DiskSuite or Solstice DiskSuite to spread disk access load across many disks. See "Using Solstice DiskSuite or Online: DiskSuite to Spread Disk Access Load"" later in this chapter.
Use the fastest zones of the disk when possible. See "Using the Optimum Zones of the Disk"" later in this chapter.
Keep these rules in mind when configuring disk drives in data-intensive environments:
Configure for a sequential environment.
Use disks with the fastest transfer speeds (preferably in stripes).
Configure one RAID device (logical volume or metadisk) for every three active version 3 clients or one device for every four to five version 2 clients.
Configure one drive for every client on Ethernet or Token Ring.
When configuring disk drives in attribute-intensive environments:
Configure with a larger number of smaller disks, which are connected to a moderate number of SCSI host adapters (such as a disk array).
Configure four to five (or up to eight or nine) fully active disks per fast SCSI host adapter Using smaller disk drives is much better than operating with one large disk drive.
Configure at least one disk drive for every two fully active clients (on any type of network.)
Configure no more than eight to ten fully active disk drives for each fast/wide SCSI host adapter.
A common problem in NFS servers is poor load balancing across disk drives and disk controllers.
To balance loads, do the following:
Balance loads by physical usage instead of logical usage. Use Solstice DiskSuite or Online: DiskSuite to spread disk access across disk drives transparently by using its striping and mirroring functions.
The disk mirroring feature of Solstice DiskSuite or Online: DiskSuite improves disk access time and reduces disk usage by providing access to two or three copies of the same data. This is particularly true in environments dominated by read operations. Write operations are normally slower on a mirrored disk since two or three writes must be accomplished for each logical operation requested.
Balance loads using disk concatenation when disks are relatively full. This procedure accomplishes a minimum amount of load balancing
If your environment is data-intensive, stripe the disk with a small interlace to improve disk throughput and distribute the service load. Disk striping improves read and write performance for serial applications. Use 64 Kbytes per number of disks in the stripe as a starting point for interlace size.
If your environment is attribute-intensive, where random access dominates disk usage, stripe the disk with the default interlace (one disk cylinder).
Use the iostat and sar commands to report disk drive usage.
Attaining even disk usage usually requires some iterations of monitoring and data reorganization. In addition, usage patterns change over time. A data layout that works when installed may perform poorly a year later. For more information on checking disk drive usage, see "Checking the NFS Server" in Chapter 3, Analyzing NFS Performance.
The Solaris 2.4 through Solaris 7 software environments and the Online: Disk Suite 3.0 or Solstice DiskSuite software support a log-based extension to the standard UNIX file system, which works like a disk-based Prestoserve NFS accelerator.
In addition to the main file system disk, a small (typically 10 Mbytes) section of disk is used as a sequential log for writes. This speeds up the same kind of operations as a Prestoserve NFS accelerator with two advantages:
In dual-machine high-available configurations, the Prestoserve NFS accelerator cannot be used. The log can be shared so that it can be used.
After an operating system crash, the fsck of the log-based file system involves a sequential read of the log only. The sequential read of the log is almost instantaneous, even on very large file systems.
You cannot use the Prestoserve NFS accelerator and the log on the same file system.
When you analyze your disk data layout, consider zone bit recording.
All of Sun's current disks (except the 207 Mbyte disk) use this type of encoding which uses the peculiar geometric properties of a spinning disk to pack more data into the parts of the platter closest to its edge. This results in the lower disk addresses (corresponding to the outside cylinders) usually outperforming the inside addresses by 50 percent.
Put the data in the lowest-numbered cylinders.
The zone bit recording data layout makes those cylinders the fastest ones.
This margin is most often realized in serial transfer performance, but also affects random access I/O. Data on the outside cylinders (zero) not only moves past the read/write heads more quickly, but the cylinders are also larger. Data will be spread over fewer large cylinders, resulting in fewer and shorter seeks.
This section explains how to determine CPU usage and provides guidelines for configuring CPUs in NFS servers.
To get 30 second averages, type mpstat 30 at the % prompt.
The following screen is displayed:
system% mpstat 30 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 6 0 0 114 14 25 0 6 3 0 48 1 2 25 72 1 6 0 0 86 85 50 0 6 3 0 66 1 4 24 71 2 7 0 0 42 42 31 0 6 3 0 54 1 3 24 72 3 8 0 0 0 0 33 0 6 4 0 54 1 3 24 72
The mpstat 30 command reports statistics per processor. Each row of the table represents the activity of one processor. The first table summarizes all activities since the system was last booted. Each subsequent table summarizes activity for the preceding interval. All values are rates (events per second).
Review the following data in the mpstat output (see Table 4-2 ):
Table 4-2 Ouput of the mpstat Command
usr |
Percent user time |
sys |
Percent system time (can be caused by NFS processing) |
wt |
Percent wait time (treat as for idle time) |
idl |
Percent idle time |
If sys is greater than 50 percent, increase CPU power to improve NFS performance.
Table 4-2 describes guidelines for configuring CPUs in NFS servers.
Table 4-3 Guidelines for Configuring CPUs in NFS Servers
If |
Then |
---|---|
Your environment is predominantly attribute-intensive, and you have one to three medium-speed Ethernet or Token Ring networks. |
A uniprocessor system is sufficient. For smaller systems, the UltraServer 1, SPARCserver 2, or SPARCserver 5 systems have sufficient processor power. |
Your environment is predominantly attribute-intensive and you have between 4 to 60 medium-speed Ethernet or Token Ring networks. |
Use an UltraServer 2, SPARCserver 10, or SPARCserver 20 system. |
You have larger attribute-intensive environments and SBus and disk expansion capacity is sufficient. |
Use multiprocessor models of the UltraServer 2, SPARCserver 10 or the SPARCserver 20 systems. |
You have larger attribute-intensive environments. |
Use dual-processor systems such as: - SPARCserver 10 system Model 512 - SPARCserver 20 system - SPARCserver 1000 or 1000E system - Ultra Enterprise 3000, 4000, 5000, or 6000 system - SPARCcenter 2000/2000E system Either the 40 MHz/1Mbyte or the 50MHz/2 Mbyte module work well for an NFS work load in the SPARCcenter 2000 system. You will get better performance from the 50 MHz/2Mbyte module. |
Your environment is data-intensive and you have a high-speed network. |
Configure one SuperSPARC processor per high-speed network (such as FDDI). |
Your environment is data-intensive and you must use an Ethernet connection due to cabling restrictions. |
Configure one SuperSPARC processor for every four Ethernet or Token Ring networks. |
Your environment is a pure NFS installation. |
You do not need to configure additional processors beyond the recommended number on your server(s). |
Your servers perform tasks in addition to NFS processing. |
Add additional processors to increase performance significantly. |
Since NFS is a disk I/O-intensive service, a slow server can suffer from I/O bottlenecks. Adding memory eliminates the I/O bottleneck by increasing the file system cache size.
The system could be waiting for file system pages, or it may be paging process images to and from the swap device. The latter effect is only a problem if additional services are provided by the system, since NFS service runs entirely in the operating system kernel.
If the swap device is not showing any I/O activity, then all paging is due to file I/O operations from NFS reads, writes, attributes, and lookups.
Paging file system data from the disk into memory is a more common NFS server performance problem.
Watch the scan rate reported by vmstat 30.
If the scan rate (sr, the number of pages scanned) is often over 200 pages/second, then the system is short of memory (RAM). The system is trying to find unused pages to be reused and may be reusing pages that should be cached for rereading by NFS clients.
Add memory.
Adding memory eliminates repeated reads of the same data and enables the NFS requests to be satisfied out of the page cache of the server. To calculate the memory required for your NFS server, see "Calculating Memory," which follows.
The memory capacity required for optimal performance depends on the average working set size of files used on that server. The memory acts as a cache for recently read files. The most efficient cache matches the current working set size as closely as possible.
Because of this memory caching feature, it is not unusual for the free memory in NFS servers to be between 0.5 Mbytes to 1.0 Mbytes if the server has been active for a long time. Such activity is normal and desirable. Having enough memory allows you to service multiple requests without blocking.
The actual files in the working set may change over time. However, the size of the working set may remain relatively constant. NFS creates a sliding window of active files, with many files entering and leaving the working set throughout a typical monitoring period.
You can calculate memory according to general or specific memory rules.
Follow these general guidelines to calculate the amount of memory you will need.
Virtual memory = RAM (main memory) + swap space
Calculate the amount of memory according to the five-minute rule:
Memory is sized at 16 Mbytes plus memory to cache the data, which will be accessed more often than once in five minutes.
Follow these specific guidelines to calculate the amount of memory you will need.
If your server primarily provides user data for many clients, configure relatively minimal memory.
For small installations, this will be 32 Mbytes; for large installations, this will be about 128 Mbytes. In multiprocessor configurations, provide at least 64 Mbytes per processor. Attribute-intensive applications normally benefit slightly more from memory than data-intensive applications.
If your server normally provides temporary file space for applications that use those files heavily, configure your server memory to about 75 percent of the size of the active temporary files in use on the server.
For example, if each client's temporary file is about 5 Mbytes, and the server is expected to handle 20 fully active clients, configure it as follows:
(20 clients x 5 Mbytes)/75% = 133 Mbytes of memory
Note that 128 Mbytes is the most appropriate amount of memory that is easily configured.
If the primary task of your server is to provide only executable images, configure server memory to be equal to approximately the combined size of the heavily-used binary files (including libraries).
For example, a server expected to provide /usr/openwin should have enough memory to cache the X server, CommandTool, libX11.so, libview.so and libXt. This NFS application is considerably different from the more typical /home, /src, or /data server in that it normally provides the same files repeatedly to all of its clients and is hence able to effectively cache this data. Clients will not use every page of all of the binaries, which is why it is reasonable to configure only enough to hold the frequently-used programs and libraries. Use the cache file system on the client, if possible, to reduce the load and RAM needs on the server.
If the clients are DOS PCs or Macintosh machines, add more RAM cache on the Sun NFS server; these systems do much less caching than UNIX system clients.
Swap space is almost not needed because NFS servers do not run user processes.
Configure at least 64 Mbytes virtual memory, which is RAM plus swap space (see Table 4-3).
Set up fifty percent of main memory as an emergency swap space to save a crash dump in case of a system panic.
Table 4-4 Swap Space Requirements
Amount of RAM |
Swap Space Requirements |
---|---|
16 Mbytes |
48 Mbytes |
32 Mbytes |
32 Mbytes |
64 or more Mbytes |
None |
NFS version 3 reduces the need for Prestoserve(TM) capability. Using the Prestoserve NFS accelerator makes a significant difference with NFS version 2. The Prestoserve NFS accelerator makes only a slight improvement with NFS version 3.
Adding a Prestoserve NFS accelerator with NFS version 2 is another way to improve NFS performance. NFS version 2 requires all writes to be written to stable storage before responding to the operation. The Prestoserve NFS accelerator enables high-speed NVRAM instead of slow disks to meet the stable storage requirement.
Two types of NVRAM used by the Prestoserve NFS accelerator are:
NVRAM-NVSIMM
SBus
Both types of Prestoserve NFS accelerators speed up NFS server performance by:
Providing faster selection of file systems
Caching writes for synchronous I/O operations
Intercepting synchronous write requests to disk and storing the data in nonvolatile memory
If you can use either NVRAM hardware, use the NVRAM-NVSIMM for the Prestoserve cache. The NVRAM-NVSIMM and SBus hardware are functionally identical. However, the NVRAM-NVSIMM hardware is slightly more efficient and does not require an SBus slot. The NVRAM-NVSIMMs reside in memory and the NVRAM-NVSIMM cache is larger than the SBus hardware.
The NVRAM-NVSIMM Prestoserve NFS accelerator significantly improves the response time of NFS clients with heavily loaded or I/O-bound servers. To improve performance add the NVRAM-NVSIMM Prestoserve NFS accelerator to the following platforms:
SPARCserver 20 system
SPARCserver 1000 or 1000E system
SPARCcenter 2000 or 2000E system
You can use an alternate method for improving NFS performance in Sun Enterprise 3x00, 4x00, 5x00, and 6x00 systems. This method is to upgrade NVRAM in the SPARCstorage Array that is connected to the server.
Sun Enterprise 3x00, 4x00, 5x00, and 6x00 server systems enable SPARCstorage Array NVRAM fast writes. Turn on fast writes by invoking the ssaadm command.
The SBus Prestoserve NFS accelerator contains only a 1 Mbyte cache and resides on the SBus. You can add the SBus Prestoserve NFS accelerator to any SBus-based server except the SPARCserver 1000(E) system, the SPARCcenter 2000(E), or the Sun Enterprise 3x00, 4x00, 5x00, or 6x00 server systems.
Some systems on which you can add the SBus Prestoserve NFS accelerator are:
SPARCserver 5 system
SPARCserver 20 system
Sun Enterprise 1 system
Sun Enterprise 2 system
SPARCserver 600 series
This section describes how to set the number of NFS threads. It also covers tuning the main NFS performance-related parameters in the /etc/system file. Tune these /etc/system parameters carefully, considering the physical memory size of the server and kernel architecture type.
Arbitrary tuning creates major instability problems, including an inability to boot.
For improved performance, NFS server configurations should set the number of NFS threads. Each thread is capable of processing one NFS request. A larger pool of threads enables the server to handle more NFS requests in parallel. The default setting of 16 in Solaris 2.4 through Solaris 7 software environments results in slower NFS response times. Scale the setting with the number of processors and networks and increase the number of NFS server threads by editing the invocation of nfsd in /etc/init.d/nfs.server:
/usr/lib/nfs/nfsd -a 64
The previous code box specifies that the maximum allocation of demand-based NFS threads is 64.
There are three ways to size the number of NFS threads. Each method results in about the same number of threads if you followed the configuration guidelines in this manual. Extra NFS threads do not cause a problem.
To set the number of NFS threads, take the maximum of the following three suggestions:
Use 2 NFS threads for each active client process.
A client workstation usually only has one active process. However, a time-shared system that is an NFS client may have many active processes.
Use 16 to 32 NFS threads for each CPU.
Use roughly 16 for a SPARCclassic or a SPARCstation 5 system. Use 32 NFS threads for a system with a 60 MHz SuperSPARC processor.
Use 16 NFS threads for each 10 Mbits of network capacity.
For example, if you have one SunFDDITM interface, set the number of threads to 160. With two SunFDDI interfaces, set the thread count to 320, and so on.
The number of fixed-size tables in the kernel has been reduced in each release of the Solaris software environment. Most are now dynamically sized or are linked to the maxusers calculation. Extra tuning to increase the DNLC and inode caches is required for the Solaris 2.4 through Solaris 7 software environments. For Solaris version 2.4 you must tune the pager. Tuning the pager is not necessary for Solaris 2.5, 2.5.1, 2.6, or 7 operating environments.
The /etc/system file is read by the operating system kernel at start-up. It configures the search path for loadable operating system kernel modules and enables kernel variables to be set. For more information, see the man page for system(4).
Use the set commands in /etc/system carefully because the commands in /etc/system cause automatic patches of the kernel.
If your machine does not boot and you suspect a problem with /etc/system, use the boot -aoption. With this option, the system prompts (with defaults) for its boot parameters. One of these is the /etc/system configuration file. Either use the name of a backup copy of the original /etc/system file or /dev/null. Fix the file and immediately reboot the system to make sure it is operating correctly.
The maxusers parameter determines the size of various kernel tables such as the process table. The maxusersparameter is set in the /etc/system file. For example:
set maxusers = 200
In the Solaris 2.4 through Solaris 7 software environments, maxusers is dynamically sized based upon the amount of RAM configured in the system.The sizing method used formaxusers is:
maxusers = Mbytes of RAM configured in the system
The number of Mbytes of RAM configured into the system is actually based upon physmem which does not include the 2 Mbytes or so that the kernel uses at boot time. The minimum limit is 8 and the maximum automatic limit is 1024, which corresponds to systems with 1 Gbyte or more of RAM. It can still be set manually in /etc/system but the manual setting is checked and limited to a maximum of 2048. This is a safe level on all kernel architectures, but uses a large amount of operating system kernel memory.
Table 4-4 describes the default settings for the performance-related inode cache and name cache operating system kernel parameters.
Table 4-5 Default Settings for Inode and Name Cache Parameters
Kernel Resource |
Variable |
Default Setting |
---|---|---|
Inode cache |
ufs_ninode |
17 * maxusers + 90 |
Name cache |
ncsize |
17 * maxusers + 90 |
The bufhwm variable, set in the /etc/system file, controls the maximum amount of memory allocated to the buffer cache and is specified in Kbytes. The default value of bufhwm is 0, which allows up to 2 percent of system memory to be used. This can be increased up to 20 percent and may need to be increased to 10 percent for a dedicated NFS file server with a relatively small memory system. On a larger system, the bufhwm variable may need to be limited to prevent the system from running out of the operating system kernel virtual address space.
The buffer cache is used to cache inode, indirect block, and cylinder group related disk I/O only. The following is an example of a buffer cache ( bufhwm) setting in the /etc/system file that can handle up to 10 Mbytes of cache. This is the highest value to which you should set bufhwm.
set bufhwm=10240
You can monitor the buffer cache using sar -b (see the following code example), which reports a read (%rcache) and a write hit rate (%wcache) for the buffer cache.
# sar -b 5 10 SunOS hostname 5.2 Generic sun4c 08/06/93 23:43:39 bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/s Average 0 25 100 3 22 88 0 0
If a significant number of reads and writes per second occur (greater than 50) and if the read hit rate (%rcache) falls below 90 percent, or if the write hit rate (%wcache) falls below 65 percent, increase the buffer cache size, bufhwm.
In the previous sar -b 5 10 command output, the read hit rate (%rcache) and the write hit rate (%wcache) did not fall below 90 percent or 65 percent respectively.
Following are descriptions of the arguments to the sar command.
Table 4-6 Descriptions of the Arguments to the sar Command
b |
Checks buffer activity |
5 |
Time, every 5 seconds (must be at least 5 seconds) |
10 |
Number of times the command gathers statistics |
Your system will prevent you from increasing the buffer cache to an unacceptably high level. Signs of increasing buffer cache size include:
Hung server
Device drivers that suffer from a shortage of operating system kernel virtual memory
Size the directory name lookup cache (DNLC) to a default value using maxusers. A large cache size (ncsize) significantly increases the efficiency of NFS servers with multiple clients.
To show the DNLC hit rate (cache hits), type vmstat -s.
% vmstat -s ... lines omitted 79062 total name lookups (cache hits 94%) 16 toolong
Directory names less than 30 characters long are cached and names that are too long to be cached are also reported. A cache miss means that a disk I/O may be needed to read the directory when traversing the path name components to get to a file. A hit rate of less than 90 percent requires attention.
Cache hit rates can significantly affect NFS performance. getattr, setattr, and lookup usually represent greater than 50 percent of all NFS calls. If the requested information isn't in cache, the request will generate a disk operation that results in a performance penalty as significant as that of a read or write request. The only limit to the size of the DNLC cache is available kernel memory.
If the hit rate (cache hits) is less than 90 percent and a problem does not exist with the number of longnames, tune the ncsize variable (see "To Reset ncsize,") which follows. The variable ncsize refers to the size of the DNLC in terms of the number of name and vnode translations that can be cached. Each DNLC entry uses about 50 bytes of extra kernel memory.
Set ncsize in the /etc/system file to values higher than the default (based on maxusers.)
As an initial guideline, since dedicated NFS servers do not need a lot of RAM, maxusers will be low and the DNLC will be small; double its size.
set ncsize=5000
The default value of ncsize is:
ncsize (name cache) = 17 * maxusers + 90
For NFS server benchmarks, set it as high as 16000.
For maxusers = 2048, set it at 34906.
See "Increasing the Inode Cache"" which follows.
A memory-resident inode is used whenever an operation is performed on an entity in the file system. The inode read from disk is cached in case it is needed again. ufs_ninode is the size that the UNIX file system attempts to keep the list of idle inodes. You can have ufs_ninod set to 1 but have 10,000 idle inodes. As active inodes become idle, if the number of idle inodes goes over ufs_ninode, then memory is reclaimed by tossing out idle inodes.
Every entry in the DNLC cache points to an entry in the inode cache, so both caches should be sized together. The inode cache should be at least as big as the DNLC cache. For best performance, it should be the same size in the Solaris 2.4 through Solaris 7 operating environments.
Since it is just a limit, ufs_ninode you can tweak with adb on a running system with immediate effect. The only upper limit is the amount of kernel memory used by the inodes. The tested upper limit corresponds to maxusers = 2048, which is the same as ncsize at 34906.
To report the size of the kernel memory allocation use sar -k.
In the Solaris 2.4 operating environment, each inode uses 300 bytes of kernel memory from the lg_mem pool.
In the Solaris 2.5.1, 2.6, and 7 operating environments, each inode uses 320 bytes of kernel memory from the lg_mem pool. ufs_ninode is automatically adjusted to be at least ncsize. Tune ncsize to get the hit rate up and let the system pick the default ufs_ninodes.
With the Solaris 2.5.1. 2.6, and 7 software environments,ufs_ninode is automatically adjusted to be at least ncsize. Tune ncsize to get the hit rate up and let the system pick the default ufs_ninodes.
If the inode cache hit rate is below 90 percent, or if the DNLC requires tuning for local disk file I/O workloads:
Increase the size of the inode cache.
Change the variable ufs_ninode in your /etc/system file to the same size as the DNLC (ncsize). For example, for the Solaris version 2.4, type:
set ufs_ninode=5000
The default value of the inode cache is the same as that for ncsize:
ufs_ninode (default value) = 17 * maxusers + 90.
Do not set ufs_ninode less than ncsize. The ufs_ninode parameter limits the number of inactive inodes, rather than the total number of active and inactive inodes.
Reboot the system.
If you are using NFS over a high speed network such as FDDI, SunFastEthernet, or SunATMTM, you will have better read throughput by increasing the number of read-aheads on the NFS client.
Increasing read-aheads is not recommended under these conditions:
The client is very short of RAM.
The network is very busy.
File accesses are randomly distributed.
When free memory is low, read-ahead will not be performed.
The read-ahead is set to 1 block, by default (8 Kbytes with version 2 and to 32 Kbytes with version 3). For example, a read-ahead set to 2 blocks uses an additional 16 Kbytes from a file while you are reading the first 8 Kbytes from the file. Thus, the read-ahead stays one step ahead of you and uses information in 8 Kbyte increments to stay ahead of the information you need.
Increasing the read-ahead count can improve read throughput up to a point. The optimal read-ahead setting will depend on your configuration and application. Increasing the read-ahead value beyond that setting may actually reduce throughput. In most cases, the optimal read-ahead setting is less than eight read-aheads (8 blocks)..
In the following procedure you can tune the nfs_nra and the nfs3_nra values independently. If a client is running Solaris the 2.5, 2.5.1, 2.6, or 7 operating environment, the client may need to tune nfs_nra (NFS version 2). This happens if the client is talking to a server that does not support version 3.
Add the following line to /etc/system on the NFS client.
set nfs:nfs_nra=4
Reboot the system to implement the read-ahead value.
Add the following line to /etc/system on the NFS client:
With versions of the Solaris software environment before the Solaris 2.6 software environment
set nfs:nfs3_nra=6
With the Solaris 2.6 operating environment, type:
set nfs:nfs3_nra=2
With the Solaris 7 operating environment type:
set nfs:nfs3_nra=4
Raising the read-ahead count too high can make read throughput worse. You may consider running benchmarks with different values of nfs3_nra or nfs_nra to see what works best in your environment.
Reboot the system to implement the read-ahead value.