NFS Server Performance and Tuning Guide for Sun Hardware

Chapter 5 Troubleshooting

Troubleshooting Tools

This chapter presents troubleshooting tips for the following types of problems:

General Troubleshooting Tuning Tips

This section (see Table 5-1) lists the actions to perform when you encounter a tuning problem.

Table 5-1 General Troubleshooting Tuning Problems and Actions to Perform

Command/Tool 

Command Output/Result 

Action 

netstat -i

Collis+Ierrs+Oerrs/Ipkts + Opkts > 2%

Check the Ethernet hardware. 

netstat -i

Collis/Opkts > 10%

Add an Ethernet interface and distribute the client load. 

netstat -i

Ierrs/Ipks > 25%

The host may be dropping packets, causing high input error rate. To compensate for bandwidth-limited network hardware: reduce the packet size; set the read buffer size, rsize and/or the write buffer size wsize to 2048 when using mount or in the /etc/vfstab file. See "To Check the Network" in Chapter 3, Analyzing NFS Performance.

nfsstat -s

readlink > 10%

Replace symbolic links with mount points. 

nfsstat -s

writes > 5%

Install a Prestoserve NFS accelerator (SBus card or NVRAM-NVSIMM) for peak performance. See "Prestoserve NFS Accelerator" in Chapter 4, Configuring the Server and the Client to Maximize NFS Performance.

nfsstat -s

There are any badcalls.

The network may be overloaded. Identify an overloaded network using network interface statistics. 

nfsstat -s

getattr > 40%

Increase the client attribute cache using the actimeo option. Make sure the DNLC and inode caches are large. Use vmstat -s to determine the percent hit rate (cache hits) for the DNLC and, if needed, increase ncsize in the /etc/system file. See "Directory Name Lookup Cache (DNLC)"in Chapter 4, Configuring the Server and the Client to Maximize NFS Performance.

vmstat -s

Hit rate (cache hits) < 90% 

Increase ncsize in the /etc/system file.

Ethernet monitor, for example: 

SunNet Manager SharpShooter, NetMetrix  

Load > 35% 

Add an Ethernet interface and distribute client load. 

Client Bottlenecks

This section (see Table 5-2) shows potential client bottlenecks and how to remedy them.

Table 5-2 Client Bottlenecks

Symptom(s) 

Command/Tool 

Cause 

Solution 

NFS server hostname not responding or slow response to commands when using NFS-mounted directories

nfsstat

User's path variable 

List directories on local file systems first, critical directories on remote file systems second, and then the rest of the remote file systems. 

NFS server hostname not responding or slow response to commands when using NFS-mounted directories 

nfsstat

Running executable from an NFS-mounted file system 

Copy the application locally (if used often).  

NFS server hostname not responding; badxid >5% of total calls and badxid = timeout

nfsstat -rc

Client times out before server responds 

Check for server bottleneck. If the server's response time isn't improved, increase the timeo parameter in the /etc/vfstab file of clients. Try increasing timeo to 25, 50, 100, 200 (tenths of seconds). Wait one day between modifications and check to see if the number of time-outs decreases.

badxid = 0

nfsstat -rc

Slow network 

Increase rsize and wsize in the /etc/vfstab file. Check interconnection devices (bridges, routers, gateways).

Server Bottlenecks

This section (see Table 5-3) shows server bottlenecks and how to remedy them.

Table 5-3 Server Bottlenecks

Symptom(s) 

Command/Tool 

Cause 

Solution 

NFS server hostname not responding

vmstat -s

or iostat

Cache hit rate is < 90% 

Adjust the suggested parameters for DNLC, then run to see if the symptom is gone. If not, reset the parameters for DNLC. Adjust the parameters for the buffer cache, then the inode cache, following the same procedure as for the DNLC. 

NFS server hostname not responding

netstat -m

or nfsstat

Server not keeping up with request arrival rate 

Check the network. If the problem is not the network, add appropriate Prestoserve NFS accelerator, or upgrade the server. 

High I/O wait time or CPU idle time; slow disk access times or NFS server hostname not responding

iostat -x

I/O load not balanced across disks; the svc_t value is greater than 40 ms

Take a large sample (~2 weeks). Balance the load across disks; add disks as necessary. Add a Prestoserve NFS accelerator for synchronous writes. To reduce disk and network traffic, use tmpfs for /tmp for both server and clients. Measure system cache efficiencies. Balance load across disks; add disks as necessary.

Slow response when accessing remote files 

netstat -s

or snoop

Ethernet interface dropping packets 

If retransmissions are indicated, increase buffer size. For information on how to use snoop, see "snoop"" in Appendix A, Using NFS Performance-Monitoring and Benchmarking Tools.

Network Bottlenecks

This section (see Table 5-4) shows network-related bottlenecks and how to remedy them.

Table 5-4 Network-Related Bottlenecks

Symptoms 

Command/Tool 

Cause 

Solution 

Poor response time when accessing directories mounted on different subnets or NFS server hostname not responding

netstat -rs

NFS requests being routed 

Keep clients on the subnet directly connected to server. 

Poor response time when accessing directories mounted on different subnets or NFS server hostname not responding

netstat -s shows

incomplete or bad headers, bad data length fields, bad checksums. 

Network problems 

Check the network hardware. 

Poor response time when accessing directories mounted on different subnets or NFS server hostname not responding; sum of input and output packets per second for an interface is over 600 per second

netstat -i

Network overloaded 

The network segment is very busy. If this is a recurring problem, consider adding another (le) network interface.

Network interface collisions are over 120 per second 

netstat -i

Network overloaded 

Reduce the number of machines on the network or check the network hardware. 

Poor response time when accessing directories mounted on different subnets or NFS server hostname not responding

netstat -i

High packet collision rate (Collis/Opkts>.10)

- If packets are corrupted, it may be due to a corrupted MUX box; use the Network General Sniffer product or another protocol analyzer to find the cause.  

- Check for overloaded network. If there are too many nodes, create another subnet. 

- Check network hardware; could be bad tap, transceiver, hub on 10base-T. Check cable length and termination.