This chapter presents troubleshooting tips for the following types of problems:
General troubleshooting tuning
Client bottlenecks
Server bottlenecks
Network-related bottlenecks
This section (see Table 5-1) lists the actions to perform when you encounter a tuning problem.
Table 5-1 General Troubleshooting Tuning Problems and Actions to Perform
Command/Tool |
Command Output/Result |
Action |
---|---|---|
netstat -i |
Collis+Ierrs+Oerrs/Ipkts + Opkts > 2% |
Check the Ethernet hardware. |
netstat -i |
Collis/Opkts > 10% |
Add an Ethernet interface and distribute the client load. |
netstat -i |
Ierrs/Ipks > 25% |
The host may be dropping packets, causing high input error rate. To compensate for bandwidth-limited network hardware: reduce the packet size; set the read buffer size, rsize and/or the write buffer size wsize to 2048 when using mount or in the /etc/vfstab file. See "To Check the Network" in Chapter 3, Analyzing NFS Performance. |
nfsstat -s |
readlink > 10% |
Replace symbolic links with mount points. |
nfsstat -s |
writes > 5% |
Install a Prestoserve NFS accelerator (SBus card or NVRAM-NVSIMM) for peak performance. See "Prestoserve NFS Accelerator" in Chapter 4, Configuring the Server and the Client to Maximize NFS Performance. |
nfsstat -s |
There are any badcalls. |
The network may be overloaded. Identify an overloaded network using network interface statistics. |
nfsstat -s |
getattr > 40% |
Increase the client attribute cache using the actimeo option. Make sure the DNLC and inode caches are large. Use vmstat -s to determine the percent hit rate (cache hits) for the DNLC and, if needed, increase ncsize in the /etc/system file. See "Directory Name Lookup Cache (DNLC)"in Chapter 4, Configuring the Server and the Client to Maximize NFS Performance. |
vmstat -s |
Hit rate (cache hits) < 90% |
Increase ncsize in the /etc/system file. |
Ethernet monitor, for example: SunNet Manager SharpShooter, NetMetrix |
Load > 35% |
Add an Ethernet interface and distribute client load. |
This section (see Table 5-2) shows potential client bottlenecks and how to remedy them.
Table 5-2 Client Bottlenecks
Symptom(s) |
Command/Tool |
Cause |
Solution |
---|---|---|---|
NFS server hostname not responding or slow response to commands when using NFS-mounted directories |
nfsstat |
User's path variable |
List directories on local file systems first, critical directories on remote file systems second, and then the rest of the remote file systems. |
NFS server hostname not responding or slow response to commands when using NFS-mounted directories |
nfsstat |
Running executable from an NFS-mounted file system |
Copy the application locally (if used often). |
NFS server hostname not responding; badxid >5% of total calls and badxid = timeout |
nfsstat -rc |
Client times out before server responds |
Check for server bottleneck. If the server's response time isn't improved, increase the timeo parameter in the /etc/vfstab file of clients. Try increasing timeo to 25, 50, 100, 200 (tenths of seconds). Wait one day between modifications and check to see if the number of time-outs decreases. |
badxid = 0 |
nfsstat -rc |
Slow network |
Increase rsize and wsize in the /etc/vfstab file. Check interconnection devices (bridges, routers, gateways). |
This section (see Table 5-3) shows server bottlenecks and how to remedy them.
Table 5-3 Server Bottlenecks
Symptom(s) |
Command/Tool |
Cause |
Solution |
---|---|---|---|
NFS server hostname not responding |
vmstat -s or iostat |
Cache hit rate is < 90% |
Adjust the suggested parameters for DNLC, then run to see if the symptom is gone. If not, reset the parameters for DNLC. Adjust the parameters for the buffer cache, then the inode cache, following the same procedure as for the DNLC. |
NFS server hostname not responding |
netstat -m or nfsstat |
Server not keeping up with request arrival rate |
Check the network. If the problem is not the network, add appropriate Prestoserve NFS accelerator, or upgrade the server. |
High I/O wait time or CPU idle time; slow disk access times or NFS server hostname not responding |
iostat -x |
I/O load not balanced across disks; the svc_t value is greater than 40 ms |
Take a large sample (~2 weeks). Balance the load across disks; add disks as necessary. Add a Prestoserve NFS accelerator for synchronous writes. To reduce disk and network traffic, use tmpfs for /tmp for both server and clients. Measure system cache efficiencies. Balance load across disks; add disks as necessary. |
Slow response when accessing remote files |
netstat -s or snoop |
Ethernet interface dropping packets |
If retransmissions are indicated, increase buffer size. For information on how to use snoop, see "snoop"" in Appendix A, Using NFS Performance-Monitoring and Benchmarking Tools. |
This section (see Table 5-4) shows network-related bottlenecks and how to remedy them.
Table 5-4 Network-Related Bottlenecks
Symptoms |
Command/Tool |
Cause |
Solution |
---|---|---|---|
Poor response time when accessing directories mounted on different subnets or NFS server hostname not responding |
netstat -rs |
NFS requests being routed |
Keep clients on the subnet directly connected to server. |
Poor response time when accessing directories mounted on different subnets or NFS server hostname not responding |
netstat -s shows incomplete or bad headers, bad data length fields, bad checksums. |
Network problems |
Check the network hardware. |
Poor response time when accessing directories mounted on different subnets or NFS server hostname not responding; sum of input and output packets per second for an interface is over 600 per second |
netstat -i |
Network overloaded |
The network segment is very busy. If this is a recurring problem, consider adding another (le) network interface. |
Network interface collisions are over 120 per second |
netstat -i |
Network overloaded |
Reduce the number of machines on the network or check the network hardware. |
Poor response time when accessing directories mounted on different subnets or NFS server hostname not responding |
netstat -i |
High packet collision rate (Collis/Opkts>.10) |
- If packets are corrupted, it may be due to a corrupted MUX box; use the Network General Sniffer product or another protocol analyzer to find the cause. - Check for overloaded network. If there are too many nodes, create another subnet. - Check network hardware; could be bad tap, transceiver, hub on 10base-T. Check cable length and termination. |