Chapter 5 Troubleshooting

Troubleshooting Tools

This chapter presents troubleshooting tips for the following types of problems:

General troubleshooting tuning
Client bottlenecks
Server bottlenecks
Network-related bottlenecks

General Troubleshooting Tuning Tips

This section (see Table 5-1) lists the actions to perform when you encounter a tuning problem.

Table 5-1 General Troubleshooting Tuning Problems and Actions to Perform


Command/Tool	Command Output/Result	Action
`netstat -i`	`Collis+Ierrs+Oerrs/Ipkts + Opkts` > `2%`	Check the Ethernet hardware.
`netstat -i`	`Collis/Opkts` > `10%`	Add an Ethernet interface and distribute the client load.
`netstat -i`	`Ierrs/Ipks` > `25%`	The host may be dropping packets, causing high input error rate. To compensate for bandwidth-limited network hardware: reduce the packet size; set the read buffer size, `rsize` and/or the write buffer size `wsize` to 2048 when using `mount` or in the `/etc/vfstab` file. See "To Check the Network" in Chapter 3, Analyzing NFS Performance.
`nfsstat -s`	`readlink` > `10%`	Replace symbolic links with mount points.
`nfsstat -s`	`writes > 5%`	Install a Prestoserve NFS accelerator (SBus card or NVRAM-NVSIMM) for peak performance. See "Prestoserve NFS Accelerator" in Chapter 4, Configuring the Server and the Client to Maximize NFS Performance.
`nfsstat -s`	There are any `badcall`s.	The network may be overloaded. Identify an overloaded network using network interface statistics.
`nfsstat -s`	`getattr` > `40%`	Increase the client attribute cache using the `actimeo` option. Make sure the DNLC and inode caches are large. Use `vmstat -s` to determine the percent hit rate (cache hits) for the DNLC and, if needed, increase `ncsize` in the `/etc/system` file. See "Directory Name Lookup Cache (DNLC)"in Chapter 4, Configuring the Server and the Client to Maximize NFS Performance.
`vmstat -s`	Hit rate (cache hits) < 90%	Increase `ncsize` in the `/etc/system` file.
Ethernet monitor, for example: SunNet Manager SharpShooter, NetMetrix	Load > 35%	Add an Ethernet interface and distribute client load.

Client Bottlenecks

This section (see Table 5-2) shows potential client bottlenecks and how to remedy them.

Table 5-2 Client Bottlenecks


Symptom(s)	Command/Tool	Cause	Solution
NFS server hostname not responding or slow response to commands when using NFS-mounted directories	`nfsstat`	User's path variable	List directories on local file systems first, critical directories on remote file systems second, and then the rest of the remote file systems.
NFS server hostname not responding or slow response to commands when using NFS-mounted directories	`nfsstat`	Running executable from an NFS-mounted file system	Copy the application locally (if used often).
NFS server hostname not responding; `badxid` >5% of total calls and `badxid` = `timeout`	`nfsstat -rc`	Client times out before server responds	Check for server bottleneck. If the server's response time isn't improved, increase the `timeo` parameter in the `/etc/vfstab` file of clients. Try increasing `timeo` to 25, 50, 100, 200 (tenths of seconds). Wait one day between modifications and check to see if the number of time-outs decreases.
`badxid` = 0	`nfsstat -rc`	Slow network	Increase `rsize` and `wsize` in the `/etc/vfstab` file. Check interconnection devices (bridges, routers, gateways).

Server Bottlenecks

This section (see Table 5-3) shows server bottlenecks and how to remedy them.

Table 5-3 Server Bottlenecks


Symptom(s)	Command/Tool	Cause	Solution
NFS server hostname not responding	`vmstat -s` or `iostat`	Cache hit rate is < 90%	Adjust the suggested parameters for DNLC, then run to see if the symptom is gone. If not, reset the parameters for DNLC. Adjust the parameters for the buffer cache, then the inode cache, following the same procedure as for the DNLC.
NFS server hostname not responding	`netstat -m` or `nfsstat`	Server not keeping up with request arrival rate	Check the network. If the problem is not the network, add appropriate Prestoserve NFS accelerator, or upgrade the server.
High I/O wait time or CPU idle time; slow disk access times or `NFS server` hostname `not responding`	`iostat` `-x`	I/O load not balanced across disks; the `svc_t` value is greater than 40 ms	Take a large sample (~2 weeks). Balance the load across disks; add disks as necessary. Add a Prestoserve NFS accelerator for synchronous writes. To reduce disk and network traffic, use `tmpfs` for `/tmp` for both server and clients. Measure system cache efficiencies. Balance load across disks; add disks as necessary.
Slow response when accessing remote files	`netstat -s` or `snoop`	Ethernet interface dropping packets	If retransmissions are indicated, increase buffer size. For information on how to use `snoop`, see "`snoop`"" in Appendix A, Using NFS Performance-Monitoring and Benchmarking Tools.

Network Bottlenecks

This section (see Table 5-4) shows network-related bottlenecks and how to remedy them.

Table 5-4 Network-Related Bottlenecks


Symptoms	Command/Tool	Cause	Solution
Poor response time when accessing directories mounted on different subnets or NFS server hostname not responding	`netstat -rs`	NFS requests being routed	Keep clients on the subnet directly connected to server.
Poor response time when accessing directories mounted on different subnets or `N`FS server hostname not responding	`netstat -s` shows incomplete or bad headers, bad data length fields, bad checksums.	Network problems	Check the network hardware.
Poor response time when accessing directories mounted on different subnets or NFS server hostname not responding; sum of input and output packets per second for an interface is over 600 per second	`netstat -i`	Network overloaded	The network segment is very busy. If this is a recurring problem, consider adding another (`le`) network interface.
Network interface collisions are over 120 per second	`netstat -i`	Network overloaded	Reduce the number of machines on the network or check the network hardware.
Poor response time when accessing directories mounted on different subnets or NFS server hostname not responding	`netstat` `-i`	High packet collision rate (`Collis/Opkts`>.10)	- If packets are corrupted, it may be due to a corrupted MUX box; use the Network General Sniffer product or another protocol analyzer to find the cause. - Check for overloaded network. If there are too many nodes, create another subnet. - Check network hardware; could be bad tap, transceiver, hub on 10base-T. Check cable length and termination.