Identifying Capacity Limitations: an Exercise (Sun Java System Directory Server Enterprise Edition 6.1 Troubleshooting Guide)

Sun Java System Directory Server Enterprise Edition 6.1 Troubleshooting Guide

Identifying Capacity Limitations: an Exercise

Often a capacity limitation manifests itself as a performance issue. To differentiate between performance and capacity, performance might be defined as “How fast the system is going” while capacity is “the maximum performance of the system or an individual component.”

If your CPU is very low (at or around 10%), try to determine if the disk controllers are fully loaded and if input/output is the cause. To determine if your problem is disk related, use the iostat tool as follows:

# iostat -xnMCz -T d 10

For example, a directory is available on the internet. Their customers submit searches from multiple sites and the Service Level Agreement (SLA) was no more than 5% of requests with response times of over 3 seconds. Currently 15% of request take more than 3 seconds, which puts the business in a penalty situation. The system is a 6800 with 12x900MHz CPUs.

The vmstat output looks as follows:

procs     memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr m0 m1 m1 m1   in   sy   cs us sy id
 0 2 0 8948920 5015176 374 642 10 12 13 0 2 1  2  1  2  132 2694 1315 14  3 83
 0 19 0 4089432 188224 466 474 50 276 278 0 55 5 5 4 3 7033 6191 2198 19  4 77
 0 19 0 4089232 188304 430 529 91 211 211 0 34 8 6 5 4 6956 9611 2377 16  5 79
 0 18 0 4085680 188168 556 758 96 218 217 0 40 12 4 6 4 6979 7659 2354 18 6 77
 0 18 0 4077656 188128 520 501 75 217 216 0 46 9 3 5 2 7044 8044 2188 17  5 78

We look at the right 3 columns, us=user, sy=system and id=idle, which show that over 50% of the CPU is idle and available for the performance problem. One way to detect a memory problem is to look at the sr, or scan rate, column of the vmstat output. If the page scanner ever starts running, or the scan rate gets over 0, then we need to look more closely at the memory system. The odd part of this display is that the blocked queue on the left of the display has 18 or 19 processes in it but there are no processes in the run queue. This suggests that the process is blocking somewhere in Solaris without using all of the available CPU.

Next, we look at the I/O subsystem. With Solaris 8, the iostat command has a switch, -C, which will aggregate I/Os at the controller level. We run the iostat command as follows:

#  iostat -xnMCz -T d
                    extended device statistics              
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  396.4   10.7    6.6    0.1  0.0 20.3    0.0   49.9   0 199 c1
  400.2    8.8    6.7    0.0  0.0 20.2    0.0   49.4   0 199 c3
  199.3    6.0    3.3    0.0  0.0 10.1    0.0   49.4   0  99 c1t0d0
  197.1    4.7    3.3    0.0  0.0 10.2    0.0   50.4   0 100 c1t1d0
  198.2    3.7    3.4    0.0  0.0  9.4    0.0   46.3   0  99 c3t0d0
  202.0    5.1    3.3    0.0  0.0 10.8    0.0   52.4   0 100 c3t1d0

On controller 1 we are doing 396 reads per second and on controller 3 we are doing 400 reads per second. On the right side of the data, we see that the output shows the controller is almost 200% busy. So the individual disks are doing almost 200 reads per second and the output shows the disks as 100% busy. That leads us to a rule of thumb that individual disks perform at approximately 150 I/Os per second. This does not apply to LUNs or LDEVs from the big disk arrays. So our examination of the numbers leads us to suggest adding 2 disks to each controller and relaying out the data.

In this exercise we looked at all the numbers and attempted to locate the precise nature of the problem. Do not assume adding CPUs and memory will fix all performance problems. In this case, the search programs were exceeding the capacity of the disk drives which manifested itself as a performance problem of transactions with extreme response times. All those CPUs were waiting on the disk drives.