Sun Java System Access Manager 7.1 Performance Tuning and Troubleshooting Guide

Conducting Baseline Authentication Tests

You will need the following test scripts are to generate the basic authentication workload:

For all tests, randomly pick user IDs from a large user pool, from minimally 100K to one million users. The load test script should first log the user in, then either log the user out or simply drop the session and let the session time out. A good practice is to remove all static pages and graphics requests from the scripts. This will make the workload cleaner and well— defined. The results are easier to interpret.

The test scripts should have zero think time to put the maximum workload on the system. The tests are not focused on response times in this phase. The baseline tests should determine the maximum system capacity based on maximum throughput. The number of test users, sometimes called test threads, is usually a few hundred. The exact number is unimportant. What is important is to achieve as close to 100% Access Manager CPU usage as possible while keeping the average response time to at least 500 ms. A minimum of 500 ms is used to minimize the impact of relatively small network latencies. If the average response time is too low (for example 50ms), a large portion is likely to be caused by network latency. The data will be contaminated with unnecessary noise.

Determine the Number of Test Users

In the following example baseline test, 200 users per one AM instance are used. For your tests, you could use 200 users for one Access Manager instance, 400 users for two Access Manager instances, 600 users for three Access Manager instances, and so forth. If the workload is too low, start with 100 users, and increase it by increments of 100 to find out the minimum number. Once you have determined the minimum test users per AM instance, use with this number for the rest of the tests to make the results more comparable.

Determine the System Steady State

In the example baseline tests, the performance data is captured at the steady state. The system can take any where from 5 to 15 minutes to reach its steady state. Watch the tests. The following indicators will settle into predictable patterns when the system has reached its steady state:

The following are examples of capturing transactions by categories on different sytems.

On each Access Manager host, parse the container access log to gather the number of different transactions received. For example, if Access Manager is deployed on Sun Web Server, use the following command to obtain the result:

cd /opt/SUNwbsvr/https-<am1host>/logs
cp access a; grep Login a | wc; grep naming a | wc; grep session a| 
wc; grep policy a | wc ; grep jaxrpc a | wc; grep notifi a | wc;  
grep Logout a | wc; wc a;

On each LDAP server, parse the LDAP access log to gather the number of different transactions received. For example, use the following command to obtain the result:

cd <slapd-xxx>/logs
cp access a; grep BIND a | grep "uid=u" | wc; grep BIND a|wc; 
grep UNBIND a| wc; grep SRCH a| wc; grep RESULT a| wc; wc a ;

Conduct the Baseline Test

In this example, the baseline test follows this sequence:

  1. Log in and log out on each individual AM directly.

  2. Log in and time out on each individual AM directly.

  3. Log in and log out using a load balancer with one Access Manager server.

  4. Log in and time out using a load balancer with one Access Manager server.

  5. Log in and log out test on LB with two AM instances behind

  6. Perform login and timeout test on LB with two AM instances behind

If you have two Access Manager instances behind a load balancer, the above tests actually involve at least ten individual test runs: two test runs for 1 through 4, one test run, and one test run for 6.

Note –

In order to perform any log in and timeout test, you must reduce the maximum session timeout value to lower than the default value. For example, change the default 30 minutes to one minute. Otherwise, at the maximum throughput, there will be too many sessions lingering on the system for so long that the memory will be exhausted quickly.

Analyze the Baseline Test Results

The data you capture will help you identify possible trouble spots in the system. The following are examples of things to look for in the baseline test results.

Compare the maximum authentication throughput of individual Access Manager instances with no load balancer in place.

If identical hardware is used in the test, the number of authentication transactions per second should be roughly the same for each Access Manager instance. If there is a large variance in throughput, investigate why one server behaves differently than another.

Compare the maximum authentication throughput of individual Access Manager instances that have a load balancer in front of them.

Using a load balancer should not cause a decrease in the maximum throughput. In the example above, test 3 should yield results similar to test 1 results, and test 4 should yield results similar to test 2 results. If the maximum throughput numbers go down when a load balancer is added to the system, investigate why the load balancer introduces significant overhead. For example, you could conduct a further test with static pages through the load balancer.

Verify that the maximum throughput on a load balancer with two Access Manager instances is roughly twice the throughput on a load balancer with one Access Manager instance behind it.

If the throughput numbers to do not increase proportionately with the number of Access Manager instances, you have not configured sticky load balancing properly. Users logged in to one Access Manager instance are being redirected to another instance for logout. You must correct the load balancer configuration. For related information, see Configuring the Access Manager Load Balancer in Deployment Example 1: Access Manager 7.1 Load Balancing, Distributed Authentication UI, and Session Failover.

Verify that for each test, the Access Manager transaction counts report indicates no unexpected Access Manager requests.

For example, if you perform the Access Manager login and logout test, your test results may look similar to this:

    1581   15810  139128
       0       0       0
       0       0       0
       0       0       0
       0       0       0
       0       0       0
    1609   16090  146419
    3198   31972  286043 a

This output indicates three important pieces of information. First, the system processed 1581 login requests and 1609 logouts request. They are roughly equal. This is expected as each login is followed by one logout. Secondly, all other types of AM requests were absent. This is expected. Lastly, the total number of requests received, 3198, is roughly the sum of 1581 and 1609. This indicates there are no unexpected requests that we didn't grepin the command.

Troubleshoot the Problems You Find

A common problem is that when two Access Manager instances are both running, you see not only login and logout requests, but session requests as well. The test results may look similar to this:

    3159   31590  277992
       0       0       0
    5096   50960  486676
       0       0       0
       0       0       0
    1305   13050  127890
    3085   30850  280735
   12664  126621 1174471 a

In this example, for each logout request, there are now extra session and notification requests. The total number of requests does add up. This means there are no other unexpected requests. The reason for the session request is that the sticky load balancing is not working properly. A user logged in on one Access Manager instance, then is sent to another AM instance for logout. The second Access Manager instance must generate an extra session request to the originating AM instance to perform the request. The extra session request increases the system workload and reduces the maximum throughput the system can provide. In this case, the two Access Manager instances cannot double the throughout of the single Access Manager throughput. Instead, there is a mere 20% increase. You can address the problem at this point by reconfiguring the load balancer. This is an example of a problem should have been caught during modular verification steps in the system construction phase.

Run Extended Tests for System Stability

Once the system has passed all the basic authentication tests, it's a good practice to put the system under the test workload for an extended period of time to test the stability. You can use test 6 let it run over several hours. You may need to set up automated scripts to periodically remove excessive access logs generated so that they do not fill up the file systems.