Sun Java System Access Manager 7.1 Performance Tuning and Troubleshooting Guide

Conducting Baseline Authorization Tests

You will need the following test scripts are to generate the basic authorization workload:

In this example, the baseline authorization test follows this sequence:

It is a good practice to set up a single URL policy that allows all authenticated users to access the wildcard URL protected by the policy agent. This simplified setup keep things simple in the baseline tests.

For all tests, randomly pick user IDs from a large user pool, from minimally 100K to one million users. The load test scripts log the user in, accesses a protected static page twice, and then logs the user out. A good practice is to remove all other static page or gif requests from the scripts. This will make the workload cleaner, well-defined, and the results are easier to interpret.

The test scripts should have zero think time to put the maximum workload on the system. The tests are not focused on response times in this phase. The baseline tests should determine the maximum system capacity based on maximum throughput. The number of test users, sometimes called test threads, is usually a few hundred. The exact number is unimportant. What is important is to achieve as close to 100% Access Manager CPU usage as possible while keeping the average response time to at least 500 milliseconds. A well executed test indicates the maximum system capacity while minimizing the impact of network latencies.

Determine the Number of Test Users

A typical 200 users per one Access Manager instance can be used . For example, you could use 200 users for one Access Manager instance, 400 users for two Access Manager instances, 600 users for three Access Manager instances, and so on. If the workload is too low, start with 100 users, and increase it by a 100—user increments to find out the minimum number. Once the number of test users per Access Manager instance is determined, continue to use this number for the rest of the tests to make the results more comparable. If you have two Access Manager instances behind a load balancer, the above tests actually involve at least five individual test runs. You conduct two runs each for tests 1 and 2, and conduct one run for test 3.

Verify that for each test, the response time of the second protected resource access is significantly lower than the response time of the first protected page access. On the first access to a protected resource, the agent needs to perform uncached session validation and authorization. This involves the agent communicating with Access Manager servers. On the second access to a protected resource, the agent can perform cached session validation and authorization. The agent does not need to communicate with the Access Manager servers. Thus the second access tends to be significantly faster. It's common to see the first page access takes 1 second (this highly depends on the number of test users used), while the second page access takes less than 10 ms (this does not depend too much on the number of test users used). If the second page access is not as fast as it should be, compared with the first page access, you should investigate to find out why. Is it because first page access being relatively too fast ? If so, you can increase the number of test users to increase the response time of the first page access. Is it because the agent machine is undersized so that no matter how much load you put on the system, Access Manager does not reach full capacity, and the agent machine reaches full capactiy first. In this case, since the agent machine is the bottleneck, and not the AccessManager, you can expect both the first and second page access to be slow while Access Manager responds quickly.

Analyze the Test Results

The data you capture will help you identify possible trouble spots in the system. The following are examples of things to look for in the baseline test results.

Compare the maximum authorization throughput of individual Access Manager instances with no load balancer in place.

If identical hardware is used in the test, the number of authorizations transactions per second should be roughly the same for each Access Manager instance. If there is a large variance in throughput, investigate why one server behaves differently than another.

Compare the maximum authorization throughput of individual Access Manager instances that have a load balancer in front of them.

Using a load balancer should not cause a decrease in the maximum throughput. In the example above, test 2 should yield results similar to test 1 results. If the maximum throughput numbers go down when a load balancer is added to the system, investigate why the load balancer introduces significant overhead. For example, you could conduct a further test with static pages through the load balancer.

Verify that the maximum throughput on a load balancer with two Access Manager instances is roughly twice the throughput on a load balancer with one Access Manager instance behind it.

If the throughput numbers to do not increase proportionately with the number of Access Manager instances, you have not configured sticky load balancing properly. Users logged in to one Access Manager instance are being redirected to another instance for logout. You must correct the load balancer configuration. When sticky load balancing is properly configured, each Access Manager should serve requests independently and thus the system would scale near linearly. If the throughput numbers to do not increase proportionately with the number of Access Manager instances, you have not configured sticky load balancing correctly. For related information, see Configuring the Access Manager Load Balancer in Deployment Example 1: Access Manager 7.1 Load Balancing, Distributed Authentication UI, and Session Failover.

Verify that for each test, the Access Manager transaction counts report indicates no unexpected Access Manager requests.

For example, if you perform the Access Manager login and logout test, your test results should look similar to this:

    1079   10790   94952
    1032   10320   99072
    1044   10440  101268
    1064   10640  101080
       0       0       0
       0       0       0
    1066   10660   97006
    5312   53093  495052 a

This output indicates three pieces of information. First, the system processed 1079 login, 1032 naming, 1044 session, 1064 policy and 1066 logout requests. These numbers are roughly equal. For each login, there is one naming call, one session call (to validate the user's session), one policy call (to authorize the user's access) and one logout. Secondly, all other types of Access Manager requests were absent. This is expected. Lastly, the total number of request received 5312 is roughly the sum of login, naming, session, policy and logout requests. This indicates there are no unexpected requests that we didn't grep in the command.

Troubleshoot Problems You Find

A common problem is that when two AM instances are both running, you see the number of session requests exceeds the number of logins. For example, the test output may look similar to this:

    4075   40750  358600
    4167   41670  400032
   19945  199450 1913866
    3979   39790  381984
       0       0       0
    3033   30330  297234
    3946   39460  359086
   39194  391891 3713840 a

Note that for each login request, there are now 5 session requests, and 0.75 notifications. The total number of requests do add up though. This indicates there are no other unexpected requests. There more session requests per login because the sticky load balancing is not working properly. A user logged in on one Access Manager instance is sometimes sent to another Access Manager instance for session validation and logout. The second Access Manager instance must generate extra session and notification requests to the originating Access Manager instance to perform the request. The extra requests increase the system workload and reduce the maximum throughput the system can provide. In this case, the two Access Manager instances cannot double the throughout of the single AM throughput. You can address the problem by reconfiguring the load balancer. The problem should have been caught during modular verification steps in the system construction phase.

Conduct Extended Stability Tests

Once you've passed all the basic authorization tests, it's a good idea to put the system under the workload for extended period of time to test the stability. You can use test 3 and let it run over several hours. You may need to set up automated scripts to periodically remove excessive access logs generated so that they do not fill up the file systems.