The system is largely performance tuned after you've run the amtune tool. But it is still too early to perform the final complex performance tests. It's always more difficult to troubleshoot performance problems in the entire system than to troubleshoot individual system components performing basic transactions. So in this phase, you perform several baseline tests. Be sure that the specific baseline test scripts you write will:
Verify the functions of the sub-systems under the stress load of basic transactions such as authentications and authorizations.
Establish baseline performance benchmarks for basic transactions.
You will need the following test scripts to generate the basic authentication workload:
Login and logout test
Login and time out test
For all tests, randomly pick user IDs from a large user pool, from minimally 100K to one million users. The load test script should first log the user in, then either log the user out or simply drop the session and let the session time out. A good practice is to remove all static pages and graphics requests from the scripts. This will make the workload cleaner and clearly defined. The results are easier to interpret.
The test scripts should have zero think time to put the maximum workload on the system. The tests are not focused on response times in this phase. The baseline tests should determine the maximum system capacity based on maximum throughput. The number of test users, sometimes called test threads, is usually a few hundred. The exact number is unimportant. What is important is to achieve as close to 100% OpenSSO Enterprise CPU usage as possible while keeping the average response time to at least 500 ms. A minimum of 500 ms is used to minimize the impact of relatively small network latencies. If the average response time is too low (for example 50ms), a large portion is likely to be caused by network latency. The data will be contaminated with unnecessary noise.
In the following example baseline test, 200 users per one OpenSSO Enterprise instance are used. For your tests, you could use 200 users for one OpenSSO Enterprise instance, 400 users for two OpenSSO Enterprise instances, 600 users for three OpenSSO Enterprise instances, and so forth. If the workload is too low, start with 100 users, and increase it by increments of 100 to find out the minimum number. Once you have determined the minimum test users per OpenSSO Enterprise instance, use with this number for the rest of the tests to make the results more comparable.
In the example baseline tests, the performance data is captured at the steady state. The system can take any where from 5 to 15 minutes to reach its steady state. Watch the tests. The following indicators will settle into predictable patterns when the system has reached its steady state:
Transactions per second (TPS), also called throughput
Average response time of individual transactions
CPU usage of all affected servers (including OpenSSO Enterprise, Directory Server, and any load generation machines)
Number of transactions performed by each component in a given period, categorized by transaction types (see Appendix for details)
The following are examples of capturing transactions by categories on different systems.
On each OpenSSO Enterprise host, parse the container access log to gather the number of different transactions received. For example, if OpenSSO Enterprise is deployed on Sun Web Server, use the following command to obtain the result:
cd /opt/SUNwbsvr/https-<openssoHost>/logs cp access a; grep Login a | wc; grep naming a | wc; grep session a| wc; grep policy a | wc ; grep jaxrpc a | wc; grep notifi a | wc; grep Logout a | wc; wc a; |
On each LDAP server, parse the LDAP access log to gather the number of different transactions received. For example, use the following command to obtain the result:
cd <slapd-xxx>/logs cp access a; grep BIND a | grep "uid=u" | wc; grep BIND a|wc; grep UNBIND a| wc; grep SRCH a| wc; grep RESULT a| wc; wc a ; |
In this example, the baseline test follows this sequence:
Log in and log out on each individual OpenSSO Enterprise directly.
Log in and time out on each individual OpenSSO Enterprise directly.
Log in and log out using a load balancer with one OpenSSO Enterprise server.
Log in and time out using a load balancer with one OpenSSO Enterprise server.
Log in and log out test one load balancer with two OpenSSO Enterprise instances behind.
Perform login and timeout test one load balancer with two OpenSSO Enterprise instances behind.
If you have two OpenSSO Enterprise instances behind a load balancer, the above tests actually involve at least ten individual test runs: two test runs for 1 through 4, one test run, and one test run for 6.
In order to perform any log in and timeout test, you must reduce the maximum session timeout value to lower than the default value. For example, change the default 30 minutes to one minute. Otherwise, at the maximum throughput, there will be too many sessions lingering on the system for so long that the memory will be exhausted quickly.
The data you capture will help you identify possible trouble spots in the system. The following are examples of things to look for in the baseline test results.
If identical hardware is used in the test, the number of authentication transactions per second should be roughly the same for each OpenSSO Enterprise instance. If there is a large variance in throughput, investigate why one server behaves differently than another.
Using a load balancer should not cause a decrease in the maximum throughput. In the example above, test 3 should yield results similar to test 1 results, and test 4 should yield results similar to test 2 results. If the maximum throughput numbers go down when a load balancer is added to the system, investigate why the load balancer introduces significant overhead. For example, you could conduct a further test with static pages through the load balancer.
If the throughput numbers do not increase proportionately with the number of OpenSSO Enterprise instances, you have not configured sticky load balancing properly. Users logged in to one OpenSSO Enterprise instance are being redirected to another instance for logout. You must correct the load balancer configuration. For related information, see Configuring Load Balancer 2 for OpenSSO Enterprise in Deployment Example: Single Sign-On, Load Balancing and Failover Using Sun OpenSSO Enterprise 8.0.
For example, if you perform the OpenSSO Enterprise login and logout test, your test results may look similar to this:
# cp access a; grep Login a|wc; grep naming a|wc; grep sesion a|wc; grep policy a|wc; grep jaxrpc a|wc; grep notifi a|wc; grep Logout a|wc;wc a; 1581 15810 139128 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1609 16090 146419 3198 31972 286043 a |
This output indicates three important pieces of information. First, the system processed 1581 login requests and 1609 logouts request. They are roughly equal. This is expected as each login is followed by one logout. Secondly, all other types of OpenSSO Enterprise requests were absent. This is expected. Lastly, the total number of requests received, 3198, is roughly the sum of 1581 and 1609. This indicates there are no unexpected requests that we didn't grepin the command.
A common problem is that when two OpenSSO Enterprise instances are both running, you see not only login and logout requests, but session requests as well. The test results may look similar to this:
# cp access a; grep Login a|wc; grep naming a|wc; grep sesion a|wc; grep policy a|wc; grep jaxrpc a|wc; grep notifi a|wc; grep Logout a|wc;wc a; 3159 31590 277992 0 0 0 5096 50960 486676 0 0 0 0 0 0 1305 13050 127890 3085 30850 280735 12664 126621 1174471 a |
In this example, for each logout request, there are now extra session and notification requests. The total number of requests does add up. This means there are no other unexpected requests. The reason for the session request is that the sticky load balancing is not working properly. A user logged in on one OpenSSO Enterprise instance, then is sent to another OpenSSO Enterprise instance for logout. The second OpenSSO Enterprise instance must generate an extra session request to the originating OpenSSO Enterprise instance to perform the request. The extra session request increases the system workload and reduces the maximum throughput the system can provide. In this case, the two OpenSSO Enterprise instances cannot double the throughout of the single OpenSSO Enterprise throughput. Instead, there is a mere 20% increase. You can address the problem at this point by reconfiguring the load balancer. This is an example of a problem that should have been caught during modular verification steps in the system construction phase.
Once the system has passed all the basic authentication tests, it's a good practice to put the system under the test workload for an extended period of time to test the stability. You can use test 6 to let it run over several hours. You may need to set up automated scripts to periodically remove excessive access logs generated so that they do not fill up the file systems.
You will need the following test scripts to generate the basic authorization workload:
Login, access an agent-protected page twice, logout test.
In this example, the baseline authorization test follows this sequence:
Perform login, page-access and logout test on each individual OpenSSO Enterprise instance, with no load balancer in place.
This test determines the OpenSSO Enterprise capacity without the influence of a network element such as the load balancer.
Perform login, page-access and logout test on the load balancer with only one OpenSSO Enterprise instance behind it.
This test determines the impact of the load balancer.
Perform login, page-access and logout test on the load balancer with two OpenSSO Enterprise instances behind it.
This test determines the baseline results when multiple OpenSSO Enterprise instances are running, and indicate whether the sticky load balancing is configured properly.
It is a good practice to set up a single URL policy that allows all authenticated users to access the wildcard URL protected by the policy agent. This simplified setup keep things simple in the baseline tests.
For all tests, randomly pick user IDs from a large user pool, from minimally 100K to one million users. The load test scripts log the user in, accesses a protected static page twice, and then logs the user out. A good practice is to remove all other static page or .gif requests from the scripts. This will make the workload cleaner, well-defined, and the results are easier to interpret.
The test scripts should have zero think time to put the maximum workload on the system. The tests are not focused on response times in this phase. The baseline tests should determine the maximum system capacity based on maximum throughput. The number of test users, sometimes called test threads, is usually a few hundred. The exact number is unimportant. What is important is to achieve as close to 100% OpenSSO Enterprise CPU usage as possible while keeping the average response time to at least 500 milliseconds. A well executed test indicates the maximum system capacity while minimizing the impact of network latencies.
A typical 200 users per one OpenSSO Enterprise instance can be used . For example, you could use 200 users for one OpenSSO Enterprise instance, 400 users for two OpenSSO Enterprise instances, 600 users for three OpenSSO Enterprise instances, and so on. If the workload is too low, start with 100 users, and increase it by a 100‐user increments to find out the minimum number. Once the number of test users per OpenSSO Enterprise instance is determined, continue to use this number for the rest of the tests to make the results more comparable. If you have two OpenSSO Enterprise instances behind a load balancer, the above tests actually involve at least five individual test runs. You conduct two runs each for tests 1 and 2, and conduct one run for test 3.
Verify that for each test, the response time of the second protected resource access is significantly lower than the response time of the first protected page access. On the first access to a protected resource, the agent needs to perform uncached session validation and authorization. This involves the agent communicating with OpenSSO Enterprise servers. On the second access to a protected resource, the agent can perform cached session validation and authorization. The agent does not need to communicate with the OpenSSO Enterprise servers. Thus the second access tends to be significantly faster. It's common to see the first page access takes 1 second (this highly depends on the number of test users used), while the second page access takes less than 10 ms (this does not depend too much on the number of test users used). If the second page access is not as fast as it should be, compared with the first page access, you should investigate to find out why. Is it because first page access being relatively too fast ? If so, you can increase the number of test users to increase the response time of the first page access. Is it because the agent machine is undersized so that no matter how much load you put on the system, OpenSSO Enterprise does not reach full capacity, and the agent machine reaches full capacity first. In this case, since the agent machine is the bottleneck, and not the OpenSSO Enterprise machine, you can expect both the first and second page access to be slow while OpenSSO Enterprise responds quickly.
The data you capture will help you identify possible trouble spots in the system. The following are examples of things to look for in the baseline test results.
If identical hardware is used in the test, the number of authorizations transactions per second should be roughly the same for each OpenSSO Enterprise instance. If there is a large variance in throughput, investigate why one server behaves differently than another.
Using a load balancer should not cause a decrease in the maximum throughput. In the example above, test 2 should yield results similar to test 1 results. If the maximum throughput numbers go down when a load balancer is added to the system, investigate why the load balancer introduces significant overhead. For example, you could conduct a further test with static pages through the load balancer.
If the throughput numbers do not increase proportionately with the number of OpenSSO Enterprise instances, you have not configured sticky load balancing properly. Users logged in to one OpenSSO Enterprise instance are being redirected to another instance for logout. You must correct the load balancer configuration. When sticky load balancing is properly configured, each OpenSSO Enterprise should serve requests independently and thus the system would scale near linearly. If the throughput numbers do not increase proportionately with the number of OpenSSO Enterprise instances, you have not configured sticky load balancing correctly. For related information, see Configuring Load Balancer 2 for OpenSSO Enterprise in Deployment Example: Single Sign-On, Load Balancing and Failover Using Sun OpenSSO Enterprise 8.0.
For example, if you perform the OpenSSO Enterprise login and logout test, your test results should look similar to this:
# cp access a; grep Login a|wc; grep naming a|wc; grep sesion a|wc; grep policy a|wc; grep jaxrpc a|wc; grep notifi a|wc; grep Logout a|wc;wc a; 1079 10790 94952 1032 10320 99072 1044 10440 101268 1064 10640 101080 0 0 0 0 0 0 1066 10660 97006 5312 53093 495052 a |
This output indicates three pieces of information. First, the system processed 1079 login, 1032 naming, 1044 session, 1064 policy and 1066 logout requests. These numbers are roughly equal. For each login, there is one naming call, one session call (to validate the user's session), one policy call (to authorize the user's access) and one logout. Secondly, all other types of OpenSSO Enterprise requests were absent. This is expected. Lastly, the total number of request received 5312 is roughly the sum of login, naming, session, policy and logout requests. This indicates there are no unexpected requests that we didn't grep in the command.
A common problem is that when two OpenSSO Enterprise instances are both running, you see the number of session requests exceeds the number of logins. For example, the test output may look similar to this:
# cp access a; grep Login a|wc; grep naming a|wc; grep sesion a|wc; grep policy a|wc; grep jaxrpc a|wc; grep notifi a|wc; grep Logout a|wc;wc a; 4075 40750 358600 4167 41670 400032 19945 199450 1913866 3979 39790 381984 0 0 0 3033 30330 297234 3946 39460 359086 39194 391891 3713840 a |
Note that for each login request, there are now five session requests, and 0.75 notifications. The total number of requests do add up though. This indicates there are no other unexpected requests. There are more session requests per login because the sticky load balancing is not working properly. A user logged in on one OpenSSO Enterprise instance is sometimes sent to another OpenSSO Enterprise instance for session validation and logout. The second OpenSSO Enterprise instance must generate extra session and notification requests to the originating OpenSSO Enterprise instance to perform the request. The extra requests increase the system workload and reduce the maximum throughput the system can provide. In this case, the two OpenSSO Enterprise instances cannot double the throughout of the single OpenSSO Enterprise throughput. You can address the problem by reconfiguring the load balancer. The problem should have been caught during modular verification steps in the system construction phase.
Once you've passed all the basic authorization tests, it's a good idea to put the system under the workload for extended period of time to test the stability. You can use test 3 and let it run over several hours. You may need to set up automated scripts to periodically remove excessive access logs generated so that they do not fill up the file systems.