Monitoring

BEA AquaLogic Service Bus provides the capability to monitor and collect run-time information for systems operations purposes. AquaLogic Service Bus aggregates run-time statistics that you can view on a customizable Dashboard. The Dashboard allows you to monitor the health of the system and alerts you to problems in your messaging services. With this information, you can quickly and easily isolate and diagnose problems as they occur.

Monitoring Scenarios

The following sections describe some of the tools and functionality available in AquaLogic Service Bus Console to monitor messages and system operations. It includes:

Operational Health

The Dashboard page in the AquaLogic Service Bus Console provides the ability to view the state of all servers and monitored services immediately. The Dashboard displays two pie charts, a table, and several links. The Service Summary pie chart shows the percentage of alerts according to their severity for all services that were issued in the past 30 minutes. The Server Summary pie chart shows the current status of every server in the AquaLogic Service Bus domain. Additionally, from the Server Summary panel, you can drill down and view the domain logs, which are grouped according to severity.

In addition to the pie charts, these Summaries include a list of the most active services and critical servers. The list displays up to ten services, with fully qualified service names, in descending order of the most number of alerts. The most critical server list displays the ten most critical servers. This display is based on the health of the running servers, as defined by the WebLogic Diagnostic Service. For more information about the WebLogic Diagnostic Service, see Configuring and Using the WebLogic Diagnostics Framework.

From each of the summaries, you can drill down into more detail by clicking a specific area on a pie chart or by clicking one of the links on the page.

The Alert Summary table lists the alerts that were issued in the past 30 minutes. This table contains the following fields:

Alert Severity: This field specifies if the serverity of the alert. By default the table is sorted by alert severity, in the following order:

Fatal
Critical
Major
Minor
Warning
Normal

Time Stamp: This field gives the details of when the error occurred in MM/DD/YY HH:MM format
Service: This field gives the path and the name of the service.
Alert Rule Name: This field gives the name of the alert rule configured for the service.

Note:

This is available only in case of SLA violations

Monitoring Alerts

When you log into the AquaLogic Service Bus Console, you may see a list of alerts on the Dashboard. This display is dynamically refreshed. These alerts could be the result of SLA violations or pipeline alerts.Service Level Agreements(SLAs) are agreements that define the precise level of service expected from the AquaLogic Service Bus business and proxy services.

Each row of the table displays the information that you have configured, such as the severity, timestamp, and associated service. Clicking the severity link will display more details about the alert to help analyze the cause of the alert.

Console, e-mail, JMS, reporting or SNMP traps are the various alert destinations that can be configured for the alert. For example, you can choose e-mail or JMS as additional or replacement destination for alert notification.

Monitoring Statistics

Monitoring Statistics helps you know how many messages in a particular service have processed successfully and how many have failed. To access this information, from the Dashboard, you access the Service Monitoring Summary page and filter the display for the relevant service. Besides displaying the number of messages that have been processed successfully or failed, you can also see which project the service belongs to, the average execution time of message processing, and the number of alerts associated with the service. You can view monitoring statistics for the period of the current aggregation interval or for the period since you last reset statistics for this service or since you last reset statistics for all services.

You use the Global Settings page in the System Administration module of the AquaLogic Service Bus Console to reset statistics. When you do this, make sure you are not in a WebLogic session on the WebLogic Server Administration Console.

Clicking the name of the service brings you to that service's Service Monitoring Details page. This page provides additional information such as the minimum and maximum response times and the overall average time it takes for the service to execute a message, the success-failure ratio, the number of messages that have failed because of security or validation errors, and the number of messages associated with proxy service components (pipelines and route nodes). You can view this information for specific operations associated with the service. Again, you can view these statistics for the period of the current aggregation interval or you can display the statistics for the period since you last reset statistics for this service or since you last reset statistics for all services.

Verifying Service Level Agreements

Assume that a particular proxy service is generating a lot of SLA violation alerts due to slow response time. To investigate this problem further, you must log into the AquaLogic Service Bus Console and a take a look at the detailed statistics for the proxy service. At this level, you will be able to identify that, a third-party web service invocation stage in the pipeline is taking a lot of time and is the actual bottleneck. After successfully renegotiating service-level characteristics with the third-party web service provider, you could configure alert metrics to track the web service provider's compliance with the new agreement terms. Thus you can use alerts as the basis for negotiating Service Level Agreements.

Pipeline Alert Action

You can also generate alerts inside a stage in the pipeline using the Alert action. For this you use the Alert action in the Reporting category of the Actions menu.

You define conditions under which a pipeline alert is triggered using the conditional constructs available in the Pipeline Editor such as Xquery Editor or an if-then-else construct. You can use the Alert Destination resource in an alert action to define the destination for alert. You will have complete control over the alert body including the pipeline, and context variables. Also you will be able to extract the portions of the message.

You can obtain an integrated view of all the alerts generated by a service on the Dashboard page in the AquaLogic Service Bus Console.

Alert Destination

AquaLogic Service Bus Console

The Dashboard shows the overall health related information of AquaLogic Service Bus. It provides an overview of the state of the system organized by server, services, and alerts.

After monitoring is enabled, the Service Monitoring Summary page in the AquaLogic Service Bus Console provides a view of the statistics collected for each service. It provides information about the alerts generated due to SLA violations or as a result of alert actions configured in the pipeline.

As previously mentioned, an SLA is an agreement that defines the precise level of service expected from business and proxy services in AquaLogic Service Bus. The SLA Manager, with the help of the AquaLogic Service Configuration module, allows users to configure SLA rule conditions and actions. The SLA Manager monitors SLA violations with the help of data provided by the Aggregator and sends notifications as configured in the alert rule actions. The SLA Manager is always deployed with the Aggregator and resides on only one managed server in cluster. The SLA Manager sends alerts to the Alert Log to store in the Alert Store.

E-mail Alert Destination

This is one of the destinations for the alerts.To configure this alert destination you have to first configure the SMTP global resource.This resource captures the address of the SMTP server corresponding to your e-mail destination, port number, and if required, the authentication credentials.The authentication credentials are stored inline and are not stored as a service account. The alert action makes use of the SMTP resource to send the outbound e-mail messages. You can also use the SMTP resource to send both pipeline alerts and SLA alerts. When an alert is delivered over an e-mail the metadata consisting of the details about the alert is prefixed to the payload configured.

SNMP Traps

The Simple Network Management Protocol (SNMP) traps allows any third party software to interface monitoring Service Level Agreements (SLAs) within AquaLogic Service Bus. By enabling the notification of alerts using SNMP, Web Services Management (WSM) and the Enterprise Service Management (ESM) tools can monitor SLA violations by monitoring alert notifications.

Simple Network Management Protocol (SNMP) is an application-layer protocol which allows the exchange of information on the management of a resource across a network. It enables you to monitor a resource and if required, rectify it based on the data obtained from the resource. Both the SNMP version 1 and SNMP version 2 are supported in this version of the AquaLogic Service Bus. SNMP is made up of the following components:

Managed Resource

This is the resource, which is being monitored. The resource and its attributes are added to the Management Information Base(MIB).

Management Information Base(MIB)

The Management Information Base (MIB) is a hierarchical data structure that stores all the resources to be monitored, in a hierarchical manner. It also stores the attributes of the resources, which are monitored. Each resource is given a unique identifier called the Object Identifier(OID).You can use the SNMP commands to retrieve the information on the management of a resource. The following section gives an illustration of the WebLogic Server MIB.

An Illustration of WebLogic Server MIB

The Weblogic Server installer creates a copy of the MIB in the following location:

where <BEA_HOME> is the directory in which you installed the WebLogic Server. WebLogic Server exposes thousands of data points in its management system. To organize this data it provides a hierarchical data model that reflects the collection of services and resources that are available in a domain. Figure 3-1 illustrates the hierarchy of objects in the MIB.

For example, if you created two managed servers, MS1 and MS2, in a domain, then MIB contains one object serverTable, which in turn contains one serverName object.The serverName object in turn contains two instances containing values MS1 and MS2. The MIB assigns a unique number called an object identifier (OID) to each managed object. Once assigned the you cannot change the OID. Each OID consists of a sequence of integers. This sequence defines the location of the object in the MIB tree. Each node in the path has both a number and a name associated with it.

SNMP Agent

Each managed resource uses an SNMP agent to update the relevant information in the MIB. For this you should configure the SNMP agent to detect certain conditions within a managed resource and send trap notification (report) to the SNMP manager. You can configure the SNMP agent to generate traps in one of the following ways:

Automatically: You can configure the SNMP agent to generate traps for events such as server startup or server shut down.
Using log messages: Using filters, you can configure the SNMP agent to detect specific log messages and generate traps.
Monitoring traps: You can create JMX API clients to monitor the changes in the attributes and notify SNMP agent to generate traps. You can also configure the SNMP agents to monitor the changes in the attribute.

SNMP Manager

The SNMP manager manages the SNMP agents. SNMP is also it is the primary interface to the Network Management System.

Network Management System (NMS)

The Network Management System forms the interface with the user. It gathers data using the SNMP manager and presents it to the user.

JMS

Java Messaging Service (JMS) is another destination for a pipeline alert and a SLA alert. You will have use a JNDI URL for the JMS destination for alerts. When you configure an alert rule to post a message to a JMS destination, you must create a JMS connection factory and a queue or topic, and target them to the appropriate JMS server in the WebLogic Server Administration Console. For information on how to do this, see "Configuring a JMS Connection Factory" and "JMS Resource Naming Rules for Domain Interoperability" in Configuring JMS System Resources in Configuring and Managing WebLogic JMS.When you define the JMS alert destination you can either use a destination queue or a destination topic. The message type can be bytes or text. For more information on how to configure JMS alert destination see Alert Destinations in Using the AquaLogic Service Bus Console.

Reporting

This is another process to monitor and analyze both pipeline alerts and SLA alerts. This process of monitoring is discussed in detail in Reporting

About Monitoring

Aggregation Interval

In AquaLogic Service Bus, the monitoring subsystem collects statistical information, such as message-count , execution time, over an aggregation interval. The aggregation interval is the time period over which statistical data is collected and displayed in the AquaLogic Service Bus Console.

Consider a proxy service you have configured for processing a purchase order, for which you have enabled with an aggregation interval of 10 minutes. When you send the first message through the proxy service, monitoring is started. Until the first ten minutes elapse, the Service Summary page displays the partially computed data. At this time the system does not have 10 minutes of data. After the first 10 minutes of data aggregation, the system always displays the last 10 minutes of data. For example, at the 14th minute, the Dashboard displays minutes 4 through 14. If no messages are processed after the 15th minute, on the 25th minute, the Service does not display any data. For more information about how aggregation interval affects the display of monitored information, see Alert Rules.

You must explicitly enable monitoring for any business or proxy service that you create; monitoring is disabled by default. After you have enabled monitoring and set the aggregation interval for your individual services, you can enable or disable monitoring for all those services from the Global Settings page in the System Administration module. For more information, see Monitoring Services.

SLA alerts are automated responses to Service Level Agreements (SLAs) violations or occurrences, which are displayed on the Dashboard. You define alert rules to specify unacceptable service performance according to your business and performance requirements. Each alert rule allows you to specify the aggregation interval for that rule when configuring the alert rule. This aggregation interval is not affected by the aggregation interval set for the service. Alert rules also allow you to send notifications to the configured alert destinations on topic about the violation. For information on defining alert rules, see Creating Alert Rules in Using the Using the AquaLogic Service Bus Console

Monitoring Architecture

The following diagram shows the architecture of AquaLogic Service Bus monitoring.

The Statistics Configuration Manager stores and manages the statistics configuration for each operational resource. An operational resource is defined as the unit for which statistical information can be collected by the monitoring subsystem. An operational resource includes a proxy service, service operations, and pipelines. The Statistics Configuration Manager is notified about changes in the service definition, such as adding, updating, or deleting a pipeline.

Each managed server in a cluster hosts a Statistics Collector. The Statistics Collector collects statistics on operational resources as directed by the Statistics Configuration Manager. The Statistics Collector also keeps samples history within the aggregation interval for the collected statistics. At every system-defined checkpoint interval, the Statistics Collector stores a snapshot of current statistics into a persistent store for recovery purposes and sends the information to the Statistics Aggregator.

One of the managed servers in a cluster, called the Aggregating Server or Aggregator, is designated as the aggregator for cluster-wide statistics. At system-defined checkpoint intervals, each managed server in the cluster sends a checkpoint snapshot of its contributions to the Aggregator. The Aggregator then combines this information to offer cluster-wide statistics to its clients through Retriever APIs. The clients of Aggregator are the Dashboard, SLA Manager, and Service Monitoring modules.

To contribute a data point to the system, an operational resource in the system, such as a run-time proxy service pipeline, calls a method on the Statistics Collector, and identifies itself, the statistic, and the data point.

Monitoring Services

When you create a business or proxy service, monitoring is disabled by default for that service. Enable monitoring as follows:

To enable monitoring for an individual service, select the Enable Monitoring checkbox on the Manage Monitoring page. Then set the aggregation interval for the service by selecting the interval times from the hour and minute drop-down lists. For information on how to do this, see "Viewing the Dashboard Statistics" in Monitoring in the Using the AquaLogic Service Bus Console.
To enable monitoring for all services, select the Enable Monitoring checkbox on the Global Settings page. For information on how to do this, see "Enabling Monitoring" in System Administration in the Using the AquaLogic Service Bus Console.

Note:

The Enable Monitoring option permits you to enable or disable monitoring of all services that have individually been enabled for monitoring. If monitoring for a particular service has not been enabled, you must first enable it and set the aggregation interval on the Manage Monitoring page before the system starts collecting statistics for that service.

When creating alert rules, you must enable monitoring before you create the rule. For more information, see Alert Rules and "Create an Alert Rule" in Monitoring in the Using the AquaLogic Service Bus Console.

Refresh Rate of Monitored Information

At run time, the default refresh rate for the Dashboard page is one minute. However, it may take up to three minutes for the information to be displayed on the Dashboard. This delay occurs because of the time gaps between when the messages are processed by the proxy service, when the metrics are collected, and the refresh rate of the Dashboard. The system works as follows:

For example, a proxy service starts sending data in T1, as shown in Figure 3-3. At T2—that is, the second minute—the Statistics Collector sends the data to the aggregator. However, if an aggregation cycle has just occurred, the aggregator does not merge this data until the next aggregation cycle, which occurs after one minute, or a maximum of two minutes from the previous aggregation cycle. When the data is merged, it is now available for the AquaLogic Service Bus Console. Since the console refreshes every minute, if the refresh cycle has just passed, but the console displays the alerts after a maximum time of three minutes.

You can change the Dashboard polling interval in the System Administration module in the AquaLogic Service Bus Console. For information on how to do this, see "Setting the Dashboard Polling Interval Refresh Rate" in System Administration in the Using the AquaLogic Service Bus Console.

Dashboard

When you log onto the AquaLogic Service Bus Console, the Dashboard is automatically displayed. The Dashboard shows the monitoring information for the last 30 minutes. It provides an overview of the state of the system—organized by server, services, and alerts, as shown in the following figure.

As shown in the previous figure the Dashboard displays the following information:

Services Summary: This summarizes the alert status for both proxy and business services if alerts have been configured. Alerts notify you of any violations in the service level agreements or if any alert action condition, which is defined in a pipeline is met.
Servers Summary: displays the status of the servers.
Alerts Summary: if alerts have been configured, displays which alert rules have been triggered.

From the Dashboard, you can drill-down into the system and easily find specific information, such as the average execution time of a service, the date and time an alert occurred, or the duration for which server has been running.

You configure the Dashboard and monitoring in the AquaLogic Service Bus Console, which is described in the Monitoring and System Administration sections of Using the AquaLogic Service Bus Console.

Service Summary

About the Service Summary

The Service Summary panel provides an overview of the state of the services. The Service Summary pie chart shows the percentage of alerts according to their severity for all services for which alerts are defined and monitoring is enabled for the last 30 minutes. The severity level of alerts is user configurable and has no absolute meaning. Severity types include Fatal, Critical, Major, Minor, Warning, and Normal. The services having the most number of alerts are listed beneath the pie chart, as shown in the following figure. Up to ten services are listed in descending order of services with the most alerts.

From the Service Summary panel, you can access more information about alerts by clicking the following:

A specific area on a pie chart: displays the Alert History page for alerts for the given level of severity.
The name of a service under Services With Most Alerts In Current Aggregation Interval: displays the Service Monitoring Details page for that service.
View Service Monitoring Summary: displays the Service Monitoring Summary page. To help you locate specific services, you can filter the services by different criteria.

For information on how to access detailed alert information, see "Viewing the Dashboard Statistics" in Monitoring in the Using the AquaLogic Service Bus Console.

Service Monitoring Summary

The Service Monitoring Summary page provides two views of service monitoring statistics, as shown in the following figures.

The first is a dynamic view of statistical data collected by each service. This view is available when you select Current Aggregation Interval in the Show Metrics For field. The aggregation interval displayed in this view determines the statistics that are displayed. For example, if the aggregation interval of a particular service is 20 minutes, that service's row displays the data collected in the last 20 minutes.

The second view is a running count of the metrics. This view is available when you select Since Last Reset in the Show Metrics For field. The statistics displayed in each row are for the period since you last reset the statistics for an individual service or since you last reset the statistics for all services on the Global Settings page in the System Administration module.

As shown in the top section of the preceding figures, you can filter the display of information using the following criteria:

Name—the name of the proxy service or business service.
Path—the project folder in which the proxy service or business service resides.
Has Alerts—by services that have alert messages.
Has Errors—by services that have failed messages.
Invoked by proxy—the name and path of the proxy service.

Name—the name of the proxy or business service. The name is a link to the Service Monitoring Details page. See Service Monitoring Details.
Path—the project folder in which the service resides. The path is a link to the Project View or Folder View page, depending on whether the service resides in the top level of a project or in a folder.
Aggregation Interval—the time period over which data points for specific statistics is collected and displayed for the service. This information is displayed only when you have selected Current Aggregation Interval in the Show Metrics For field.
Average Execution Time—the average time it has taken the service to process a message for the period of the current aggregation interval or for the period since the last reset. This is measured in milli-seconds.
Message Count—the total number of messages processed by the service for the period of the current aggregation interval or for the period since the last reset.
Error Count—the number or messages that have failed for the period of the current aggregation interval or for the period since the last reset.
Alert Counts—the number of alerts raised by alert rule occurrences and violations for the period of the current aggregation interval or for the period since the last reset.

Service Monitoring Details

The Service Monitoring Details page provides you with two views of detailed information about a specific service, as shown in the following figures.

The first is a dynamic view of the statistical data collected by the service. This view is available when you select Current Aggregation Interval in the Show Metrics For field. The aggregation interval displayed in this view determines the statistics that are displayed. For example, if the aggregation interval of this service is 20 minutes, the view displays the data collected in the last 20 minutes.

The second view is a running count of the metrics. This view is available when you select Since Last Reset in the Show Metrics For field. The statistics displayed are for the period since you last reset statistics for this particular service or since you last reset statistics for all services on the Global Settings page in the System Administration module.

Service Monitoring Details

Alert Status—the current alert status, which is displayed only when you have selected Current Aggregation Interval in the Show Metrics For field.
Aggregation Interval—the time period over which data points for specific statistics are collected and then displayed for the service. This information is displayed only when you have selected Current Aggregation Interval in the Show Metrics For field.
Alerts for Current Aggregation Interval—the total number of alerts associated with this service during the current aggregation interval. This information is displayed only when you have selected Current Aggregation Interval in the Show Metrics For field.
Alerts Since Last Reset—the total number of alerts associated with this service since you last reset statistics for the service or since you last reset statistics for all services on the Global Settings page. This information is displayed only when you have selected Since Last Reset in the Show Metrics For field.
Alert History—a link to the Customized System Alerts History page. See System Alerts History.
Location Path—the project and folder where the service resides.
Display Metrics For—displays the metrics for a server. For a single node, only one item is displayed.

Operations

Operations—the operations associated with the service, if any exist.
Message Count—the number of messages associated with each operation during the current aggregation interval or within the period since the last reset.
Minimum Response Time—the minimum time this operation has taken to execute messages during the current aggregation interval or within the period since the last reset.
Maximum Response Time—the maximum time this operation has taken to execute messages during the current aggregation interval or within the period since the last reset.
Average Execution Time—the average time the operation has taken to execute messages during the current aggregation interval or within the period since the last reset.

Performance

Minimum Response Time—the minimum time this service has taken to execute messages during the current aggregation interval or within the period since the last reset.
Maximum Response Time—the maximum time this service has taken to execute messages during the current aggregation interval or within the period since the last reset.
Overall Average Execution Time—the overall average time that the service has taken to execute messages during the current aggregation interval or within the period since the last reset.
Total Number of Messages—the total number of messages, including failed messages, during the current aggregation interval or within the period since the last reset.
Messages With Errors—the number of messages that failed during the current aggregation interval or within the period since the last reset. In two-way messaging, if the response message fails, but the request message was processed, only the failed response message is counted.
Failover Count—for business services only, the number of failover messages during the current aggregation interval or within the period since the last reset.
Success Ratio—the percentage of successfully processed messages during the current aggregation interval or within the period since the last reset.
Failure Ratio—the percentage of messages that failed to process during the current aggregation interval or within the period since the last reset.
Number of WS Security Errors—the number of messages that failed due to security reasons, such as authentication errors, security policy violations, or authorization errors, during the current aggregation interval or within the period since the last reset.
Number of Validation Errors—the number of messages that failed when a validate action compared one or more parts of a message against an XSD schema or WSDL resource, during the current aggregation interval or within the period since the last reset. This is displayed for proxy services only.

Flow Components for proxy services

Component Name—the name of pipeline or node in the message flow.
Message Count—the number of messages associated with each component during the current aggregation interval or within the period since the last reset.
Error Count—the number of failed messages associated with each component during the current aggregation interval or within the period since the last reset.
Average Execution Time—the average time the component has taken to execute a message during the current aggregation interval or within the period since the last reset.

Server Summary

About the Server Summary

The Server Summary panel provides an overview of the state of the servers. The pie chart shows the status of each server in the domain. The status for each server is derived from the WebLogic Diagnostic Service (see Configuring and Using the WebLogic Diagnostics Framework.). The five most critical servers are displayed, as shown in Figure 3-10.

Fatal—the server has failed and must be restarted.
Critical—server failure likely; something must be done immediately to prevent failure. For more details, check the server logs and the corresponding RuntimeMBean.
Warning—the server could have problems in the future. For more details, check the server logs and the corresponding RuntimeMBean.
OK—the server is functioning without any problems.
Overloaded—the server has more work assigned to it than the configured threshold; it might refuse more load.

Log Summary

The AquaLogic Service Bus Console allows you to view the WebLogic Server domain log. The domain log file provides a central location from which to view the overall status of the domain. Each server instance forwards a subset of its messages to a domain-wide log file. By default, servers forward only messages of severity level NOTICE or higher. You can modify the set of messages that are forwarded. For more information, see Understanding WebLogic Logging Services in Configuring Log Files and Filtering Log Messages.

If you configure the logging action in a pipeline, the log is forwarded to the server log. Unless you configure WebLogic Server to forward these messages to the domain log, you cannot view this log from AquaLogic Service Bus Console. For information in how to do this, see Create Log Filters in the WebLogic Server Administration Console Online Help.

To see the number of messages currently raised by the system, click the View Log Summary link in the Server Summary panel. A table is displayed that contains the number of messages grouped by severity, as shown in the following figure.

Alert—a particular service is in an unusable state while other parts of the system continue to function. Automatic recovery is not possible; immediate attention of the administrator is required to resolve the problem.
Critical—a system or service error has occurred. The system can recover but there might be a momentary loss or permanent degradation of service.
Emergency—the server is in an unusable state. This severity indicates a severe system failure.
Error—a user error has occurred. The system or application can handle the error with no interruption. Limited degradation of service may occur.
Info—reports normal operations; a low-level informational message.
Notice—an informational message with a higher level of importance than Info messages.
Warning—a suspicious operation or configuration has occurred. However, normal operations may not be affected.

This display is based on the health state of the running servers, as defined by the WebLogic Diagnostic Service. For more information about the WebLogic Diagnostic Service, see Configuring and Using the WebLogic Diagnostics Framework.

To view the domain log for a particular type of message, click the number corresponding with the type of message. The following figure shows an example of a domain log file displayed in the AquaLogic Service Bus Console.

Date—the date and time the entry was logged in a format that is specific to the local time zone and format.
Subsystem—the WebLogic Server subsystem that was the source of the message, such as the EJB container or Java Messaging Service (JMS).
Severity—indicates the degree of impact or seriousness of the event.
Message ID—the unique six-digit identification for the message.
Message—a description of the event or condition.

To display details of a single log file on the page, select the radio button for the appropriate log, then click the View button.

Server Summary

The Server Summary page provides a customizable table of servers, as shown in the following figure.

As shown in the upper section of the Figure 3-13, the Server Summary Page displays the number of messages currently raised by the system. For information about the meaning of each type of status message, see Log Summary.

Status—the status of the server:

Fatal—the server has failed and must be restarted.
Critical—server failure likely; something must be done immediately to prevent failure. For more details, check the server logs and the corresponding RuntimeMBean.
Warning—the server could have problems in the future. For more details, check the server logs and the corresponding RuntimeMBean.
OK—the server is functioning without any problems.
Overloaded—the server has more work assigned to it than its configured threshold; it cannot take on more load.

Server—the name of the server. The name is a link to the View Server Details page. See Server Details.
Cluster Name—if the server is part of a cluster, the name of the cluster.
Machine Name—the name of the computer associated with the server.
State—the state of the server:

RUNNING
FAILED
SHUTDOWN

Uptime—the duration for which this server has been running.

To view this information in the table as a pie or bar chart, click View as a Graph.

To filter the display of servers, click Customize Table above the server table. The available filtering is shown in the following figure.

For information about how to use the Server Summary Table Filter, see "Customize Your View of the Server Summary" in Monitoring in the Using the AquaLogic Service Bus Console.

Server Details

You can access the View Server Details page by clicking the name of a server under Most Critical Servers or by clicking the name of a server in the Servers Summary page.

The View Server Details page enables you to view more server monitoring details, as shown in the following figure.

The information displayed on this page is a subset of the Monitoring tab in the AquaLogic Service Bus Console Server Settings page. The details available are:

General—provides general run-time information about the server. Click Advanced to view more information, such as WebLogic Server version or operating system name.
Channels—displays monitoring information about each channel.
Performance—displays performance information about the server.
Threads—displays current run-time characteristics and statistics for the server's active executable queues.
Timers—displays information about the timer used by the server.
Workload—displays statistics for work managers, constraints, and policies configured on the server.
Security—allows you to monitor user-lockout management statistics for the server.
JMS—allows you to monitor JMS information about the server.
JTA—displays the summary of all transaction information for all resource types on the server.

Alert Summary

About the Alert Summary

In AquaLogic Service Bus there are two types of alerts that can occur. They are:

Pipeline Alerts

The alerts triggered when alert actions, configured within a pipeline are executed, are called as the pipeline alerts. You can use actions grouped under the reporting category. The actions available under the Report category are:

For more information, see Proxy Service: Actions in Using the AquaLogic Service Bus ConsoleThe alerts are monitored using the alert destinations.

Service Level Agreement Alerts (SLA)

The Service Level Agreement (SLA) alerts are generated when the service violates the service level agreement or a predefined condition. The Alert Summary panel contains a customizable table displaying information about violations or occurrences of events in the system. These violations and occurrences are based on SLAs. AquaLogic Service Bus provides various SLA monitors that you can configure to monitor proxy and business services. Some examples of SLA monitors are maximum execution time and authorization failure. You configure these monitors by creating alert rules.When a rule evaluates to true, it raises an alert. This alert can be sent to console, SNMP trap, reporting stream, e-mail recipients or JMS queue/topic. These destinations for the alert are configured using the alert destination resource.

The AquaLogic Service Bus Console provides several ways to view and find alerts, such as by severity and by service. You can also view alerts graphically. For information on how to do this, see "Listing and Locating Alerts" and "Viewing a Chart of Alerts" in Monitoring in Using the AquaLogic Service Bus Console.

The Alert Summary panel shows alerts for the last 30 minutes. It contains the following details:

Alert Severity—the user-defined severity of the alert. The Severity is a link to the Alert Details page. See System Alert Details.
Timestamp—the date and time when the alert occurred.
Alert Rule Name—the name assigned to the alert. The name is a link to the View Alert Rule Details page. See View Alert Rule Details.

Note:

The Alert Rule Name content acts as a link only for SLA alerts (which possess rule configuration information) and not so for pipeline alerts.

Service/Project Name—the name of the service and project associated with the alert. The name is a link to the Service Monitoring Details page. See Service Monitoring Details.

To view a complete list of alerts, click View Alert Summary List. See System Alerts History.

To customize the information displayed in the Alert Summary Panel, click Customize table above the summary table. The available filtering is shown in the following figure.

To customize the sort order of the displayed alerts, click the sort icons beside the column headers.

System Alerts History

To access the Customized System Alerts History page, in the Alert Summary panel, click View Alert Summary List. The Customized System Alerts History page enables you to view all the alerts by paging through the table (see Figure 3-18) or by filtering the display of the alerts (see Figure 3-19).

You can customize the table shown in the Figure 3-18 and provides the following details:

Alert Severity—the severity level of alerts is user configurable and has no absolute meaning. The field is a link to the System Alert Details page. See System Alert Details.
Timestamp—the date and time when the alert occurred.
Alert Rule Name—the name assigned to the alert. The name is a link to the View Alert Rule Details page. See View Alert Rule Details.

Note:

The Alert Rule Name content acts as a link only for SLA alerts, which are configured with an alert rule and not for pipeline alerts.

Service—the name of the service and project associated with the alert. The name is a link to the Service Monitoring Details page. See Service Monitoring Details.

To search for a specific alert, you can filter the display of alerts by clicking Customize Table in the Customized System Alerts History table. The filtering is shown options are available in the following figure.

For information about how to use the Alerts Table Filter, see "Customizing Your View of Alerts" in Monitoring in the Using the AquaLogic Service Bus Console.

System Alert Details

The System Alert Details page displays complete information about the alert and allows you to add an annotation to the alert, as shown in the following figure.

Alert Rule Name— alert rule name for SLA alerts and alert summary for pipeline alerts. Acts as a link to the View Rule Details page for SLA alerts.
Description—rule description of the alerts for the SLA and alert payload for pipelines.
Timestamp—the date and time of the alert.
Severity—the user-defined severity of the alert.
Service—the name of the service associated with the alert. The name is a link to the Service Monitoring Details page. See Service Monitoring Details.
Annotation—use this field to add notes to the alert.

You access this page from the dashboard by clicking Alert Severity in the Alert Summary table. This page also allows you to delete the alert.

View Alert Rule Details

The View Alert Rule Details page displays complete information about a specific alert rule, as shown in the following figure.

Last Modified By—Specifies who modified the alert rule.
Last Modified On—Specifies about when the alert rule was modified.
References—Specifies the number of resources it refers to.
Referenced By—Specifies the number of resources, which refer to the rule.
General Configuration

Rule Name—the name of the alert rule.The value in this field will be used as the subject for an e-mail alert.
Alert Summary—The summary to describe the purpose of the alert rule. This is also used as the subject line for the e-mail message if this alert rule is configured with an e-mail destination.
Alert Destination—You associate the alert rule with the Alert Destination. By this you set the destinations for the alert notifications for the alert rule. You have to set the alert destination in order to determine the distribution of severity of the alerts.

Note:

Although an alert is detected and counted even if alert destination is not set, you cannot determine the severity of the alert and hence it will not be reflected on the dashboard.

Start Time (HH:MM)—specifies the starting time during which the rule is active on each day prior to the expiration date.
End Time (HH:MM)—specifies the ending time during which the rule is active on each day prior to the expiration date.
Rule Expiration Date (MM/DD/YY)—the expiration date of the rule. The rule expires at 12.01am on the specified date. If you do not specify a date, the rule never expires.
Rule Enabled—indicates whether the rule is enabled or not.
Alert Severity—the user-defined severity of the alert.
Alert Frequency—indicates whether notifications should be issued to the configured alert destinations every time the alert rule evaluates to true or issued once when the rule evaluates to true within a given aggregation interval.
Stop Processing More Rules—when multiple rules associated with a service exist, this flag indicates whether subsequent rules associated with the service must be processed if the current rule evaluates to true.

Conditions

Condition Expression—displays the condition that triggers the alert rule and aggregation interval details of the alert.

For information about how to define alert rules, see "Create an Alert Rule" in Monitoring in the Using the AquaLogic Service Bus Console.

Alert Rules

About Alert Rules

As mentioned earlier, alerts are automated responses to SLAs violations, which are displayed on the Dashboard. You define alert rules to specify unacceptable service performance according to your business and performance requirements. Each alert rule allows you to specify the aggregation interval for that rule when configuring the alert rule. The alert aggregation interval is not affected by the aggregation interval set for the service.

On the Alert Rule page, if you set the Alert Frequency to Every Time, the notifications are issued every time the alert rule evaluates to true. If you set the Alert Frequency to Once When Condition Is True the notifications are issued the first time the rule evaluates to true, and no more notifications are generated until the condition resets itself and evaluates to true again.

In the case where the Alert Frequency is set to Every Time, the number of times an alert rule is fired depends on the aggregation interval and the sample interval associated with that rule. For example, if the aggregation interval is set to 5 minutes, the sample interval is 1 minute. Rules are evaluated each time 5 samples of data are available. Therefore, the rule is evaluated for the first time approximately 5 minutes after it is created and every minute thereafter.

In the case where the Alert Frequency is set to Once When Condition is True, after an alert is fired the first time in an aggregation interval, it is not fired again in the same aggregation interval.

General Configuration—defines the name, description, summary, duration, severity, frequency, state of the enabled alert rule and other general characteristic.
Define Condition—defines one or more conditions that trigger the alert rule. Additionally, you can define the aggregation interval for the condition on this page.

For more information about creating an alert rule is located in "Create an Alert Rule" in Monitoring in the Using the AquaLogic Service Bus Console.

Some Uses for Alerts

Monitoring and e-mail notification of WS-Security errors.
Monitoring the number of messages passing through a particular pipeline.
E-mail notification when the average execution time exceeds 5 seconds during stock exchange hours.

Understanding Alert Rules

Question 1: I created a service with an alert rule that has the following condition expression:

Answer: Monitoring statistic collection for each statistical attribute, such as message count and error count, associated with a service begins when a change in the value of that statistic occurs. Data collection for the Message Count attributes begins when the first message is processed by the service and the Message Count attribute is incremented. Similarly, collection of data for the Error Count statistic starts only when the service encounters its first error and the Error Count attribute is incremented. If the service is idle, no monitoring information is collected for that service and subsequently no alert rules are triggered. After the first message is processed, monitoring data for that service is continually collected even if the service does not receive any further requests. Check to see if the service has received any requests.

Question 2: I defined a new alert rule with an aggregation interval that did not exist before and that rule does not seem to raise any alerts. All other rules created prior to this one are working correctly.

Answer: The cause is the same as for Question 1; the service needs to process at least one request after a rule with a new aggregation interval is created to trigger the alert rule. The other rules defined with different aggregation interval values are not affected by the alert rule.

Question 3: I restarted the server and none of my services have processed any requests. Why do I see alerts being generated?

Answer: Once the Monitoring subsystem has started collecting data for services, stopping and restarting a server does not abort the collection process. The data collected is persisted and statistic collection picks up from where it left off.

Why am I being alerted in this case? Shouldn't the success rate be 80% in this case?

Answer: No, the message count value displayed is the total of all messages processed by the service, including the ones that generated an error. Subsequently, in this case, the success rate is 75%.

Question 5: I created a service with an aggregation interval of 10 minutes that sends a JMS message. I could see the message on the Service Monitoring Summary page, but some time later the message count for my service shows as zero.

Answer: The Service Monitoring Summary page displays dynamic statistics. In this case, it shows the message count in the last 10 minutes. Because no messages were processed by the system in the last 10 minutes, the message count is displayed as zero.

Question 6: I changed the aggregation interval of a service from 10 minutes to 5 minutes. The Service Monitoring Summary page shows all statistics as zero. One of the alerts in this server was configured to a statistical element with a 2 minute aggregation interval, which did not fire the next minute.

Answer: Changing the aggregation interval for a service removes the statistical information for all the services and alerts associated with that service. The alert initializes again and triggers an alert at the end of aggregation interval expiry.

Question 7: I have a business service with multiple endpoints with an alert rule defined as Failover-count > 0. When one of the endpoints goes down, the alert is triggered. However, when a service has only one endpoint, the Failover-count is not incremented for this service. Instead, an error is generated.

Answer: Set the Retry count to a number greater than zero. For information about setting the Retry count, see "Adding a Business Service" in Business Services in the Using the AquaLogic Service Bus Console.

Question 8: I see that an alert is generated on the Dashboard but the value for the Alerts for Current Aggregation Interval field on the Service Monitoring Details page displays zero.

Answer: Alert rules are evaluated after the completion of the interval, which occurs after a checkpoint completion. If a rule evaluates to true, the rule's actions are triggered, a log is generated, and the interval-count statistic attribute (Alerts for Current Aggregation Interval) is incremented. The updated value of this counter is processed in the next checkpoint, 60 seconds later. The Monitoring Details page displays the updated count approximately one minute after the alert is generated.

Answer: Consider the case where the active time for a rule is specified as 22:00 to 09:00.

The ServerStatistics are sent to the dashboard. The ServerStatistics represents the monitoring runtime data for that minute. In other words, it contains the statistics information for the services that have been enabled.

The monitoring system aggregates the data received every minute makes it available for the retriever sub system. The aggregator thread is behind by 15 seconds with respect to the Statistics Collector checkpoint thread.

If you disable monitoring for the domain, you disable the collection of statistics for that domain. The monitoring data is no longer collected from the next minute, which means there is no data returned if you attempt to retrieve it. The same applies when you enable monitoring for the domain. The system initially does not show any data. However, after a maximum of two minutes, the Service Summary page displays the results of monitoring.

Statistics Associated With Different Resources

The following section provides more information on different statistics associated with:

SERVICE

A service has an inbound endpoint or an outbound endpoint that is registered with the Service Directory of the AquaLogic Service Bus. Such services are associated with other resources such as WSDL, and security settings. The statistics reported for this resource type is listed inTable 3-1. It also give you the type of the statistics.

Table 3-1 Statistics Reported for SERVICE
Statistic	Type
`message-count`	count
`error-count`	count
`failover-count`	count
`response-time`	interval
`validation-errors`	count
`severity-warning`	count
`severity-major`	count
`severity-minor`	count
`severity-normal`	count
`severity-fatal`	count
`severity-critical`	count
`severity-all`	count
`failure-rate`	count
`wss-error`	count
`success-rate`	count

FLOW_COMPONENT

Statistics are collected for two FLOW_COMPONENT types, namely, Pipeline-pair node and Route node. For more details on Pipeline-pair node and route node see Table 2-1 of Modeling Message Flow in AquaLogic Service Bus. The statistics reported for FLOW_COMPONENT are listed in Table 3-2

Table 3-2 Statistics Reported For FLOW_COMPONENT
Statistic	Type
`elapsed-time`	interval
`message-count`	count
`error-count`	count

WEBSERVICE_OPERATION

The statistics pertaining to the WEBSERVICE_OPERATION such as WSDLs are collected and stored in a runtime XML file. The statistics reported for this type of resource are listed in Table 3-3

Table 3-3 Statistics Reported for WEBSERVICE_OPERATION
Statistics	Type
`elapsed-time`	interval
`message-count`	count
`error-count`	count

Auditing

Auditing helps you to keep track of changes in the configuration of the AquaLogic Service Bus(ALSB). The three types of auditing you can perform are briefly described in:

Configuration Change Auditing

When you perform configurational changes in AquaLogic Service Bus console a track record of the changes is generated and history of all the configurational changes is maintained. Only the previous image of the object is maintained. You can view or access the history of configurational changes and the list of resources that have been changed during the session only through the console. However, in order to access all the information on configuration you have to activate the session.

Runtime Auditing of Messages

Auditing the entire message flow pipeline during is tedious. However, you can use the reporting action to perform selective auditing of the message flow pipeline during run time. You insert the reporting action at required points in the message flow pipeline and extract the required information. The extracted information may be then stored in a database or sent to the reporting stream in order to write the auditing report.

Security Auditing

When a message is sent to the proxy service and there is a breach in the transport level authentication or the security of the Web Services, WebLogic server generates an audit trail. You have to configure the WebLogic server to generate this audit trail. Using this you can audit all security violations that occur in the message flow pipeline. It also generates an audit trail whenever it authenticates a user. For more information on security auditing, see Configuring the WebLogic Security Framework: Main Steps in AquaLogic Service Bus Security Guide.

User Guide

Monitoring

Monitoring Scenarios

Operational Health

Monitoring Alerts

Monitoring Statistics

Verifying Service Level Agreements

Pipeline Alert Action

Alert Destination

AquaLogic Service Bus Console

E-mail Alert Destination

SNMP Traps

Managed Resource

Management Information Base(MIB)

An Illustration of WebLogic Server MIB

SNMP Agent

SNMP Manager

Network Management System (NMS)

JMS

Reporting

About Monitoring

Aggregation Interval

Monitoring Architecture

Monitoring Services

Refresh Rate of Monitored Information

Dashboard

Service Summary

About the Service Summary

Service Monitoring Summary

Service Monitoring Details

Server Summary

About the Server Summary

Log Summary

Server Summary

Server Details

Alert Summary

About the Alert Summary

Pipeline Alerts

Service Level Agreement Alerts (SLA)

System Alerts History

System Alert Details

View Alert Rule Details

Alert Rules

About Alert Rules

Some Uses for Alerts

Understanding Alert Rules

Statistics Associated With Different Resources

SERVICE

FLOW_COMPONENT

WEBSERVICE_OPERATION

Auditing

Configuration Change Auditing

Runtime Auditing of Messages

Security Auditing