6 Monitoring Services

This chapter describes how you can use Business Transaction Management to monitor services. It explains the meaning of the instruments used for monitoring, and it describes the Top 10 Services dashboard. It includes the following sections:

6.1 Ways of Monitoring Services

This section lists the ways you can monitor services. The list orders these monitoring tasks from the most general to the most specific, and explains, for each task, how you navigate to the view where you can perform the monitoring.

To monitor:

  • Overall status of services, that is total number of services that are up/down and with SLA warnings/failures

    Navigate to Dashboards > Operational Health Summary > Services

  • Most stressed services

    Navigate to Dashboards > Top 10 Services

  • Current summary information across all services, endpoints, and operations (includes up/down status, SLA compliance, and performance)

    1. Navigate to Explorer > Services to Endpoints or navigate to Explorer > Services to Operations and look at the Main area.

      Services To Endpoints provides a physical view, letting you drill down from services to endpoints and then physical operations. Services To Operations provides a logical view, letting you drill down from services to logical operations.

    2. Select the service or endpoint of interest in the Main area.

    3. Select the Summary tab.

  • Detailed current performance and usage for a specific service, endpoint, or operation

    Navigate to specific service, endpoint, or operation > Analysis tab

  • Detailed current SLA compliance for a specific service, endpoint, or operation

    Navigate to specific service, endpoint, or operation and click the Compliance tab

  • Recent history of issues affecting a specific service, endpoint, or operation

    Navigate to specific service, endpoint, or operation and click the Alerts tab

  • Logged messages

    Navigate to specific service, endpoint, or operation and click the Message Log tab. Message logging is available only if a service endpoint is part of a transaction and if message logging is enabled for the transaction.

6.2 Top 10 Services Dashboard

The Top 10 Services dashboard enables you to quickly identify and assess the health of the most stressed services in your system.

To display the top ten services, Choose Dashboards > Top 10 Services.

The dashboard provides four tables. Each table is based on a particular instrument and lists the services with the ten highest measurements for that instrument (except for uptime, which is lowest).

The default evaluation period for the data displayed for services is seven days. To change the evaluation period, click the Time Period control at the top of the display. You can change the period to the last day, hour, or 10 minutes.

The Top 10 Services dashboard provides tables that list the 10 services with the following characteristics:

  • highest throughput (Most Load table)

  • lowest uptime (Uptime Issues table)

  • slowest average response time (Slowest Avg Response Time table)

  • highest number of vaults (Most Faults table)

Each table provides numeric instrument values as well as charts. Hover the cursor over a chart to view detailed information for a particular time segment.

For in-depth information and analysis, double-click a transaction or service to display the Tabs area, and select the Analysis tab.

6.3 About Instruments

Business Transaction Management uses a variety of instruments to measure the performance and usage characteristics of your business transactions and underlying services and operations. These instruments are displayed in various parts of the Management Console for interactive monitoring. You can also use most of these instruments as a basis for defining service-level agreements (SLA).

The period over which these instruments operate is either the evaluation period, in the case of an SLA, or the display period, in the case of interactive monitoring in the Management Console. The following descriptions use the term period to mean the evaluation period and/or display period, depending on the context in which the instrument is used. Some instruments, for example current compliance status, provide a current value only.

6.3.1 Transaction Instruments

The following instruments are available for monitoring transactions.

Average Response Time

The average amount of time a transaction requires to complete. For each instance of the transaction, the instrument measures the time from when the instance's start message is observed until its end message is observed. The instrument keeps a running average of the response time across all instances observed during the period. All completed instances are counted in the response time, regardless of whether condition alerts occurred.

If no transactions are observed during the period, the instrument value is set to -. Response time is measured in milliseconds.

Maximum Response Time

The maximum amount of time a transaction requires to complete. The instrument records the single highest response time from all instances of the transaction observed during the period.

Completed Transactions

The number of instances of a transaction that complete during the period. An instance is considered to have completed when both its start and end messages have been observed, regardless of whether condition alerts occurred. However, if the end message is defined as being in the response phase (for example, submit.response) and the end operation faults, the end message will not exist and the instance will, therefore, not be counted.

Completed Transaction Rate

The number of instances of a transaction that complete per hour during the period. This instrument derives its measurements from the completed transactions instrument.

Started Transactions

The number of instances of a transaction that start during the period. An instance is considered to have started when its start message is observed.

Started Transaction Rate

The number of instances of a transaction that start per hour during the period. This instrument derives its measurements from the started transactions instrument.

Condition Alerts

The number of condition alerts generated on the transaction during the period.

Condition Alert Rate

The number of condition alerts generated on the transaction per hour during the period. This instrument derives its measurements from the transaction condition alerts instrument.

Current Compliance Status

The current compliance status for the transaction.

Violation Alerts

The number of SLA violations or warnings caused by a transaction during the period.

6.3.2 Service and Operation Instruments

The following instruments are available for monitoring services, endpoints, and operations.

Average Response Time (services, endpoints, and operations)

The average amount of time a service or operation requires to respond to a request. For each request, the instrument measures the time from when the service receives the request until it sends a corresponding response to the client. The instrument keeps a running average of the response time across all messages received during the period.

Only successfully processed requests are counted in the response time; the response times for faults are not figured into this measurement. The response time is measured individually for each operation. The response time for a service is the average response time of all of its operations. This average is weighted according to the number of messages processed by each operation.

If no requests are observed during the period, the value of the instrument is set to -. Response time is measured in milliseconds.

Maximum Response Time (services, endpoints, and operations)

The maximum amount of time a service or operation requires to respond to a request. The instrument records the single highest response time for all requests received during the period.

Link Average Response Time

The average response time to outbound requests. For example, imagine a hypothetical orderService that receives a request from some client, and as a result sends a request to a creditCheckService. In this case, orderService is acting as a client to creditCheckService. The response time is measured from the point of view of the service that is acting as a client. In other words, it measures the time from when the client service sends the request until it receives the response, meaning that network latency, if it exists, is included in the response time.

Only successfully processed requests are counted in the response time; the response times for faults are not figured into this measurement. If no requests are observed during the period, the value of the instrument is set to -. Response time is measured in milliseconds.

Traffic

The number of requests that a service or operation receives during the period. The traffic count equals the throughput plus the fault count. Traffic count is measured individually for each operation. Traffic count for a service is the total traffic count of all of its operations.

Throughput

The number of requests that a service or operation successfully receives, processes, and responds to during the period (in other words, the number of responses). A message that generates a fault is not counted by the throughput instrument. Throughput is measured individually for each operation. Throughput for a service is the total throughput of all of its operations.

Throughput Rate

The number of successfully handled requests per hour during the period. This instrument derives its measurements from the throughput instrument.

Link Throughput

The number of outbound requests to another service that are successfully received, processed, and responded to during the period (in other words, the number of inbound responses; see the link average response time instrument for an explanation of service-to-service calls).

Faults

The number of faults generated by a service or operation during the period. Fault count is measured individually for each operation. The overall fault count for a service is the total fault count of all its operations.

Fault Rate

The average number of faults generated per hour over the period. This instrument derives its measurements from the faults instrument.

Fault Percentage

The percentage of messages that cause faults during the period. This instrument derives its measurements from the faults and traffic instruments.

Link Faults

The number of faults generated by outbound requests to another service during the period (see the link average response time instrument for an explanation of service-to-service calls).

Current Compliance Status

The current compliance health for the selected object.

Violation Alerts

The number of SLA violations or warnings caused by a service or operation during the period.

Violation Alerts Percentage

The percentage of time that a service or operation is in a state of SLA violation or warning during the period.

Failure Alerts

Count of failure violations for the specified period.

Warning Alerts

Count of warning violations for the specified period.

Uptime

The percentage of time that an endpoint's container responds successfully to a periodic ping message. See the configureAlivenessCheck command for details on how you can specify the method to be used for aliveness checking.