Monitoring at Run Time

7 Monitoring at Run Time

RIB runtime monitoring enables you to monitor the state and volume of messages running through the RIB system. It also provides the status of various components of the system. The current RIB system and message flows are interrogated transparently to collect useful metrics that immensely enable business users and system administrators to review the state and health of the system. The monitoring enhancement collects application and adapter statuses, message event counts, transaction counts, error hospital statistics, and server resource utilization statistics.

The following graphic describes the architecture of the system:

Figure 7-1 RIB Monitoring ARchitecture

Instance and Central Repository

The monitoring metric data is collected in the rib-<app> instances. The data collected from all rib-<app> instances are consolidated in the central location. Both the collection and consolidation server instances store the data in in-memory repositories. Various pieces of data are collected at different times based on the nature of data and performance considerations. At any point of time, the repository data shows a complete picture of the state as of the last data collection time.

Monitoring Data as XML

The collected data is reported in a defined format. The monitoring data is exchanged between components that produce and consume in XML format. rib-<app> instances produce the data and the central repository and Retail Integration console (RIC) (or third-party tools) consume the monitoring data.

Push Versus Pull

Sometimes, data is collected by scheduled background jobs. Message related data is collected asynchronously as the messages are consumed/published by adapters. The collected metric data is kept in a local repository in the rib-<app> instance. This information is pushed to a central repository (in memory) on a scheduled frequency (every two minutes). If any rib-<app> is down, the central repository does not receive data from that instance. The Central repository does not poll for data nor pull data from the rib-<app> instances. This way the central repository has no dependency on the rib-<app>s.

While each rib-<app> has its own monitoring data, the central repository holds the consolidated data from all the rib-<app> instances.

Service Interfaces

The monitoring data in the rib-<app> instances and the central repository are made available to RIB monitoring system as well as the third-party tools via SOAP web services running in the respective server instances.

What is an Event?

RIB messages flow from the publishing apps to subscribing apps, TAFRs, and error hospital in the RIB system. Sometimes, messages can be rolled back due to application or system errors. Each attempted delivery, whether successful or not, is called an Event. The RIB monitoring system counts the events which include both successful and failed delivery of messages. Also, any changes in the adapter status, error hospital data, server resource utilization etc. is considered an event.

There are two types of events - Adapter Events and Application Events.

How are Event Count and Messages Count Related?

Event count includes both successful and failed message counts. There is no reliable way of getting the exact successful message count without affecting the performance of the system. Hence, the RIB monitoring system collects event counts instead of message counts. For the most part, they are similar, but not exact.

Adapter Events

Adapter events are adapter level events like message flows (subscription, publishing) and adapter statuses. In the RIB monitoring system, message related adapter events are collected in real-time. Adapter status events are collected by scheduled background threads.

Application Events

Application events are application level events like server resource (CPU, Memory) utilization, application status, error hospital data, etc. These metrics are collected by scheduled background threads.

Event Collection Schedule

Various events in the system are collected at various times.

Note:

There is a difference between the collection time and reporting time. For example, even though the event counts are collected in real-time, they are not available in the central repository immediately.

The following is a complete schedule of collection times:

Table 7-1 Schedule of Collection Times

Metric	Event Type	Schedule
Event Count	Adapter	Real time
Adapter Execution Time	Adapter	Real time
API Execution Time	Adapter	Real time
Adapter Status	Adapter	Every three minutes
Application Status	Application	At startup
Error Hospital Statistics	Application	Every five minutes
CPU Utilization	Application	Every five minutes
Memory Utilization	Application	Every five minutes

Publisher Versus Subscriber Events

The publishing event does not collect certain metrics, like the API Execution Time, since it is not possible to find out the API execution time once the message is published. It collects only the Adapter Execution time, which is the time taken to publish the message.

TAFR Instrumentation

TAFRs are monitored for collecting various time metrics. Measuring the time for the TAFR API execution begins as soon as the TAFR starts transforming the inbound message to an outbound message and ends when the message get transformed. Collecting Adapter Execution Time begins as soon as the message is available for the rib-tafr to transform and ends after routing the message to the destination topic.

Data Retention

The monitoring data is collected in rib-<app> repositories and a central repository in the functional artifact app. These are in-memory repositories. The information in the repositories is lost when the application is restarted. Additionally, the repositories are not purged, so the data collects as long as the applications run. The monitoring data is collected in hourly buckets. There can only be a maximum of 24 records per day. This strategy reduces the chances of the system going out of memory.

Metric Definitions

The following sections describe the metrics that are collected by the system.

Event Counts

When a message is subscribed or published, an event is generated to increment the event count for the hour of the day.

Adapter Execution Time

For a subscriber adapter, the time is noted as soon as the message arrives. At the end of the onMessage method the difference is calculated. An Adapter Execution Time event is created, which is used (if applicable) to set the minimum, maximum, and last adapter execution time for the hour of the day.

For a publishing adapter, the time is noted at the beginning and end of the publishing method, and the difference is calculated. An Adapter Execution Time event is created, which is used (if applicable) to set the minimum, maximum, and last adapter execution time for the hour of the day.

API Execution Time

For a subscriber adapter, the time is noted around the API call and the difference is calculated. An API Execution Time event is created, which is used (if applicable) to set the minimum, maximum, and last API execution time for the hour of the day.

For publishing adapter, there is no API execution time.

Adapter Status

A scheduled background job collects the Adapter status and updates the local repository. If the RIB application is down, since the job cannot run the status of the adapter in the central repository will be the last known status until the cache expires. After the cache expiry it will be "Unknown' until the status is reset by the rib-<app>.

Commits and Rollbacks

The commit and rollback count is the same information maintained by WebLogic server for the EJBs transactions. RIB monitoring system interrogates the JMX MBeans for the commit and rollback counts and updates the local repository. A message flow may result in more than one commit and rollback, depending on various scenarios of failures.

Error Hospital Metrics

Error hospital data for the RIB application is queried by a scheduled background thread and the following information is collected:

Total Messages in Error Hospital: Total number of messages in the Error Hospital for the application
Total Messages in Error Hospital due to dependency: Total number of dependent messages in the Error Hospital
Message Family: Message family of the family-vice statistics
Adapter class Definition: Adapter information for the message family
Error count: Number of error messages for the message family
Dependency count: Number of the dependent messages for the message family

RIB Application Status

Status of the RIB application, e.g., RUNNING, STOPPED etc.