Alarms and KPIs Guide

3.9.100 22224 - Average Hold Time Limit Exceeded

Alarm Group:

DIAM

Description:

The average transaction hold time has exceeded its configured limits.

This alarm is generated when KPI #10098 (TmAvgRspTime) exceeds Peer CNDRA-wide engineering attributes associated with average hold time, defined in the DraWorker profile assigned to the DraWorker server. KPI #10098 is defined as the average time (in milliseconds) from when the routing layer (DRL) receives a request message from a downstream peer to the time that an answer response is sent to that downstream peer. The source measurement of KPI #10098 is the TmResponseTimeDownstreamMp (10093) measurement.

This alarm indicates the average response time (TmAvgRspTime) for messages forwarded by the Relay Agent is larger than what is defined for a deployment as per DraWorker profile assignment. One of these problems could exist:

The IP network may be experiencing problems that are adding propagation delays to the forwarded request message and the answer response.
- Verify the IP network connectivity exists between the MP server and the adjacent nodes.
- View the event history logs for additional events or alarms from this MP server.
One or more upstream nodes may be experiencing traffic overload.
One or more MPs is experiencing traffic overload.
- View the KPI Routing Recv Msgs/Sec.
- View the CPU utilization of MPs.

Severity:

Minor, Major, Critical

Instance:

N/A

HA Score:

Normal

Auto Clear Seconds:

0 (zero)

OID:

eagleXgDiameterAvgHoldTimeLimitExceededNotify

Cause:

Alarm 22224 is generated when KPI #10098 (TmAvgRspTime) exceeds Peer CNDRA-wide engineering attributes associated with average hold time, defined in the DraWorker profile assigned to the DraWorker server. KPI #10098 is defined as the average time (in milliseconds) from when the routing layer (DRL) receives a request message from a downstream peer to the time that an answer response is sent to that downstream peer. The source measurement of KPI #10098 is the TmResponseTimeDownstreamMp (10093) measurement.

The alarm thresholds are configurable for:

Average hold time minor alarm onset threshold
Average hold time minor alarm abatement threshold
Average hold time major alarm onset threshold
Average hold time major alarm abatement threshold
Average hold time critical alarm onset threshold
Average hold time critical alarm abatement threshold

The severity of the alarm (Minor, Major, or Critical) is according to onset threshold/abatement threshold of each severity level. When the average hold time initially exceeds the average hold time for an alarm onset threshold, a minor, major, or critical alarm is triggered. When the average hold time subsequently exceeds a higher onset threshold, or drops below an abatement threshold, but is still above the minor alarm abatement threshold, the alarm severity changes based on the highest onset threshold crossed by the current average hold time.

Diagnostic Information:

If Alarm #22224 is raised, then it indicates the average response time (TmAvgRspTime) for messages forwarded by the Relay Agent is larger than the defined for a deployment as per DraWorker profile assignment. One of the following problems could exist:

The IP network may be experiencing problems that are adding propagation delays to the forwarded request message and the answer response.
- Verify the IP network connectivity exists between the MP server and the adjacent nodes.
- View the event history logs for additional events or alarms from this MP server.
The IP network may be experiencing problems that are adding propagation delays to the forwarded request message and the answer response.
One or more upstream nodes may be experiencing traffic overload.
One or more MPs is experiencing traffic overload.
- View the KPI Routing Recv Msgs/Sec.
- View the CPU utilization of MPs.

Recovery:

The average transaction hold time is exceeding its configured limits, resulting in an abnormally large number of outstanding transactions that may be leading to excessive use of resources like memory.
- Reduce the average hold time by examining the configured Pending Answer Timer values and reducing any values that are unnecessarily large or small.
- Identify the causes for the large average delay between the Peer CNDRA sending requests to the upstream peers and receiving answers for the requests.
- Confirm the peer node(s) or Peer CNDRA is in overload by viewing KPI/Measurements/CPU usage and take corrective action.
- Identify the main contributor to increased value of (T2-T1) such as a time difference between the routing layer (DRL) receiving the request to the DRL sending the answer to downstream peer.
The alarm thresholds are configurable for:
- Average hold time minor alarm onset threshold
- Average hold time minor alarm abatement threshold
- Average hold time major alarm onset threshold
- Average hold time major alarm abatement threshold
- Average hold time critical alarm onset threshold
- Average hold time critical alarm abatement threshold
The severity of the alarm (Minor, Major, or Critical) is according to the onset threshold/abatement threshold of each severity level. When the average hold time initially exceeds the average hold time for an alarm onset threshold, a minor, major, or critical alarm is triggered. When the average hold time subsequently exceeds a higher onset threshold, or drops below an abatement threshold, but is still above the minor alarm abatement threshold, the alarm severity changes based on the highest onset threshold crossed by the current average hold time.
If the problem persists, it is recommended to contact My Oracle Support.