6.4.1 Overload Controls

The SBRs that implement the Session and Binding databases must protect themselves from becoming so overloaded that they cease to perform their function. There are two parts to achieving this goal:
  • Detecting the overload condition and severity
  • Shedding work to reduce load.

Overload Control in PCA

The number of ingress messages (both Requests and Answers) per second received by PCA is counted as input to PCA ingress message processing capacity. The capacity is an engineering number of ingress messages per second processed by PCA. The number of Request messages received at PCA per second is also measured separately.

PCA defines alarms on the queue utilization levels based on configured threshold values. Thresholds (in percentage) are configured in association with the PCA ingress message capacity. If the ingress message rate received at PCA exceeds the configured percentage of the maximum capacity, alarms are raised. PCA ingress Request capacity can be engineering configured to provide the value based on which thresholds (in percentage) are configured. See Alarm Settings.

The PCA congestion is then defined by the ingress Request messages capacity and the configured threshold values. PCA is considered in congestion if the ingress Request rate at PCA exceeds the configured percentages (thresholds) of PCA ingress Request capacity.

Three PCA congestion levels (CL1, CL2 and CL3) are defined, each of them is associated with onset and abatement threshold values. The onset and abatement values are configurable (see Congestion Options). When PCA is in congestion, a PCA congestion alarm will be raised at the severity (Minor, Major or Critical) corresponding to the congestion level (CL1, CL2 or CL3).

When congestion is detected, DA-MP overload control throttles a portion of incoming messages to keep PCA from being severely impacted. The type and percentage of the messages to be throttled are configurable through the PCA GUI as displayed in Table 6-2:

Table 6-2 PCA Default Overload Control Thresholds

PCA Operational Status Alarm ID 22721 PCA Congestion Level PCA Message Throttling Rules
Severity Onset Threshold Abatement Threshold
Available N/A N/A N/A CL0

No messages are discarded

(Accept and process 100% Request and Answer messages)

Available Minor 110% 100% CL1
  • Discard 25% of requests for creating new sessions
  • Discard 0% of requests for updating existing sessions
  • Discard 0% of requests for terminating existing sessions
  • Discard 0% of answer messages
Available Major 140% 130% CL2
  • Discard 50% of requests for creating new sessions
  • Discard 25% of requests for updating existing sessions
  • Discard 0% of requests for terminating existing sessions
  • Discard 0% of answer messages
Degraded Critical 160% 150% CL3
  • Discard 100% of requests for creating new sessions
  • Discard 50% of requests for updating existing sessions
  • Discard 0% of requests for terminating existing sessions
  • Discard 0% of answer messages

The PCA's internal congestion state contributes to PCA's Operational Status directly, along with its Admin state and Shutdown state. Consequently, the congestion state of the PCA impacts the Diameter Routing Function message transferring decision. Depending on the PCA's Operational Status (Unavailable, Degraded, Available), the Diameter Routing Function forwards all the ingress messages to the PCA when the PCA's Operational Status is Available, or discard some or all of the ingress messages when the Operational Status is Degraded or Unavailable. Table 6-3 describes the Diameter Routing Function handling of the messages to the PCA.

Table 6-3 Diameter Routing Function Message Handling Based on PCA Operational Status

PCA Operational Status Diameter Routing Function Message Handling
Available Forward all Request and Answer messages to PCA
Degraded Forward all Answer messages only to PCA
Unavailable Discard all messages intended for PCA

PCA verifies if an ingress message has the priority or not. Priority messages are inviolable and are not discarded by the DA-MP overload control functionality, regardless of the congestion state of the PCA MP. Such messages are assigned with minimum inviolable priority.

PCA also assigns message priority to Gx Re-Authorization Request (RAR) messages that originate in PCA. The priority of PCA-generated RAR messages is determined by the intent of the RAR message, such as querying the status of a session or removing an existing session. PCA distinguishes between the two different types of RAR message by inclusion or exclusion of the Session-Release-Clause AVP in the generated RAR. If the Session-Release-Clause AVP is included, the RAR is intended to remove an existing session. Otherwise, the RAR is intended to query the status of a session. The priority for an RAR without the Session-Release-Clause AVP is set to a lower priority, while an RAR with the Session-Release Clause AVP is set to a higher priority.

The DA-MP overload control function's message priority detection checks the priority of an ingress message. If the message priority is greater than or equal to the minimum inviolable priority, the message is not throttled by the DA-MP overload control function, regardless of the congestion level of the PCA MP. However, if the message priority is smaller than the minimum inviolable priority, the DA-MP overload control function discards the message based on the congestion level thresholds shown in Table 6-2.

Overload Control in SBR

SBR relies on ComAgent for resource monitoring and overload control. The ComAgent Resource Monitoring and Overload Framework monitors local MP’s resource utilizations, defines MP congestion based on one or multiple resource utilizations, communicates the MP congestion levels to Peers, and reports local MP congestion level to the local application (SBR).

Messages called stack events are used for communication to and from ComAgent.

ComAgent defines MP congestion levels based on a CPU utilization metric and ingress stack event rate (number of stack events received per second at local ComAgent), whichever is higher than the pre-defined congestion threshold, and broadcasts the MP congestion state to all its Peers. ComAgent provides APIs that the local SBR can call for receiving congestion level notifications.

SBR congestion is measured based on the SBR CPU utilization level. There are four SBR congestion levels: CL0 (normal), CL1 (Minor), CL2 (Major) and CL3 (Critical). There are related Onset and Abatement threshold values, and Abatement time delays.

The SBR congestion state (CPU utilization) is managed and controlled by the ComAgents on both PCA and SBR MPs based on the ComAgent MP Overload Management Framework. Messages to a SBR from a PCA are handled based on the congestion state of the SBR. A SBR congestion alarm will be raised when MP congestion notification is received from ComAgent. The appropriate alarm severity information will be included in the notification. The alarm will be cleared if the congestion level is changed to Normal, also indicated in the notification from ComAgent.

To manage the overload situation on a SBR, all stack event messages are associated with pre-defined priorities. Before a stack event message is sent, its priority is compared with the congestion level of the SBR to which the stack event is sent. If the priority is higher than or equal to the SBR current congestion level, the message will be forwarded. Otherwise, it will be discarded.

Table 6-4 PCA-SBR Stack Event Priorities

Stack Event Category Priority Reasoning
Audit stack events 0 Audit get lowest priority in the presence of overload.
Response stack events 3 Responses get the highest priority since the request has already been made.
Remove stack events 0 (Audit) 3 (Call Processing) If done for auditing, Remove gets lowest priority. If part of call processing, Remove gets highest priority because it is cleaning up data.
Update stack events 2 Falls under the category of in-session processing. Existing sessions/bindings are more important than new sessions/bindings.
Find stack events 2 Falls under the category of in-session processing. Existing sessions/bindings are more important than new sessions/bindings.
Create stack events 1 New sessions/bindings are lower priority than existing sessions, but higher priority than audit.
Query stack events 1 Query stack events are used for troubleshooting, so they are higher priority than audit, but still lower priority than most of the call processing stack events.
MITM RAR events 0 (Query) 3 (Terminate) If used for query RAR, priority 0 is used. If used for terminate RAR, priority 3 is used.

The stack events may also be routed from a SBR to another SBR in some scenarios. The congestion control in this case should be conducted based on the congestion state of the receiving SBR, for example, the ComAgent on the sending SBR is responsible to compare the stack event priority with the congestion level of the receiving SBR and make the routing decision accordingly.

Stack events that are triggered by Diameter messages with inviolable priorities have the highest priority among all the stack events to ensure the Diameter messages and are more favorably processed by SBR or PCA.

Four priority levels (P0, P1, P2, and P3) are used for the stack event priority setting. PCA determines if a stack event to be sent to an active SBR is a priority message. If it is, the stack event is assigned the highest priority (P3). Otherwise, the stack event's priority level is assigned based on the values shown in Table 6-4.

Load Shedding

After the SBR has determined that it is in overload (CL1 – CL3), it informs ComAgent that its resources and sub-resources are in congestion. ComAgent then broadcasts this information to all of the resource users for the specified resources and sub-resources. The resource users now begin to shed load by sending only certain requests for database updates. The resource users determine which database requests to discard based on the current congestion level of the resource provider.

Database requests are delivered to SBRs using ComAgent stack events. Each stack event has a priority. The resource user software (on either DA-MPs or SBRs) sets the stack event priority for every Stack Event it sends, depending on the type of stack event and the circumstances under which the Stack Event is being used. For example, the same stack event may be used for signaling and for audit, but may have a different priority in each circumstance. The Stack Event priority is compared with the congestion level of the server that is the target of the stack event to determine whether stack event should be sent, as shown in Table 6-5.

Table 6-5 Stack Event Load Shedding

Congestion Level Description
CL0 The resource provider is not congested. No load shedding occurs. Send all Stack Events.
CL1 Minor congestion. Auditing is suspended. Send all Stack Events not related to auditing.
CL2 Major congestion. No new bindings or sessions are created. Existing bindings and sessions are unaffected. Send only Stack Events related to existing sessions.
CL3 Critical congestion. Send only Stack Events already started and Stack Events that remove sessions or bindings.