9 OCNADD Alarms

This section provides information on all the alarms generated by the OCNADD.

Alarm Types

The following table depicts the OCNADD alarm type and range:

Table 9-1 Alarm Type

Alarm Type Reason Range
SECURITY Security Violation 1000-1999
COMMUNICATION Communication Failure 2000-2999
QOS Quality Of Service 3000-3999
PROCESSING_ERROR Processing Error 4000-4999
OPERATIONAL_ALARMS Operational Alarms 5000-5999

Note:

Alarm Purge or Clear Criteria:

The raised alarm persists in the database and is cleared or purged when either of the following conditions are met:

  • The corresponding service sends a clear alarm request to the alarm service.
  • The alarm is purged after the expiry of configured purge alarm timeout. The default timeout value is 7 days.

OCNADD OIDs

OCNADD OIDs are listed below:

OCNADD OID: 1.3.6.1.4.1.323.5.3.51

Table 9-2 OCNADD OID

Name Value
ocnaddconfiguration 1.3.6.1.4.1.323.5.3.51.20

ocnaddscpaggregation

ocnaddnrfaggregation

1.3.6.1.4.1.323.5.3.51.22
ocnaddalarm 1.3.6.1.4.1.323.5.3.51.24
<appname>-adapter 1.3.6.1.4.1.323.5.3.51.25
ocnaddgui 1.3.6.1.4.1.323.5.3.51
ocnadduirouter 1.3.6.1.4.1.323.5.3.51
ocnaddkafka 1.3.6.1.4.1.323.5.3.51.27
ocnaddhealthmonitoring 1.3.6.1.4.1.323.5.3.51.28
ocnaddsystem 1.3.6.1.4.1.323.5.3.51.29
ocnaddadmin 1.3.6.1.4.1.323.5.3.51.30

Alarm Details

Table 9-3 Alarm Information

Alarm Detail Description
alarmName Alarm Name will be constructed as OCNADDnnnnn (OCNADD followed by five digit number), e.g. OCNADD01000, where number is the alarm number for the defined alarm type.
alarmType Type of alarm [SECURITY, COMMUNICATION, QOS, PROCESSING_ERROR, OPERATIONAL_ALARMS]
alarmSeverity Severity of alarms as per the alarm cause [CRITICAL, MAJOR, MINOR, WARN, INFO]
alarmDescription The alarm description shall report the specific problem for which the alarm is raised
additionalInfo This is an optional and providing additional troubleshooting and recovery steps that user should perform on the occurrence of alarm
serviceName Name of the service that raises the alarm
instance Instance Id of the POD in which the alarm is raised

Communication Failure Alarms

Table 9-4 Communication Failure Alarms

alarmName alarmType alarmSeverity alarmDescription additionalInfo serviceName instance(POD Instance Id) Remarks
OCNADD02000: Loss of Connection COMMUNICATION MAJOR

Raise: Connection could not be established with the service <service_name>

Clear: Connection Established again for service <service_name>

  ocnaddhealthmonitoring    
OCNADD02001: Loss of Heartbeat COMMUNICATION MINOR

Raise: Missing heartbeat from service <service_name>

Clear: Heartbeat received from <service_name>

The heartbeat from a service is missed ocnaddhealthmonitoring    
OCNADD02002: Service Down COMMUNICATION MAJOR

Raise: Service <service_name> is down

Clear: Service <service_name> is up

The service is not accessible. The configured number of continuous HBs may have been missed or the service is not connected after configured number of retries All the services Prometheus Alert  
OCNADD02003: Kafka Broker Not Available COMMUNICATION CRITICAL

Raise: Service <service_name> is not able to connect to Kafka Broker

Clear: Service <service_name> is able to connect to Kafka again

  ocnaddadminservice    
OCNADD02004: Kafka Consumption Paused COMMUNICATION MINOR

Raise: Kafka consumption by service <service_name> paused

Raise: Kafka consumption by service <service_name> resumed

The service may have experienced connection timeout or failures from the peer end, applied circuit breaking and paused the consumption from the Kafka topic. ocnaddadminservice    
OCNADD02005: ThirdParty Connection Failure COMMUNICATION MAJOR

Raise: Connection to third party is failed

Clear: Connection to third party is successful

Check connectivity to third party from server where Egress adapter is deployed ocnaddconsumeradapter    

Quality of Service Alarms

Table 9-5 Quality of Service Alarms

alarmName alarmType alarmSeverity alarmDescription additionalInfo serviceName instance(POD Instance Id) Remarks
OCNADD03006: No Data Available QOS MINOR

Raise: No Data available on the Kafka Stream

Clear: Data received on the Kafka Stream

Check the connectivity between producer and kafka and verify if data is generated by producers or not.

ocnaddadminservice

   

Processing Error Alarms

Table 9-6 Processing Error Alarms

alarmName alarmType alarmSeverity alarmDescription additionalInfo serviceName instance(POD Instance Id) Remarks
OCNADD04000: Out of Memory PROCESSING_ERROR MAJOR

Raise: Not enough memory available for service<service_name>

Clear: Memory Available to service <service_name>

  All the services    
OCNADD04002: CPU Overload PROCESSING_ERROR MAJOR

Raise: CPU usage crossed 70% service<service_name>

Clear: CPU usage back to less than 70% for service <service_name>

  All the services   Prometheus Alert
OCNADD04004: Storage full PROCESSING_ERROR MAJOR

Raise: Storage full for the service <service_name>

Clear: Storage available for the service <service_name>

  ocnaddhealthmonitoring    
OCNADD04005: Memory overload PROCESSING_ERROR MAJOR

Raise: Memory usage crossed 70% service<service_name>

Clear: Memory usage back to less than 70% for service <service_name>

  All the services   Prometheus Alert

Operational Alarms

Table 9-7 Operational Alarms

alarmName alarmType alarmSeverity alarmDescription additionalInfo serviceName instance(POD Instance Id) Remarks
OCNADD05001: POD Instance Created OPERATIONAL_ALARM INFO New POD for the service <service_name> created or registered   ocnaddhealthmonitoring    
OCNADD05002: POD Instance Destroyed OPERATIONAL_ALARM INFO POD for the service <service_name> destroyed or de-registered   ocnaddhealthmonitoring    
OCNADD05005: Max instances reached OPERATIONAL_ALARM INFO Max instance reached for the service <service_name>   ocnaddhealthmonitoring    
OCNADD05006: POD Restarted OPERATIONAL_ALARM MINOR Raised by Prometheus when A POD for OCNADD has restarted   All services   Prometheus Alert
OCNADD05007: Ingress MPS Threshold crossed OPERATIONAL_ALARM WARN, MINOR, MAJOR, CRITICAL

The ingress MPS threshold crossed

WARN: 80%, MINOR: 90%, MAJOR:95%, and CRITICAL:100%

The threshold alerts are cleared when the traffic goes back to below set threshold alert values.

  Kafka Aggregation   Prometheus Alert
OCNADD05008: Egress MPS Threshold crossed OPERATIONAL_ALARM WARN, MINOR, MAJOR, CRITICAL

The egress MPS threshold crossed

WARN: 80%, MINOR: 90%, MAJOR:95%, and CRITICAL:100%

The threshold alerts are cleared when the traffic goes back to below set threshold alert values.

  ocddconsumeradapter   Prometheus Alert
OCNADD05009: Egress MPS Threshold crossed for a particular consumer application OPERATIONAL_ALARM CRITICAL

The egress MPS threshold crossed for a particular consumer CRITICAL:100%

The threshold alerts are cleared when the traffic goes back to below set threshold alert values.

  ocddconsumeradapter   Prometheus Alert
OCNADD05010: Average E2E latency threshold crossed OPERATIONAL_ALARM WARN, MINOR, MAJOR, CRITICAL

The average e2e latency threshold crossed

WARN: 80%, MINOR: 90%, MAJOR:95%, and CRITICAL:100%

The threshold alerts are cleared when the latency goes back to below set threshold alert values.

  ocddconsumeradapter   Prometheus Alert
OCNADD05011: Average Ingress Packet Drop rate threshold crossed OPERATIONAL_ALARM MAJOR,CRITICAL

The average ingress packet drop rate threshold crossed

MAJOR:1% and CRITICAL:10%

The threshold alerts are cleared when the packet drop rate goes back to below set threshold alert values.

  Kafka Aggregation   Prometheus Alert
OCNADD05012: Average Egress failure rate threshold crossed OPERATIONAL_ALARM INFO,WARN, MINOR, MAJOR, CRITICAL

The egress failure rate threshold crossed

WARN:1%

MINOR:10%

MAJOR:25%

CRITICAL:50%

The threshold alerts are cleared when the failure rate goes back to below set threshold alert values.

  ocddconsumeradapter   Prometheus Alert
OCNADD05013: Ingress Traffic spike threshold crossed OPERATIONAL_ALARM MAJOR

The Ingress traffic spike threshold crossed Major :10%

Clear: The threshold alerts are cleared when the traffic spike goes back to below set threshold alert values.

  Kafka Aggregation   Prometheus Alert
OCNADD050014: Topic unavailable OPERATIONAL_ALARM MAJOR

Raise: <TopicName> topic is not available

Clear: <TopicName> topic is available

Create <TopicName> topic in kafka from Admin service. ocddconsumeradapter, ocnaddaggregation