9 OCNADD Alarms
This section provides information on all the alarms generated by the OCNADD.
Alarm Types
The following table depicts the OCNADD alarm type and range:
Table 9-1 Alarm Type
Alarm Type | Reason | Range |
---|---|---|
SECURITY | Security Violation | 1000-1999 |
COMMUNICATION | Communication Failure | 2000-2999 |
QOS | Quality Of Service | 3000-3999 |
PROCESSING_ERROR | Processing Error | 4000-4999 |
OPERATIONAL_ALARMS | Operational Alarms | 5000-5999 |
Note:
Alarm Purge or Clear Criteria:The raised alarm persists in the database and is cleared or purged when either of the following conditions are met:
- The corresponding service sends a clear alarm request to the alarm service
- The alarm is purged after the expiry of configured purge alarm timeout. The default timeout value is 7 days.
OCNADD OIDs
OCNADD OIDs are listed below:
OCNADD OID: 1.3.6.1.4.1.323.5.3.51
Table 9-2 OCNADD OID
Name | Value |
---|---|
ocnaddconfiguration | 1.3.6.1.4.1.323.5.3.51.20 |
ocnaddscpaggregation ocnaddnrfaggregation |
1.3.6.1.4.1.323.5.3.51.22 |
<appname>-egw | 1.3.6.1.4.1.323.5.3.51.23 |
ocnaddalarm | 1.3.6.1.4.1.323.5.3.51.24 |
<appname>-adapter | 1.3.6.1.4.1.323.5.3.51.25 |
ocnaddgui | 1.3.6.1.4.1.323.5.3.51 |
ocnaddbackendrouter | 1.3.6.1.4.1.323.5.3.51 |
ocnaddkafka | 1.3.6.1.4.1.323.5.3.51.27 |
ocnaddhealthmonitoring | 1.3.6.1.4.1.323.5.3.51.28 |
ocnaddsystem | 1.3.6.1.4.1.323.5.3.51.29 |
ocnaddadmin | 1.3.6.1.4.1.323.5.3.51.30 |
Alarm Details
Table 9-3 Alarm Information
Alarm Detail | Description |
---|---|
alarmName | Alarm Name is constructed as OCNADDnnnnn that is, OCNADD followed by five digit number. For example, OCNADD01000, where number is the alarm number for the defined alarm type. |
alarmType | Type of alarm. The supported types are SECURITY, COMMUNICATION, QOS, PROCESSING_ERROR and OPERATIONAL_ALARMS. |
alarmSeverity | Severity of alarms based on the cause of the alarm. The supported severity types are CRITICAL, MAJOR, MINOR, WARN and INFO |
alarmDescription | The alarm description which reports the specific problem for which the alarm is raised |
additionalInfo | This is an optional field and provides additional troubleshooting and recovery steps that user should perform on the occurrence of alarm |
serviceName | Name of the service that raises the alarm |
instance | Instance Id of the POD in which the alarm is raised |
Communication Failure Alarms
Table 9-4 Communication Failure Alarms
alarmName | alarmType | alarmSeverity | alarmDescription | additionalInfo | serviceName | instance(POD Instance Id) | Remarks |
---|---|---|---|---|---|---|---|
OCNADD02000: Loss of Connection | COMMUNICATION | MAJOR |
Alarm is raised when connection cannot be established with the service identified by <service_name> Alarm is cleared when connection is established with the service identified by <service_name> |
All the services | |||
OCNADD02001: Loss of Heartbeat | COMMUNICATION | MINOR |
Alarm is raised when heartbeat is missing from service identified by <service_name> Alarm is cleared when hearbeat is received from service identified by <service_name> |
The heartbeat from a service is missed | ocnaddhealthmonitoring | ||
OCNADD02002: Service Down | COMMUNICATION | MAJOR |
Alarm is raised when the service is down Alarm is cleared when service is up |
The service is not accessible. The configured number of continuous HBs may have been missed or the service is not connected after configured number of retries | All the services | Prometheus Alert and Healthmonitoring | |
OCNADD02003: Kafka Broker Not Available | COMMUNICATION | MAJOR |
Alarm is raised when the service is unable to connect to Kafka Broker Alarm is cleared when service is able to connect to Kafka Broker |
ocnaddadminservice, ocnaddnrfaggregation, ocnaddscpaggregation, <appname>-adapter |
Quality of Service Alarms
Table 9-5 Quality of Service Alarms
alarmName | alarmType | alarmSeverity | alarmDescription | additionalInfo | serviceName | instance(POD Instance Id) | Remarks |
---|---|---|---|---|---|---|---|
OCNADD03006: No Data Available | QOS | MINOR |
Alarm is raised when no data is available on Kafka Stream Alarm is cleared when data is received on Kafka Stream |
<appName>-adapter, ocnaddnrfaggregation, ocnaddscpaggregation |
Processing Error Alarms
Table 9-6 Processing Error Alarms
alarmName | alarmType | alarmSeverity | alarmDescription | additionalInfo | serviceName | instance(POD Instance Id) | Remarks |
---|---|---|---|---|---|---|---|
OCNADD04000: Out of Memory | PROCESSING_ERROR | MAJOR |
Alarm is raised when sufficient memory is unavailable for service Alarm is cleared when sufficient memory is available for service |
All the services | |||
OCNADD04002: CPU Overload | PROCESSING_ERROR | MAJOR |
Alarm is raised when CPU usage has crossed 70% for the service Alarm is cleared when CPU usage is back to less than 70% for the service |
All the services | Prometheus Alert | ||
OCNADD04004: Storage full | PROCESSING_ERROR | MAJOR |
Alarm is raised when storage is full for the service Alarm is cleared when storage is available for the service |
ocnaddhealthmonitoring | |||
OCNADD04005: Memory overload | PROCESSING_ERROR | MAJOR |
Alarm is raised when memory usage has crossed 70% for the service Alarm is cleared when memory usage is back to less than 70% for the service |
All the services | Prometheus Alert |
Operational Alarms
Table 9-7 Operational Alarms
alarmName | alarmType | alarmSeverity | alarmDescription | additionalInfo | serviceName | instance(POD Instance Id) | Remarks |
---|---|---|---|---|---|---|---|
OCNADD05001: POD Instance Created | OPERATIONAL_ALARM | INFO | New POD for the service <service_name> created or registered | ocnaddhealthmonitoring | |||
OCNADD05002: POD Instance Destroyed | OPERATIONAL_ALARM | INFO | POD for the service <service_name> destroyed or de-registered | ocnaddhealthmonitoring | |||
OCNADD05003: Partition Added | OPERATIONAL_ALARM | INFO | New Partition created for the topic <topic_name> | ocnaddadmin | |||
OCNADD05004: Topic Added | OPERATIONAL_ALARM | INFO | New topic <topic_name> created | ocnaddadmin | |||
OCNADD05005: Max instances reached | OPERATIONAL_ALARM | INFO | Max instance reached for the service <service_name> | ocnaddhealthmonitoring | |||
OCNADD05006: POD Restarted | OPERATIONAL_ALARM | MINOR | Raised by Prometheus when a OCNADD POD for has restarted | All services | Prometheus Alert | ||
OCNADD05007: Ingress MPS Threshold crossed | OPERATIONAL_ALARM | WARN, MINOR, MAJOR, CRITICAL |
The ingress MPS threshold crossed WARN: 80%, MINOR: 90%, MAJOR:95%, and CRITICAL:100% The threshold alerts are cleared when the traffic goes back to below set threshold alert values. |
Kafka Aggregation | Prometheus Alert | ||
OCNADD05008: Egress MPS Threshold crossed | OPERATIONAL_ALARM | WARN, MINOR, MAJOR, CRITICAL |
The egress MPS threshold crossed WARN: 80%, MINOR: 90%, MAJOR:95%, and CRITICAL:100% The threshold alerts are cleared when the traffic goes back to below set threshold alert values. |
Egress Gateway | Prometheus Alert | ||
OCNADD05009: Egress MPS Threshold crossed for a particular consumer application | OPERATIONAL_ALARM | CRITICAL |
The egress MPS threshold crossed for a particular consumer CRITICAL:100% The threshold alerts are cleared when the traffic goes back to below set threshold alert values. |
Egress Gateway | Prometheus Alert | ||
OCNADD05010: Average E2E latency threshold crossed | OPERATIONAL_ALARM | WARN, MINOR, MAJOR, CRITICAL |
The average e2e latency threshold crossed WARN: 80%, MINOR: 90%, MAJOR:95%, and CRITICAL:100% The threshold alerts are cleared when the latency goes back to below set threshold alert values. |
Egress Gateway | Prometheus Alert | ||
OCNADD05011: Average Ingress Packet Drop rate threshold crossed | OPERATIONAL_ALARM | MAJOR,CRITICAL |
The average ingress packet drop rate threshold crossed MAJOR:1% and CRITICAL:10% The threshold alerts are cleared when the packet drop rate goes back to below set threshold alert values. |
Kafka Aggregation | Prometheus Alert | ||
OCNADD05012: Average Egress Gateway failure rate threshold crossed | OPERATIONAL_ALARM | INFO,WARN, MINOR, MAJOR, CRITICA |
The egress failure rate threshold crossed CRITICAL:100% The threshold alerts are cleared when the failure rate goes back to below set threshold alert values. |
Egress Gateway | Prometheus Alert | ||
OCNADD05013: Ingress Traffic spike threshold crossed | OPERATIONAL_ALARM | MAJOR |
The Ingress traffic spike threshold crossed Major :10% The threshold alerts are cleared when the traffic spike goes back to below set threshold alert values. |
Kafka Aggregation | Prometheus Alert |