9 OCNADD Alarms
This section provides information on all the alarms generated by the OCNADD.
Alarm Types
The following table depicts the OCNADD alarm type and range:
Table 9-1 Alarm Type
| Alarm Type | Reason | Range |
|---|---|---|
| SECURITY | Security Violation | 1000-1999 |
| COMMUNICATION | Communication Failure | 2000-2999 |
| QOS | Quality Of Service | 3000-3999 |
| PROCESSING_ERROR | Processing Error | 4000-4999 |
| OPERATIONAL_ALARMS | Operational Alarms | 5000-5999 |
Note:
Alarm Purge or Clear Criteria:The raised alarm persists in the database and is cleared or purged when either of the following conditions are met:
- The corresponding service sends a clear alarm request to the alarm service.
- The alarm is purged after the expiry of configured purge alarm timeout. The default timeout value is 7 days.
OCNADD OIDs
OCNADD OIDs are listed below:
OCNADD OID: 1.3.6.1.4.1.323.5.3.51
Table 9-2 OCNADD OID
| Name | Value |
|---|---|
| ocnaddconfiguration | 1.3.6.1.4.1.323.5.3.51.20 |
|
ocnaddscpaggregation ocnaddnrfaggregation |
1.3.6.1.4.1.323.5.3.51.22 |
| ocnaddalarm | 1.3.6.1.4.1.323.5.3.51.24 |
| <appname>-adapter | 1.3.6.1.4.1.323.5.3.51.25 |
| ocnaddgui | 1.3.6.1.4.1.323.5.3.51 |
| ocnadduirouter | 1.3.6.1.4.1.323.5.3.51 |
| ocnaddkafka | 1.3.6.1.4.1.323.5.3.51.27 |
| ocnaddhealthmonitoring | 1.3.6.1.4.1.323.5.3.51.28 |
| ocnaddsystem | 1.3.6.1.4.1.323.5.3.51.29 |
| ocnaddadmin | 1.3.6.1.4.1.323.5.3.51.30 |
Alarm Details
Table 9-3 Alarm Information
| Alarm Detail | Description |
|---|---|
| alarmName | Alarm Name will be constructed as OCNADDnnnnn (OCNADD followed by five digit number), e.g. OCNADD01000, where number is the alarm number for the defined alarm type. |
| alarmType | Type of alarm [SECURITY, COMMUNICATION, QOS, PROCESSING_ERROR, OPERATIONAL_ALARMS] |
| alarmSeverity | Severity of alarms as per the alarm cause [CRITICAL, MAJOR, MINOR, WARN, INFO] |
| alarmDescription | The alarm description shall report the specific problem for which the alarm is raised |
| additionalInfo | This is an optional and providing additional troubleshooting and recovery steps that user should perform on the occurrence of alarm |
| serviceName | Name of the service that raises the alarm |
| instance | Instance Id of the POD in which the alarm is raised |
Communication Failure Alarms
Table 9-4 Communication Failure Alarms
| alarmName | alarmType | alarmSeverity | alarmDescription | additionalInfo | serviceName | instance(POD Instance Id) | Remarks |
|---|---|---|---|---|---|---|---|
| OCNADD02000: Loss of Connection | COMMUNICATION | MAJOR |
Raise: Connection could not be established with the service <service_name> Clear: Connection Established again for service <service_name> |
ocnaddhealthmonitoring | |||
| OCNADD02001: Loss of Heartbeat | COMMUNICATION | MINOR |
Raise: Missing heartbeat from service <service_name> Clear: Heartbeat received from <service_name> |
The heartbeat from a service is missed | ocnaddhealthmonitoring | ||
| OCNADD02002: Service Down | COMMUNICATION | MAJOR |
Raise: Service <service_name> is down Clear: Service <service_name> is up |
The service is not accessible. The configured number of continuous HBs may have been missed or the service is not connected after configured number of retries | All the services | Prometheus Alert | |
| OCNADD02003: Kafka Broker Not Available | COMMUNICATION | CRITICAL |
Raise: Service <service_name> is not able to connect to Kafka Broker Clear: Service <service_name> is able to connect to Kafka again |
ocnaddadminservice | |||
| OCNADD02004: Kafka Consumption Paused | COMMUNICATION | MINOR |
Raise: Kafka consumption by service <service_name> paused Raise: Kafka consumption by service <service_name> resumed |
The service may have experienced connection timeout or failures from the peer end, applied circuit breaking and paused the consumption from the Kafka topic. | ocnaddadminservice | ||
| OCNADD02005: ThirdParty Connection Failure | COMMUNICATION | MAJOR |
Raise: Connection to third party is failed Clear: Connection to third party is successful |
Check connectivity to third party from server where Egress adapter is deployed | ocnaddconsumeradapter |
Quality of Service Alarms
Table 9-5 Quality of Service Alarms
| alarmName | alarmType | alarmSeverity | alarmDescription | additionalInfo | serviceName | instance(POD Instance Id) | Remarks |
|---|---|---|---|---|---|---|---|
| OCNADD03006: No Data Available | QOS | MINOR |
Raise: No Data available on the Kafka Stream Clear: Data received on the Kafka Stream |
Check the connectivity between producer and kafka and verify if data is generated by producers or not. |
ocnaddadminservice |
Processing Error Alarms
Table 9-6 Processing Error Alarms
| alarmName | alarmType | alarmSeverity | alarmDescription | additionalInfo | serviceName | instance(POD Instance Id) | Remarks |
|---|---|---|---|---|---|---|---|
| OCNADD04000: Out of Memory | PROCESSING_ERROR | MAJOR |
Raise: Not enough memory available for service<service_name> Clear: Memory Available to service <service_name> |
All the services | |||
| OCNADD04002: CPU Overload | PROCESSING_ERROR | MAJOR |
Raise: CPU usage crossed 70% service<service_name> Clear: CPU usage back to less than 70% for service <service_name> |
All the services | Prometheus Alert | ||
| OCNADD04004: Storage full | PROCESSING_ERROR | MAJOR |
Raise: Storage full for the service <service_name> Clear: Storage available for the service <service_name> |
ocnaddhealthmonitoring | |||
| OCNADD04005: Memory overload | PROCESSING_ERROR | MAJOR |
Raise: Memory usage crossed 70% service<service_name> Clear: Memory usage back to less than 70% for service <service_name> |
All the services | Prometheus Alert |
Operational Alarms
Table 9-7 Operational Alarms
| alarmName | alarmType | alarmSeverity | alarmDescription | additionalInfo | serviceName | instance(POD Instance Id) | Remarks |
|---|---|---|---|---|---|---|---|
| OCNADD05001: POD Instance Created | OPERATIONAL_ALARM | INFO | New POD for the service <service_name> created or registered | ocnaddhealthmonitoring | |||
| OCNADD05002: POD Instance Destroyed | OPERATIONAL_ALARM | INFO | POD for the service <service_name> destroyed or de-registered | ocnaddhealthmonitoring | |||
| OCNADD05005: Max instances reached | OPERATIONAL_ALARM | INFO | Max instance reached for the service <service_name> | ocnaddhealthmonitoring | |||
| OCNADD05006: POD Restarted | OPERATIONAL_ALARM | MINOR | Raised by Prometheus when A POD for OCNADD has restarted | All services | Prometheus Alert | ||
| OCNADD05007: Ingress MPS Threshold crossed | OPERATIONAL_ALARM | WARN, MINOR, MAJOR, CRITICAL |
The ingress MPS threshold crossed WARN: 80%, MINOR: 90%, MAJOR:95%, and CRITICAL:100% The threshold alerts are cleared when the traffic goes back to below set threshold alert values. |
Kafka Aggregation | Prometheus Alert | ||
| OCNADD05008: Egress MPS Threshold crossed | OPERATIONAL_ALARM | WARN, MINOR, MAJOR, CRITICAL |
The egress MPS threshold crossed WARN: 80%, MINOR: 90%, MAJOR:95%, and CRITICAL:100% The threshold alerts are cleared when the traffic goes back to below set threshold alert values. |
ocddconsumeradapter | Prometheus Alert | ||
| OCNADD05009: Egress MPS Threshold crossed for a particular consumer application | OPERATIONAL_ALARM | CRITICAL |
The egress MPS threshold crossed for a particular consumer CRITICAL:100% The threshold alerts are cleared when the traffic goes back to below set threshold alert values. |
ocddconsumeradapter | Prometheus Alert | ||
| OCNADD05010: Average E2E latency threshold crossed | OPERATIONAL_ALARM | WARN, MINOR, MAJOR, CRITICAL |
The average e2e latency threshold crossed WARN: 80%, MINOR: 90%, MAJOR:95%, and CRITICAL:100% The threshold alerts are cleared when the latency goes back to below set threshold alert values. |
ocddconsumeradapter | Prometheus Alert | ||
| OCNADD05011: Average Ingress Packet Drop rate threshold crossed | OPERATIONAL_ALARM | MAJOR,CRITICAL |
The average ingress packet drop rate threshold crossed MAJOR:1% and CRITICAL:10% The threshold alerts are cleared when the packet drop rate goes back to below set threshold alert values. |
Kafka Aggregation | Prometheus Alert | ||
| OCNADD05012: Average Egress failure rate threshold crossed | OPERATIONAL_ALARM | INFO,WARN, MINOR, MAJOR, CRITICAL |
The egress failure rate threshold crossed WARN:1% MINOR:10% MAJOR:25% CRITICAL:50% The threshold alerts are cleared when the failure rate goes back to below set threshold alert values. |
ocddconsumeradapter | Prometheus Alert | ||
| OCNADD05013: Ingress Traffic spike threshold crossed | OPERATIONAL_ALARM | MAJOR |
The Ingress traffic spike threshold crossed Major :10% Clear: The threshold alerts are cleared when the traffic spike goes back to below set threshold alert values. |
Kafka Aggregation | Prometheus Alert | ||
| OCNADD050014: Topic unavailable | OPERATIONAL_ALARM | MAJOR |
Raise: <TopicName> topic is not available Clear: <TopicName> topic is available |
Create <TopicName> topic in kafka from Admin service. | ocddconsumeradapter, ocnaddaggregation |