SNMP Poller

Overview

The SNMP Poller microservice can perform snmp device discovery created through the discovery-service's discovery RESTful API or execute periodic snmp polling once the devices have been provisioned in Unified Assurance through the discovery-service's inventory RESTful API.

The microservice is designed using the Controller-Worker architecture, composed of two components: a single coordinator and multiple instances of workers. The coordinator is responsible for managing workers, publishing metrics, calculating and coordinating snmp workloads between its workers. Whereas the workers are only responsible in executing those workloads and publishing their results to the appropriate microservice pipelines through the Apache Pulsar bus.

The microservice is expected to run in a separate microservices cluster for each Device Zone alongside other mandatory microservices. See Part 3 of the Prerequisites section below for more details.

Prerequisites

  1. A microservices cluster must be setup. Refer to Microservice Cluster Setup.

  2. Apache Pulsar must be installed. Refer to Apache Pulsar microservice.

  3. The following core microservices must be installed as per the requirement

Setup

su - assure1
export NAMESPACE=a1-zone1-pri
export WEBFQDN=<Primary Presentation Web FQDN> 
a1helm install snmp-poller assure1/snmp-poller -n $NAMESPACE --set global.imageRegistry=$WEBFQDN

Default Global Configuration

Name Value Possible Values Notes
LOG_LEVEL INFO FATAL, ERROR, WARN, INFO, DEBUG Global logging level between coordinator and workers.
PORT_COORDINATOR 38890 Integer Internal port used by the coordinator service.
PORT_WORKER 38891 Integer Internal port used by the worker service.

Above configurations can be changed by passing the values to the a1helm install prefixed with configData.

Example of setting the logging level to DEBUG for both coordinator and worker

a1helm install ... --set configData.LOG_LEVEL=DEBUG

Default Coordinator-Only Configuration

Name Value Possible Values Notes
LOG_LEVEL INFO FATAL, ERROR, WARN, INFO, DEBUG Coordinator logging level. This overrides the global configuration.
GRPC_FALLBACK_USE_IP true Text (true/false) Should coordinator communicate with workers using ip addresses instead of hostnames.
WORKER_CONCURRENCY 2000 Integer (0 < value) How many concurrent snmp workloads can a single worker instance perform.
DISCOVERY_WORKER_PERCENTAGE 25 Integer (0 <= value <= 100) What percentage of workers are allocated only to perform discovery workloads.
POLLER_RESYNC_PERIOD 15m Integer + Text ("ns", "us" (or "µs"), "ms", "s", "m", "h".) How frequently should the coordinator re-synchronize with the Unified Assurance database.
PULSAR_SNMP_DISCOVERY_TOPIC_OVERRIDE "" Text Override for the topic from which the coordinator listens for discovery workload requests.
REDUNDANCY_INIT_DELAY 20s Integer + Text ("ns", "us" (or "µs"), "ms", "s", "m", "h".) How long should redundancy wait for primary up status before becoming active over during startup.
REDUNDANCY_POLL_PERIOD 5s Integer + Text ("ns", "us" (or "µs"), "ms", "s", "m", "h".) How frequently should the secondary microservice poll for primary microservices failure.
REDUNDANCY_FAILOVER_THRESHOLD 4 Integer (0 < value) The number of failed checks after which the secondary microservice becomes active.
REDUNDANCY_FALLBACK_THRESHOLD 1 Integer (0 < value) The number of successful checks after which the secondary microservice goes back to sleep.

Configurations can be changed by passing the values to the a1helm install prefixed with coordinator.configData.

Example of setting the logging level to DEBUG only for the coordinator

a1helm install ... --set coordinator.configData.LOG_LEVEL=DEBUG

Default Worker-Only Configuration

Name Value Possible Values Notes
LOG_LEVEL INFO FATAL, ERROR, WARN, INFO, DEBUG Worker logging level. This overrides the global configuration.
GRPC_GRACEFUL_CONN_TIME 60s Integer + Text ("ns", "us" (or "µs"), "ms", "s", "m", "h".) Up to how long should the workers try attempt connecting with the coordinator before failing.
STREAM_OUTPUT_METRIC "" Text Override for the topic where performance polling workload results are published.
STREAM_OUTPUT_AVAILABILITY "" Text Override for the topic where availability polling workload results are published.
PULSAR_DISCOVERY_CALLBACK_OVERRIDE "" Text Override for the topic where discovery workload results are published.

Configurations can be changed by passing the values to the a1helm install prefixed with worker.configData.

Example of setting the logging level to DEBUG only for workers

a1helm install ... --set worker.configData.LOG_LEVEL=DEBUG

Autoscaling

The SNMP Poller Microservice uses the below formulae to determine the best number of workers required to perform snmp workloads:

polling workers = round up(unique devices being polled / worker concurrency)
discovery workers = round up(polling workers * discovery worker percentage / 100)
total workers required = polling workers + discovery workers

During re-synchronisation with the Unified Assurance database, the coordinator determines the number of unique polled devices and performs the above calculations.

The result is then exposed as snmp_coordinator_metric_workers_required prometheus metric, to be ingested by KEDA to make the scaling decision.

Info:

Autoscaling is disabled by default.

Warn:

While the provided autoscaling has almost out-of-the-box functionality, You will need to manually configure the upper bound autoscaling limit during installation.

Using the expected number of devices to be polled in each Device Zone, decide the percentage of discovery workers (or leave it default), apply the formulae and configure the upper-bound limit.

For common microservice scaling configuration options, please refer to the autoscaling docs.

Examples

Modifying scaling triggers

By default, only a single autoscaling trigger is defined. You can define additional triggers during installation alongside the common configuration options.

autoscaling:
  ...
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus-kube-prometheus-prometheus.a1-monitoring.svc.cluster.local:9090
        query: snmp_coordinator_metric_workers_required
        threshold: '1'
        metricType: Value

Microservice self-metrics

The SNMP Poller Microservice exposes the following self-metrics to Prometheus.

Coordinator metrics table

Note:

Each of the below metrics is prefixed with snmp_coordinator prefix. Example of a full metric name: snmp_coordinator_metric_worker_count

Metric Name Type Labels Description
metric_worker_count Gauge N/A The number of workers currently enrolled with the coordinator.
metric_workforce_count Gauge N/A The number of workers multiplied by worker concurrency.
metric_discovery_worker_count Gauge N/A The number of discovery workers currently enrolled with the coordinator.
metric_polling_worker_count Gauge N/A The number of polling workers currently enrolled with the coordinator.
metric_workers_required Gauge N/A The number of workers required for polling and discovery.
metric_discovery_requests_queued Gauge N/A The number of discovery requests. (queued, realtime)
metric_discovery_requests_processing Gauge N/A The number of discovery requests. (processing, realtime)
metric_polling_requests_queued Gauge N/A The number of polling requests. (queued, realtime)
metric_polling_requests_processing Gauge N/A The number of polling requests. (processing, realtime)
metric_polled_devices_count GaugeVec domain, cycle The number of polled devices per domain and cycle.
metric_polled_objects_count GaugeVec domain, cycle The number of polled objects per domain and per cycle.
metric_polling_duration GaugeVec domain, cycle The total polling duration in seconds for last cycle per domain and per cycle.
metric_polling_average GaugeVec domain, cycle The average polling duration in seconds for last cycle per domain and per cycle.
metric_polling_percentile95 GaugeVec domain, cycle The 95th percentile average polling duration in seconds for last cycle per domain and per cycle.
metric_polling_utilisation GaugeVec domain, cycle The polling utilisation in percent for last cycle per domain and per cycle.

Microservice redundancy

Redundancy in the SNMP Poller Microservice controls which of the two microservices in a redundant pair is considered active to run periodic device polling.

Info:

Redundancy is disabled by default.

Example of enabling redundancy

a1helm install ... --set redundancy.enabled=true