Standard Threshold Processor

The Standard Threshold Processor microservice evaluates collected metrics against configured thresholds, generates threshold violation events, and publishes them to an output topic.

This microservice is part of the Event microservice pipeline. It also relies on the Metric microservice pipeline configured for Observability Analytics, and thresholds configured using the NPM Thresholds UI. See Understanding the Event Pipeline and Understanding the Metric Pipeline in Unified Assurance Concepts for conceptual information.

This microservice follows a coordinator–worker pattern. The coordinator manages enrollment and work distribution, and the workers run threshold evaluation work and publish results.

Autoscaling is disabled by default for this microservice. You can optionally enable autoscaling when you deploy the microservice. See Configuring Autoscaling.

You can enable redundancy for this microservice when you deploy it. See Configuring Microservice Redundancy for general information.

This microservice provides additional Prometheus monitoring metrics. See Standard Threshold Processor Self-Monitoring Metrics.

Standard Threshold Processor Prerequisites

Before deploying the microservice, confirm that the following prerequisites are met:

A microservice cluster is set up. See Microservice Cluster Setup.
The Apache Pulsar microservice is deployed. See Pulsar.

Deploying Standard Threshold Processor

To deploy the microservice, run the following commands:

su - assure1
export NAMESPACE=<namespace>
export WEBFQDN=<WebFQDN>

a1helm install <microservice-release-name> assure1/standard-threshold-processor -n $NAMESPACE --set global.imageRegistry=$WEBFQDN

In the commands:

<namespace> is the namespace where you are deploying the microservice. The default namespace is a1-zone1-pri, but you can change the zone number and, when deploying to a redundant cluster, change pri to sec.
<WebFQDN> is the fully-qualified domain name of the primary presentation server for the cluster.
<microservice-release-name> is the name to use for the microservice instance. Oracle recommends using the microservice name (standard-threshold-processor) unless you are deploying multiple instances of the microservice to the same cluster.

You can also use the Unified Assurance UI to deploy microservices. See Deploying a Microservice by Using the UI for more information.

Changing Standard Threshold Processor Configuration Parameters

When running the install command, you can optionally change default configuration parameter values by including them in the command with additional --set arguments. You can add as many additional --set arguments as needed.

For example:

Set a parameter described in Default Global Standard Threshold Processor Configuration by adding --set configData.<parameter_name>=<parameter_value>. For example, --set configData.LOG_LEVEL=DEBUG.
Enable autoscaling for the microservice by adding --set autoscaling.enabled=true.

Default Global Standard Threshold Processor Configuration

The following table describes the default configuration parameters found in the Helm chart under configData for the microservice. These apply to both workers and coordinators.

Name	Default Value	Supported Values or Types	Notes
LOG_LEVEL	INFO	FATAL, ERROR, WARN, INFO, or DEBUG	The logging level for the microservice.
STREAM_OUTPUT	persistent://assure1/event/collection	Text, 255 characters	The Pulsar topic where threshold events are published. The topic at the end of the path may be any text value.
POLL_INTERVAL	60	Integer	The polling interval, in seconds, between threshold evaluations.
CHECK_INTERVAL	900	Integer	The interval, in seconds, for threshold re-evaluation or integrity checks.
BATCH_SIZE	500	Integer	The number of records processed for each batch cycle.
WORKER_THREADS	20	Integer	The number of worker threads used for processing threshold workloads.
REDUNDANCY_POLL_PERIOD	5	Integer	The number of seconds between status checks from the secondary microservice to the primary microservice.
REDUNDANCY_FAILOVER_THRESHOLD	4	Integer	The number of times the primary microservice must fail checks before the secondary microservice becomes active.
REDUNDANCY_FALLBACK_THRESHOLD	1	Integer	The number of times the primary microservice must succeed checks before the secondary microservice becomes inactive.
GRPC_GRACEFUL_CONN_TIME	60	Integer	The number of seconds the workers should try to connect with the coordinator before failing.
GRPC_CLIENT_KEEPALIVE	false	Boolean	Whether to use client-side keepalive checks, sent from the workers, to validate communication with the coordinator.
GRPC_CLIENT_KEEPALIVE_TIME	30	Integer	The number of seconds of inactivity after which to ping the coordinator.
GRPC_CLIENT_KEEPALIVE_TIMEOUT	5	Integer	The number of seconds to wait for a response to the ping before the connection to the coordinator is considered down.
GRPC_SERVER_KEEPALIVE	false	Boolean	Whether to use server-side keepalive checks, sent from the coordinator, to validate communication with the workers.
GRPC_SERVER_KEEPALIVE_TIME	30	Integer	The number of seconds of inactivity after which to ping a worker.
GRPC_SERVER_KEEPALIVE_TIMEOUT	5	Integer	The number of seconds to wait for a response to the ping before the connection to a worker is considered down.
SEND_ALL_VIOLATIONS	0	0 or 1	Controls duplicate threshold violations. Set this to: 0 to suppress repeated violations if the state is unchanged 1 to send violations on every poll, even if state remains the same.

About Keep-Alive Configurations

By default, the coordinator and individual workers periodically send heartbeat messages between each other, with no validation, to check that the connection is not idle. To validate the connection, you can optionally enable ping-based gPRC keepalive checks, which expect a response within a configurable timeframe. If no response is received, the connection is considered down and the workers attempt to reestablish communication.

In the Standard Threshold Processor microservice, the coordinator acts as the gPRC server and the workers act as clients. You enable keepalive checks from the coordinator to workers in the GRPC_SERVER_KEEPALIVE parameter and from workers to the coordinator in the GRPC_CLIENT_KEEPALIVE parameter. You set the interval at which the checks are made in the GRPC_SERVER_KEEPALIVE_TIME and GRPC_CLIENT_KEEPALIVE parameters, and the time within which a response is expected in the GRPC_SERVER_KEEPALIVE_TIMEOUT and GRPC_CLIENT_KEEPALIVE_TIMEOUT parameters.

Client-side keepalive checks have mandatory enforcement policies. If the client checks too frequently, the connection will be dropped with an ENHANCE_YOUR_CALM(too_many_pings) error. When you enable client-side keepalive checks, the Standard Threshold Processor automatically sets the enforcement policy to allow no more than the value of GRPC_CLIENT_KEEPALIVE_TIME minus the value of GRPC_CLIENT_KEEPALIVE_TIMEOUT.

Coordinating Poll Intervals

The POLL_INTERVAL parameter interacts with the frequency that standard thresholds are checked and the poll interval of the metrics themselves. The following settings affect timing:

On a metric: The poll interval determines how frequently the metric's value is updated.
On a threshold:
- The frequency determines how frequently the metric value is checked for threshold violations.
- The time range determines the range of data points to use when checking for threshold violations.
On the Standard Threshold Processor, the POLL_INTERVAL determines how frequently the processor polls the thresholds to see which need to be checked.

You must consider the interaction between these poll intervals when setting them. Because the Standard Threshold Processor processes all standard thresholds, you must be aware of the frequency of all of the related thresholds when setting the POLL_INTERVAL.

The thresholds will only be checked as frequently as the Standard Threshold Processor polls them. If the Standard Threshold Processor's POLL_INTERVAL is set to 60 seconds, even if you set a threshold's Frequency to 30 seconds, the threshold will still only be checked every minute. However, having the Standard Threshold Processor poll the thresholds more frequently than the most frequent threshold results in unnecessary work.

Similarly, checking the threshold for violations more frequently than the metric data is updated could result in false positives, while checking the threshold too infrequently could result in missed violations. For example, you could get inaccurate data by setting the threshold's Frequency to 60 seconds when the metric's Poll Interval is set to 300 seconds, or setting Frequency to 300 seconds when the metric's Poll Interval is 30 seconds.

As a basic guideline, Oracle recommends setting Frequency for thresholds to be the same or less frequent than the POLL_INTERVAL for the Standard Threshold Processor and the Poll Interval for the related metric.

See Metrics in Unified Assurance User's Guide and Setting Up NPM Thresholding in Unified Assurance Network Performance Management Reporting Guide for more information about configuring metrics and thresholds, including setting their frequency and poll intervals.

Note:

For thresholds that need very frequent polling times (less than a minute), using in-application thresholding may be more efficient than using the Standard Threshold Processor.

Poll Interval Example

This example involves the following components:

Standard Threshold Processor: POLL_INTERVAL is set to 60.
Threshold 1:
Frequency is set to 300s.
Time Range is set to 900s.
Metric 1: Poll Interval is set to 300.
Threshold 2:
Frequency is set to 60s.
Time Range is set to 180s.
Metric 2: Poll Interval is set to 60.

At 10:01, the following happens:

The Standard Threshold Processor polls the thresholds, to see if any should be checked.
Because threshold 1 was last checked at 10:00, and it only needs to be checked every 5 minutes, the threshold processor does not check it.
Because threshold 2 was last checked at 10:00, and needs to be checked every minute, the threshold processor checks if metric 2, which is polled every minute, violates threshold 2.
If metric 2 violates threshold 2, which is evaluated for data received since 9:58, the threshold processor creates an event.

If POLL_INTERVAL for the Standard Threshold Processor is instead set to 300, it will not check threshold 2 frequently enough, and might miss violations. If POLL_INTERVAL for the Standard Threshold Processor is instead set to 15, it will perform extra work polling thresholds to see if they need to be checked at an unnecessary frequency.

Standard Threshold Processor Autoscaling Configuration

Autoscaling is supported for the Standard Threshold Processor microservice. See Configuring Autoscaling for general information and details about the standard autoscaling configurations.

The Standard Threshold Processor microservice also supports the additional configuration described in the following table.

Name	Default Value	Possible Values	Notes
thresholds.requiredWorkersTarget	1	Integer	The target number of workers for scaling.

The total workers required is determined dynamically according to the value of the stp_required_total_workers Prometheus metric. If this is higher than thresholds.requiredWorkersTarget, KEDA scales up the number of pods for Standard Threshold Processor (up to maxReplicaCount, by default set to 25).

If the total number of required workers (the value of the stp_required_total_workers Prometheus metric) is lower than or equal to thresholds.requiredWorkersTarget, KEDA scales down (down to minReplicaCount, by default set to 1) after cooldownPeriod (by default set to 300 seconds).

For example, assume the value of stp_required_total_workers is 6. With the default values, because this is greater than 1 (the default thresholds.requiredWorkersTarget), KEDA will scale up to 6 worker pods. If stp_required_total_workers subsequently changes to 4, after 300 seconds, KEDA terminates two worker pods to scale down from 6 pods to 4.

Standard Threshold Processor Self-Monitoring Metrics

The Standard Threshold Processor microservice exposes the self-monitoring metrics for coordinators described in the following table to Prometheus.

Metric Name	Type	Description	Component
stp_required_total_workers	Gauge	The total number of workers required to process violation events.	Coordinator
stp_threshold_configured_devices	Gauge	The total number of configured devices for threshold violation checks.	Coordinator
work_queue_backlog	Gauge	The total number of remaining items in queue.	Coordinator
poll_cycle_duration_ms	Gauge	The number of milliseconds required to process single poll cycle.	Coordinator
number_of_violations	Gauge	The total number of violations raised.	Worker
number_of_clear_alarms	Gauge	The total number of CLEARED alarms processed.	Worker
stp_worker_execution_duration_ms	Gauge	The end-to-end device execution time per worker, in milliseconds.	Worker
stp_success_work_executions	Gauge	The total number of successful work executions.	Worker
stp_failed_work_executions	Gauge	The total number of failed work executions.	Worker

Note:

In the database, each of the metrics is prefixed with prom and standard-threshold-processor to indicate the services that inserted them.

Title and Copyright Information

Implementation Guide

G49441-01