Standard Threshold Processor
The Standard Threshold Processor microservice evaluates collected metrics against configured thresholds, generates threshold violation events, and publishes them to an output topic.
This microservice is part of the Event microservice pipeline. It also relies on the Metric microservice pipeline configured for Observability Analytics, and thresholds configured using the NPM Thresholds UI. See Understanding the Event Pipeline and Understanding the Metric Pipeline in Unified Assurance Concepts for conceptual information.
This microservice follows a coordinator–worker pattern. The coordinator manages enrollment and work distribution, and the workers run threshold evaluation work and publish results.
Autoscaling is disabled by default for this microservice. You can optionally enable autoscaling when you deploy the microservice. See Configuring Autoscaling.
You can enable redundancy for this microservice when you deploy it. See Configuring Microservice Redundancy for general information.
This microservice provides additional Prometheus monitoring metrics. See Standard Threshold Processor Self-Monitoring Metrics.
Standard Threshold Processor Prerequisites
Before deploying the microservice, confirm that the following prerequisites are met:
-
A microservice cluster is set up. See Microservice Cluster Setup.
-
The Apache Pulsar microservice is deployed. See Pulsar.
Deploying Standard Threshold Processor
To deploy the microservice, run the following commands:
su - assure1
export NAMESPACE=<namespace>
export WEBFQDN=<WebFQDN>
a1helm install <microservice-release-name> assure1/standard-threshold-processor -n $NAMESPACE --set global.imageRegistry=$WEBFQDN
In the commands:
-
<namespace> is the namespace where you are deploying the microservice. The default namespace is a1-zone1-pri, but you can change the zone number and, when deploying to a redundant cluster, change pri to sec.
-
<WebFQDN> is the fully-qualified domain name of the primary presentation server for the cluster.
-
<microservice-release-name> is the name to use for the microservice instance. Oracle recommends using the microservice name (standard-threshold-processor) unless you are deploying multiple instances of the microservice to the same cluster.
You can also use the Unified Assurance UI to deploy microservices. See Deploying a Microservice by Using the UI for more information.
Changing Standard Threshold Processor Configuration Parameters
When running the install command, you can optionally change default configuration parameter values by including them in the command with additional --set arguments. You can add as many additional --set arguments as needed.
For example:
-
Set a parameter described in Default Global Standard Threshold Processor Configuration by adding --set configData.<parameter_name>=<parameter_value>. For example, --set configData.LOG_LEVEL=DEBUG.
-
Enable autoscaling for the microservice by adding --set autoscaling.enabled=true.
Default Global Standard Threshold Processor Configuration
The following table describes the default configuration parameters found in the Helm chart under configData for the microservice. These apply to both workers and coordinators.
| Name | Default Value | Supported Values or Types | Notes |
|---|---|---|---|
| LOG_LEVEL | INFO | FATAL, ERROR, WARN, INFO, or DEBUG | The logging level for the microservice. |
| STREAM_OUTPUT | persistent://assure1/event/collection | Text, 255 characters | The Pulsar topic where threshold events are published. The topic at the end of the path may be any text value. |
| POLL_INTERVAL | 60 | Integer | The polling interval, in seconds, between threshold evaluations. |
| CHECK_INTERVAL | 900 | Integer | The interval, in seconds, for threshold re-evaluation or integrity checks. |
| BATCH_SIZE | 500 | Integer | The number of records processed for each batch cycle. |
| WORKER_THREADS | 20 | Integer | The number of worker threads used for processing threshold workloads. |
| REDUNDANCY_POLL_PERIOD | 5 | Integer | The number of seconds between status checks from the secondary microservice to the primary microservice. |
| REDUNDANCY_FAILOVER_THRESHOLD | 4 | Integer | The number of times the primary microservice must fail checks before the secondary microservice becomes active. |
| REDUNDANCY_FALLBACK_THRESHOLD | 1 | Integer | The number of times the primary microservice must succeed checks before the secondary microservice becomes inactive. |
| GRPC_GRACEFUL_CONN_TIME | 60 | Integer | The number of seconds the workers should try to connect with the coordinator before failing. |
| GRPC_CLIENT_KEEPALIVE | false | Boolean | Whether to use client-side keepalive checks, sent from the workers, to validate communication with the coordinator. |
| GRPC_CLIENT_KEEPALIVE_TIME | 30 | Integer | The number of seconds of inactivity after which to ping the coordinator. |
| GRPC_CLIENT_KEEPALIVE_TIMEOUT | 5 | Integer | The number of seconds to wait for a response to the ping before the connection to the coordinator is considered down. |
| GRPC_SERVER_KEEPALIVE | false | Boolean | Whether to use server-side keepalive checks, sent from the coordinator, to validate communication with the workers. |
| GRPC_SERVER_KEEPALIVE_TIME | 30 | Integer | The number of seconds of inactivity after which to ping a worker. |
| GRPC_SERVER_KEEPALIVE_TIMEOUT | 5 | Integer | The number of seconds to wait for a response to the ping before the connection to a worker is considered down. |
| SEND_ALL_VIOLATIONS | 0 | 0 or 1 | Controls duplicate threshold violations. Set this to:
|
About Keep-Alive Configurations
By default, the coordinator and individual workers periodically send heartbeat messages between each other, with no validation, to check that the connection is not idle. To validate the connection, you can optionally enable ping-based gPRC keepalive checks, which expect a response within a configurable timeframe. If no response is received, the connection is considered down and the workers attempt to reestablish communication.
In the Standard Threshold Processor microservice, the coordinator acts as the gPRC server and the workers act as clients. You enable keepalive checks from the coordinator to workers in the GRPC_SERVER_KEEPALIVE parameter and from workers to the coordinator in the GRPC_CLIENT_KEEPALIVE parameter. You set the interval at which the checks are made in the GRPC_SERVER_KEEPALIVE_TIME and GRPC_CLIENT_KEEPALIVE parameters, and the time within which a response is expected in the GRPC_SERVER_KEEPALIVE_TIMEOUT and GRPC_CLIENT_KEEPALIVE_TIMEOUT parameters.
Client-side keepalive checks have mandatory enforcement policies. If the client checks too frequently, the connection will be dropped with an ENHANCE_YOUR_CALM(too_many_pings) error. When you enable client-side keepalive checks, the Standard Threshold Processor automatically sets the enforcement policy to allow no more than the value of GRPC_CLIENT_KEEPALIVE_TIME minus the value of GRPC_CLIENT_KEEPALIVE_TIMEOUT.
Coordinating Poll Intervals
The POLL_INTERVAL parameter interacts with the frequency that standard thresholds are checked and the poll interval of the metrics themselves. The following settings affect timing:
-
On a metric: The poll interval determines how frequently the metric's value is updated.
-
On a threshold:
-
The frequency determines how frequently the metric value is checked for threshold violations.
-
The time range determines the range of data points to use when checking for threshold violations.
-
-
On the Standard Threshold Processor, the POLL_INTERVAL determines how frequently the processor polls the thresholds to see which need to be checked.
You must consider the interaction between these poll intervals when setting them. Because the Standard Threshold Processor processes all standard thresholds, you must be aware of the frequency of all of the related thresholds when setting the POLL_INTERVAL.
The thresholds will only be checked as frequently as the Standard Threshold Processor polls them. If the Standard Threshold Processor's POLL_INTERVAL is set to 60 seconds, even if you set a threshold's Frequency to 30 seconds, the threshold will still only be checked every minute. However, having the Standard Threshold Processor poll the thresholds more frequently than the most frequent threshold results in unnecessary work.
Similarly, checking the threshold for violations more frequently than the metric data is updated could result in false positives, while checking the threshold too infrequently could result in missed violations. For example, you could get inaccurate data by setting the threshold's Frequency to 60 seconds when the metric's Poll Interval is set to 300 seconds, or setting Frequency to 300 seconds when the metric's Poll Interval is 30 seconds.
As a basic guideline, Oracle recommends setting Frequency for thresholds to be the same or less frequent than the POLL_INTERVAL for the Standard Threshold Processor and the Poll Interval for the related metric.
See Metrics in Unified Assurance User's Guide and Setting Up NPM Thresholding in Unified Assurance Network Performance Management Reporting Guide for more information about configuring metrics and thresholds, including setting their frequency and poll intervals.
Note:
For thresholds that need very frequent polling times (less than a minute), using in-application thresholding may be more efficient than using the Standard Threshold Processor.
Poll Interval Example
This example involves the following components:
-
Standard Threshold Processor: POLL_INTERVAL is set to 60.
-
Threshold 1:
-
Frequency is set to 300s.
-
Time Range is set to 900s.
-
Metric 1: Poll Interval is set to 300.
-
Threshold 2:
-
Frequency is set to 60s.
-
Time Range is set to 180s.
-
Metric 2: Poll Interval is set to 60.
At 10:01, the following happens:
-
The Standard Threshold Processor polls the thresholds, to see if any should be checked.
-
Because threshold 1 was last checked at 10:00, and it only needs to be checked every 5 minutes, the threshold processor does not check it.
-
Because threshold 2 was last checked at 10:00, and needs to be checked every minute, the threshold processor checks if metric 2, which is polled every minute, violates threshold 2.
-
If metric 2 violates threshold 2, which is evaluated for data received since 9:58, the threshold processor creates an event.
If POLL_INTERVAL for the Standard Threshold Processor is instead set to 300, it will not check threshold 2 frequently enough, and might miss violations. If POLL_INTERVAL for the Standard Threshold Processor is instead set to 15, it will perform extra work polling thresholds to see if they need to be checked at an unnecessary frequency.
Standard Threshold Processor Autoscaling Configuration
Autoscaling is supported for the Standard Threshold Processor microservice. See Configuring Autoscaling for general information and details about the standard autoscaling configurations.
The Standard Threshold Processor microservice also supports the additional configuration described in the following table.
| Name | Default Value | Possible Values | Notes |
|---|---|---|---|
| thresholds.requiredWorkersTarget | 1 | Integer | The target number of workers for scaling. |
The total workers required is determined dynamically according to the value of the stp_required_total_workers Prometheus metric. If this is higher than thresholds.requiredWorkersTarget, KEDA scales up the number of pods for Standard Threshold Processor (up to maxReplicaCount, by default set to 25).
If the total number of required workers (the value of the stp_required_total_workers Prometheus metric) is lower than or equal to thresholds.requiredWorkersTarget, KEDA scales down (down to minReplicaCount, by default set to 1) after cooldownPeriod (by default set to 300 seconds).
For example, assume the value of stp_required_total_workers is 6. With the default values, because this is greater than 1 (the default thresholds.requiredWorkersTarget), KEDA will scale up to 6 worker pods. If stp_required_total_workers subsequently changes to 4, after 300 seconds, KEDA terminates two worker pods to scale down from 6 pods to 4.
Standard Threshold Processor Self-Monitoring Metrics
The Standard Threshold Processor microservice exposes the self-monitoring metrics for coordinators described in the following table to Prometheus.
| Metric Name | Type | Description | Component |
|---|---|---|---|
| stp_required_total_workers | Gauge | The total number of workers required to process violation events. | Coordinator |
| stp_threshold_configured_devices | Gauge | The total number of configured devices for threshold violation checks. | Coordinator |
| work_queue_backlog | Gauge | The total number of remaining items in queue. | Coordinator |
| poll_cycle_duration_ms | Gauge | The number of milliseconds required to process single poll cycle. | Coordinator |
| number_of_violations | Gauge | The total number of violations raised. | Worker |
| number_of_clear_alarms | Gauge | The total number of CLEARED alarms processed. | Worker |
| stp_worker_execution_duration_ms | Gauge | The end-to-end device execution time per worker, in milliseconds. | Worker |
| stp_success_work_executions | Gauge | The total number of successful work executions. | Worker |
| stp_failed_work_executions | Gauge | The total number of failed work executions. | Worker |
Note:
In the database, each of the metrics is prefixed with prom and standard-threshold-processor to indicate the services that inserted them.