Alert Notifications

The Siebel CRM Observability – Monitoring solution offers ability to generate alert notifications based on predefined conditions incorporating various metrics collected.

Alerting can be handled with Prometheus for all involved resources as it’s where the collected metrics reside. It is also possible to configure alerting with OCI Notifications service. For details on using OCI Notification services, refer official documentation available at https://docs.public.oneportal.content.oci.oraclecloud.com/en-us/iaas/Content/Notification/home.htm.

Broadly, this is how alert notification is functionally handled with Prometheus based configurations:

Alerting rules defined in Prometheus servers get evaluated and when necessary conditions are fulfilled, alerts get send to Alertmanager configured.
The Alertmanager can be configured to manage alerts using, among others, actions like:
- Grouping alerts of similar nature into a single notification.
- Inhibition or suppressing certain alerts if certain other alerts are already firing.
- Silencing or muting alerting for specific time periods, and so on.
The Alertmanager can send notifications to:
- Email systems
- On-call notification systems
- Chat platforms

Alert notifications using Prometheus in the Siebel CRM Observability - Monitoring solution involve:

Creating alerting rules in Prometheus.
Configuring Prometheus to connect to and notify the Alertmanagers.
Setup and configuration of the Alertmanager, which ultimately sends notification to target channels.

Alerting Rules in Prometheus

Rules for alert trigger conditions are defined in the following place in the Siebel CRM Observability – Monitoring solution:

In the <namespce>-helmcharts/prometheus/templates/configMap.yaml file under prometheus.rules key in Git repository <namespace>-helmcharts

Specific metrics and thresholds can be configured by following the Prometheus documentation.

The rules are written in PromQL (Prometheus Query Language).

Here is a sample of the alert rule block to be used in Prometheus that will get evaluated to true when container CPU usage is above 60% - the evaluation checks the value over period blocks of 15 minutes.

prometheus.rules: |-
   groups:
   - name: siebel alerts
      rules:
      - alert: ContainerHighCpuUtilization
         expr: (sum(rate(container_cpu_usage_seconds_total{name!=""}[15m]))
            BY (instance, name) * 100) > 60
         for: 2m
         labels:
            severity: critical
         annotations:
            summary: |
               Container High CPU utilization (instance {{ "{{" }}
               $labels.instance }})
            description: |
               " Container CPU utilization is above 60%\n
               VALUE = {{ "{{" }} $value }}\n LABELS = {{ "{{" }} $labels }}"

A few of the notations used as example above are briefly explained below. For more details, refer Prometheus documentation available at https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/.

Groups: Alerting rules exist in a rule group. Rules within a group are run sequentially at a regular interval, with the same evaluation time.
alert: The name of the alert. It must be a valid label value. It's a string type.
expr: It is string type PromQL expression to evaluate. In every evaluation cycle this is evaluated at the current time, and all resultant time series become pending/firing alerts.
for: The optional for clause causes Prometheus to wait for a certain duration between first encountering a new expression output vector element and counting an alert as firing for this element. In this case, Prometheus will check that the alert continues to be active during each evaluation for 2 minutes before firing the alert.
labels: The labels clause allows specifying a set of additional labels to be attached to the alert. Any existing conflicting labels will be overwritten. The label values can be templated.
annotations: The annotations clause specifies a set of informational labels that can be used to store longer, additional information such as alert descriptions or runbook links. The annotation values can be templated.

Label and annotation values can be templated using console templates. The $labels variable holds the label key/value pairs of an alert instance. The $value variable holds the evaluated value of an alert instance.

Refer Prometheus documentation on to how to use PromQL and all other options available to set up rules in Prometheus server that meet your business requirements.

Target Alertmanager endpoints are also defined in the same file <namespace>-helmcharts/prometheus/templates/configMap.yaml but under prometheus.yml. Among various options available (refer Prometheus official documentation for all options), Alertmanager's URL and any routing or grouping configurations are noteworthy.

A sample configuration is provided below.

prometheus.yml: |-
   global:
      {{ .Values.server.global | toYaml | trimSuffix "\n" | indent 6 }}
         {{- if .Values.alerting }}
         rule_files:
            - /etc/prometheus/prometheus.rules
         alerting:
            alertmanagers:
            - scheme: http
               static_configs:
               - targets:
                  - "prometheus-alertmanager.{{ .Release.Namespace }}.svc:9093"
         {{- end }}

Above is a configuration for Prometheus to inform Alertmanager when the rule previously defined in Prometheus (under prometheus.rules key) evaluates to True.

The rule files can be reloaded at runtime by sending SIGHUP to the Prometheus process. The changes are only applied if all rule files are well-formatted.

In the above sample:

alerting section to define the alerting target Alertmanager.
Scheme may contain values http or https. Because the alertmanager pods are within the same cluster, Siebel CRM Observability - Monitoring solution uses http which is the default scheme.
.Values contains values defined in values.yaml.
.Release.Namespace is a variable containing the namespace of the current Helm release in "Helm Templating Language".

Note: Once you update any file in any helmchart, you have to increment the "version" in the respective Chart.yaml for the deployed state to get reconciled to your declared state in the yaml files.

Prometheus Alertmanager Configurations

Details of Prometheus Alertmanager configurations are available at https://prometheus.io/docs/alerting/latest/configuration/. In this document, we will touch upon a very small set of considerations to keep in mind.

In file <namespce>-helmcharts/prometheus-alert-manager/templates/AlertManagerConfigmap.yaml under config.yml key in Git repository <namespace>-helmchartse.

The Alertmanager is configured using a YAML-based configuration file. Essential configuration components and parameters include:

Global Configurations

resolve_timeout: This global setting defines the default duration after which an alert will be considered resolved if no more firing alerts are received for it.

Example snippet:

global: 
   resolve_timeout: 5m 
   smtp_smarthost: {{ .Values.email_config.smtp_host }}:{{ .Values.email_config.smtp_port }} 
   smtp_from: {{ .Values.email_config.smtp_from }} 
   smtp_auth_username: {{ .Values.email_config.smtp_auth_username }} 
   smtp_auth_password: {{ .Values.email_config.smtp_auth_password }}

Route Configurations
- receiver: Specifies the default receiver for alerts.
- group_by: Groups alerts by specific labels. In this example, alerts are grouped by alertname and severity.
- group_wait: Specifies how long to wait before grouping alerts. New alerts within this window will be grouped together.
- group_interval: Defines the interval at which groups of alerts are evaluated for sending.
- repeat_interval: Specifies how often to repeat notifications for the same alert group.
- routes: Defines routing rules. In this example, alerts with a severity label set to "critical" are sent to the 'urgent-email' receiver, while others are sent to the 'normal-email' receiver.
Example snippet:
```
route:
   receiver: alert-emailer
   group_by: ['alertname', 'priority']
   group_wait: 10s
   group_interval: 5m
   repeat_interval: 30m
   routes:
      - receiver: alert-emailer
      matchers:
         - severity="critical"
```
Receiver Configurations
- receivers: specify different receivers for alerts. Each receiver can have various configurations based on the notification channel, such as email, Slack, or other integrations.
Example snippet:
```
receivers:
- name: alert-emailer
   email_configs:
      - to: "team@example.com"
```

Note: Once you update any file in any helmchart, you have to increment "version" in the respective Chart.yaml for the deployed state to get reconciled to your declared state in the yaml files.

It is recommended to point Prometheus to a list of all Alertmanagers instead of load-balancing.