References

This section details the procedures for changing the retention size of Prometheus. Prometheus is configured to delete data only in the following conditions:

Data is older than the configured retention period (14 days) and size.
The PV is at least 90% full.

The retention level does not consider the write-ahead log (WAL) in its calculations. If the WAL size is greater than 10% of the PV size, then the PV fills up before Prometheus detect the "too full" condition. Therefore, it is recommended to calculate the WAL growth and adjust the retention size of Prometheus accordingly.

Calculating WAL Growth

Refer to the WAL space calculation document for all the formulas that are used in this section for calculating the WAL growth.
Users must have a benchmarking of the metrics size with each release to calculate the approximate WAL size Growth. Consider the following parameters for analyzing and understanding the metrics size:
Note:
- The metrics size must be calculated for a time period of three hours.
- The traffic has to be active and running at the Maximum Transaction Per Second (MAX TPS). The TPS considered in this example is 5000.
1. Samples scraped per scrape:
```
sum(scrape_samples_scraped{kubernetes_namespace="<namespace>"})
```
  Example for CNE Samples:
```
sum(scrape_samples_scraped{kubernetes_namespace="occne-infra"})
```
  Example for UDR Samples:
```
sum(scrape_samples_scraped{kubernetes_namespace="udr"})
```
2. Samples must remain constant irrespective of the TPS rate. However, for some of the NFs, the samples may grow with increasing TPS. In such cases, collect the growth rate of samples.
3. PV disk growth with the increasing sample size. This metric is collected only for observation and is not used in calculating the metrics size.
After gathering the data as described in step 2, use the following formula to calculate WAL growth.
The formula considers the following values:
- scrape_samples_scraped = total samples gathered in Step 2
- One Sample Size = 13 bytes
- Time period = 3 hours, that is 180min (average of two and four hours is the generally observed compaction interval)
- Default scrape interval provided by CNE = 60s
Formula to calculate WAL growth
```
WAL Growth = sum(scrape_samples_scraped{kubernetes_namespace="occne-infra"}) * 13 * 180 * (60/scrape-interval)
```
Example 1
Considering the following values:
- Samples scraped per scrape at 5000 TPS with 4 worker nodes (CNE+UDR) = 75,000
- scrape-interval = 60s
- Total samples scraped = 75,000 * 13 * 180 * 1 = 175,500,000 which approximates to 175MB
the upper limit of WAL growth can be considered as 250MB in 3 hours.
Example 2
Considering the following values:
- Samples scraped per scrape at 5000 TPS with 5 worker nodes (CNE+UDR) = 88,200
- scrape-interval = 10s
- Total samples scraped = 88200 * 13 * 180 * (60/10) = 1153282 which approximates to 1.15GB
the upper limit of WAL growth can be considered as 1.2GB in 3 hours.

Note:

Prometheus runs a garbage collection only after PV is full by 40% (for more information, see Prometheus-Storage). The threshold garbage collection is not initiated before PV is 40% full. For example, if Prometheus PV is full by 90%, Prometheus doesn't start purging old persisted data immediately. Instead, it waits until the next cycle of Garbage Collection. Therefore, there is no fixed time interval and the entire process is internal to Prometheus. To avoid Prometheus getting completely full and crashing before garbage collector is called, CNE provides a buffer of 10% in retention size.

Calculating Retention Size

The following examples provides details on calculating retention size.

Example 1

Considering the following values:

Size of PV equal to 8GB
WAL growth in 3 hours greter than or equal to 250MB (0.25GB), which is approximately 5% when rounded to the nearest multiple of 5
Buffer left for Garbage Collection (GC) cycle equal to 10%, which is greater than or equal to 0.8GB

the retention size is calculated as 85% (100 - 5 - 10), which is greater than or equal to 6.8GB (85% of 8GB).

Example 2

Considering the following values:

Size of PV equal to 8GB
WAL growth in 3 hours greater than or equal to 1.2GB, which is approximately 15% when rounded to the nearest multiple of 5
Buffer left for Garbage Collection (GC) cycle equal to 10%, which is greater than or equal to 0.8GB

the retention size is calculated as 75% (100 - 15 - 10), which is greater than or equal to 6GB (75% of 8GB).

Changing the Retention Value for CNE 1.8.x and below

Perform the following steps to scale down Prometheus deployment to 0.

Caution:

This step can lead to loss of data as Prometheus do not scrape or store metrics during the down time.

Run the following command to get the Prometheus deployment:

$ kubectl get deploy -n <namespace>

Example:

$ kubectl get deploy -n occne-infra

Sample output:

NAME                                            READY   UP-TO-DATE   AVAILABLE   AGE
occne-elastic-exporter-elasticsearch-exporter   1/1     1            1           9d
occne-grafana                                   1/1     1            1           9d
occne-kibana                                    1/1     1            1           9d
occne-metrics-server                            1/1     1            1           9d
occne-prometheus-kube-state-metrics             1/1     1            1           9d
occne-prometheus-pushgateway                    1/1     1            1           9d
occne-prometheus-server                         1/1     1            1           9d
occne-snmp-notifier                             1/1     1            1           9d
occne-tracer-jaeger-collector                   1/1     1            1           9d
occne-tracer-jaeger-query                       1/1     1            1           9d

Run the following command to scale down the deployment to 0:

$ kubectl scale deploy occne-prometheus-server --replicas 0 -n occne-infra

Sample output:

deployment.apps/occne-prometheus-server scaled

Edit the retention size of Prometheus deployment and save the deployment:

$ kubectl edit deploy occne-prometheus-server -n <namespace>

Example:

$ kubectl edit deploy occne-prometheus-server -n occne-infra

Sample output:

 317       - args:
    318         - --storage.tsdb.retention.time=14d
    319         - --config.file=/etc/config/prometheus.yml
    320         - --storage.tsdb.path=/data
    321         - --web.console.libraries=/etc/prometheus/console_libraries
    322         - --web.console.templates=/etc/prometheus/consoles
    323         - --web.enable-lifecycle
    324         - --web.external-url=http://localhost/prometheus
    325         - --storage.tsdb.retention.size=8GB

Note:

Edit line number 325 in the sample output, which displays the retention size, to 6.8GB (as per example 1 in Calculating Retention Size).

Scale up the Prometheus:

$ kubectl scale deployment occne-prometheus-server --replicas 1 -n <namespace>

Example:

$ kubectl scale deploy occne-prometheus-server --replicas 1 -n occne-infra

Sample output:

deployment.apps/occne-prometheus-server scaled

Check if the pod is up and running:

$ kubectl get pods -n <namespace>

Example:

$ kubectl get pods -n occne-infra | grep prometheus-server

Sample output:

NAME                                                             READY   STATUS      RESTARTS   AGE
occne-prometheus-server-58d4d5c459-7dq2d                         2/2     Running     0          19h

Perform the following steps to check if the deployment reflects the updated retention size:

Run the following command to get the Prometheus deployment:

$ kubectl get deploy -n occne-infra

Sample output:

NAME                                            READY   UP-TO-DATE   AVAILABLE   AGE
occne-elastic-exporter-elasticsearch-exporter   1/1     1            1           9d
occne-grafana                                   1/1     1            1           9d
occne-kibana                                    1/1     1            1           9d
occne-metrics-server                            1/1     1            1           9d
occne-prometheus-kube-state-metrics             1/1     1            1           9d
occne-prometheus-pushgateway                    1/1     1            1           9d
occne-prometheus-server                         1/1     1            1           9d
occne-snmp-notifier                             1/1     1            1           9d
occne-tracer-jaeger-collector                   1/1     1            1           9d
occne-tracer-jaeger-query                       1/1     1            1           9d

Run the following commands to open the occne-prometheus-server.yaml file and verify the updated retention size:

$ kubectl get deploy occne-prometheus-server -n occne-infra -o yaml > occne-prometheus-server.yaml
$ cat occne-prometheus-server.yaml

Sample output:

317       - args:
318         - --storage.tsdb.retention.time=14d
319         - --config.file=/etc/config/prometheus.yml
320         - --storage.tsdb.path=/data
321         - --web.console.libraries=/etc/prometheus/console_libraries
322         - --web.console.templates=/etc/prometheus/consoles
323         - --web.enable-lifecycle
324         - --web.external-url=http://localhost/prometheus
325         - --storage.tsdb.retention.size=6.8GB

Note:

Use search or Grep "retention.size" in the occne-prometheus-server.yaml file to check if the retention size is updated to 6.8GB.

Changing the Retention Value for CNE 1.9.x and above

Note:

As CNE 1.9.x and above uses HA for Prometheus, there is no service interruption or loss of data while performing this procedure. This is because one Prometheus pod is always up while performing the procedure.

Run the following command to list the Prometheus component:

$ kubectl get prometheus -n <namespace>

Example:

$ kubectl get prometheus -n occne-infra

Sample output:

NAME                                    VERSION   REPLICAS   AGE
occne-kube-prom-stack-kube-prometheus   v2.24.0   2          7d5h

Run the following command to edit the Prometheus component:
```
$ kubectl edit prometheus occne-kube-prom-stack-kube-prometheus -n <namespace>
```
Example:
```
$ kubectl edit prometheus occne-kube-prom-stack-kube-prometheus -n occne-infra
```
Sample output:
```
51   replicas: 2
     52   retention: 14d
     53   retentionSize: 8GB
     54   routePrefix: /prometheus
     55   ruleNamespaceSelector: {}
     56   ruleSelector:
     57     matchLabels:
```
Note:
1. Edit line number 53 in the sample output, which displays the retention size, to 6.8GB (as per example 1 in Calculating Retention Size).
2. This initiates rolling update for the prometheus-occne-kube-prom-stack-kube-prometheus pods which takes a few minutes be complete.

Check if the pods are up and running.

$ kubectl get pods -n <namespace>

Example:

$ kubectl get pods -n occne-infra | grep kube-prometheus

Sample output:

prometheus-occne-kube-prom-stack-kube-prometheus-0               2/2     Running            1          4m22s
prometheus-occne-kube-prom-stack-kube-prometheus-1               2/2     Running            1          4m58s

Perform the following steps to check if the deployment reflects the updated retention size:

Run the following command to get the Statefulset of the deployment:

$ kubectl get sts -n occne-infra

Sample output:

NAME                                                   READY   AGE
alertmanager-occne-kube-prom-stack-kube-alertmanager   2/2     7d5h
occne-elastic-elasticsearch-client                     3/3     7d5h
occne-elastic-elasticsearch-data                       3/3     7d5h
occne-elastic-elasticsearch-master                     3/3     7d5h
prometheus-occne-kube-prom-stack-kube-prometheus       2/2     7d5h

Run the following commands to open the occne-kube-prom-stack.yaml file and verify the updated retention size:

$ kubectl get sts prometheus-occne-kube-prom-stack-kube-prometheus -n occne-infra -o yaml > occne-kube-prom-stack.yaml
$ cat occne-kube-prom-stack.yaml

Sample output:

- args:
       - --web.console.templates=/etc/prometheus/consoles
       - --web.console.libraries=/etc/prometheus/console_libraries
       - --storage.tsdb.retention.size=6.8GB

Note:

Use search or Grep "retention.size" in the occne-kube-prom-stack.yaml file to check if the retention size is updated to 6.8GB.

A References

A.1 Changing Retention Size of Prometheus