A References

This section lists the additional topics that are referred to while performing some of the procedures in the document.

A.1 Changing Retention Size of Prometheus

This section details the procedures for changing the retention size of Prometheus. Prometheus is configured to delete data only in the following conditions:
  • Data is older than the configured retention period (14 days) and size.
  • The PV is at least 90% full.
The retention level does not consider the write-ahead log (WAL) in its calculations. If the WAL size is greater than 10% of the PV size, then the PV fills up before Prometheus detect the "too full" condition. Therefore, it is recommended to calculate the WAL growth and adjust the retention size of Prometheus accordingly.

Calculating WAL Growth

  1. Refer to the WAL space calculation document for all the formulas that are used in this section for calculating the WAL growth.
  2. Users must have a benchmarking of the metrics size with each release to calculate the approximate WAL size Growth. Consider the following parameters for analyzing and understanding the metrics size:

    Note:

    • The metrics size must be calculated for a time period of three hours.
    • The traffic has to be active and running at the Maximum Transaction Per Second (MAX TPS). The TPS considered in this example is 5000.
    1. Samples scraped per scrape:
      sum(scrape_samples_scraped{kubernetes_namespace="<namespace>"})
      Example for CNE Samples:
      sum(scrape_samples_scraped{kubernetes_namespace="occne-infra"})
      Example for UDR Samples:
      sum(scrape_samples_scraped{kubernetes_namespace="udr"})
    2. Samples must remain constant irrespective of the TPS rate. However, for some of the NFs, the samples may grow with increasing TPS. In such cases, collect the growth rate of samples.
    3. PV disk growth with the increasing sample size. This metric is collected only for observation and is not used in calculating the metrics size.
  3. After gathering the data as described in step 2, use the following formula to calculate WAL growth.
    The formula considers the following values:
    • scrape_samples_scraped = total samples gathered in Step 2
    • One Sample Size = 13 bytes
    • Time period = 3 hours, that is 180min (average of two and four hours is the generally observed compaction interval)
    • Default scrape interval provided by CNE = 60s

    Formula to calculate WAL growth

    WAL Growth = sum(scrape_samples_scraped{kubernetes_namespace="occne-infra"}) * 13 * 180 * (60/scrape-interval)

    Example 1

    Considering the following values:
    • Samples scraped per scrape at 5000 TPS with 4 worker nodes (CNE+UDR) = 75,000
    • scrape-interval = 60s
    • Total samples scraped = 75,000 * 13 * 180 * 1 = 175,500,000 which approximates to 175MB
    the upper limit of WAL growth can be considered as 250MB in 3 hours.

    Example 2

    Considering the following values:
    • Samples scraped per scrape at 5000 TPS with 5 worker nodes (CNE+UDR) = 88,200
    • scrape-interval = 10s
    • Total samples scraped = 88200 * 13 * 180 * (60/10) = 1153282 which approximates to 1.15GB
    the upper limit of WAL growth can be considered as 1.2GB in 3 hours.

Note:

Prometheus runs a garbage collection only after PV is full by 40% (for more information, see Prometheus-Storage). The threshold garbage collection is not initiated before PV is 40% full. For example, if Prometheus PV is full by 90%, Prometheus doesn't start purging old persisted data immediately. Instead, it waits until the next cycle of Garbage Collection. Therefore, there is no fixed time interval and the entire process is internal to Prometheus. To avoid Prometheus getting completely full and crashing before garbage collector is called, CNE provides a buffer of 10% in retention size.

Calculating Retention Size

The following examples provides details on calculating retention size.

Example 1

Considering the following values:
  • Size of PV equal to 8GB
  • WAL growth in 3 hours greter than or equal to 250MB (0.25GB), which is approximately 5% when rounded to the nearest multiple of 5
  • Buffer left for Garbage Collection (GC) cycle equal to 10%, which is greater than or equal to 0.8GB
the retention size is calculated as 85% (100 - 5 - 10), which is greater than or equal to 6.8GB (85% of 8GB).

Example 2

Considering the following values:
  • Size of PV equal to 8GB
  • WAL growth in 3 hours greater than or equal to 1.2GB, which is approximately 15% when rounded to the nearest multiple of 5
  • Buffer left for Garbage Collection (GC) cycle equal to 10%, which is greater than or equal to 0.8GB
the retention size is calculated as 75% (100 - 15 - 10), which is greater than or equal to 6GB (75% of 8GB).

Changing the Retention Value for CNE 1.8.x and below

  1. Perform the following steps to scale down Prometheus deployment to 0.

    Caution:

    This step can lead to loss of data as Prometheus do not scrape or store metrics during the down time.
    1. Run the following command to get the Prometheus deployment:
      $ kubectl get deploy -n <namespace>
      Example:
      $ kubectl get deploy -n occne-infra
      Sample output:
      NAME                                            READY   UP-TO-DATE   AVAILABLE   AGE
      occne-elastic-exporter-elasticsearch-exporter   1/1     1            1           9d
      occne-grafana                                   1/1     1            1           9d
      occne-kibana                                    1/1     1            1           9d
      occne-metrics-server                            1/1     1            1           9d
      occne-prometheus-kube-state-metrics             1/1     1            1           9d
      occne-prometheus-pushgateway                    1/1     1            1           9d
      occne-prometheus-server                         1/1     1            1           9d
      occne-snmp-notifier                             1/1     1            1           9d
      occne-tracer-jaeger-collector                   1/1     1            1           9d
      occne-tracer-jaeger-query                       1/1     1            1           9d
    2. Run the following command to scale down the deployment to 0:
      $ kubectl scale deploy occne-prometheus-server --replicas 0 -n occne-infra
      Sample output:
      deployment.apps/occne-prometheus-server scaled
  2. Edit the retention size of Prometheus deployment and save the deployment:
    $ kubectl edit deploy occne-prometheus-server -n <namespace>
    Example:
    $ kubectl edit deploy occne-prometheus-server -n occne-infra
    Sample output:
     317       - args:
        318         - --storage.tsdb.retention.time=14d
        319         - --config.file=/etc/config/prometheus.yml
        320         - --storage.tsdb.path=/data
        321         - --web.console.libraries=/etc/prometheus/console_libraries
        322         - --web.console.templates=/etc/prometheus/consoles
        323         - --web.enable-lifecycle
        324         - --web.external-url=http://localhost/prometheus
        325         - --storage.tsdb.retention.size=8GB

    Note:

    Edit line number 325 in the sample output, which displays the retention size, to 6.8GB (as per example 1 in Calculating Retention Size).
  3. Scale up the Prometheus:
    $ kubectl scale deployment occne-prometheus-server --replicas 1 -n <namespace>
    Example:
    $ kubectl scale deploy occne-prometheus-server --replicas 1 -n occne-infra
    Sample output:
    deployment.apps/occne-prometheus-server scaled
  4. Check if the pod is up and running:
    $ kubectl get pods -n <namespace>
    Example:
    $ kubectl get pods -n occne-infra | grep prometheus-server
    Sample output:
    NAME                                                             READY   STATUS      RESTARTS   AGE
    occne-prometheus-server-58d4d5c459-7dq2d                         2/2     Running     0          19h
  5. Perform the following steps to check if the deployment reflects the updated retention size:
    1. Run the following command to get the Prometheus deployment:
      $ kubectl get deploy -n occne-infra
      Sample output:
      NAME                                            READY   UP-TO-DATE   AVAILABLE   AGE
      occne-elastic-exporter-elasticsearch-exporter   1/1     1            1           9d
      occne-grafana                                   1/1     1            1           9d
      occne-kibana                                    1/1     1            1           9d
      occne-metrics-server                            1/1     1            1           9d
      occne-prometheus-kube-state-metrics             1/1     1            1           9d
      occne-prometheus-pushgateway                    1/1     1            1           9d
      occne-prometheus-server                         1/1     1            1           9d
      occne-snmp-notifier                             1/1     1            1           9d
      occne-tracer-jaeger-collector                   1/1     1            1           9d
      occne-tracer-jaeger-query                       1/1     1            1           9d
    2. Run the following commands to open the occne-prometheus-server.yaml file and verify the updated retention size:
      $ kubectl get deploy occne-prometheus-server -n occne-infra -o yaml > occne-prometheus-server.yaml
      $ cat occne-prometheus-server.yaml
      Sample output:
      317       - args:
      318         - --storage.tsdb.retention.time=14d
      319         - --config.file=/etc/config/prometheus.yml
      320         - --storage.tsdb.path=/data
      321         - --web.console.libraries=/etc/prometheus/console_libraries
      322         - --web.console.templates=/etc/prometheus/consoles
      323         - --web.enable-lifecycle
      324         - --web.external-url=http://localhost/prometheus
      325         - --storage.tsdb.retention.size=6.8GB

      Note:

      Use search or Grep "retention.size" in the occne-prometheus-server.yaml file to check if the retention size is updated to 6.8GB.

Changing the Retention Value for CNE 1.9.x and above

Note:

As CNE 1.9.x and above uses HA for Prometheus, there is no service interruption or loss of data while performing this procedure. This is because one Prometheus pod is always up while performing the procedure.
  1. Run the following command to list the Prometheus component:
    $ kubectl get prometheus -n <namespace>
    Example:
    $ kubectl get prometheus -n occne-infra
    Sample output:
    NAME                                    VERSION   REPLICAS   AGE
    occne-kube-prom-stack-kube-prometheus   v2.24.0   2          7d5h
  2. Run the following command to edit the Prometheus component:
    $ kubectl edit prometheus occne-kube-prom-stack-kube-prometheus -n <namespace>
    Example:
    $ kubectl edit prometheus occne-kube-prom-stack-kube-prometheus -n occne-infra
    Sample output:
    51   replicas: 2
         52   retention: 14d
         53   retentionSize: 8GB
         54   routePrefix: /prometheus
         55   ruleNamespaceSelector: {}
         56   ruleSelector:
         57     matchLabels:

    Note:

    1. Edit line number 53 in the sample output, which displays the retention size, to 6.8GB (as per example 1 in Calculating Retention Size).
    2. This initiates rolling update for the prometheus-occne-kube-prom-stack-kube-prometheus pods which takes a few minutes be complete.
  3. Check if the pods are up and running.
    $ kubectl get pods -n <namespace>
    Example:
    $ kubectl get pods -n occne-infra | grep kube-prometheus
    Sample output:
    prometheus-occne-kube-prom-stack-kube-prometheus-0               2/2     Running            1          4m22s
    prometheus-occne-kube-prom-stack-kube-prometheus-1               2/2     Running            1          4m58s
  4. Perform the following steps to check if the deployment reflects the updated retention size:
    1. Run the following command to get the Statefulset of the deployment:
      $ kubectl get sts -n occne-infra
      Sample output:
      NAME                                                   READY   AGE
      alertmanager-occne-kube-prom-stack-kube-alertmanager   2/2     7d5h
      occne-elastic-elasticsearch-client                     3/3     7d5h
      occne-elastic-elasticsearch-data                       3/3     7d5h
      occne-elastic-elasticsearch-master                     3/3     7d5h
      prometheus-occne-kube-prom-stack-kube-prometheus       2/2     7d5h
    2. Run the following commands to open the occne-kube-prom-stack.yaml file and verify the updated retention size:
      $ kubectl get sts prometheus-occne-kube-prom-stack-kube-prometheus -n occne-infra -o yaml > occne-kube-prom-stack.yaml
      $ cat occne-kube-prom-stack.yaml
      Sample output:
      - args:
             - --web.console.templates=/etc/prometheus/consoles
             - --web.console.libraries=/etc/prometheus/console_libraries
             - --storage.tsdb.retention.size=6.8GB
      

      Note:

      Use search or Grep "retention.size" in the occne-kube-prom-stack.yaml file to check if the retention size is updated to 6.8GB.