A References
This section lists the additional topics that are referred to while performing some of the procedures in the document.
A.1 Changing Retention Size of Prometheus
- Data is older than the configured retention period (14 days) and size.
- The PV is at least 90% full.
Calculating WAL Growth
- Refer to the WAL space calculation document for all the formulas that are used in this section for calculating the WAL growth.
- Users must have a benchmarking of the metrics size with each release to calculate the
approximate WAL size Growth. Consider the following parameters for analyzing and
understanding the metrics size:
Note:
- The metrics size must be calculated for a time period of three hours.
- The traffic has to be active and running at the Maximum Transaction Per Second (MAX TPS). The TPS considered in this example is 5000.
- Samples scraped per
scrape:
sum(scrape_samples_scraped{kubernetes_namespace="<namespace>"})
Example for CNE Samples:sum(scrape_samples_scraped{kubernetes_namespace="occne-infra"})
Example for UDR Samples:sum(scrape_samples_scraped{kubernetes_namespace="udr"})
- Samples must remain constant irrespective of the TPS rate. However, for some of the NFs, the samples may grow with increasing TPS. In such cases, collect the growth rate of samples.
- PV disk growth with the increasing sample size. This metric is collected only for observation and is not used in calculating the metrics size.
- After gathering the data as described in step 2, use the following formula to calculate
WAL growth.
The formula considers the following values:
- scrape_samples_scraped = total samples gathered in Step 2
- One Sample Size = 13 bytes
- Time period = 3 hours, that is 180min (average of two and four hours is the generally observed compaction interval)
- Default scrape interval provided by CNE = 60s
Formula to calculate WAL growth
WAL Growth = sum(scrape_samples_scraped{kubernetes_namespace="occne-infra"}) * 13 * 180 * (60/scrape-interval)
Example 1
Considering the following values:- Samples scraped per scrape at 5000 TPS with 4 worker nodes (CNE+UDR) = 75,000
- scrape-interval = 60s
- Total samples scraped = 75,000 * 13 * 180 * 1 = 175,500,000 which approximates to 175MB
Example 2
Considering the following values:- Samples scraped per scrape at 5000 TPS with 5 worker nodes (CNE+UDR) = 88,200
- scrape-interval = 10s
- Total samples scraped = 88200 * 13 * 180 * (60/10) = 1153282 which approximates to 1.15GB
Note:
Prometheus runs a garbage collection only after PV is full by 40% (for more information, see Prometheus-Storage). The threshold garbage collection is not initiated before PV is 40% full. For example, if Prometheus PV is full by 90%, Prometheus doesn't start purging old persisted data immediately. Instead, it waits until the next cycle of Garbage Collection. Therefore, there is no fixed time interval and the entire process is internal to Prometheus. To avoid Prometheus getting completely full and crashing before garbage collector is called, CNE provides a buffer of 10% in retention size.
Calculating Retention Size
The following examples provides details on calculating retention size.
Example 1
- Size of PV equal to 8GB
- WAL growth in 3 hours greter than or equal to 250MB (0.25GB), which is approximately 5% when rounded to the nearest multiple of 5
- Buffer left for Garbage Collection (GC) cycle equal to 10%, which is greater than or equal to 0.8GB
Example 2
- Size of PV equal to 8GB
- WAL growth in 3 hours greater than or equal to 1.2GB, which is approximately 15% when rounded to the nearest multiple of 5
- Buffer left for Garbage Collection (GC) cycle equal to 10%, which is greater than or equal to 0.8GB
Changing the Retention Value for CNE 1.8.x and below
- Perform the following steps to scale down Prometheus deployment to 0.
Caution:
This step can lead to loss of data as Prometheus do not scrape or store metrics during the down time.- Run the following command to get the Prometheus
deployment:
$ kubectl get deploy -n <namespace>
Example:$ kubectl get deploy -n occne-infra
Sample output:NAME READY UP-TO-DATE AVAILABLE AGE occne-elastic-exporter-elasticsearch-exporter 1/1 1 1 9d occne-grafana 1/1 1 1 9d occne-kibana 1/1 1 1 9d occne-metrics-server 1/1 1 1 9d occne-prometheus-kube-state-metrics 1/1 1 1 9d occne-prometheus-pushgateway 1/1 1 1 9d occne-prometheus-server 1/1 1 1 9d occne-snmp-notifier 1/1 1 1 9d occne-tracer-jaeger-collector 1/1 1 1 9d occne-tracer-jaeger-query 1/1 1 1 9d
- Run the following command to scale down the deployment to
0:
$ kubectl scale deploy occne-prometheus-server --replicas 0 -n occne-infra
Sample output:deployment.apps/occne-prometheus-server scaled
- Run the following command to get the Prometheus
deployment:
- Edit the retention size of Prometheus deployment and save the
deployment:
$ kubectl edit deploy occne-prometheus-server -n <namespace>
Example:$ kubectl edit deploy occne-prometheus-server -n occne-infra
Sample output:317 - args: 318 - --storage.tsdb.retention.time=14d 319 - --config.file=/etc/config/prometheus.yml 320 - --storage.tsdb.path=/data 321 - --web.console.libraries=/etc/prometheus/console_libraries 322 - --web.console.templates=/etc/prometheus/consoles 323 - --web.enable-lifecycle 324 - --web.external-url=http://localhost/prometheus 325 - --storage.tsdb.retention.size=8GB
Note:
Edit line number 325 in the sample output, which displays the retention size, to 6.8GB (as per example 1 in Calculating Retention Size). - Scale up the
Prometheus:
$ kubectl scale deployment occne-prometheus-server --replicas 1 -n <namespace>
Example:$ kubectl scale deploy occne-prometheus-server --replicas 1 -n occne-infra
Sample output:deployment.apps/occne-prometheus-server scaled
- Check if the pod is up and
running:
$ kubectl get pods -n <namespace>
Example:$ kubectl get pods -n occne-infra | grep prometheus-server
Sample output:NAME READY STATUS RESTARTS AGE occne-prometheus-server-58d4d5c459-7dq2d 2/2 Running 0 19h
- Perform the following steps to check if the deployment reflects the updated retention
size:
- Run the following command to get the Prometheus
deployment:
$ kubectl get deploy -n occne-infra
Sample output:NAME READY UP-TO-DATE AVAILABLE AGE occne-elastic-exporter-elasticsearch-exporter 1/1 1 1 9d occne-grafana 1/1 1 1 9d occne-kibana 1/1 1 1 9d occne-metrics-server 1/1 1 1 9d occne-prometheus-kube-state-metrics 1/1 1 1 9d occne-prometheus-pushgateway 1/1 1 1 9d occne-prometheus-server 1/1 1 1 9d occne-snmp-notifier 1/1 1 1 9d occne-tracer-jaeger-collector 1/1 1 1 9d occne-tracer-jaeger-query 1/1 1 1 9d
- Run the following commands to open the
occne-prometheus-server.yaml
file and verify the updated retention size:$ kubectl get deploy occne-prometheus-server -n occne-infra -o yaml > occne-prometheus-server.yaml $ cat occne-prometheus-server.yaml
Sample output:317 - args: 318 - --storage.tsdb.retention.time=14d 319 - --config.file=/etc/config/prometheus.yml 320 - --storage.tsdb.path=/data 321 - --web.console.libraries=/etc/prometheus/console_libraries 322 - --web.console.templates=/etc/prometheus/consoles 323 - --web.enable-lifecycle 324 - --web.external-url=http://localhost/prometheus 325 - --storage.tsdb.retention.size=6.8GB
Note:
Use search or Grep "retention.size" in theoccne-prometheus-server.yaml
file to check if the retention size is updated to 6.8GB.
- Run the following command to get the Prometheus
deployment:
Changing the Retention Value for CNE 1.9.x and above
Note:
As CNE 1.9.x and above uses HA for Prometheus, there is no service interruption or loss of data while performing this procedure. This is because one Prometheus pod is always up while performing the procedure.- Run the following command to list the Prometheus
component:
$ kubectl get prometheus -n <namespace>
Example:$ kubectl get prometheus -n occne-infra
Sample output:NAME VERSION REPLICAS AGE occne-kube-prom-stack-kube-prometheus v2.24.0 2 7d5h
- Run the following command to edit the Prometheus
component:
$ kubectl edit prometheus occne-kube-prom-stack-kube-prometheus -n <namespace>
Example:$ kubectl edit prometheus occne-kube-prom-stack-kube-prometheus -n occne-infra
Sample output:51 replicas: 2 52 retention: 14d 53 retentionSize: 8GB 54 routePrefix: /prometheus 55 ruleNamespaceSelector: {} 56 ruleSelector: 57 matchLabels:
Note:
- Edit line number 53 in the sample output, which displays the retention size, to 6.8GB (as per example 1 in Calculating Retention Size).
- This initiates rolling update for the prometheus-occne-kube-prom-stack-kube-prometheus pods which takes a few minutes be complete.
- Check if the pods are up and
running.
$ kubectl get pods -n <namespace>
Example:$ kubectl get pods -n occne-infra | grep kube-prometheus
Sample output:prometheus-occne-kube-prom-stack-kube-prometheus-0 2/2 Running 1 4m22s prometheus-occne-kube-prom-stack-kube-prometheus-1 2/2 Running 1 4m58s
- Perform the following steps to check if the deployment reflects the updated retention
size:
- Run the following command to get the Statefulset of the
deployment:
$ kubectl get sts -n occne-infra
Sample output:NAME READY AGE alertmanager-occne-kube-prom-stack-kube-alertmanager 2/2 7d5h occne-elastic-elasticsearch-client 3/3 7d5h occne-elastic-elasticsearch-data 3/3 7d5h occne-elastic-elasticsearch-master 3/3 7d5h prometheus-occne-kube-prom-stack-kube-prometheus 2/2 7d5h
- Run the following commands to open the
occne-kube-prom-stack.yaml
file and verify the updated retention size:$ kubectl get sts prometheus-occne-kube-prom-stack-kube-prometheus -n occne-infra -o yaml > occne-kube-prom-stack.yaml $ cat occne-kube-prom-stack.yaml
Sample output:- args: - --web.console.templates=/etc/prometheus/consoles - --web.console.libraries=/etc/prometheus/console_libraries - --storage.tsdb.retention.size=6.8GB
Note:
Use search or Grep "retention.size" in theoccne-kube-prom-stack.yaml
file to check if the retention size is updated to 6.8GB.
- Run the following command to get the Statefulset of the
deployment: