7 Maintenance Procedures
This chapter provides detailed instructions about how to maintain the CNE platform.
7.1 Premaintenance Check for VMware Deployments
This section provides details about the checks that must be run on VMware deployments before performing any maintenance procedures.
- Verify the content of the
compute/main.tf
andcompute-lbvm/main.tf
files:- Run the following command to verify the content of the
compute/main.tf
file:$ cat /var/occne/cluster/${OCCNE_CLUSTER}/modules/compute/main.tf | grep 'ignore_changes\|override_template_disk' -C 2
Ensure that the content of the file exactly matches the following content:} override_template_disk { bus_type = "paravirtual" size_in_mb = var.disk -- lifecycle { ignore_changes = [ vapp_template_id, template_name, catalog_name, override_template_disk ] } -- } override_template_disk { bus_type = "paravirtual" size_in_mb = var.disk -- lifecycle { ignore_changes = [ vapp_template_id, template_name, catalog_name, override_template_disk ] }
- Run the following command to verify the content of the
compute-lbvm/main.tf
file:$ cat /var/occne/cluster/${OCCNE_CLUSTER}/modules/compute-lbvm/main.tf | grep 'ignore_changes\|override_template_disk' -C 2
Ensure that the content of the file exactly matches the following content:} override_template_disk { bus_type = "paravirtual" size_in_mb = var.disk -- lifecycle { ignore_changes = [ vapp_template_id, template_name, catalog_name, override_template_disk ] } -- } override_template_disk { bus_type = "paravirtual" size_in_mb = var.disk -- lifecycle { ignore_changes = [ vapp_template_id, template_name, catalog_name, override_template_disk ] }
- Run the following command to verify the content of the
- If the files don't contain the
ignore_changes
argument, then edit the files and add the argument to each of the "vcd_vapp_vm
" resources:- Run the following command to edit the
compute/main.tf
file:$ vi /var/occne/cluster/${OCCNE_CLUSTER}/modules/compute/main.tf
- Add the following content between each
override_template_disk
code block andmetadata = var.metadata
line for each "vcd_vapp_vm
" resource:lifecycle { ignore_changes = [ vapp_template_id, template_name, catalog_name, override_template_disk ] }
- Save the
compute/main.tf
file. - Run the following command to edit the
compute-lbvm/main.tf
file:$ vi /var/occne/cluster/${OCCNE_CLUSTER}/modules/compute/main.tf
- Add the following content between each
override_template_disk
code block andmetadata = var.metadata
line for each "vcd_vapp_vm
" resource:lifecycle { ignore_changes = [ vapp_template_id, template_name, catalog_name, override_template_disk ] }
- Save the
compute-lbvm/main.tf
file. - Repeat step 1 to ensure that the content of the files matches the ones provided in the step.
- Run the following command to edit the
7.2 Accessing the CNE
This section describes the procedures to access an CNE for maintenance purposes.
7.2.1 Accessing the Bastion Host
This section provides information about how to access a CNE Bastion Host.
Prerequisites
- SSH private key must be available on the server or VM that is used to access the Bastion Host.
- The SSH private keys generated or provided during the installation must match the authorized key (public) present in the Bastion Hosts. For more information about the keys, see the installation prerequisites in Oracle Communications Cloud Native Core, Cloud Native Environment Installation, Upgrade, and Fault Recovery Guide.
Procedure
All commands must be run from a server or VM that has network access to the CNE Bastion Hosts. To access the Bastion Host, perform the following tasks.
7.2.1.1 Logging in to the Bastion Host
This section describes the procedure to log in to the Bastion Host.
- Determine the Bastion Host IP address.
Contact your system administrator to obtain the IP addresses of the CNE Bastion Hosts. The system administrator can obtain the IP addresses from the OpenStack Dashboard, VMware Cloud Director, or by other means such as from the BareMetal Hosts.
- To log in to the Bastion Host, run the following command:
Note:
The default value for<user_name>
iscloud-user
(for vCNE) oradmusr
(for Baremetal).$ ssh -i /<ssh_key_dir>/<ssh_key_name>.key <user_name>@<bastion_host_ip_address>
7.2.1.2 Copying Files to the Bastion Host
This section describes the procedure to copy the files to the Bastion Host.
- Determine the Bastion Host IP address.
Contact your system administrator to obtain the IP addresses of the CNE Bastion Hosts. The system administrator can obtain the IP addresses from the Openstack Dashboard, VMware Cloud Director, or by other means such as from the BareMetal Hosts.
- To copy files to the Bastion Host, run the following
command:
$ scp -i /<ssh_key_dir>/<ssh_key_name>.key <source_file> <user_name>@<bastion_host_ip_address>:/<path>/<dest_file>
7.2.1.3 Managing Bastion Host
The Bastion Host comes with the following built-in scripts to manage the Bastion Hosts:
- is_active_bastion
- get_active_bastion
- get_other_bastions
- update_active_bastion.sh
These scripts are used to get details about Bastion Hosts, such as checking if the current Bastion Host is the active one and getting the list of other Bastions. This section provides the procedures to manage Bastion Hosts using these scripts.
These scripts are located in the
/var/occne/cluster/$OCCNE_CLUSTER/artifacts/
directory. You
don't have to change the directory to run these scripts. You can run these scripts from
anywhere within a Bastion Host like a system command as the directory containing the
scripts is a part of $PATH
.
- If the lb-controller pod is not running.
- If the kubectl admin configuration is not set properly.
7.2.1.3.1 Verifying if the Current Bastion Host is the Active One
This section describes the procedure to verify if the current Bastion
Host is the active one using the is_active_bastion
script.
7.2.1.3.2 Getting the Host IP or Hostname of the Current Bastion Host
This section provides details about getting the Host IP or Hostname of
the current Bastion Host using the get_active_bastion
script.
7.2.1.4 Troubleshooting Bastion Host
This section describes the issues that you may encounter while using Bastion Host and their troubleshooting guidelines.
Permission Denied Error While Running Kubernetes Command
Users may encounter "Permission Denied" error while running Kubernetes commands if there is no proper access.
error: error loading config file "/var/occne/cluster/occne1-rainbow/artifacts/admin.conf": open /var/occne/cluster/occne1-rainbow/artifacts/admin.conf: permission denied
Verify permission access to
admin.conf
. The user running the command must be able to
run basic kubectl
commands to use the Bastion
scripts.
Commands Take Too Long to Respond and Fail to Return Output
A command may take too long to display any output. For example,
running the is_active_bastion
command may take too long to
respond leading to the timed out error.
error: timed out waiting for the condition
- Verify the status of the bastion-controller. This error can occur if the pods are not running or in a crash state due to various reasons such as lack of resources at the cluster.
- Print the bastion controller logs to check the issue. For example, print the
logs and check if a loop crash error is caused due to lack of
resources.
$ kubectl logs -n ${OCCNE_NAMESPACE} deploy/occne-bastion-controller
Sample output:Error from server (BadRequest): container "bastion-controller" in pod "occne-bastion-controller-797db5f845-hqlm6" is waiting to start: ContainerCreating
Command Not Found Error
User may encounter command not found
error while
running a script.
-bash: is_active_bastion: command not found
$PATH
variable is set properly and contains the artifacts
directory.
Note:
By default, CNE sets up the path automatically during the installation.$ echo $PATH
$PATH
:/home/cloud-user/.local/bin:/home/cloud-user/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/var/occne/cluster/occne1-rainbow/artifacts/istio-1.18.2/bin/:/var/occne/cluster/occne1-rainbow/artifacts
7.3 General Configuration
This section describes the general configuration tasks for CNE.
7.3.1 Configuring SNMP Trap Destinations
This section describes the procedure to set up SNMP notifiers within CNE, such that the AlertManager can send alerts as SNMP traps to one or more SNMP receivers.
- Perform the following steps to verify the cluster condition before setting up
multiple trap receivers:
- Run the following command and verify that the
alertmanager
andsnmp-notifier
services are running:$ kubectl get services --all-namespaces | grep -E 'snmp-notifier|alertmanager'
Sample output:NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE occne-infra occne-kube-prom-stack-kube-alertmanager LoadBalancer 10.233.16.156 10.75.151.178 80:31100/TCP 11m occne-infra occne-alertmanager-snmp-notifier ClusterIP 10.233.41.30 <none> 9464/TC 11m
- Run the following command and verify that the
alertmanager
andsnmp-notifier
pods are running:$ kubectl get pods --all-namespaces | grep -E 'snmp-notifier|alertmanager'
Sample output:occne-infra alertmanager-occne-kube-prom-stack-kube-alertmanager-0 2/2 Running 0 18m occne-infra alertmanager-occne-kube-prom-stack-kube-alertmanager-1 2/2 Running 0 18m occne-infra occne-alertmanager-snmp-notifier-744b755f96-m8vbx 1/1 Running 0 18m
- Run the following command and verify that the
- Perform the following steps to edit the default
snmp-destination
and add a newsnmp-destination
:- Run the following command from Bastion Host to get the current
snmp-notifier
resources:$ kubectl get all -n occne-infra | grep snmp
Sample output:pod/occne-alertmanager-snmp-notifier-75656cf4b7-gw55w 1/1 Running 0 37m service/occne-alertmanager-snmp-notifier ClusterIP 10.233.29.86 <none> 9464/TCP 10h deployment.apps/occne-alertmanager-snmp-notifier 1/1 1 1 10h replicaset.apps/occne-alertmanager-snmp-notifier-75656cf4b7 1 1 1 37m
- The
snmp-destination
is the interface IP address of the trap receiver to get the traps. Edit the deployment to modifysnmp-destination
and add a newsnmp-destination
when needed:- Run the following command to edit the
deployment:
$ kubectl edit -n occne-infra deployment occne-alertmanager-snmp-notifier
- From the vi editor, move down to the
snmp-destination
section. The default configuration is as follows:- --snmp.destination=127.0.0.1:162
- Add a new destination to receive the traps.
For example:
- --snmp.destination=192.168.200.236:162
- If want to add multiple trap receivers, add them in multiple new
lines.
For example:
- --snmp.destination=192.168.200.236:162 - --snmp.destination=10.75.135.11:162 - --snmp.destination=10.33.64.50:162
- After editing, use the
:x
or:wq
command to save the exit.Sample output:deployment.apps/occne-alertmanager-snmp-notifier edited
- Run the following command to edit the
deployment:
- Perform the following steps to verify the new replicaset and delete the old
replicaset:
- Run the following command to get the resource and check the restart
time to verify that the pod and replicaset are
regenerated:
$ kubectl get all -n occne-infra | grep snmp
Sample output:pod/occne-alertmanager-snmp-notifier-88976f7cc-xs8mv 1/1 Running 0 90s service/occne-alertmanager-snmp-notifier ClusterIP 10.233.29.86 <none> 9464/TCP 10h deployment.apps/occne-alertmanager-snmp-notifier 1/1 1 1 10h replicaset.apps/occne-alertmanager-snmp-notifier-75656cf4b7 0 0 0 65m replicaset.apps/occne-alertmanager-snmp-notifier-88976f7cc 1 1 1 90s
- Identify the old replicaset from the previous step and delete it.
For example, the restart time of the
replicaset.apps/occne-alertmanager-snmp-notifier-75656cf4b7
in the previous step output is 65m. This indicates that it is the old replica set. Use the following command to delete the old replicaset:$ kubectl delete -n occne-infra replicaset.apps/occne-alertmanager-snmp-notifier-75656cf4b7
- Run the following command to get the resource and check the restart
time to verify that the pod and replicaset are
regenerated:
- Port 162 of the server must be open and have some application to catch the
traps to test if the new trap receiver receives the SNMP traps. This step
may vary depending on the type of server. The following codeblock provides
an example for Linux
server:
$ sudo iptables -A INPUT -p udp -m udp --dport 162 -j ACCEPT $ sudo dnf install -y tcpdump $ sudo tcpdump -n -i <interface of the ip address set in snmp-destination> port 162
- Run the following command from Bastion Host to get the current
7.3.2 Changing Network MTU
This section describes the procedure to modify the Maximum Transmission Unit (MTU) of the Kubernetes internal network after the initial CNE installation.
Changing MTU on Internal Interface (eth0) for vCNE (OpenStack or VMware)
Note:
- The MTU value on the VM host depends on the ToR switch configuration:
- cisco Nexus9000 93180YC-EX has
"system jumbomtu"
up to 9216. - If you're using
port-channel/vlan-interface/uplnk-interface-to-customer-switch
, then run the"system jumbomtu <mtu>"
command and configure"mtu <value>"
up to the value obtained from the command. - If you're using other types of ToR switches, you can configure the MTU value of VM host up to the maximum MTU value of the switch. Therefore, check the switches for the maximum MTU value and configure the MTU value accordingly.
- cisco Nexus9000 93180YC-EX has
- The following steps are for a standard setup with bastion-1 or master-1 on host-1, bastion-2 or master-2 on host-2, and master-3 on host-3. If you have a different setup, then modify the commands accordingly. Each step in this procedure is performed to change MTU for the VM host and the Bastion on the VM host.
- SSH to k8s-host-2 from
bastion-1:
$ ssh k8s-host-2
- Run the following command to show all the
connections:
$ nmcli con show
- Run the following commands to modify the MTU value on all the
connections:
Note:
Modify the connection names in the following commands according to the connection names obtained from step 2.$ sudo nmcli con mod bond0 802-3-ethernet.mtu <MTU value> $ sudo nmcli con mod bondbr0 802-3-ethernet.mtu <MTU value> $ sudo nmcli con mod "vlan<mgmt vlan id>-br" 802-3-ethernet.mtu <MTU value> $ sudo nmcli con up bond0 $ sudo nmcli con up bondbr0 $ sudo nmcli con up "vlan<mgmt vlan id>-br"
- Run the following commands if there is
vlan<ilo_vlan_id>-br
on this host:$ sudo nmcli con mod "vlan<ilo vlan id>-br" 802-3-ethernet.mtu <MTU value> $ sudo nmcli con up "vlan<ilo vlan id>-br"
- After the values are updated on VM host, run the following
commands to shut down all the VM
guests:
$ sudo virsh list --all $ sudo virsh shutdown <VM guest>
where,
<VM guest>
is the VM guest name obtained from the$ sudo virsh list --all
command. - Run the
virsh list
command until the status of the VM guest is changed to"shut off"
:$ sudo virsh list --all
- Run the following command to start the VM
guest:
$ sudo virsh start <VM guest>
where,
<VM guest>
is the name of the VM guest. - Wait until bastion-2 is reachable and run the following command
to SSH to bastion-2:
$ ssh bastion-2
- Run the following command to list all connections in
bastion-2:
$ nmcli con show
- Run the following commands to modify the MTU value on all the
connections in bastion-2:
Note:
Modify the connection names in the following commands according to the connection names obtained in step 8.$ sudo nmcli con mod "System enp1s0" 802-3-ethernet.mtu <MTU value> $ sudo nmcli con mod "System enp2s0" 802-3-ethernet.mtu <MTU value> $ sudo nmcli con mod "System enp3s0" 802-3-ethernet.mtu <MTU value> $ sudo nmcli con up "System enp1s0" $ sudo nmcli con up "System enp2s0" $ sudo nmcli con up "System enp3s0"
- Wait until bastion-2 is reachable and run the following command
to SSH to bastion-2:
$ ssh bastion-2
- Repeat steps 9 and 10 to change the MTU value on k8s-host-1 and bastion-1.
- Repeat steps 1 to 10 to change the MTU values on k8s-host-3 and
restart all VM guests on it. You can use bastion-1 or bastion-2 for
performing this step.
Note:
For the VM guests that are controller nodes, perform only thevirsh shutdown
andvirsh start
commands to restart the VM guests. The MTU values of these controller nodes are updated in the following section.
Changing MTU on enp1s0 or bond0 Interface for BareMetal Controller or Worker Nodes
- Run the following command to launch the provision
container:
$ podman run -it --rm --network host -v /var/occne/cluster/${OCCNE_CLUSTER}:/host winterfell:5000/occne/provision:<release> /bin/bash
Where, <release> is the currently installed release.
This creates a Bash shell session running within the provision container.
- Run the following commands to change enp1s0 interfaces for controller nodes and
validate MTU value of the interface:
- Change enp1s0 interfaces for controller nodes:
Replace <MTU value> in the command with a real integer value.
$ ansible -i /host/hosts.ini kube-master -m shell -a 'sudo nmcli con mod "System enp1s0" 802-3-ethernet.mtu <MTU value>; sudo nmcli con up "System enp1s0"'
- Validate the MTU value of the
interface:
$ ansible -i /host/hosts.ini kube-master -m shell -a 'ip link show enp1s0'
- Change enp1s0 interfaces for controller nodes:
- Run the following commands to change bond0 interfaces for worker nodes and validate
the MTU value of the interface:
- Change bond0 interfaces for controller nodes:
Replace <MTU value> in the command with a real integer value.
$ ansible -i /host/hosts.ini kube-node -m shell -a 'sudo nmcli con mod bond0 802-3-ethernet.mtu <MTU value>; sudo nmcli con up bond0'
- Validate the MTU value of the
interface:
$ ansible -i /host/hosts.ini kube-node -m shell -a 'ip link show bond0'$ exit
- Change bond0 interfaces for controller nodes:
- Log in to the Bastion host and run the following
command:
$ kubectl edit daemonset calico-node -n kube-system
- Locate the line with FELIX_VXLANMTU and replace the current
<MTU value>
with the new integer value:Note:
vlan.calico
has an extra header in the packet. The modified MTU value must be at least 50 lower than the MTU set in previous steps to work.- name: FELIX_VXLANMTU value: "<MTU value>"
- Use
:x
to save and exit the vi editor and run the following command:$ kubectl rollout restart daemonset calico-node -n kube-system
- Run the following command to provision
container:
$ podman run -it --rm --network host -v /var/occne/cluster/${OCCNE_CLUSTER}:/host winterfell:5000/occne/provision:${OCCNE_VERSION} /bin/bash
- Validate the MTU value of the interface on the controller nodes
and worker nodes:
- For BareMetal, run the following command to validate
the MTU
value:
$ ansible -i /host/hosts.ini k8s-cluster -m shell -a 'ip link show vxlan.calico'
- For vCNE (OpenStack or VMware), run the following
command to validate the MTU
value:
$ ansible -i /host/hosts k8s-cluster -m shell -a 'ip link show vxlan.calico'
Note:
It takes some time for all the nodes to change to the new MTU. If the MTU value isn't updated, run the command several times to see the changes in the values.
- For BareMetal, run the following command to validate
the MTU
value:
- Log in to Bastion host and launch the provision container for vCNE or BareMetal using commands from Step 1 of Change MTU on eth0 interface for vCNE and Change MTU on enp1s0 or bond0 interface for BareMetal.
- Run the ansible command for all worker nodes from the provision
container:
Note:
Run this command for worker nodes only and not for controller nodes.- Run the following command for a BareMetal
deployment:
Note:
Replace <MTU value> in the command with an integer value without quote.bash-4.4# ansible -i /host/hosts.ini kube-node -m shell -a 'sudo sed -i '/\\\"mtu\\\"/d' /etc/cni/net.d/calico.conflist.template; sudo sed -i "/\\\"type\\\": \\\"calico\\\"/a \ \ \ \ \ \ \\\"mtu\\\": <MTU value>," /etc/cni/net.d/calico.conflist.template' bash-4.4# exit
- Run the following command for a vCNE
deployment:
Note:
Replace <MTU value> in the command with an integer value without quote.bash-4.4# ansible -i /host/hosts kube-node -m shell -a 'sudo sed -i '/\\\"mtu\\\"/d' /etc/cni/net.d/calico.conflist.template; sudo sed -i "/\\\"type\\\": \\\"calico\\\"/a \ \ \ \ \ \ \\\"mtu\\\": <MTU value>," /etc/cni/net.d/calico.conflist.template' bash-4.4# exit
- Run the following command for a BareMetal
deployment:
- Log in to the Bastion Host and run the following command to
restart the
deamonset:
$ kubectl rollout restart daemonset calico-node -n kube-system
- Run the following commands to delete deployment and reapply with
YAML file. The calico interface MTU change takes effect while starting a new
pod on the node.
- Verify that the deployment is READY 1/1 before delete
and
reapply:
$ kubectl get deployment occne-kube-prom-stack-grafana -n occne-infra
Sample output:NAME READY UP-TO-DATE AVAILABLE AGE occne-kube-prom-stack-grafana 1/1 1 1 10h
- Run the following commands to delete the deployment and
reapply with YAML
file:
$ kubectl get deployment occne-kube-prom-stack-grafana -n occne-infra -o yaml > dp-occne-kube-prom-stack-grafana.yaml $ kubectl delete deployment occne-kube-prom-stack-grafana -n occne-infra $ kubectl apply -f dp-occne-kube-prom-stack-grafana.yaml
- Verify that the deployment is READY 1/1 before delete
and
reapply:
- Run the following commands to verify the MTU change on worker
nodes:
- Verify which node has the new
pod:
$ kubectl get pod -A -o wide | grep occne-kube-prom-stack-grafana
Sample output:occne-infra occne-kube-prom-stack-grafana-79f9b5b488-cl76b 3/3 Running 0 60s 10.233.120.22 k8s-node-2.littlefinger.lab.us.oracle.com <none> <none>
- Use SSH to log in to the node and check the calico
interface change. Change only the last interface MTU due to the new
pod for the services. Other calico interfaces' MTU will be changed
when other services are
changed.
$ ssh k8s-node-2.littlefinger.lab.us.oracle.com [admusr@k8s-node-2 ~] $ ip link
Sample output:... 35: calia44682149a1@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1480 qdisc noqueue state UP mode DEFAULT group default link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-7f1a8116-5acf-b7df-5d6a-eb4f56330cf1 115: calif0adcd64a1c@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu <MTU value> qdisc noqueue state UP mode DEFAULT group default link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-7b99dc36-3b3b-75c6-e27c-9045eeb8242d
- Verify which node has the new
pod:
7.3.3 Changing Metrics Storage Allocation
The following procedure describes how to increase the amount of persistent storage allocated to Prometheus for metrics storage.
Prerequisites
Note:
When you increase the storage size for Prometheus, the retention size must also be increased to maintain the purging cycle of Prometheus. The default retention is set to 6.8 GB. If the storage is increased to a higher value and retention remains at 6.8 GB, the amount of data that is stored inside the storage is still 6.8 GB. Therefore, follow the Changing Retention Size of Prometheus procedure to calculate the retention size and update the retention size in Prometheus. These steps are applied while performing Step 3.Procedure
7.3.4 Changing OpenSearch Storage Allocation
This section describes the procedure to increase the amount of persistent storage allocated to OpenSearch for data storage.
Prerequisites
- Calculate the revised amount of persistent storage required by
OpenSearch. Rerun the OpenSearch storage calculations as provided in the
"Preinstallation Taks" section of Oracle Communications Cloud Native Core, Cloud Native Environment
Installation, Upgrade, and Fault Recovery Guide, and record the calculated
log_trace_active_storage
andlog_trace_inactive_storage
values.
Procedure
log_trace_active_storage
for opensearch-data PV size and
log_trace_inactive_storage
for opensearch-master PV size. The
following table displays the sample PV sizes considered in this procedure:
OpenSearch Component | Current PV Size | Desired PV Size |
---|---|---|
occne-opensearch-master | 500Mi | 500Mi |
occne-opensearch-data | 10Gi | 200Gi (log_trace_active_storage) |
opensearch-data-replicas-count | 5 | 7 |
- Store the output of the current configuration values for the
os-master-helm-values.yaml
file.$ helm -n occne-infra get values occne-opensearch-master > os-master-helm-values.yaml
- Update the PVC size block in the
os-master-helm-values.yaml
file. The PVC size must be updated to the newly required PVC size (in this case, 50Gi as per the sample value considered). Theos-master-helm-values.yaml
file is required in Step 8 to recreateoccne-opensearch-master
Statefulset.$ vi os-master-helm-values.yaml persistence: enabled: true image: occne-repo-host:5000/docker.io/busybox imageTag: 1.31.0 size: <desired size>Gi storageClass: occne-esmaster-sc
- Delete the statefulset of
occne-opensearch-cluster-master
by running the following command:$ kubectl -n occne-infra delete sts --cascade=orphan occne-opensearch-cluster-master
- Delete the
occne-opensearch-cluster-master-2
pod by running the following command:$ kubectl -n occne-infra delete pod occne-opensearch-cluster-master-2
- Update the PVC storage size in the PVC of
occne-opensearch-cluster-master-2
by running the following command:$ kubectl -n occne-infra patch -p '{ "spec": { "resources": { "requests": { "storage": "40Gi" }}}}' pvc occne-opensearch-cluster-master-occne-opensearch-cluster-master-2
- Get the PV volume ID from the PVC of
opensearch-master-2
:$ kubectl get pvc -n occne-infra | grep master-2
Sample output:occne-opensearch-cluster-master-occne-opensearch-cluster-master-2 Bound pvc-9d9897c1-b7b9-43a3-bf78-f03b91ea4d72 30Gi RWO occne-esmaster-sc 17h
In this case, the PV volume ID in the sample output is pvc-9d9897c1-b7b9-43a3-bf78-f03b91ea4d72.
- Hold on to the PV attached to
occne-opensearch-cluster-master-2
PVC using the volume ID until the newly updated size gets reflected. Verify the updated PVC value by running the following command:$ kubectl get pv -w | grep pvc-9d9897c1-b7b9-43a3-bf78-f03b91ea4d72
Sample output:pvc-9d9897c1-b7b9-43a3-bf78-f03b91ea4d72 30Gi RWO Delete Bound occne-infra/occne-opensearch-cluster-master-occne-opensearch-cluster-master-2 occne-esmaster-sc 17h pvc-9d9897c1-b7b9-43a3-bf78-f03b91ea4d72 40Gi RWO Delete Bound occne-infra/occne-opensearch-cluster-master-occne-opensearch-cluster-master-2 occne-esmaster-sc 17h
- Run Helm upgrade to recreate the
occne-opensearch-master
statefulset:$ helm upgrade -f os-master-helm-values.yaml occne-opensearch-master opensearch-project/opensearch -n occne-infra
- Once the deleted pod (master-2) and its statefulset are up and
running, check the pod's PVC status and verify if it reflects the updated
size.
$ kubectl get pvc -n occne-infra | grep master-2
Sample output:occne-opensearch-cluster-master-occne-opensearch-cluster-master-2 Bound pvc-9d9897c1-b7b9-43a3-bf78-f03b91ea4d72 40Gi RWO occne-esmaster-sc 17h e.g id: pvc-9d9897c1-b7b9-43a3-bf78-f03b91ea4d72
- Repeat steps 3 through 9 for each of the remaining pods, one after the other (in order master-1, master-0).
- Store the output of the current configuration values for
os-master-helm-values.yaml
file.$ helm -n occne-infra get values occne-opensearch-data > os-data-helm-values.yaml
- Update the PVC size block in the
os-master-helm-values.yaml
file. The PVC size must be updated to the newly required PVC size (in this case, 200Gi as per the sample value considered). Theos-master-helm-values.yaml
file is required in Step 8 of this procedure to recreate theoccne-opensearch-data
statefulset.$ vi os-data-helm-values.yaml
Sample output:persistence: enabled: true image: occne-repo-host:5000/docker.io/busybox imageTag: 1.31.0 size: <desired size>Gi storageClass: occne-esdata-sc
- Delete the statefulset of
occne-opensearch-opensearch-data
by the running the following command:$ kubectl -n occne-infra delete sts --cascade=orphan occne-opensearch-cluster-data
- Delete the
occne-opensearch-cluster-data-2
.$ kubectl -n occne-infra delete pod occne-opensearch-cluster-data-2
- Update the PVC storage size in the PVC of
occne-opensearch-cluster-data-2
.$ kubectl -n occne-infra patch -p '{ "spec": { "resources": { "requests": { "storage": "20Gi" }}}}' pvc occne-opensearch-cluster-data-occne-opensearch-cluster-data-2
- Get the PV volume ID from the PVC of
opensearch-data-2
.$ kubectl get pvc -n occne-infra | grep data-2
Sample output:occne-opensearch-cluster-data-occne-opensearch-cluster-data-2 Bound pvc-80a56d73-d7b7-417f-a7a7-c8484bc8171d 10Gi RWO occne-esdata-sc 17h
- Hold on to the PV attached to opensearch-data-2 PVC using the
volume ID until the newly updated size gets reflected. Verify the updated
PVC value by running the following command:
$ kubectl get pv -w | grep pvc-80a56d73-d7b7-417f-a7a7-c8484bc8171d
Sample output:pvc-80a56d73-d7b7-417f-a7a7-c8484bc8171d 10Gi RWO Delete Bound occne-infra/occne-opensearch-cluster-data-occne-opensearch-cluster-data-2 occne-esdata-sc 17h pvc-80a56d73-d7b7-417f-a7a7-c8484bc8171d 20Gi RWO Delete Bound occne-infra/occne-opensearch-cluster-data-occne-opensearch-cluster-data-2 occne-esdata-sc 17h
- Run helm upgrade to recreate the
occne-opensearch-data
statefulset$ helm upgrade -f os-data-helm-values.yaml occne-opensearch-data opensearch-project/opensearch -n occne-infra
- Once the deleted pod (data-2) and its statefulset are up and
running, check the pod's PVC status and verify if it reflects the updated
size.
$ kubectl get pvc -n occne-infra | grep data-2
Sample output:occne-opensearch-cluster-data-occne-opensearch-cluster-data-2 Bound pvc-80a56d73-d7b7-417f-a7a7-c8484bc8171d 20Gi RWO occne-esdata-sc 17h
- Repeat steps 3 through 9 for each of the remaining pods, one after the other (in the order, data-1, data-0,..).
7.3.5 Changing the RAM and CPU Resources for Common Services
This section describes the procedure to change the RAM and CPU resources for CNE common services.
Prerequisites
- The cluster must be in a healthy state. This can verified by
checking if all the common services are up and running.
Note:
- When changing the CPU and RAM resources for any component, the limit value must always be greater than or equal to the requested value.
- Run all the commands in this section from the Bastion Host.
7.3.5.1 Changing the Resources for Prometheus
This section describes the procedure to change the RAM or CPU resources for Prometheus.
Procedure
- Run the following command to edit the Prometheus
resource:
kubectl edit prometheus occne-kube-prom-stack-kube-prometheus -n occne-infra
The system opens a
vi
editor session that contains all the configuration for the CNE Prometheus instances. - Scroll to the resources section and change the CPU and Memory resources to the
desired values. This updates the resources for both the prometheus pods.
For example:
resources: limits: cpu: 2000m memory: 4Gi requests: cpu: 2000m memory: 4Gi
- Type
:wq
to exit the editor session and save the changes. - Verify if both the Prometheus pods are
restarted:
kubectl get pods -n occne-infra |grep kube-prom-stack-kube-prometheus
Sample output:prometheus-occne-kube-prom-stack-kube-prometheus-0 2/2 Running 0 85s prometheus-occne-kube-prom-stack-kube-prometheus-1 2/2 Running 0 104s
7.3.5.2 Changing the Resources for Alertmanager
This section describes the procedure to change the RAM or CPU resources for Alertmanager.
Procedure
- Run the following command to edit the Alertmanager
resource:
kubectl edit alertmanager occne-kube-prom-stack-kube-alertmanager -n occne-infra
The system opens a
vi
editor session that contains all the configuration for the CNE Alertmanager instances. - Scroll to the resources section and change the CPU and Memory resources to the
desired values. This updates the resources for the Alertmanager pods.
For example:
resources: limits: cpu: 20m memory: 64Mi requests: cpu: 20m memory: 64Mi
- Type
:wq
to exit the editor session and save the changes. - Verify if the Alertmanager pods are
restarted:
kubectl get pods -n occne-infra |grep alertmanager
Sample output:alertmanager-occne-kube-prom-stack-kube-alertmanager-0 2/2 Running 0 16s alertmanager-occne-kube-prom-stack-kube-alertmanager-1 2/2 Running 0 35s
7.3.5.3 Changing the Resources for Grafana
This section describes the procedure to change the RAM or CPU resources for Grafana.
Procedure
- Run the following command to edit the Grafana
resource:
kubectl edit deploy occne-kube-prom-stack-grafana -n occne-infra
The system opens a
vi
editor session that contains all the configuration for the CNE Grafana instances. - Scroll to the resources section and change the CPU and Memory resources to the
desired values. This updates the resources for the Grafana pod.
For example:
resources: limits: cpu: 100m memory: 128Mi requests: cpu: 100m memory: 128Mi
- Type
:wq
to exit the editor session and save the changes. - Verify if the Grafana pod is
restarted:
kubectl get pods -n occne-infra |grep grafana
Sample output:occne-kube-prom-stack-grafana-84898d89b4-nzkr4 3/3 Running 0 54s
7.3.5.4 Changing the Resources for Kube State Metrics
This section describes the procedure to change the RAM or CPU resources for kube-state-metrics.
Procedure
- Run the following command to edit the kube-state-metrics
resource:
kubectl edit deploy occne-kube-prom-stack-kube-state-metrics -n occne-infra
The system opens a
vi
editor session that contains all the configuration for the CNE kube-state-metrics instances. - Scroll to the resources section and change the CPU and Memory resources to the
desired values. This updates the resources for the kube-state-metrics pod.
For example:
resources: limits: cpu: 20m memory: 100Mi requests: cpu: 20m memory: 32Mi
- Type
:wq
to exit the editor session and save the changes. - Verify if the kube-state-metrics pod is
restarted:
kubectl get pods -n occne-infra |grep kube-state-metrics
Sample output:occne-kube-prom-stack-kube-state-metrics-cff54c76c-t5k7p 1/1 Running 0 20s
7.3.5.5 Changing the Resources for OpenSearch
This section describes the procedure to change the RAM or CPU resources for OpenSearch.
Procedure
- Run the following command to edit the opensearch-master
resource:
kubectl edit sts occne-opensearch-cluster-master -n occne-infra
The system opens a
vi
editor session that contains all the configuration for the CNE opensearch-master instances. - Scroll to the resources section and change the CPU and Memory resources to the
desired values. This updates the resources for the opensearch-master pod.
For example:
resources: limits: cpu: "1" memory: 2Gi requests: cpu: "1" memory: 2Gi
- Type
:wq
to exit the editor session and save the changes. - Verify if the opensearch-master pods are
restarted:
kubectl get pods -n occne-infra |grep opensearch-cluster-master
Sample output:occne-opensearch-cluster-master-0 1/1 Running 0 3m34s occne-opensearch-cluster-master-1 1/1 Running 0 4m8s occne-opensearch-cluster-master-2 1/1 Running 0 4m19s
Note:
Repeat this procedure for opensearch-data and opensearch-client pods if required.
7.3.5.6 Changing the Resources for OpenSearch Dashboard
This section describes the procedure to change the RAM or CPU resources for OpenSearch Dashboard.
Procedure
- Run the following command to edit the opensearch-dashboard
resource:
kubectl edit deploy occne-opensearch-dashboards -n occne-infra
The system opens a
vi
editor session that contains all the configuration for the CNE opensearch-dashboard instances. - Scroll to the resources section and change the CPU and Memory resources to the
desired values. This updates the resources for the opensearch-dashboard
pod.
For example:
resources: limits: cpu: 100m memory: 512Mi requests: cpu: 100m memory: 512Mi
- Type
:wq
to exit the editor session and save the changes. - Verify if the opensearch-dashboard pod is
restarted:
kubectl get pods -n occne-infra |grep dashboard
Sample output:occne-opensearch-dashboards-7b7749c5f7-jcs7d 1/1 Running 0 20s
7.3.5.7 Changing the Resources for Fluentd OpenSearch
This section describes the procedure to change the RAM or CPU resources for Fluentd OpenSearch.
Procedure
- Run the following command to edit the
occne-fluentd-opensearch
resource:kubectl edit ds occne-fluentd-opensearch -n occne-infra
The system opens a
vi
editor session that contains all the configuration for the CNE Fluentd OpenSearch instances. - Scroll to the resources section and change the CPU and memory
resources to the desired values. This updates the resources for the Fluentd
OpenSearch pods.
For example:
resources: limits: cpu: 100m memory: 128Mi requests: cpu: 100m memory: 128Mi
- Type
:wq
to exit the editor session and save the changes. - Verify if the Fluentd OpenSearch pods are
restarted:
kubectl get pods -n occne-infra |grep fluentd-opensearch
Sample output:occne-fluentd-opensearch-kcx87 1/1 Running 0 19s occne-fluentd-opensearch-m9zhz 1/1 Running 0 9s occne-fluentd-opensearch-pbbrw 1/1 Running 0 14s occne-fluentd-opensearch-rstqf 1/1 Running 0 4s
7.3.5.8 Changing the Resources for Jaeger Agent
This section describes the procedure to change the RAM or CPU resources for Jaeger Agent.
Procedure
- Run the following command to edit the jaeger-agent
resource:
kubectl edit ds occne-tracer-jaeger-agent -n occne-infra
The system opens a
vi
editor session that contains all the configuration for the CNE jaeger-agent instances. - Scroll to the resources section and change the CPU and Memory resources to the
desired values. This updates the resources for the jaeger-agent pods.
For example:
resources: limits: cpu: 500m memory: 512Mi requests: cpu: 256m memory: 128Mi
- Type
:wq
to exit the editor session and save the changes. - Verify if the jaeger-agent pods are
restarted:
kubectl get pods -n occne-infra |grep jaeger-agent
Sample output:occne-tracer-jaeger-agent-dpn4v 1/1 Running 0 58s occne-tracer-jaeger-agent-dvpnv 1/1 Running 0 62s occne-tracer-jaeger-agent-h4t67 1/1 Running 0 55s occne-tracer-jaeger-agent-q92ld 1/1 Running 0 51s
7.3.5.9 Changing the Resources for Jaeger Query
This section describes the procedure to change the RAM or CPU resources for Jaeger Query.
Procedure
- Run the following command to edit the jaeger-query
resource:
kubectl edit deploy occne-tracer-jaeger-query -n occne-infra
The system opens a
vi
editor session that contains all the configuration for the CNE jaeger-query instances. - Scroll to the resources section and change the CPU and Memory resources to the
desired values. This updates the resources for the jaeger-query pod.
For example:
resources: limits: cpu: 500m memory: 512Mi requests: cpu: 256m memory: 128Mi
- Type
:wq
to exit the editor session and save the changes. - Verify if the jaeger-query pod is
restarted:
kubectl get pods -n occne-infra |grep jaeger-query
Sample output:occne-tracer-jaeger-query-67bdd85fcb-hw67q 2/2 Running 0 19s
Note:
Repeat this procedure for the jaeger-collector pod if required.
7.3.6 Activating and Configuring Local DNS
This section provides information about activating and configuring local DNS.
7.3.6.1 Activating Local DNS
Note:
Before activating Local DNS, ensure that you are aware about the following conditions:- Local DNS does not handle backups of any added record.
- You must run this procedure to activate local DNS only after installing or upgrading to release 23.4.x.
7.3.6.1.1 Prerequisites
- Ensure that the cluster is running in a healthy state.
- Ensure that the CNE cluster is
running with version 23.4.x. You can validate the CNE version by echoing the
OCCNE_VERSION
environment variable on Bastion Host:echo $OCCNE_VERSION
- Ensure that the cluster is running with the Bastion DNS configuration.
7.3.6.1.2 Preactivation Checks
This section provides information about the checks that are performed before activating local DNS.
Determining the Active Bastion Host
- Log in to one of the Bastion Hosts (for example,
Bastion 1) and determine if that Bastion Host is
active or not by running the following
command:
$ is_active_bastion
The system displays the following output if the Bastion Host is active :IS active-bastion
- If the current Bastion is not
active, then log in to the mate Bastion Host and verify if it is
active:
$ is_active_bastion
The system displays the following output if the Bastion Host is active :IS active-bastion
Verifying if Local DNS is Already Activated
- Navigate to the cluster
directory:
$ cd /var/occne/cluster/${OCCNE_CLUSTER}
- Open the
occne.ini
file (for vCNE) orhosts.ini
file (for Bare Metal) and verify if the local_dns_enabled variable under theoccne:vars
header is set to False.Example for vCNE:$ cat occne.ini
Sample output:[occne:vars] . local_dns_enabled=False .
Example for Bare Metal:$ cat hosts.ini
Sample output:[occne:vars] . local_dns_enabled=False .
Iflocal_dns_enabled
is set to True, then it indicates that local DNS feature is already enabled in the CNE cluster.Note:
Ensure that the first character of the variable value (True or False) is capitalized and there is no space before and after the equal to sign.
7.3.6.1.3 Enabling Local DNS
- Log in to the active Bastion Host and run the
following command to navigate to the cluster
directory:
$ cd /var/occne/cluster/${OCCNE_CLUSTER}
- Open the
occne.ini
file (for vCNE) orhosts.ini
file (for Bare metal) in edit mode:Example for vCNE:
$ vi occne.ini
Example for Bare Metal:$ vi hosts.ini
- Set the
local_dns_enabled
variable under theoccne:vars
header to True. If thelocal_dns_enabled
variable is not present under theoccne:vars
header, then add the variable.Note:
Ensure that the first character of the variable value (True or False) is capitalized and there is no space before and after the equal to sign.For example,[occne:vars] . local_dns_enabled=True .
- For vCNE (OpenStack or VMware) deployments,
additionally add the
provider_domain_name
andprovider_ip_address
variables under theoccne:vars
section of theoccne.ini
file. You can obtain the provider domain name and IP address from the provider administrator and set the variable values accordingly.The following block shows the sampleoccne.ini
file with the additional variables:[occne:vars] . local_dns_enabled=True provider_domain_name=<cloud provider domain name> provider_ip_address=<cloud provider IP address> .
- Update the cluster with the new settings in the
ini
file:
$ OCCNE_CONTAINERS=(K8S) OCCNE_STAGES=(DEPLOY) OCCNE_ARGS='--tags=coredns' pipeline.sh
7.3.6.1.4 Validating Local DNS
This section provides the steps to validate if you have successfully enabled local DNS.
validateLocalDns.py
script to validate if you have successfully enabled Local DNS. The
validateLocalDns.py
script is located in the
/var/occne/cluster/${OCCNE_CLUSTER}/artifacts/maintenance/validateLocalDns.py
folder. This automated script validates Local DNS by performing the following actions:
- Creating a test record
- Reloading local DNS
- Querying the test record from within a pod
- Getting the response (Success status)
- Deleting the test record
validateLocalDns.py
script:
- Log in to the active Bastion Host and navigate to the cluster
directory:
$ cd /var/occne/cluster/${OCCNE_CLUSTER}
- Run the
validateLocalDns.py
script:$ ./artifacts/maintenance/validateLocalDns.py
Sample output:Beginning local DNS validation - Validating local DNS configuration in occne.ini - Adding DNS A record. - Adding DNS SRV record. - Reloading local coredns. - Verifying local DNS A record. - DNS A entry has not been propagated, retrying in 10 seconds (retry 1/5) - Verifying local DNS SRV record. - Deleting DNS SRV record. - Deleting DNS A record. - Reloading local coredns. Validation successful
Note:
If the script encounters an error, it returns an error message indicating which part of the process failed. For more information about troubleshooting local DNS errors, see Troubleshooting Local DNS. - Once you successfully enable Local DNS, add the external hostname records using the Local DNS API to resolve external domain names using CoreDNS. For more information, see Adding and Removing DNS Records.
7.3.6.2 Adding and Removing DNS Records
This section provides the procedures to add and remove DNS records ("A" records and SRV records) using Local DNS API to the core DNS configuration.
Each Bastion Host runs a version of the Local DNS API as a service on port 8000. The system doesn't require any authentication from inside a Bastion Host and runs the API requests locally.
7.3.6.2.1 Prerequisites
- The Local DNS feature must be enabled on the cluster. For more information about enabling Local DNS, see Activating Local DNS.
- The CNE cluster version must be 23.2.x or above.
7.3.6.2.2 Adding an A Record
This section provides information on how to use the Local DNS API to create or add an A record in the CNE cluster.
Note:
- You cannot create and maintain identical A records.
- You cannot create two A records with the same name.
- You cannot create two A records with the same IP address within the same zone.
The following table provides details on how to use the Local DNS API to add an "A" record:
Table 7-1 Adding an A Record
Request URL | HTTP Method | Content Type | Request Body | Response Code | Sample Response |
---|---|---|---|---|---|
http://localhost:8000/occne/dns/a |
POST | application/json |
Note: Define each field in the request body within double quotes (" "). Sample
request:
|
|
200: DNS A record
added in coredns file for occne.lab.oracle.com 175.80.200.20 3600, msg SUCCESS: Zone
info and A record updated for domain name |
The following table provides details about the request body parameters:
Table 7-2 Request Body Parameters
Parameter | Required or Optional | Type | Description |
---|---|---|---|
name | Required | string | Fully-Qualified Domain Name
(FQDN) to be include in the core DNS.
This parameter can contain
multiple subdomains where each subdomain can range between 1 and 63 characters and
contain the following characters: This parameter cannot start or end with For example, |
ip-address | Required | string | The IP address to locate a
service. For example, xxx.xxx.xxx.xxx .
The API supports IPv4 protocol only. |
ttl | Required | integer |
The Time To Live (TTL) in seconds. This is the amount of time the record is allowed to be cached by a resolver. The minimum and the maximum value that can be set are 300 and 3600 respectively. |
7.3.6.2.3 Deleting an A Record
This section provides information on how to use the Local DNS API to delete an A record in the CNE cluster.
Note:
- When the last A record in a zone is deleted, the system deletes the zone as well.
- You cannot delete an A record that is linked to an existing SRV record. You much first delete the linked SRV record to delete the A record.
The following table provides details on how to use the Local DNS API to delete an "A" record:
Table 7-3 Deleting an A Record
Request URL | HTTP Method | Content Type | Request Body | Response Code | Sample Response |
---|---|---|---|---|---|
http://localhost:8000/occne/dns/a |
DELETE | application/json |
Note: Define each field in the request body within double quotes (" "). Sample
request:
|
|
200: DNS A record
added in coredns file for occne.lab.oracle.com 175.80.200.20 3600, msg SUCCESS: Zone
info and A record updated for domain name |
The following table provides details about the request body parameters:
Table 7-4 Request Body Parameters
Parameter | Required or Optional | Type | Description |
---|---|---|---|
name | Required | string | Fully-Qualified Domain Name
(FQDN).
This parameter can contain multiple subdomains where each
subdomain can range between 1 and 63 characters and contain the following
characters: This parameter cannot
start or end with For example,
|
ip-address | Required | string | The IP address to locate a service. For example,
xxx.xxx.xxx.xxx .
|
7.3.6.2.4 Adding an SRV Record
This section provides information on how to use the Local DNS API to create or add an SRV record in the CNE cluster.
Note:
- You cannot create and maintain identical SRV records. However, you can have a different protocol for the same combo service and target A record.
- Currently, there is no provision to edit an existing SRV record. If you want to edit an SRV record, then delete the existing SRV record and then re-add the record with the updated parameters (weight, priority, or TTL).
The following table provides details on how to use the Local DNS API to create an SRV record:
Table 7-5 Adding an SRV Record
Request URL | HTTP Method | Content Type | Request Body | Response Code | Sample Response |
---|---|---|---|---|---|
https://localhost:8000/occne/dns/srv |
POST | application/json |
Note: Define each field in the request body within double quotes (" "). Sample
request:
|
|
200: SUCCESS: SRV
record successfully added to config map coredns. |
The following table provides details about the request body parameters:
Table 7-6 Request Body Parameters
Parameter | Required or Optional | Type | Description |
---|---|---|---|
service | Required | string | The symbolic name for the
service, such as "sip", and "my_sql".
The value of this parameter can
range between 1 and 63 characters and contain the following characters:
[a-zA-Z0-9_-]. The parameter cannot start or end with |
protocol | Required | string | The protocol supported by the
service. The allowed values are:
|
dn | Required | string | The domain name that the SRV record is applicable to. This parameter
can contain multiple subdomains where each subdomain can range between 1 and 63
characters and contain the following characters: [a-zA-Z0-9_-] . For
example: lab.oracle.com. If the SRV record is
applicable to the entire domain, then provide only the domain name without
subdomains. For example, The length
of the Top Level Domains (TLD) must be between 1 and 6 characters and must only
contain the following characters: |
ttl | Required | integer |
The Time To Live (TTL) in seconds. This is the amount of time the record is allowed to be cached by a resolver. This value can range between 300 and 3600. |
priority | Required | integer | The priority of the current SRV record in comparison to the other SRV
records.
The values can range from 0 to n. |
weight | Required | integer | The weight of the current SRV record in comparison to the other SRV
records with the same priority.
The values can range from 0 to n. |
port | Required | integer | The port on which the target service is found.
The values can range from 1 to 65535. |
server | Required | string | The name of the machine providing the service without including the
domain name (value provided in the dn field).
The
value can range between 1 and 63 characters and contain the following characters:
|
a_record | Required | string | The "A" record name to which the SRV is added.
The "A" record mentioned here must be already added. Otherwise the request fails. |
7.3.6.2.5 Deleting an SRV Record
This section provides information on how to use the Local DNS API to delete an SRV record in the CNE cluster.
Note:
To delete an SRV record, the details in the request payload must exactly match the details, such as weight, priority, and ttl, of an existing SRV record.The following table provides details on how to use the Local DNS API to delete an SRV record:
Table 7-7 Deleting an SRV Record
Request URL | HTTP Method | Content Type | Request Body | Response Code | Sample Response |
---|---|---|---|---|---|
https://localhost:8000/occne/dns/srv |
DELETE | application/json |
Note: Define each field in the request body within double quotes (" "). Sample
request:
|
|
200: SUCCESS: SRV
record successfully deleted from config map coredns |
The following table provides details about the request body parameters:
Table 7-8 Request Body Parameters
Parameter | Required or Optional | Type | Description |
---|---|---|---|
service | Required | string | The symbolic name for the
service, such as "sip", and "my_sql".
The value of this parameter can
range between 1 and 63 characters and contain the following characters:
[a-zA-Z0-9_-]. The parameter cannot start or end with |
protocol | Required | string | The protocol supported by the
service. The allowed values are:
|
dn | Required | string | The domain name that the SRV record is applicable to. This parameter
can contain multiple subdomains where each subdomain can range between 1 and 63
characters and contain the following characters: [a-zA-Z0-9_-] .
The length of the Top Level Domains (TLD) must be between 1 and 6
characters and must only contain the following characters: |
ttl | Required | integer |
The Time To Live (TTL) in seconds. This is the amount of time the record is allowed to be cached by a resolver. This value can range between 300 and 3600. |
priority | Required | integer | The priority of the current SRV record in comparison to the other SRV
records.
The values can range from 0 to n. |
weight | Required | integer | The weight of the current SRV record in comparison to the other SRV
records with the same priority.
The values can range from 0 to n. |
port | Required | integer | The port on which the target service is found.
The values can range from 1 to 65535. |
server | Required | string | The name of the machine providing the service minus the domain name
(the value in the dn field).
The value can range from 1 and 63 characters and
contain the following characters: |
a_record | Required | string | The "A" record name from which the SRV is deleted.
The "A" record mentioned here must be already added. Otherwise the request fails. |
7.3.6.3 Reloading Local or Core DNS Configurations
This section provides information about reloading core DNS configuration
using the reload
endpoint provided by Local DNS API.
Note:
You must reload the core DNS configuration to commit the last configuration update, whenever you:- add or remove multiple records in the same zone
- update a single or multiple DNS records
The following table provides details on how to use the Local DNS API endpoint to reload the core DNS configuration:
Table 7-9 Reloading Local or Core DNS Configurations
Request URL | HTTP Method | Content Type | Request Body | Response Code | Sample Response |
---|---|---|---|---|---|
http://localhost:8000/occne/coredns/reload |
POST | application/json |
Note:
Sample request to reload the core DNS without payload
(using the default
values):
Sample request to reload the core DNS using the
payload:
|
|
200: Deployment reloaded, msg SUCCESS: Reloaded
coredns deployment in ns kube-system |
The following table provides details about the request body parameters:
Table 7-10 Request Body Parameters
Parameter | Required or Optional | Type | Description |
---|---|---|---|
deployment-name | Required | string | The deployment Name to be reloaded. The value must be a
valid Kubernetes deployment name.
The default value is coredns. |
namespace | Required | string | The namespace where the deployment exists. The value must
be a valid Kubernetes namespace name.
The default value is kube-system. |
7.3.6.4 Other Local DNS API Endpoints
This section provides information about the additional endpoints provided by Local DNS API.
Get Data
The Local DNS API provides an endpoint to get the current configuration, zones and records of local DNS or core DNS.
The following table provides details on how to use the Local DNS API endpoint to get the Local DNS or core DNS configuration details:
Table 7-11 Get Local DNS or Core DNS Configurations
Request URL | HTTP Method | Content Type | Request Body | Response Code | Sample Response |
---|---|---|---|---|---|
http://localhost:8000/occne/dns/data |
GET | NA | Sample
request:
|
|
200:
[True, {'api_version': 'v1', 'binary_data': None, 'data': {'Corefile': '.:53 {\n' ... # Output Omitted ... 'db.oracle.com': ';oracle.com db file\n' 'oracle.com. 300 ' 'IN SOA ns1.oracle.com andrei.oracle.com ' '201307231 3600 10800 86400 3600\n' 'occne1.us.oracle.com. ' '3600 IN A ' '10.65.200.182\n' '_sip._tcp.lab.oracle.com 30 IN SRV 10 102 32061 ' 'occne.lab.oracle.com.\n' 'occne.lab.oracle.com. ' '3600 IN A ' '175.80.200.20\n', ... # Output Omitted ... |
7.3.6.5 Troubleshooting Local DNS
This section describes the issues that you may encounter while configuring Local DNS and their troubleshooting guidelines.
By design, the Local DNS functionality is built on top of the core DNS (CoreDNS). Therefore, all the troubleshooting, logging, and configuration management are performed directly on the core DNS. Each cluster runs a CoreDNS deployment (2 pods), with the rolling update strategy. Therefore, any change in the configuration is applied to both the pods one by one. This process can take some time (approximately, 30 to 60 seconds to reload both pods).
A NodeLocalDNS daemonset is a cache implementation of core DNS. The NodeLocalDNS runs as a pod on each node and is used for quick DNS resolution. When a pod requires a certain domain name resolution, it first checks its NodeLocalDNS pod, the one running in the same node, for resolution. If the pod doesn't get the required resolution, then it forwards the request to the core DNS.
Note:
Use the active Bastion to run all the troubleshooting procedures in this section.7.3.6.5.1 Troubleshooting Local DNS API
This section provides the troubleshooting guidelines for the common scenarios that you may encounter while using Local DNS API.
Validating Local DNS API
$ systemctl status bastion_http_server
● bastion_http_server.service - Bastion http server Loaded: loaded (/etc/systemd/system/bastion_http_server.service; enabled; vendor preset: disabled) Active: active (running) since Wed 2023-04-12 00:12:51 UTC; 1 day 19h ago Main PID: 283470 (gunicorn) Tasks: 4 (limit: 23553) Memory: 102.6M CGroup: /system.slice/bastion_http_server.service ├─283470 /usr/bin/python3.6 /usr/local/bin/gunicorn --workers=3 --bind 0.0.0.0:8000 --chdir /bin/bastion_http_setup wsgi:app --max-requests 0 --timeout 5 --keep> ├─283474 /usr/bin/python3.6 /usr/local/bin/gunicorn --workers=3 --bind 0.0.0.0:8000 --chdir /bin/bastion_http_setup wsgi:app --max-requests 0 --timeout 5 --keep> ├─283476 /usr/bin/python3.6 /usr/local/bin/gunicorn --workers=3 --bind 0.0.0.0:8000 --chdir /bin/bastion_http_setup wsgi:app --max-requests 0 --timeout 5 --keep> └─641094 /usr/bin/python3.6 /usr/local/bin/gunicorn --workers=3 --bind 0.0.0.0:8000 --chdir /bin/bastion_http_setup wsgi:app --max-requests 0 --timeout 5 --keep>
The sample output shows the status of the
Bastion http server
service as active (running) and
enabled. All Bastion servers have their own independent version of this service.
Therefore, it is recommended to check the status of all Bastion servers.
Starting or Restarting Local DNS API
If Local DNS API is not running, run the following command to start or restart it:
$ sudo systemctl start bastion_http_server
$ sudo systemctl restart bastion_http_server
The start and restart commands don’t display any output on completion. To check the status of Local DNS API, perform the Validating Local DNS API procedure.
If bastion_http_server doesn't run even after starting or restarting it, refer to the following section to check its log.
Generating and Checking Local DNS Logs
This section provides details about generating and checking Local DNS logs.
You can use journalctl
to get the logs of Local DNS API that runs as a
service (bastion_http_server
) on each bastion server.
$ journalctl -u bastion_http_server
journalctl -u bastion_http_server --no-pager -n 20
Note:
In the interactive mode, you can use the keyboard shortcuts to scroll through the logs. The system displays the latest logs at the end.-- Logs begin at Tue 2023-04-11 22:36:02 UTC. -- Apr 12 16:33:27 test-bastion-1.novalocal gunicorn[283474]: 2023-04-12 16:33:27,357 BHHTTP:INFO: Request payload: Record name occne.lab.oracle.com record ip 175.80.200.20 [/bin/bastion_http_setup/bastionApp.py:125] Apr 12 16:33:27 test-bastion-1.novalocal gunicorn[283474]: 2023-04-12 16:33:27,357 BHHTTP:INFO: Domain name oracle.com db name db.oracle.com for record entry [/bin/bastion_http_setup/coreDnsData.py:362] Apr 12 16:33:27 test-bastion-1.novalocal gunicorn[283474]: 2023-04-12 16:33:27,369 BHHTTP:INFO: SUCCESS: Validate coredns common config msg data oracle.com [/bin/bastion_http_setup/commons.py:36] Apr 12 16:33:27 test-bastion-1.novalocal gunicorn[283474]: 2023-04-12 16:33:27,380 BHHTTP:INFO: SUCCESS: A Record deleted msg data occne.lab.oracle.com [/bin/bastion_http_setup/commons.py:36] Apr 12 16:33:27 test-bastion-1.novalocal gunicorn[283474]: 2023-04-12 16:33:27,380 BHHTTP:INFO: SUCCESS: A Record deleted msg data occne.lab.oracle.com [/bin/bastion_http_setup/commons.py:36] Apr 12 16:33:27 test-bastion-1.novalocal gunicorn[283474]: 2023-04-12 16:33:27,380 BHHTTP:INFO: Domain name oracle.com db name db.oracle.com for record entry [/bin/bastion_http_setup/coreDnsData.py:362] Apr 12 16:33:27 test-bastion-1.novalocal gunicorn[283474]: 2023-04-12 16:33:27,388 BHHTTP:INFO: SUCCESS: Validate coredns common config msg data oracle.com [/bin/bastion_http_setup/commons.py:36] Apr 12 16:33:27 test-bastion-1.novalocal gunicorn[283474]: 2023-04-12 16:33:27,388 BHHTTP:INFO: DNS A record deleted in coredns file for occne.lab.oracle.com 175.80.200.20, msg SUCCESS: SUCCESS: A Record deleted [/bin/bastion_http_setup/commons.py:47] Apr 12 16:34:13 test-bastion-1.novalocal gunicorn[283474]: 2023-04-12 16:34:13,487 BHHTTP:INFO: Deployment reloaded, msg SUCCESS: Reloaded coredns deployment in ns kube-system [/bin/bastion_http_setup/commons.py:47]
Table 7-12 Local DNS Log Messages
Message | Type/ Level | Description |
---|---|---|
Deployment reloaded, msg SUCCESS: Reloaded coredns deployment in ns kube-system | INFO | Success message indicating that the core DNS deployment reloaded successfully. |
Validate coredns common config msg data oracle.com | INFO | Indicates that the module was able to process core DNS configuration data for a specific domain name. |
Request payload incomplete. Request requires name and ip-address, error missing param 'ip-address' | ERROR | Indicates an invalid payload. The API sends this type of messages when the payload used for a given record is not valid or not complete. |
FAILED: A record occne.lab.oracle.com does not exists in Zone db.oracle.com | ERROR | This message is used by an API module to trigger a creation of a new zone. This error message does not require any intervention. |
Already exists: DNS A record in coredns file for occne.lab.oracle.com 175.80.200.20 3600, msg SUCCESS: A record occne.lab.oracle.com already exists in Zone db.oracle.com, msg: Record occne.lab.oracle.com cannot be duplicated. | ERROR | Same domain name error. Records in the same zone cannot be duplicated, have the same name, or share the same IP address. This message is displayed if either of these conditions is true. |
DNS A record deleted in coredns file for occne.lab.oracle.com 175.80.200.20, msg SUCCESS: A Record deleted | INFO | Success message indicating that an A record was deleted successfully. |
DNS A record added in coredns file for occne.lab.oracle.com 175.80.200.20 3600, msg SUCCESS: Zone info and A record updated for domain name | INFO | Success message indicating that the API has successfully added a new A record and updated the zone information. |
ERROR in app: Exception on /occne/dns/a [POST] ... Traceback Omitted | ERROR | Fatal error indicating that an exception has occurred while processing a request. You can get more information by performing a traceback. This type of error is not common and must be reported as a bug. |
Zone already present with domain name oracle.com | DEBUG | This type of debug messages are not enabled by default. They are usually used to print a high amount of information while troubleshooting. |
FAILED: Unable to add SRV record: _sip._tcp.lab.oracle.com. 3600 IN SRV 10 100 35061 occne.lab.oracle.com. - record already exists - data: ... Data Omitted | ERROR | Error message indicating that the record already exists and cannot be duplicated. |
7.3.6.5.2 Troubleshooting Core DNS
This section provides information about troubleshooting Core DNS using the core DNS logs.
Local DNS records are added to CoreDNS configuration. Therefore, the logs are generated and reported by the core DNS pods. As per the default configuration, CoreDNS reports information logs only on start up (for example, after a reload) and on running into an error.
- Run the following command to print all logs from both core DNS pods to the terminal,
separated by
name:
$ for pod in $(kubectl -n kube-system get pods | grep coredns | awk '{print $1}'); do echo "----- $pod -----"; kubectl -n kube-system logs $pod; done
Sample output:----- coredns-8ddb9dc5d-5nvrv ----- [INFO] plugin/ready: Still waiting on: "kubernetes" [INFO] plugin/auto: Inserting zone `occne.lab.oracle.com.' from: /etc/coredns/..2023_04_12_16_34_13.510777403/db.occne.lab.oracle.com .:53 [INFO] plugin/reload: Running configuration SHA512 = 2bc9e13e66182e6e829fe1a954359de92746468f433b8748589dfe16e1afd0e790e1ff75415ad40ad17711abfc7a8348fdda2770af99962db01247526afbe24a CoreDNS-1.9.3 linux/amd64, go1.18.2, 45b0a11 ----- coredns-8ddb9dc5d-6lf5s ----- [INFO] plugin/auto: Inserting zone `occne.lab.oracle.com.' from: /etc/coredns/..2023_04_12_16_34_15.930764941/db.occne.lab.oracle.com .:53 [INFO] plugin/reload: Running configuration SHA512 = 2bc9e13e66182e6e829fe1a954359de92746468f433b8748589dfe16e1afd0e790e1ff75415ad40ad17711abfc7a8348fdda2770af99962db01247526afbe24a CoreDNS-1.9.3 linux/amd64, go1.18.2, 45b0a11
- Additionally, you can pipe the above command to a file for better readability and
sharing:
$ for pod in $(kubectl -n kube-system get pods | grep coredns | awk '{print $1}'); do echo "----- $pod -----"; kubectl -n kube-system logs $pod; done > coredns.logs $ vi coredns.logs
- Run the following command to get the latest logs from any of the CoreDNS
pods:
$ kubectl -n kube-system --tail 20 logs $(kubectl -n kube-system get pods | grep coredns | awk '{print $1 }' | head -n 1)
This command prints the latest 20 log entries. You can modify the
--tail
value as per your requirement.Sample output:[INFO] plugin/auto: Inserting zone `occne.lab.oracle.com.' from: /etc/coredns/..2023_04_13_19_29_29.1646737834/db.occne.lab.oracle.com .:53 [INFO] plugin/reload: Running configuration SHA512 = 2bc9e13e66182e6e829fe1a954359de92746468f433b8748589dfe16e1afd0e790e1ff75415ad40ad17711abfc7a8348fdda2770af99962db01247526afbe24a CoreDNS-1.9.3 linux/amd64, go1.18.2, 45b0a11
7.3.6.5.3 Troubleshooting DNS Records
This section provides information about validating, and querying internal and external records.
Note:
Use the internal cluster network to resolve the records added to core DNS through local DNS API. The system does not respond if you query for a DNS record from outside the cluster (for example, querying from a Bastion server).Validating Records
You can use any pod to access and query a DNS record in core DNS. However, most of the pods do not have the network utilities to directly query a record. In such cases, you can include the network utilities, such as bind-utils, bundled with the pods to allow them to access and query records.
- Run the following command from a Bastion server to query an A
record:
$ kubectl -n occne-infra exec -i -t $(kubectl -n occne-infra get pod | grep metallb-cont | awk '{print $1}') -- nslookup occne.lab.oracle.com
Sample output:.oracle.com Server: 169.254.25.10 Address: 169.254.25.10:53 Name: occne.lab.oracle.com Address: 175.80.200.20
- Run the following command from a Bastion server to query an SRV
record:
$ kubectl -n occne-infra exec -i -t $(kubectl -n occne-infra get pod | grep metallb-cont | awk '{print $1}') -- nslookup -type=srv _sip._tcp.lab.oracle.com
Sample output:Server: 169.254.25.10 Address: 169.254.25.10:53 _sip._tcp.lab.oracle.com service = 10 100 35061 occne.lab.oracle.com
Note:
Reload the core DNS configuration after adding multiple records to ensure that your changes are applied.
Note:
This example considers that an A record is already loaded to occne1.us.oracle.com using the API.$ kubectl -n occne-demo exec -it test-app -- nslookup occne1.us.oracle.com
.oracle.com
Server: 169.254.25.10
Address: 169.254.25.10:53
Name: occne1.us.oracle.com
Address: 10.65.200.182
Querying Non Existing or External Records
You cannot access or query an external record or a record that is not added using the API. The system terminates such queries with an error code.
- the following codeblock shows a case where a non existing A record is
queried:
$ kubectl -n occne-infra exec -i -t $(kubectl -n occne-infra get pod | grep metallb-cont | awk '{print $1}') -- nslookup not-in.oracle.com
Sample output:Server: 169.254.25.10 Address: 169.254.25.10:53 ** server can't find not-in.oracle.com: NXDOMAIN ** server can't find not-in.oracle.com: NXDOMAIN command terminated with exit code 1
- the following codeblock shows a case where a non existing SRV record is
queried:
$ kubectl -n occne-infra exec -i -t $(kubectl -n occne-infra get pod | grep metallb-cont | awk '{print $1}') -- nslookup not-in.oracle.com
Sample output:Server: 169.254.25.10 Address: 169.254.25.10:53 ** server can't find not-in.oracle.com: NXDOMAIN ** server can't find not-in.oracle.com: NXDOMAIN command terminated with exit code 1
Querying Internal Services
Core DNS is configured to resolve internal services by default. Therefore, you can query any internal Kubernetes services as usual.
- the following codeblock shows a case where an A record is queried from an
internal Kubernetes
service:
$ kubectl -n occne-infra exec -i -t $(kubectl -n occne-infra get pod | grep metallb-cont | awk '{print $1}') -- nslookup kubernetes
Sample output:Server: 169.254.25.10 Address: 169.254.25.10:53 Name: kubernetes.default.svc.test Address: 10.233.0.1 ** server can't find kubernetes.svc.test: NXDOMAIN ** server can't find kubernetes.svc.test: NXDOMAIN ** server can't find kubernetes.test: NXDOMAIN ** server can't find kubernetes.test: NXDOMAIN ** server can't find kubernetes.occne-infra.svc.test: NXDOMAIN ** server can't find kubernetes.occne-infra.svc.test: NXDOMAIN
The sample output displays the response from default.svc.test as "Kubernetes", as a service, exists only in the default namespace.
- the following codeblock shows a case where an SRV record is queried from an
internal Kubernetes
service:
$ kubectl -n occne-infra exec -i -t $(kubectl -n occne-infra get pod | grep metallb-cont | awk '{print $1}') -- nslookup -type=srv kubernetes.default.svc.test
Sample output:Server: 169.254.25.10 Address: 169.254.25.10:53 kubernetes.default.svc.occne3-toby-edwards service = 0 100 443 kubernetes.default.svc.test ** server can't find kubernetes.svc.test: NXDOMAIN ** server can't find kubernetes.occne-infra.svc.test: NXDOMAIN ** server can't find kubernetes.test: NXDOMAIN
The sample output displays the response from default.svc.test as "Kubernetes", as a service, exists only in the default namespace.
7.3.6.5.4 Accessing Configuration Files
This section provides information about accessing configuration files for troubleshooting.
Note:
Local DNS API takes care of configurations and modifications by default. Therefore, it is not recommended to access or update the configmaps as manual intervention to these files can potentially break the entire CoreDNS functionality.If there is absolute necessity to access configmap for troubleshooting, then use the data endpoint to access records of all zones along with the CoreDNS configuration.
# The following line, starting with "db.DOMAIN-NAME" represents a Zone file 'db.oracle.com': ';oracle.com db file\n' 'oracle.com. 300 ' # All zone files contain a default SOA entry auto generated 'IN SOA ns1.oracle.com andrei.oracle.com ' '201307231 3600 10800 86400 3600\n' 'occne.lab.oracle.com. ' # User added A record '3600 IN A 175.80.200.20\n' '_sip._tcp.lab.oracle.com 30 IN SRV 10 102 32061 ' # User added SRV record 'occne.lab.oracle.com.\n' 'occne1.us.oracle.com. ' # User added A record '3600 IN A ' '10.65.200.182\n'},
7.3.6.5.5 Troubleshooting Validation Script Errors
The local DNS feature provides the validateLocalDns.py
script to validate if the Local DNS feature is activated successfully. This section provides
information about troubleshooting some of the common issues that occur while using the
validateLocalDns.py
script.
Local DNS variable is not set properly
Beginning local DNS validation - Getting the occne-metallb-controller pod's name. - Validating occne.ini. Unable to continue - err: Cannot continue - local_dns_enabled variable is set to False, which is not valid to continue..In such cases, ensure that:
- the
local_dns_enabled
variable is set to True:local_dns_enabled=True
- there are no black spaces before and after the "=" sign
- the variable is typed correctly as it is case sensitive
Note:
To successfully enable Local DNS, you must follow the entire activation procedure. Otherwise, the system doesn't enable the feature successfully even after you set theocal_dns_enabled
variable to the correct value.
Unable to access the test pod
occne-metallb-controller
pod to validate the test record. This is because the DNS records can be accessed
from inside the cluster only, and the MetalLB pod contains the necessary utility
tools to access the records by default. You can encounter the following error while
running the validation script if the MetalLB pod is not
accessible:Beginning local DNS validation - Getting the occne-metallb-controller pod's name. - Error while trying to get occne-metallb-controller pod's name, error: ...
In such cases, ensure that the occne-metallb-controller
is
accessible.
Unable to add a test record
Beginning local DNS validation - Getting the occne-metallb-controller pod's name. - Validating occne.ini. - Adding DNS A record. Unable to continue - err: Failed to add DNS entry.
Table 7-13 Validation Script Errors and Resolutions
Issue | Error Message | Resolution |
---|---|---|
The script is previously run and interrupted before it finished. The script possibly created a test record the previous time it was run unsuccessfully. When the script is run again, it tries to create a duplicate test record and fails. | Cannot add a duplicate record.
Test record: name:occne.dns.local.com, ip-address: 10.0.0.3 |
Delete the existing test record from the system and rerun the validation script. |
A record similar to the test record is added manually. | Cannot add a duplicate record.
Test record: name:occne.dns.local.com, ip-address: 10.0.0.3 |
Delete the existing test record from the system and rerun the validation script. |
Local DNS API is not available. | The Local DNS API is not running or is in an error state | Validate if the Local DNS feature is enabled properly. For more information, see Troubleshooting Local DNS API. |
Local DNS API returns 50X status code. | Kubernetes Admin Configmap missing or misconfigured | Check if Kubernetes admin.conf is properly set to allow the API to interact with Kubernetes. |
Note:
The name and ip-address of the test record are managed by the script. Use these details for validation purpose only.Unable to reload configuration
Beginning local DNS validation - Getting the occne-metallb-controller pod's name. - Validating occne.ini. - Adding DNS A record. - Adding DNS SRV record. - Reloading local coredns. - Error while trying to reload the local coredns, error: .... # Reason Omitted
In such cases, analyze the cause of the issue using the Local DNS logs. For more information, see Troubleshooting Local DNS API.
Other miscellaneous errors
If you are encountering other miscellaneous errors (such as, "unable to remove record"), follow the steps in the Troubleshooting Local DNS API section to generate logs and analyze the issue.
7.4 Managing the Kubernetes Cluster
This section provides instructions on how to manage the Kubernetes Cluster.
7.4.1 Creating CNE Cluster Backup
This section describes the procedure to create a backup of CNE cluster
data using the createClusterBackup.py
script.
Critical CNE data can be damaged or lost during a fault recovery scenario. Therefore, it is advised to take a backup of your CNE cluster data regularly. These backups can be used to restore your CNE cluster when the cluster data is lost or damaged.
Backing up a CNE cluster data involves the following steps:- Backing up Bastion Host data
- Backing up Kubernetes data using Velero
The createClusterBackup.py
script is used to backup both the
bastion host data and Kubernetes data.
Prerequisites
Before creating CNE cluster backup, ensure that the following prerequisites are met:
- Velero must be activated successfully. For Velero installation procedure, see Oracle Communications Cloud Native Core, Cloud Native Environment Installation, Upgrade, and Fault Recovery Guide.
- Velero v1.10.0 server must be installed and running.
- Velero CLI for v1.10.0 must be installed and running.
- boto3 python module must be installed. For more information, see the "Configuring PIP Repository" section in the Oracle Communications Cloud Native Core, Cloud Native Environment Installation, Upgrade, and Fault Recovery Guide.
- The S3 Compatible Object Storage Provider must be configured and ready to be used.
- The following S3 related credentials must be available:
- Endpoint Url
- Access Key Id
- Secret Access Key
- Region Name
- Bucket Name
- An external S3 compatible data store to store backup data must have been configured while installing CNE.
- Cluster must be in a good state, that is all content included in the following
namespaces must be up and running:
- occne-infra
- cert-manager
- kube-system
- rook-ceph (for bare metal)
- istio-system
- All bastion-controller and lb-controller PVCs must be in "Bound" status.
Note:
- This procedure creates only a CNE cluster backup that contains bastion host data, including Kubernetes.
- For Kubernetes, this procedure creates the backup content included in the
following namespaces only:
- occne-infra
- cert-manager
- kube-system
- rook-ceph (for bare metal)
- istio-system
- You must take the bastion backup in the ACTIVE bastion only.
7.4.1.1 Creating a Backup of Bastion Host and Kubernetes Data
This section describes the procedure to back up the Bastion Host and
Kubernetes data using the createClusterBackup.py
script.
- Run the following command to verify if you are currently on an
active Bastion. If you are not, log in to an active Bastion and continue this
procedure.
$ is_active_bastion
Sample output:IS active-bastion
- Use the following commands to run the
createClusterBackup.py
script:$ cd /var/occne/cluster/${OCCNE_CLUSTER}/artifacts $ ./backup/createClusterBackup.py
Sample output:Initializing cluster backup occne-cluster-20230717-183615 No /var/occne/cluster/occne-cluster/artifacts/backup/cluster_backups_log.json log file, creating new one Creating bastion backup: 'occne-cluster-20230717-183615' Successfully created bastion backup GENERATED LOG FILE AT: /var/occne/cluster/occne-cluster/createBastionBackup-20230717-183615.log Creating velero backup: 'occne-cluster-20230717-183615' Successfully created velero backup Successfully created cluster backup GENERATED LOG FILE AT: /var/occne/cluster/occne-cluster/createClusterBackup.py-20230717-183615.log
- If the
createClusterBackup.py
script fails due to a missing boto3 library, then perform the following steps to add your proxy and download boto3. Else, move to Step 3.- Run the following commands to install boto3
library:
export http_proxy=YOUR_PROXY export https_proxy=$http_proxy export HTTP_PROXY=$http_proxy export HTTPS_PROXY=$http_proxy pip3 install boto3
While installing boto3 library, you may see a warning regarding the versions of dependencies. You can ignore the warning as the boto3 library can work without these dependencies.
- Once you install boto3 library, run the following commands
to unset the
proxy:
unset HTTP_PROXY unset https_proxy unset http_proxy unset HTTPS_PROXY
- Run the following commands to install boto3
library:
- Navigate to the
/home/cloud-user
directory and verify if the backup tar file is generated. - Log in to your S3 cloud storage and verify if the Bastion Host data is uploaded successfully.
7.4.1.2 Verifying Backup in S3 Bucket
This section describes the procedure to verify the CNE cluster data backup in S3 bucket.
- bastion-data-backups: for storing Bastion backup
- velero-backup: for storing Velero backup
- Verify if the Bastion Host data is stored as a
.tar
file in the{BUCKET_NAME}/bastion-data-backups/{CLUSTER-NAME}/{BACKUP_NAME}
folder. Where,{CLUSTER-NAME}
is the name of the cluster and{BACKUP_NAME}
is the name of the backup. - Verify if the Velero Kubernetes backup is stored in the
{BUCKET_NAME}/velero-backup/{BACKUP_NAME}/
folder. Where,{BACKUP_NAME}
is the name of the backup.Caution:
Thevelero-backup
folder must not be modified manually as this folder is managed by Velero. Modifying the folder can corrupt the structure or files.For information about restoring CNE cluster from a backup, see "Restoring CNE from Backup" in Oracle Communications Cloud Native Core, Cloud Native Environment Installation, Upgrade, and Fault Recovery Guide.
7.4.2 Renewing Kubernetes Certificates
Some of the Kubernetes certificates in your cluster are valid for a period of one year. These certificates include various important files that secure the communication within your cluster, such as the API server certificate, the etcd certificate, and the controller manager certificate. To maintain the security and operation of your CNE Kubernetes cluster, it is important to keep these certificates updated. The certificates are renewed automatically during the CNE upgrade. If you have not performed an CNE upgrade in the last year, you must run this procedure to renew your certificates for the continued operation of the CNE Kubernetes cluster.
Introduction
Kubernetes uses many different TLS certificates to secure access to internal services. There certificates are automatically renewed during upgrade. However, if upgrade is not performed regularly, these certificates may expire and cause the Kubernetes cluster to fail. To avoid this situation follow the procedure below to renew all certificates used by Kubernetes. This procedure can also be used to renew expired certificates and restore access to the Kubernetes cluster.
List of K8s internal certificates
Table 7-14 Kubernetes Internal Certificates and Validity Period
Node Type | Componet Name | .crt File Path | Validity (in years) | .pem File Path | Validity (in years) |
---|---|---|---|---|---|
Kubernetes Controller | etcd | /etc/pki/ca-trust/source/anchors/etcd-ca.crt | 100 | /etc/ssl/etcd/ssl/admin-<node_name>.pem | 100 |
Kubernetes Controller | etcd | NA | NA | /etc/ssl/etcd/ssl/ca.pem | 100 |
Kubernetes Controller | etcd | NA | NA | /etc/ssl/etcd/ssl/member-<node_name>.pem | 100 |
Kubernetes Controller | etcd | NA | NA | /etc/ssl/etcd/ssl/member-<node_name>.pem | 100 |
Kubernetes Controller | Kubernetes | /etc/kubernetes/ssl/ca.crt | 10 | NA | NA |
Kubernetes Controller | Kubernetes | /etc/kubernetes/ssl/apiserver.crt | 1 | NA | NA |
Kubernetes Controller | Kubernetes | /etc/kubernetes/ssl/apiserver-kubelet-client.crt | 1 | NA | NA |
Kubernetes Controller | Kubernetes | /etc/kubernetes/ssl/front-proxy-ca.crt | 10 | NA | NA |
Kubernetes Controller | Kubernetes | /etc/kubernetes/ssl/front-proxy-client.crt | 1 | NA | NA |
Kubernetes Node | Kubernetes | /etc/kubernetes/ssl/ca.crt | 10 | NA | NA |
Prerequisites
Caution:
Run this procedure on each controller node and verify that the certificates are renewed successfully to avoid cluster failures. The controller nodes are the orchestrator and maintainers of the metadata of all objects and components of the cluster. If you do not run this procedure on all the controller nodes and the certificates expire, the integrity of the cluster and the applications that are deployed on the cluster are staged at risk. This causes the communication within the internal components to be lost resulting in a total cluster failure. In such a case, you must recover each controller node or in the worst case scenario, recover the complete cluster.Checking Certificate Expiry
$ sudo su
# export PATH=$PATH:/usr/local/bin
# kubeadm certs check-expiration
[check-expiration] Reading configuration from the cluster... [check-expiration] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml' W0214 13:39:25.870724 84036 utils.go:69] The recommended value for "clusterDNS" in "KubeletConfiguration" is: [10.233.0.10]; the provided value is: [169.254.25.10] CERTIFICATE EXPIRES RESIDUAL TIME CERTIFICATE AUTHORITY EXTERNALLY MANAGED admin.conf Feb 14, 2026 17:42 UTC 364d ca no apiserver Feb 14, 2026 17:42 UTC 364d ca no apiserver-kubelet-client Feb 14, 2026 17:42 UTC 364d ca no controller-manager.conf Feb 14, 2026 17:42 UTC 364d ca no front-proxy-client Feb 14, 2026 17:42 UTC 364d front-proxy-ca no scheduler.conf Feb 14, 2026 17:42 UTC 364d ca no super-admin.conf Feb 14, 2026 17:42 UTC 364d ca no CERTIFICATE AUTHORITY EXPIRES RESIDUAL TIME EXTERNALLY MANAGED ca Feb 12, 2035 17:42 UTC 9y no front-proxy-ca Feb 12, 2035 17:42 UTC 9y no
Procedure
- Use SSH to log in to the active Bastion Host.
- Run the following command to verify if the Bastion Host is the
active Bastion
Host:
The system displays the following output if the Bastion Host is the active Bastion Host:$ is_active_bastion
If the Bastion Host is not the active Bastion Host, try a different Bastion Host.IS active-bastion
Note:
If the certificates are expired, theis_active_bastion
command doesn't work as it depends onkubectl
. In this case, skip this step and move to the next step. - Perform the following steps to log in to a controller node as a
root user and back up the SSL directory:
- Use SSH to log in to Kubernetes controller node as a root
user:
$ ssh <k8s-ctrl-node> $ sudo su # export PATH=$PATH:/usr/local/bin
- Take a backup of the
ssl
directory:# cp -r /etc/kubernetes/ssl /etc/kubernetes/ssl_backup
- Use SSH to log in to Kubernetes controller node as a root
user:
- Renew all
kubeadm
certificates:# kubeadm certs renew all
Sample output:[renew] Reading configuration from the cluster... [renew] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml' W0212 18:04:43.840444 3620859 utils.go:69] The recommended value for "clusterDNS" in "KubeletConfiguration" is: [10.233.0.10]; the provided value is: [169.254.25.10] certificate embedded in the kubeconfig file for the admin to use and for kubeadm itself renewed certificate for serving the Kubernetes API renewed certificate for the API server to connect to kubelet renewed certificate embedded in the kubeconfig file for the controller manager to use renewed certificate for the front proxy client renewed certificate embedded in the kubeconfig file for the scheduler manager to use renewed certificate embedded in the kubeconfig file for the super-admin renewed Done renewing certificates. You must restart the kube-apiserver, kube-controller-manager, kube-scheduler and etcd, so that they can use the new certificates.
- Perform the following steps to remove the manifest files in the
/etc/kubernetes/manifests/
directory and restart the static pods:Note:
This step requires removing (moving the file totmp
folder) the manifest files in the/etc/kubernetes/manifests/
directory and copying back the file to the same directory to restart thekube-apiserver
pod. Each time you remove and copy the manifest files, the system waits for a period configured infileCheckFrequency
.fileCheckFrequency
is a Kubelet configuration and the default value is 20 seconds.- Perform the following steps to restart the API server
pod:
- Remove the
kube-apiserver
pod:# mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp
- Run the watch command until the
kube-apiserver
pod is removed. When the pod is removed, useCtrl+C
to exit the watch command:
Sample output:# watch -n 1 "sudo /usr/local/bin/crictl -r unix:///run/containerd/containerd.sock ps | grep -e api -e kube-controller-manager -e scheduler"
Every 1.0s: sudo /usr/local/bin/crictl -r unix:///run/containerd/contain... occne-example-k8s-ctrl-1: Fri Feb 14 13:52:26 2025 ff79b19fdffd7 9aa1fad941575 27 seconds ago Running kube-scheduler 2 ab0da7c51b413 kube-scheduler-occne-example-k8s-ctrl-1 64059f7efadc5 175ffd71cce3d 27 seconds ago Running kube-controller-manager 3 9591cd755dae4 kube-controller-manager-occne-example-k8s-ctrl-1
- Restore the
kube-apiserver
pod to the/etc/kubernetes/manifests/
directory:# mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests
- Run the watch command until the
kube-apiserver
pod appears in the output. When the pod appears, useCtrl+C
to exit the watch command:
Sample output:# watch -n 1 "sudo /usr/local/bin/crictl -r unix:///run/containerd/containerd.sock ps | grep -e api -e kube-controller-manager -e scheduler"
Every 1.0s: sudo /usr/local/bin/crictl -r unix:///run/containerd/contain... occne-example-k8s-ctrl-1: Fri Feb 14 13:53:28 2025 67c8d5c42645f 6bab7719df100 10 seconds ago Running kube-apiserver 0 3bb9f31dad8c6 kube-apiserver-occne-example-k8s-ctrl-1 ff79b19fdffd7 9aa1fad941575 About a minute ago Running kube-scheduler 2 ab0da7c51b413 kube-scheduler-occne-example-k8s-ctrl-1 64059f7efadc5 175ffd71cce3d About a minute ago Running kube-controller-manager 3 9591cd755dae4 kube-controller-manager-occne-example-k8s-ctrl-1
- Remove the
- Perform the following steps to restart the controller
manager pod:
- Remove the
kube-controller-manager
pod:# mv /etc/kubernetes/manifests/kube-controller-manager.yaml /tmp
- Run the watch command until the
kube-controller-manager
pod is removed. When the pod is removed, useCtrl+C
to exit the watch command:
Sample output:# watch -n 1 "sudo /usr/local/bin/crictl -r unix:///run/containerd/containerd.sock ps | grep -e api -e kube-controller-manager -e scheduler"
Every 1.0s: sudo /usr/local/bin/crictl -r unix:///run/containerd/contain... occne-example-k8s-ctrl-1: Fri Feb 14 13:55:48 2025 67c8d5c42645f 6bab7719df100 2 minutes ago Running kube-apiserver 0 3bb9f31dad8c6 kube-apiserver-occne-example-k8s-ctrl-1 ff79b19fdffd7 9aa1fad941575 3 minutes ago Running kube-scheduler 2 ab0da7c51b413 kube-scheduler-occne-example-k8s-ctrl-1
- Restore the
kube-controller-manager
pod to the/etc/kubernetes/manifests/
directory:# mv /tmp/kube-controller-manager.yaml /etc/kubernetes/manifests
- Run the watch command until the
kube-controller-manager
pod appears in the output. When the pod appears, useCtrl+C
to exit the watch command:
Sample output:# watch -n 1 "sudo /usr/local/bin/crictl -r unix:///run/containerd/containerd.sock ps | grep -e api -e kube-controller-manager -e scheduler"
Every 1.0s: sudo /usr/local/bin/crictl -r unix:///run/containerd/contain... occne-example-k8s-ctrl-1: Fri Feb 14 13:57:11 2025 fa16530da2e04 175ffd71cce3d 15 seconds ago Runningeconds ago kube-controller-manager 0 9b6c69c940bfa kube-controller-manager-occne-example-k8s-ctrl-1 67c8d5c42645f 6bab7719df100 3 minutes ago Running kube-apiserver 0 3bb9f31dad8c6 kube-apiserver-occne-example-k8s-ctrl-1 ff79b19fdffd7 9aa1fad941575 5 minutes ago Running kube-scheduler 2 ab0da7c51b413 kube-scheduler-occne-example-k8s-ctrl-1
- Remove the
- Perform the following steps to restart the scheduler
pod:
- Remove the
kube-scheduler
pod:# mv /etc/kubernetes/manifests/kube-scheduler.yaml /tmp
- Run the watch command until the
kube-scheduler
pod is removed. When the pod is removed, useCtrl+C
to exit the watch command:
Sample output:# watch -n 1 "sudo /usr/local/bin/crictl -r unix:///run/containerd/containerd.sock ps | grep -e api -e kube-scheduler -e scheduler"
Every 1.0s: sudo /usr/local/bin/crictl -r unix:///run/containerd/contain... occne-example-k8s-ctrl-1: Thu Feb 13 13:16:06 2025 fa16530da2e04 175ffd71cce3d 19 minutes ago Running kube-controller-manager 0 9b6c69c940bfa kube-controller-manager-occne-example-k8s-ctrl-1 67c8d5c42645f 6bab7719df100 23 minutes ago Running kube-apiserver 0 3bb9f31dad8c6 kube-apiserver-occne-example-k8s-ctrl-1
- Restore the
kube-scheduler
pod to the/etc/kubernetes/manifests/
directory:# mv /tmp/kube-scheduler.yaml /etc/kubernetes/manifests
- Run the watch command until the
kube-scheduler
pod appears in the output. When the pod appears, useCtrl+C
to exit the watch command:
Sample output:# watch -n 1 "sudo /usr/local/bin/crictl -r unix:///run/containerd/containerd.sock ps | grep -e api -e kube-scheduler -e scheduler"
Every 1.0s: sudo /usr/local/bin/crictl -r unix:///run/containerd/contain... occne-example-k8s-ctrl-1: Fri Feb 14 14:16:35 2025 8c4500f3d61d7 9aa1fad941575 16 seconds ago Running kube-scheduler 0 7c175d8106f0c kube-scheduler-occne-example-k8s-ctrl-1 fa16530da2e04 175ffd71cce3d 19 minutes ago Running kube-controller-manager 0 9b6c69c940bfa kube-controller-manager-occne-example-k8s-ctrl-1 67c8d5c42645f 6bab7719df100 23 minutes ago Running kube-apiserver 0 3bb9f31dad8c6 kube-apiserver-occne-example-k8s-ctrl-1
- Remove the
- Perform the following steps to restart the API server
pod:
- Renew the
admin.conf
file and update the contents of$HOME/.kube/config
. Type yes when prompted.# cp -i /etc/kubernetes/admin.conf $HOME/.kube/config cp: overwrite '/root/.kube/config'? yes # chown $(id -u):$(id -g) $HOME/.kube/config
- Run the following command to validate if the certificates are
renewed:
Sample output:# kubeadm certs check-expiration
[check-expiration] Reading configuration from the cluster... [check-expiration] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml' W0214 14:21:49.907835 143445 utils.go:69] The recommended value for "clusterDNS" in "KubeletConfiguration" is: [10.233.0.10]; the provided value is: [169.254.25.10] CERTIFICATE EXPIRES RESIDUAL TIME CERTIFICATE AUTHORITY EXTERNALLY MANAGED admin.conf Feb 14, 2026 18:51 UTC 364d ca no apiserver Feb 14, 2026 18:51 UTC 364d ca no apiserver-kubelet-client Feb 14, 2026 18:51 UTC 364d ca no controller-manager.conf Feb 14, 2026 18:51 UTC 364d ca no front-proxy-client Feb 14, 2026 18:51 UTC 364d front-proxy-ca no scheduler.conf Feb 14, 2026 18:51 UTC 364d ca no super-admin.conf Feb 14, 2026 18:51 UTC 364d ca no CERTIFICATE AUTHORITY EXPIRES RESIDUAL TIME EXTERNALLY MANAGED ca Feb 12, 2035 17:42 UTC 9y no front-proxy-ca Feb 12, 2035 17:42 UTC 9y no
- Perform steps 3 through 7 on the remaining controller nodes.
- Exit from the root user privilege:
# exit
- Copy the
/etc/kubernetes/admin.conf
file from the controller node to the artifacts directory of the active Bastion.Note:
- Replace
<OCCNE_ACTIVE_BASTION>
and<OCCNE_CLUSTER>
with the values corresponding to your system. Refer to Step 2 for the value of<OCCNE_ACTIVE_BASTION>
(For example,occne-example-bastion-1
). - Type yes and enter your password if prompted.
$ sudo scp /etc/kubernetes/admin.conf ${USER}@<OCCNE_ACTIVE_BASTION>:/var/occne/cluster/<OCCNE_CLUSTER>/artifacts
- Replace
- Log in to the active Bastion Host and update the server address in the
admin.conf
file to https://lb-apiserver.kubernetes.local:6443:$ ssh <active-bastion> $ sed -i 's#https://127.0.0.1:6443#https://lb-apiserver.kubernetes.local:6443#' /var/occne/cluster/${OCCNE_CLUSTER}/artifacts/admin.conf
- If you are using a Load Balancer VM (LBVM), perform the following steps to
delete the existing
lb-controller-admin
secret and create a new one:- Run the following command to delete the existing
lb-controller-admin
secret:$ kubectl -n occne-infra delete secret lb-controller-admin-config
- Run the following command to create a new
lb-controller-admin
secret from the updatedadmin.conf
file:$ kubectl -n occne-infra create secret generic lb-controller-admin-config --from-file=/var/occne/cluster/${OCCNE_CLUSTER}/artifacts/admin.conf
- Run the following command to delete the existing
- If you are using a Load Balancer VM (LBVM), perform the following steps to
patch the
lb-controller-admin-config
secret and restart thelb-controller-server
pod:- Patch the
lb-controller-admin-config
secret:$ echo -n "$(kubectl get secret lb-controller-admin-config -n occne-infra -o jsonpath='{.data.admin\.conf}' | base64 -d | sed 's#https://lb-apiserver.kubernetes.local:6443#https://kubernetes.default:443#g')" | base64 -w0 | xargs -I{} kubectl -n occne-infra patch secret lb-controller-admin-config --patch '{"data":{"admin.conf":"{}"}}'
- Remove the
lb-controller-server
pod:$ kubectl scale deployment/occne-lb-controller-server -n occne-infra --replicas=0
- Run the watch command until the
occne-lb-controller-server
pod is removed. When the pod is removed, useCtrl+C
to exit the watch command:$ watch -n 1 "kubectl -n occne-infra get pods | grep lb-controller"
- Restore the
lb-controller-server
pod:$ kubectl scale deployment/occne-lb-controller-server -n occne-infra --replicas=1
- Run the watch command until the
occne-lb-controller-server
pod appears in the output. When the pod appears, useCtrl+C
to exit the watch command:$ watch -n 1 "kubectl -n occne-infra get pods | grep lb-controller"
- Patch the
- Renew the Kyverno certificates by deleting the secrets from the
kyverno
namespace:Note:
You must perform this step to renew the Kyverno certificates manually as the current verion of Kyverno doesn't support automatic renewal of certificates.
Sample output:$ kubectl delete secret occne-kyverno-svc.kyverno.svc.kyverno-tls-ca -n kyverno
secret "occne-kyverno-svc.kyverno.svc.kyverno-tls-ca" deleted
Sample output:$ kubectl delete secret occne-kyverno-svc.kyverno.svc.kyverno-tls-pair -n kyverno
secret "occne-kyverno-svc.kyverno.svc.kyverno-tls-pair" deleted
- Perform the following steps to verify if the secrets are recreated and the
certificates are renewed:
- Run the following command to verify the Kyverno
secrets:
Sample output:$ kubectl get secrets -n kyverno
NAME TYPE DATA AGE occne-kyverno-svc.kyverno.svc.kyverno-tls-ca kubernetes.io/tls 2 21s occne-kyverno-svc.kyverno.svc.kyverno-tls-pair kubernetes.io/tls 2 11s sh.helm.release.v1.occne-kyverno-policies.v1 helm.sh/release.v1 1 26h sh.helm.release.v1.occne-kyverno.v1 helm.sh/release.v1 1 26h
- Run the following commands to review the expiry dates of Kyverno
certificates:
Sample output:$ for secret in $(kubectl -n kyverno get secrets --no-headers | grep kubernetes.io/tls | awk {'print $1'}); do currdate=$(date +'%s'); echo $secret; expires=$(kubectl -n kyverno get secrets $secret -o jsonpath="{.data['tls\.crt']}" | base64 -d | openssl x509 -enddate -noout | awk -F"=" {'print $2'} | xargs -d '\n' -I {} date -d '{}' +'%s'); if [ $expires -le $currdate ]; then echo "Certificate invalid, expired: $(date -d @${expires})"; echo "Need to renew certificate using:"; echo "kubectl -n kyverno delete secret $secret"; else echo "Certificate valid, expires: $(date -d @${expires})"; fi done
occne-kyverno-svc.kyverno.svc.kyverno-tls-ca Certificate valid, expires: Wed Feb 25 05:35:03 PM EST 2026 occne-kyverno-svc.kyverno.svc.kyverno-tls-pair Certificate valid, expires: Fri Jul 25 06:35:12 PM EDT 2025
- Run the following command to verify the Kyverno
secrets:
This section provides the
procedure to renew Kubelet server certificate using the
renew-kubelet-server-cert.sh
script.
The certificate
rotation configuration of the Kubelet server renews the Kubelet client
certificates automatically, as this configuration is enabled by default. The
renew-kubelet-server-cert.sh
script sets the
--rotate-server-certificates
flag to
true, which enables the
serverTLSBootstrap
variable in the Kubelet
configuration.
- Use SSH to log in to the active Bastion Host.
- Run the following command to verify if the Bastion Host is the
active Bastion
Host:
The system displays the following output if the Bastion Host the active Bastion Host:$ is_active_bastion
If the Bastion Host is not the active Bastion Host, try a different Bastion Host.IS active-bastion
Note:
If the certificates are expired, theis_active_bastion
command doesn't work as it depends onkubectl
. In this case, skip this step and move to the next step. - Navigate to the
/var/occne/cluster/${OCCNE_CLUSTER}/artifacts/
directory:$ cd /var/occne/cluster/${OCCNE_CLUSTER}/artifacts/
- Run the
renew-kubelet-server-cert.sh
script:
Sample output:$ ./renew-kubelet-server-cert.sh
============ Checking if all nodes are accessible via ssh ============ occne3-k8s-ctrl-1 occne3-k8s-ctrl-2 occne3-k8s-ctrl-3 occne3-k8s-node-1 occne3-k8s-node-2 occne3-k8s-node-3 occne3-k8s-node-4 All nodes are healthy and accessible using ssh, Starting kubelet server certificate renewal procedure now... ---------------------------------------------------------------------------------------------- Starting renewal of K8s kubelet server certificate for occne3-k8s-ctrl-1. Adding the line --rotate-server-certificates=true --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 to kubelet environment file. Restarting Kubelet to trigger Certificate signing request... Kubelet is successfully restarted! A signing request has been raised, Verifying it now.... A Certificate signing request csr-lfsq9 has been found, Approving it now! certificatesigningrequest.certificates.k8s.io/csr-lfsq9 approved The CSR has been approved for the node occne3-k8s-ctrl-1. Checking if the new K8s kubelet server certificate has been generated... New K8s kubelet server certificate has been successfully generated for the node occne3-k8s-ctrl-1 as shown below. lrwxrwxrwx. 1 root root 59 Jul 24 08:05 kubelet-server-current.pem -> /var/lib/kubelet/pki/kubelet-server-2024-07-24-08-05-40.pem Marked occne3-k8s-ctrl-1 as RENEWED. Kubelet server certificate creation was successful for the node occne3-k8s-ctrl-1.
7.4.3 Renewing the Kubernetes Secrets Encryption Key
This section describes the procedure to renew the key that is used to encrypt the Kubernetes Secrets stored in the CNE Kubernetes cluster.
The key that is used to encrypt Kubernetes Secrets does not expire. However, it is recommended to change the encryption key periodically to ensure the security of your Kubernetes Secrets. If you think that your key is compromised, you must change the encryption key immediately.
To renew a Kubernetes Secrets encryption key, perform the following steps:
- From bastion host, run the following
commands:
$ NEW_KEY=$(head -c 32 /dev/urandom | base64) $ KEY_NAME=$(cat /dev/random | tr -dc '[:alnum:]' | head -c 10) $ kubectl get nodes | awk '/control-plane/ {print $1}' | xargs -I{} ssh {} " sudo sed -i '/keys:$/a\ - name: key_$KEY_NAME\n\ secret: $NEW_KEY' /etc/kubernetes/ssl/secrets_encryption.yaml; sudo cat /etc/kubernetes/ssl/secrets_encryption.yaml"
This creates a random encryption key with a random key name, and adds it to the
/etc/kubernetes/ssl/secrets_encryption.yaml
file within each controller node. The output shows the new encryption key, the key name, and the contents of the/etc/kubernetes/ssl/secrets_encryption.yaml
file.Sample Output:This site is for the exclusive use of Oracle and its authorized customers and partners. Use of this site by customers and partners is subject to the Terms of Use and Privacy Policy for this site, as well as your contract with Oracle. Use of this site by Oracle employees is subject to company policies, including the Code of Conduct. Unauthorized access or breach of these terms may result in termination of your authorization to use this site and/or civil and criminal penalties. kind: EncryptionConfig apiVersion: v1 resources: - resources: - secrets providers: - secretbox: keys: - name: key_ZOJ1Hf5OCx secret: l+CaDTmMkC85LwJRiWJ0LQPYVtOyZ0TdtNZ2ij+kuGA= - name: key secret: ZXJ1Ulk2U0xSbWkwejdreTlJWkFrZmpJZjhBRzg4U00= - identity: {}
- Restart the API server by running the following command. This
ensures that all the secrets get encrypted with the new key while encrypting the
secrets in the next
step:
kubectl get nodes | awk '/control-plane/ {print $1}' | xargs -I{} ssh {} " sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml ~; sleep 2; sudo mv ~/kube-apiserver.yaml /etc/kubernetes/manifests"
- To encrypt all the existing secrets with a new key, run the
following
command:
kubectl get secrets --all-namespaces -o json | kubectl replace -f
Sample output:-secret/occne-cert-manager-webhook-ca replaced secret/sh.helm.release.v1.occne-cert-manager.v1 replaced secret/istio-ca-secret replaced secret/cloud-config replaced secret/external-openstack-cloud-config replaced secret/occne-kyverno-svc.kyverno.svc.kyverno-tls-ca replaced secret/occne-kyverno-svc.kyverno.svc.kyverno-tls-pair replaced secret/sh.helm.release.v1.occne-kyverno-policies.v1 replaced secret/sh.helm.release.v1.occne-kyverno.v1 replaced secret/alertmanager-occne-kube-prom-stack-kube-alertmanager replaced secret/etcd-occne6-j-jorge-l-lopez-k8s-ctrl-1 replaced secret/etcd-occne6-j-jorge-l-lopez-k8s-ctrl-2 replaced secret/etcd-occne6-j-jorge-l-lopez-k8s-ctrl-3 replaced secret/lb-controller-user replaced secret/occne-alertmanager-snmp-notifier replaced secret/occne-kube-prom-stack-grafana replaced secret/occne-kube-prom-stack-kube-admission replaced secret/occne-kube-prom-stack-kube-prometheus-scrape-confg replaced secret/occne-metallb-memberlist replaced secret/occne-tracer-jaeger-elasticsearch replaced secret/prometheus-occne-kube-prom-stack-kube-prometheus replaced secret/prometheus-occne-kube-prom-stack-kube-prometheus-tls-assets-0 replaced secret/prometheus-occne-kube-prom-stack-kube-prometheus-web-config replaced secret/sh.helm.release.v1.occne-alertmanager-snmp-notifier.v1 replaced secret/sh.helm.release.v1.occne-bastion-controller.v1 replaced secret/sh.helm.release.v1.occne-fluentd-opensearch.v1 replaced secret/sh.helm.release.v1.occne-kube-prom-stack.v1 replaced secret/sh.helm.release.v1.occne-lb-controller.v1 replaced secret/sh.helm.release.v1.occne-metallb.v1 replaced secret/sh.helm.release.v1.occne-metrics-server.v1 replaced secret/sh.helm.release.v1.occne-opensearch-client.v1 replaced secret/sh.helm.release.v1.occne-opensearch-dashboards.v1 replaced secret/sh.helm.release.v1.occne-opensearch-data.v1 replaced secret/sh.helm.release.v1.occne-opensearch-master.v1 replaced secret/sh.helm.release.v1.occne-promxy.v1 replaced secret/sh.helm.release.v1.occne-tracer.v1 replaced secret/webhook-server-cert replaced Error from server (Conflict): error when replacing "STDIN": Operation cannot be fulfilled on secrets "alertmanager-occne-kube-prom-stack-kube-alertmanager-generated": the object has been modified; please apply your changes to the latest version and try again Error from server (Conflict): error when replacing "STDIN": Operation cannot be fulfilled on secrets "alertmanager-occne-kube-prom-stack-kube-alertmanager-tls-assets-0": the object has been modified; please apply your changes to the latest version and try again Error from server (Conflict): error when replacing "STDIN": Operation cannot be fulfilled on secrets "alertmanager-occne-kube-prom-stack-kube-alertmanager-web-config": the object has been modified; please apply your changes to the latest version and try again
Note:
You may see some errors on the output depending on how the secret is created. You can ignore these errors and verify the encrypted secret using the following step. - To verify if the new key is used for encrypting the existing
secrets, run the following command from a controller node. Replace <cert
pem file>, <key pem file> and <secret> in
the following command with the corresponding
values.
sudo ETCDCTL_API=3 /usr/local/bin/etcdctl --cert /etc/ssl/etcd/ssl/<cert pem file> --key /etc/ssl/etcd/ssl/<key pem file> get /registry/secrets/default/<secret> -w fields | grep Value
Example:[cloud-user@occne3-user-k8s-ctrl-3 ~]$ sudo ETCDCTL_API=3 /usr/local/bin/etcdctl --cert /etc/ssl/etcd/ssl/node-occne3-user-k8s-ctrl-1.pem --key /etc/ssl/etcd/ssl/node-occne3-user-k8s-ctrl-1-key.pem get /registry/secrets/default/secret1 -w fields | grep Value "Value" : "k8s:enc:secretbox:v1:key_ZOJ1Hf5OCx:&9\x90\u007f'*6\x0e\xf8]\x98\xd7t1\xa9|\x90\x93\x88\xebc\xa9\xfe\x82<\xebƞ\xaa\x17$\xa4\x14%m\xb7<\x1d\xf7N\b\xa7\xbaZ\xb0\xd4#\xbev)\x1bv9\x19\xdel\xab\x89@\xe7\xaf$L\xb8)\xc9\x1bl\x13\xc1V\x1b\xf7\bX\x88\xe7\ue131\x1dG\xe2_\x04\xa2\xf1n\xf5\x1dP\\4\xe7)^\x81go\x99\x98b\xbb\x0eɛ\xc0R;>աj\xeeV54\xac\x06̵\t\x1b9\xd5N\xa77\xd9\x03㵮\x05\xfb%\xa1\x81\xd5\x0e \xcax\xc4\x1cz6\xf3\xd8\xf9?Щ\x9a%\x9b\xe5\xa7й\xcd!,\xb8\x8b\xc2\xcf\xe2\xf2|\x8f\x90\xa9\x05y\xc5\xfc\xf7\x87\xf9\x13\x0e4[i\x12\xcc\xfaR\xdf3]\xa2V\x1b\xbb\xeba6\x1c\xba\v\xb0p}\xa5;\x16\xab\x8e\xd5Ol\xb7\x87BW\tY;寄ƻ\xcaċ\x87Y;\n;/\xf2\x89\xa1\xcc\xc3\xc9\xe3\xc5\v\x1b\x88\x84Ӯ\xc6\x00\xb4\xed\xa5\xe2\xfa\xa9\xff \xd9kʾ\xf2\x04\x8f\x81,l"
This example shows a new key, key_ZOJ1Hf5OCx, being used to encrypt secret1.
7.4.4 Removing a Kubernetes Controller Node
This section describes the procedure to remove a controller node from the CNE Kubernetes cluster in a vCNE deployment.
Note:
- A controller node must be removed from the cluster only when it is required for maintenance.
- This procedure is applicable for vCNE (OpenStack and VMWare) deployments only.
- This procedure is applicable for removing a single controller node only.
7.4.4.1 Removing a Controller Node in OpenStack Deployment
This section describes the procedure to remove a single controller node from the CNE Kubernetes cluster in an OpenStack deployment.
- Locate the controller node internal IP address by running the
following command from the Bastion
Host:
$ kubectl get nodes -o wide | egrep control | awk '{ print $1, $2, $6}'
For example:$ [cloud-user@occne7-test-bastion-1 ~]$ kubectl get node -o wide | egrep control | awk '{ print $1, $2, $6}'
Sample output:occne7-test-k8s-ctrl-1 NotReady 192.168.201.158 occne7-test-k8s-ctrl-2 Ready 192.168.203.194 occne7-test-k8s-ctrl-3 Ready 192.168.200.115
Note that the status of controller node 1 is
NotReady
in the sample output. - Run the following commands to backup the
terraform.tfstate
file:$ cd /var/occne/cluster/${OCCNE_CLUSTER} $ cp terraform.tfstate ${OCCNE_CLUSTER}/terraform.tfstate.backup
- From the Bastion Host, use SSH to log in to a working controller
node and run the following commands to list the etcd
members:
$ ssh <working control node hostname> # sudo su # source /etc/etcd.env # /usr/local/bin/etcdctl --endpoints https://<working control node IP address>:2379 --cacert=${ETCD_PEER_TRUSTED_CA_FILE} --cert=${ETCD_CERT_FILE} --key=${ETCD_KEY_FILE} member list
For example:$ ssh occne7-test-k8s-ctrl-2 [cloud-user@occne7-test-k8s-ctrl-2]$ sudo su [root@occne7-test-k8s-ctrl-2 cloud-user]# source /etc/etcd.env [root@occne7-test-k8s-ctrl-2 cloud-user]# /usr/local/bin/etcdctl --endpoints https://192.168.203.194:2379 --cacert=${ETCD_PEER_TRUSTED_CA_FILE} --cert=${ETCD_CERT_FILE} --key=${ETCD_KEY_FILE} member list 52513ddd2aa49770, started, etcd1, https://192.168.201.158:2380, https://192.168.201.158:2379, false 80845fb2b5120458, started, etcd3, https://192.168.200.115:2380, https://192.168.200.115:2379, false f1200d9975868073, started, etcd2, https://192.168.203.194:2380, https://192.168.203.194:2379, false
- From the output, identify the etcd (etcd1, etcd2, or etcd3) to which the failed controller node belongs.
- Copy the controller node ID that is displayed in the first column of the output to be used later in the procedure.
- If the failed controller node is reachable, use SSH to log in to
the failed controller node from the Bastion Host and stop etcd service by
running the following
commands:
$ ssh <failed control node hostname> $ sudo systemctl stop etcd
Example:$ ssh occne7-test-k8s-ctrl-1 $ sudo systemctl stop etcd
- From the Bastion Host, use SSH to log in to a working controller
node and remove the failed controller node from the etcd member
list:
$ ssh <working control node hostname> $ sudo su $ source /etc/etcd.env $ /usr/local/bin/etcdctl --endpoints https://<working control node IP address>:2379 --cacert=${ETCD_PEER_TRUSTED_CA_FILE} --cert=${ETCD_CERT_FILE} --key=${ETCD_KEY_FILE} member remove <failed control node ID>
Example:[root@occne7-test-k8s-ctrl-2 cloud-user]# /usr/local/bin/etcdctl --endpoints https://192.168.203.194:2379 --cacert=${ETCD_PEER_TRUSTED_CA_FILE} --cert=${ETCD_CERT_FILE} --key=${ETCD_KEY_FILE} member remove 52513ddd2aa49770
Sample output:Member 52513ddd2aa49770 removed from cluster f347ab69786ba4f7
- Validate if the failed node is removed from the etcd member
list:
$ /usr/local/bin/etcdctl --endpoints https://<working control node IP address>:2379 --cacert=${ETCD_PEER_TRUSTED_CA_FILE} --cert=${ETCD_CERT_FILE} --key=${ETCD_KEY_FILE} member list
For example:[root@occne7-test-k8s-ctrl-2 cloud-user]# /usr/local/bin/etcdctl --endpoints https://192.168.203.194:2379 --cacert=${ETCD_PEER_TRUSTED_CA_FILE} --cert=${ETCD_CERT_FILE} --key=${ETCD_KEY_FILE} member list 80845fb2b5120458, started, etcd3, https://192.168.200.115:2380, https://192.168.200.115:2379, false f1200d9975868073, started, etcd2, https://192.168.203.194:2380, https://192.168.203.194:2379, false
- From the Bastion Host, switch the controller nodes in
terraform.tfstate
by running the following commands:Note:
Perform this step only if the failed controller node is a etcd1 member.$ cd /var/occne/cluster/$OCCNE_CLUSTER $ cp terraform.tfstate terraform.tfstate.original $ python3 scripts/switchTfstate.py
For example:[cloud-user@occne7-test-bastion-1]$ python3 scripts/switchTfstate.py
Sample output:Beginning tfstate switch order k8s control nodes terraform.tfstate.lastversion created as backup Controller Nodes order before rotation: occne7-test-k8s-ctrl-1 occne7-test-k8s-ctrl-2 occne7-test-k8s-ctrl-3 Controller Nodes order after rotation: occne7-test-k8s-ctrl-2 occne7-test-k8s-ctrl-3 occne7-test-k8s-ctrl-1 Success: terraform.tfstate rotated for cluster occne7-test
- Remove the failed controller node from the cluster by performing
one the following steps in the Bastion Host depending on whether the failed
controller node is reachable or not:
- If the failed controller node is reachable, run the
following commands to remove the controller node from the
cluster:
$ kubectl cordon <failed control node hostname> $ kubectl drain <failed control node hostname> --force --ignore-daemonsets --delete-emptydir-data $ kubectl delete node <failed control node hostname>
Example:$ [cloud-user@occne7-test-bastion-1]$ kubectl cordon occne7-test-k8s-ctrl-1 $ [cloud-user@occne7-test-bastion-1]$ kubectl drain occne7-test-k8s-ctrl-1 --force --ignore-daemonsets --delete-emptydir-data $ [cloud-user@occne7-test-bastion-1]$ kubectl delete node occne7-test-k8s-ctrl-1
- If the failed controller node is not reachable, run the
following commands to remove the controller node from the
cluster:
$ kubectl cordon <failed control node hostname> $ kubectl delete node <failed control node hostname>
Example:$ [cloud-user@occne7-test-bastion-1]$ kubectl cordon occne7-test-k8s-ctrl-1 $ [cloud-user@occne7-test-bastion-1]$ kubectl delete node occne7-test-k8s-ctrl-1
- If the failed controller node is reachable, run the
following commands to remove the controller node from the
cluster:
- Verify if the failed controller node is deleted from
cluster.
$ kubectl get node
Sample output:[cloud-user@occne7-test-bastion-1]$ kubectl get node NAME STATUS ROLES AGE VERSION occne7-test-k8s-ctrl-2 Ready control-plane,master 82m v1.23.7 occne7-test-k8s-ctrl-3 Ready control-plane,master 82m v1.23.7 occne7-test-k8s-node-1 Ready <none> 81m v1.23.7 occne7-test-k8s-node-2 Ready <none> 81m v1.23.7 occne7-test-k8s-node-3 Ready <none> 81m v1.23.7 occne7-test-k8s-node-4 Ready <none> 81m v1.23.7
Note:
If you are not able to runkubectl
commands from the Bastion Host, update the/var/occne/cluster/$OCCNE_CLUSTER/artifacts/admin.conf
file with the new working node IP address:vi /var/occne/cluster/occne7-test/artifacts/admin.conf server: https://192.168.203.194:6443
- Delete the failed controller node's instance using the Openstack
GUI:
- Log in to OpenStack cloud using your credentials.
- From the Compute menu, select Instances, and
locate the failed controller node's instance that you want to delete, as
shown in the following image:
- On the instance record, click the drop-down option in the
Actions column, select Delete Instance to delete the
failed controller node's instance, as shown in the following image:
7.4.4.2 Removing a Controller Node in VMware Deployment
This section describes the procedure to remove a single controller node from the CNE Kubernetes cluster in a VMware deployment.
- Locate the controller node internal IP address by running the
following command from the Bastion
Host:
$ kubectl get node -o wide | egrep ctrl | awk '{ print $1, $2, $6}'
Sample output:occne7-test-k8s-ctrl-1 192.168.201.158 occne7-test-k8s-ctrl-2 192.168.203.194 occne7-test-k8s-ctrl-3 192.168.200.115
For example:$ [cloud-user@occne7-test-bastion-1 ~]$ kubectl get node -o wide | egrep control | awk '{ print $1, $2, $6}'
Sample output:occne7-test-k8s-ctrl-1 NotReady 192.168.201.158 occne7-test-k8s-ctrl-2 Ready 192.168.203.194 occne7-test-k8s-ctrl-3 Ready 192.168.200.115
Note that the status of control node 1 is
NotReady
in the sample output. - Backup the terraform.tfstate file by running the following
commands:
$ cd /var/occne/cluster/${OCCNE_CLUSTER} $ cp terraform.tfstate ${OCCNE_CLUSTER}/terraform.tfstate.backup
- On the Bastion Host, use SSH to log in to a working controller node
and run the following commands to list the etcd
members:
$ ssh <working control node hostname> # sudo su # source /etc/etcd.env # /usr/local/bin/etcdctl --endpoints https://<working control node IP address>:2379 --cacert=${ETCD_PEER_TRUSTED_CA_FILE} --cert=${ETCD_CERT_FILE} --key=${ETCD_KEY_FILE} member list
For example:$ ssh occne7-test-k8s-ctrl-2 [cloud-user@occne7-test-k8s-ctrl-2]$ sudo su [root@occne7-test-k8s-ctrl-2 cloud-user]# source /etc/etcd.env [root@occne7-test-k8s-ctrl-2 cloud-user]# /usr/local/bin/etcdctl --endpoints https://192.168.203.194:2379 --cacert=${ETCD_PEER_TRUSTED_CA_FILE} --cert=${ETCD_CERT_FILE} --key=${ETCD_KEY_FILE} member list
Sample output:52513ddd2aa49770, started, etcd1, https://192.168.201.158:2380, https://192.168.201.158:2379, false 80845fb2b5120458, started, etcd3, https://192.168.200.115:2380, https://192.168.200.115:2379, false f1200d9975868073, started, etcd2, https://192.168.203.194:2380, https://192.168.203.194:2379, false
- From the output, identify the etcd (etcd1, etcd2, or etcd3) to which the failed controller node belongs.
- Copy the controller node ID that is displayed in the first column of the output to be used later in the procedure.
- If the failed controller node is reachable, use SSH to log in to the
failed controller node from the Bastion Host and stop etcd service by running
the following
commands:
$ ssh <failed control node hostname> $ sudo systemctl stop etcd
For example:$ ssh occne7-test-k8s-ctrl-1 $ sudo systemctl stop etcd
- From the Bastion Host, use SSH to log in to a working controller
node and remove the failed controller node from the etcd member
list:
$ ssh <working control node hostname> $ sudo su $ source /etc/etcd.env $ /usr/local/bin/etcdctl --endpoints https://<working control node IP address>:2379 --cacert=${ETCD_PEER_TRUSTED_CA_FILE} --cert=${ETCD_CERT_FILE} --key=${ETCD_KEY_FILE} member remove <failed control node ID>
For example:[root@occne7-test-k8s-ctrl-2 cloud-user]# /usr/local/bin/etcdctl --endpoints https://192.168.203.194:2379 --cacert=${ETCD_PEER_TRUSTED_CA_FILE} --cert=${ETCD_CERT_FILE} --key=${ETCD_KEY_FILE} member remove 52513ddd2aa49770
Sample output:Member 52513ddd2aa49770 removed from cluster f347ab69786ba4f7
- Validate if the failed node is removed from the etcd member
list:
$ /usr/local/bin/etcdctl --endpoints https://<working control node IP address>:2379 --cacert=${ETCD_PEER_TRUSTED_CA_FILE} --cert=${ETCD_CERT_FILE} --key=${ETCD_KEY_FILE} member list
For example:[root@occne7-test-k8s-ctrl-2 cloud-user]# /usr/local/bin/etcdctl --endpoints https://192.168.203.194:2379 --cacert=${ETCD_PEER_TRUSTED_CA_FILE} --cert=${ETCD_CERT_FILE} --key=${ETCD_KEY_FILE} member list
Sample output:80845fb2b5120458, started, etcd3, https://192.168.200.115:2380, https://192.168.200.115:2379, false f1200d9975868073, started, etcd2, https://192.168.203.194:2380, https://192.168.203.194:2379, false
- From the Bastion Host, switch the controller nodes in
terraform.tfstate
by running the following commands:Note:
Perform this step only if the failed controller node is a etcd1 member.$ cd /var/occne/cluster/${OCCNE_CLUSTER} $ cp terraform.tfstate terraform.tfstate.original $ python3 scripts/switchTfstate.py
For example:[cloud-user@occne7-test-bastion-1]$ python3 scripts/switchTfstate.py
Sample output:Beginning tfstate switch order k8s control nodes terraform.tfstate.lastversion created as backup Controller Nodes order before rotation: occne7-test-k8s-ctrl-1 occne7-test-k8s-ctrl-2 occne7-test-k8s-ctrl-3 Controller Nodes order after rotation: occne7-test-k8s-ctrl-2 occne7-test-k8s-ctrl-3 occne7-test-k8s-ctrl-1 Success: terraform.tfstate rotated for cluster occne7-test
- Remove the failed controller node from the cluster by performing
one the following steps in the Bastion Host depending on whether the failed
controller node is reachable or not:
- If the failed controller node is reachable, run the
following commands to remove the controller node from the
cluster:
$ kubectl cordon <failed control node hostname> $ kubectl drain <failed control node hostname> --force --ignore-daemonsets --delete-emptydir-data $ kubectl delete node <failed control node hostname>
For example:$ [cloud-user@occne7-test-bastion-1]$ kubectl cordon occne7-test-k8s-ctrl-1 $ [cloud-user@occne7-test-bastion-1]$ kubectl drain occne7-test-k8s-ctrl-1 --force --ignore-daemonsets --delete-emptydir-data $ [cloud-user@occne7-test-bastion-1]$ kubectl delete node occne7-test-k8s-ctrl-1
- If the failed controller node is not reachable, run the
following commands to remove the controller node from the
cluster:
$ kubectl cordon <failed control node hostname> $ kubectl delete node <failed control node hostname>
Example:$ [cloud-user@occne7-test-bastion-1]$ kubectl cordon occne7-test-k8s-ctrl-1 $ [cloud-user@occne7-test-bastion-1]$ kubectl delete node occne7-test-k8s-ctrl-1
- If the failed controller node is reachable, run the
following commands to remove the controller node from the
cluster:
- Verify if the failed controller node is deleted from
cluster.
$ kubectl get node
Sample output:NAME STATUS ROLES AGE VERSION occne7-test-k8s-ctrl-2 Ready control-plane,master 82m v1.23.7 occne7-test-k8s-ctrl-3 Ready control-plane,master 82m v1.23.7 occne7-test-k8s-node-1 Ready <none> 81m v1.23.7 occne7-test-k8s-node-2 Ready <none> 81m v1.23.7 occne7-test-k8s-node-3 Ready <none> 81m v1.23.7 occne7-test-k8s-node-4 Ready <none> 81m v1.23.7
Note:
If you are not able to runkubectl
commands from the Bastion Host, update the/var/occne/cluster/$OCCNE_CLUSTER/artifacts/admin.conf
file with the new working node IP address:vi /var/occne/cluster/occne7-test/artifacts/admin.conf server: https://192.168.203.194:6443
- Delete the failed controller node's VM using the VMWare GUI:
- Log in to VMware cloud using your credentials.
- From the Compute menu, select Virtual
Machines, and locate the failed controller node's VM to delete, as
shown in the following image:
- From the Actions menu, select Delete to
delete the failed controller node's VM, as shown in the following
image:
7.4.5 Adding a Kubernetes Worker Node
This section provides the procedure to add additional worker nodes to a previously installed CNE Kubernetes cluster.
Note:
- For a BareMetal installation, ensure that you are familiar with the inventory file preparation procedure. For more information about this procedure, see "Inventory File Preparation" section in Oracle Communications Cloud Native Core, Cloud Native Environment Installation, Upgrade, and Fault Recovery Guide.
- Run this procedure from the active Bastion Host only.
- You can add only one node at a time using this procedure.
Adding a Kubernetes Worker Node on BareMetal
Note:
For any failure or successful run, the system maintains all Terraform and pipeline output in the/var/occne/cluster/${OCCNE_CLUSTER}/addBmWkrNodeCapture-<mmddyyyy_hhmmss>.log
file.
- Log in to Bastion Host and verify if it's an active Bastion Host.
If the Bastion Host isn't an active Bastion Host, then log in to another.
Use the following command to check if the Bastion Host is an active Bastion Host:The system displays the following output if the Bastion Host is an active Bastion Host:
$ is_active_bastion
IS active-bastion
The system displays the following output if the Bastion Host isn't an active Bastion Host:
NOT active-bastion
- Run the following command to navigate to the cluster
directory:
$ cd /var/occne/cluster/${OCCNE_CLUSTER}/
- Run the following command to open the
host.ini
file in edit mode:$ vi host.ini
- Perform the following steps to edit the
hosts.ini
file and add the node details:- Run the following command to open the
hosts.ini
file in edit mode:$ vi hosts.ini
- Add the node details under the
[host_hp_gen_X]
or[host_netra_X]
hardware header, depending on your hardware type:[host_hp_gen_10]/[host_netra_X] k8s-node.example.oracle.com ansible_host=<ipv4> hp_ilo=<ipv4> mac=<mac-address> pxe_config_ks_nic=<nic0> pxe_config_nic_list=<nic0>,<nic1>,<nic2> pxe_uefi=False
where,<NODE_FULL_NAME>
is the full name of the node that is added.Note:
<NODE_FULL_NAME>
,ansible_host
,hp_ilo
ornetra_ilom
, andmac
are the required parameters and their values must be unique in thehost.ini
file.<mac-address>
must be a string of six two-digit hexadecimal numbers separated by a dash. For example,a2-27-3d-d3-b4-00
.- All IP addresses must be in proper IPV4 format.
pxe_config_ks_nic
,pxe_config_nic_list
, andpxe_uefi
are the optional parameters. The node details can also contain other optional parameters that are not listed in the example.- All the required and optional parameters
must be in the
<KEY>=<VALUE>
format without any space between the equal to sign. - All defined parameters must have a valid value.
- Comments must be added in a separate line
using
#
and must not be added at the end of the line.
For example, the following code block displays the node details of a worker node (
k8s-node-5.test.us.oracle.com
) added under the[host_hp_gen_10]
hardware header:... [host_hp_gen_10] k8s-host-1.test.us.oracle.com ansible_host=179.1.5.2 hp_ilo=172.16.9.44 mac=a2-27-3d-d3-b4-00 oam_host=10.75.216.13 k8s-host-2.test.us.oracle.com ansible_host=179.1.5.3 hp_ilo=172.16.9.45 mac=4d-d9-1a-e2-7e-e8 oam_host=10.75.216.14 k8s-host-3.test.us.oracle.com ansible_host=179.1.5.4 hp_ilo=172.16.9.46 mac=e1-15-b4-1d-32-10 k8s-node-1.test.us.oracle.com ansible_host=179.1.5.5 hp_ilo=172.16.9.47 mac=3b-d2-2d-f6-1e-20 k8s-node-2.test.us.oracle.com ansible_host=179.1.5.6 hp_ilo=172.16.9.48 mac=a8-1a-37-b1-c0-dc k8s-node-3.test.us.oracle.com ansible_host=179.1.5.7 hp_ilo=172.16.9.49 mac=a4-be-2d-3f-21-f0 k8s-node-4.test.us.oracle.com ansible_host=179.1.5.8 hp_ilo=172.16.9.35 mac=3a-d9-2c-e6-35-18 # New node k8s-node-5.test.us.oracle.com ansible_host=179.1.5.9 hp_ilo=172.16.9.46 mac=2a-e1-c3-d4-12-a9 ...
- Add the full name of the node under the
[kube-node]
header.[kube-node] <NODE_FULL_NAME>
where,
<NODE_FULL_NAME>
is the full name of the node that is added.For example, the following code block shows the full name of the worker node (k8s-node-5.test.us.oracle.com
) added under the[kube-node]
header:... [kube-node] k8s-node-1.test.us.oracle.com k8s-node-2.test.us.oracle.com k8s-node-3.test.us.oracle.com k8s-node-4.test.us.oracle.com # New node k8s-node-5.test.us.oracle.com ...
- Save the
host.ini
file and exit.
- Run the following command to open the
- Navigate to the
maintenance
directory:$ cd /var/occne/cluster/${OCCNE_CLUSTER}/artifacts/maintenance
- The
addBmWorkerNode.py
script in themaintenance
directory is used to add Kubernetes worker node on BareMetal. Run the following command to add one worker node at a time:$ ./addBmWorkerNode.py -nn <NODE_FULL_NAME>
where,
<NODE_FULL_NAME>
is the full name of the node that you added to thehost.ini
file in the previous steps.For example:$ ./addBmWorkerNode.py -nn k8s-5.test.us.oracle.com
Sample output:Beginning add worker node: k8s-5.test.us.oracle.com - Backing up configuration files - Verify hosts.ini values - Updating /etc/hosts on all nodes with new node - Successfully updated file: /etc/hosts on all servers - check /var/occne/cluster/test/addBmWkrNodeCapture-05312024_224446.log for details. - Set maintenance banner - Successfully set maintenance banner - check /var/occne/cluster/test/addBmWkrNodeCapture-05312024_224446.log for details. - Create toolbox - Checking if the rook-ceph toolbox deployment already exists. - rook-ceph toolbox deployment already exists, skipping creation. - Wait for Toolbox pod - Waiting for Toolbox pod to be in Running state. - ToolBox pod in namespace rook-ceph is now in Running state. - Updating OS on new node - Successfully run Provisioning pipeline - check /var/occne/cluster/test/addBmWkrNodeCapture-05312024_224446.log for details. - Scaling new node into cluster - Successfully run k8_install scale playbook - check /var/occne/cluster/test/addBmWkrNodeCapture-05312024_224446.log for details. - Running verification - Node k8s-5.test.us.oracle.com verification passed. - Restarting rook-ceph operator - rook-ceph pods ready! - Restoring default banner - Successfully run POST stage on PROV container - check /var/occne/cluster/test/addBmWkrNodeCapture-05312024_224446.log for details. Worker node: k8s-5.test.us.oracle.com added successfully
- Run the following commands to verify if the node is added
successfully:
- Run the following command and verify if the new node is in
the Ready state:
$ kubectl get nodes
Sample output:NAME STATUS ROLES AGE VERSION k8s-master-1.test.us.oracle.com Ready control-plane 7d15h v1.29.1 k8s-master-2.test.us.oracle.com Ready control-plane 7d15h v1.29.1 k8s-master-3.test.us.oracle.com Ready control-plane 7d15h v1.29.1 k8s-node-1.test.us.oracle.com Ready <none> 7d15h v1.29.1 k8s-node-2.test.us.oracle.com Ready <none> 7d15h v1.29.1 k8s-node-4.test.us.oracle.com Ready <none> 7d15h v1.29.1 k8s-node-5.test.us.oracle.com Ready <none> 14m v1.29.1
- Run the following command and verify if all pods are in the
Running or Completed
state:
$ kubectl get pod -A
- Run the following command and verify if the services are
running and the service GUIs are
reachable:
$ kubectl get svc -A
- Run the following command and verify if the new node is in
the Ready state:
Adding a Kubernetes Worker Node on vCNE (OpenStack and VMware)
Note:
For any failure or successful run, the system maintains all Terraform and pipeline output in the/var/occne/cluster/${OCCNE_CLUSTER}/addWrkNodeCapture-<mmddyyyy_hhmmss>.log
file.
- Log in to a Bastion Host and ensure if all the pods are in the
Running or Completed
state:
$ kubectl get pod -A
- Verify if the services are reachable and if the common services
GUIs are accessible using the LoadBalancer
EXTERNAL-IPs:
$ kubectl get svc -A | grep LoadBalancer $ curl <svc_external_ip>
- Navigate to the cluster
directory:
$ cd /var/occne/cluster/$OCCNE_CLUSTER/
- Run the following command to open the
$OCCNE_CLUSTER/cluster.tfvars
file. Search for thenumber_of_k8s_nodes
parameter in the file and increment the value of the parameter by one.$ vi $OCCNE_CLUSTER/cluster.tfvars
The following example shows the current value ofnumber_of_k8s_nodes
set to 5:... # k8s nodes # number_of_k8s_nodes = 5 ...
The following example shows the value ofnumber_of_k8s_nodes
incremented by one to 6.... # k8s nodes # number_of_k8s_nodes = 6 ...
- For OpenStack, perform this step to source the
openrc.sh
file. Theopenrc.sh
file sets the necessary environment variables for OpenStack. For VMware, skip this step and move to the next step.- Source the
openrc.sh
file.$ source openrc.sh
- Enter the OpenStack username and password when prompted.
The following block shows the username and password prompt displayed by the system:
Please enter your OpenStack Username: Please enter your OpenStack Password as <username>:
- Source the
- Run the following command to ensure that the
openstack-cacert.pem
file exists in the same folder and the file is populated with appropriate certificates if TLS is supported:
Sample output:$ ls /var/occne/cluster/$OCCNE_CLUSTER
... openstack-cacert.pem ...
- Run the
addWorkerNode.py
script to add a worker node:Note:
The system backs up number of files such aslbvm/lbCtrlData.json
,cluster.tfvars
,hosts.ini
,terraform.tfstate
(renamed to terraform.tfstate.ORIG), and/etc/hosts
into the/var/occne/cluster/${OCCNE_CLUSTER}/backUpConfig
directory. These files are backed up only once to take a backup of the original files.
Sample output for OpenStack:$ ./scripts/addWorkerNode.py
Sample output for VMware:Starting addWorkerNode instance for the last worker node. - Backing up configuration files... - Checking if cluster.tfvars matches with the terraform state... Succesfully checked the number_of_k8s_nodes parameter in the cluster.tfvars file. - Running terraform apply to update its state... Successfully applied Openstack terraform apply - check /var/occne/cluster/occne-test/addWkrNodeCapture-11262024_220914.log for details - Get name for the new worker node... Successfully retrieved the name of the new worker node. - Update /etc/hosts files on all previous servers... Successfully updated file: /etc/hosts on all servers - check /var/occne/cluster/occne-test/addWkrNodeCapture-11262024_220914.log for details. - Setting maintenance banner... Successfully set maintenance banner - check /var/occne/cluster/occne-test/addWkrNodeCapture-11262024_220914.log for details. - Running pipeline.sh for provision - can take considerable time to complete... Successfully run Provisioning pipeline - check /var/occne/cluster/occne-test/addWkrNodeCapture-11262024_220914.log for details. - Running pipeline.sh for k8s_install - can take considerable time to complete... Successfully run K8s pipeline - check /var/occne/cluster/occne-test/addWkrNodeCapture-11262024_220914.log for details. - Get IP address for the new worker node... Successfully retrieved IP address of the new worker node occne-test-k8s-node-5. - Update lbCtrlData.json file... Successfully updated file: /var/occne/cluster/occne-test/lbvm/lbCtrlData.json. - Update lb-controller-ctrl-data and lb-controller-master-ip configmap... Successfully created configmap lb-controller-ctrl-data. Successfully created configmap lb-controller-master-ip. - Restarting LB Controller POD to bind in configmaps... Successfully restarted deployment occne-lb-controller-server. Waiting for occne-lb-controller-server deployment to return to Running status. Deployment "occne-lb-controller-server" successfully rolled out - Update servers from new occne-lb-controller pod... Successfully updated server list for each service in haproxy.cfg on LBVMs with new node: occne-test-k8s-node-5. - Restoring default banner... Successfully restored default banner - check /var/occne/cluster/occne-test/addWkrNodeCapture-11262024_220914.log for details. Worker node successfully added to cluster: occne-test
Starting addWorkerNode instance for the last worker node. - Backing up configuration files... - Checking if cluster.tfvars matches with the terraform state... Succesfully checked the number_of_k8s_nodes parameter in the cluster.tfvars file. - Running terraform apply to update its state... VmWare terraform apply -refresh-only successful - check /var/occne/cluster/occne5-chandrasekhar-musti/addWkrNodeCapture-11282023_115313.log for details. VmWare terraform apply successful - node - check /var/occne/cluster/occne5-chandrasekhar-musti/addWkrNodeCapture-11282023_115313.log for details. - Get name for the new worker node... Successfully retrieved the name of the new worker node. - Running pipeline.sh for provision - can take considerable time to complete... Successfully run Provisioning pipeline - check /var/occne/cluster/occne5-chandrasekhar-musti/addWkrNodeCapture-11282023_115313.log for details. - Running pipeline.sh for k8s_install - can take considerable time to complete... Successfully run K8s pipeline - check /var/occne/cluster/occne5-chandrasekhar-musti/addWkrNodeCapture-11282023_115313.log for details. - Get IP address for the new worker node... Successfully retrieved IP address of the new worker node occne5-chandrasekhar-musti-k8s-node-4. - Update /etc/hosts files on all previous servers... Successfully updated file: /etc/hosts on all servers - check /var/occne/cluster/occne5-chandrasekhar-musti/addWkrNodeCapture-11282023_115313.log for details. - Update lbCtrlData.json file... Successfully updated file: /var/occne/cluster/occne5-chandrasekhar-musti/lbvm/lbCtrlData.json. - Update lb-controller-ctrl-data and lb-controller-master-ip configmap... Successfully created configmap lb-controller-ctrl-data. Successfully created configmap lb-controller-master-ip. - Deleting LB Controller POD: occne-lb-controller-server-5d8cd867b7-s5gb2 to bind in configmaps... Successfully restarted deployment occne-lb-controller-server. Waiting for occne-lb-controller-server deployment to return to Running status. Deployment "occne-lb-controller-server" successfully rolled out - Update servers from new occne-lb-controller pod... Successfully updated server list for each service in haproxy.cfg on LBVMs with new node: occne5-chandrasekhar-musti-k8s-node-4. Worker node successfully added to cluster: occne5-chandrasekhar-musti
- If there's a failure in the previous step, perform the following
steps to rerun the script:
- Copy backup files to the original
files:
$ cp /var/occne/cluster/${OCCNE_CLUSTER}/cluster.tfvars ${OCCNE_CLUSTER}/cluster.tfvars $ cp /var/occne/cluster/${OCCNE_CLUSTER}/backupConfig/lbCtrlData.json lbvm/lbCtrlData.json # sudo cp /var/occne/cluster/${OCCNE_CLUSTER}/backupConfig/hosts /etc/hosts
- If you ran Podman commands before the failure, then drain
the new pods before rerunning the
script:
$ kubectl drain --ignore-daemonsets <worker_node_hostname>
For example:$ kubectl drain --ignore-daemonsets ${OCCNE_CLUSTER}-k8s-node-5
- Rerun the
addWorkerNode.py
script:$ scripts/addWorkerNode.py
- After rerunning the script, uncordon the
nodes:
$ kubectl uncordon <new node>
For example:$ kubectl uncordon ${OCCNE_CLUSTER}-k8s-node-5
- Copy backup files to the original
files:
- Verify the nodes, pods, and services:
- Verify if the new nodes are in Ready state by running the
following
command:
$ kubectl get nodes
- Verify if all pods are in the Running or Completed state by
running the following
command:
$ kubectl get pod -A -o wide
- The services are running and the services GUIs are
reachable:
$ kubectl get svc -A
- Verify if the new nodes are in Ready state by running the
following
command:
7.4.6 Removing a Kubernetes Worker Node
This section describes the procedure to remove a worker node from the CNE Kubernetes cluster after the original CNE installation. This procedure is used to remove a worker node that is unreachable (crashed or powered off), or that is up and running.
Note:
- This procedure is used to remove only one node at a time. If you want to remove multiple nodes, then perform this procedure on each node.
- Removing multiple worker nodes can cause unwanted side effects such as increasing the overall load of your cluster. Therefore, before removing multiple nodes, make sure that there is enough capacity left in the cluster.
- CNE requires a minimum of three worker nodes to properly run some of the common services such as, Opensearch, Bare Metal Rook Ceph cluster, and any daemonsets that require three or more replicas.
- For a vCNE deployment, this procedure is used to remove only the last worker node in the Kubernetes. Therefore, refrain from using this procedure to remove any other worker node.
Note:
For any failure or successful run, the system maintains all terraform and pipeline output in the
/var/occne/cluster/${OCCNE_CLUSTER}/removeWrkNodeCapture-<mmddyyyy_hhmmss>.log
file.
- Log in to a Bastion Host and verify the following:
- Run the following command to verify if all pods are the
Running or
Completed:
$ kubectl get pod -A
Sample output:NAMESPACE NAME READY STATUS RESTARTS AGE cert-manager occne-cert-manager-6dcffd5b9-jpzmt 1/1 Running 1 (3h17m ago) 4h56m cert-manager occne-cert-manager-cainjector-5d6bccc77d-f4v56 1/1 Running 2 (3h15m ago) 3h48m cert-manager occne-cert-manager-webhook-b7f4b7bdc-rg58k 0/1 Completed 0 3h39m cert-manager occne-cert-manager-webhook-b7f4b7bdc-tx7gz 1/1 Running 0 3h17m ...
- Run the following command to verify if the service
LoadBalancer IPs are reachable and common service GUIs are
running:
$ kubectl get svc -A | grep LoadBalancer
Sample output:occne-infra occne-kibana LoadBalancer 10.233.36.151 10.75.180.113 80:31659/TCP 4h57m occne-infra occne-kube-prom-stack-grafana LoadBalancer 10.233.63.254 10.75.180.136 80:32727/TCP 4h56m occne-infra occne-kube-prom-stack-kube-alertmanager LoadBalancer 10.233.32.135 10.75.180.204 80:30155/TCP 4h56m occne-infra occne-kube-prom-stack-kube-prometheus LoadBalancer 10.233.3.37 10.75.180.126 80:31964/TCP 4h56m occne-infra occne-promxy-apigw-nginx LoadBalancer 10.233.42.250 10.75.180.4 80:30100/TCP 4h56m occne-infra occne-tracer-jaeger-query LoadBalancer 10.233.4.43 10.75.180.69 80:32265/TCP,16687:30218/TCP 4h56m
- Run the following command to verify if all pods are the
Running or
Completed:
- Navigate to the
/var/occne/cluster/${OCCNE_CLUSTER}/
directory:$ cd /var/occne/cluster/${OCCNE_CLUSTER}/
- Open the
$OCCNE_CLUSTER/cluster.tfvars
file and decrement the value of thenumber_of_k8s_nodes
field by 1:$ vi $OCCNE_CLUSTER/cluster.tfvars
The following example shows the current value ofnumber_of_k8s_nodes
set to 6:... # k8s nodes # number_of_k8s_nodes = 6 ...
The following example shows the value ofnumber_of_k8s_nodes
decremented by 1 to 5:... # k8s nodes # number_of_k8s_nodes = 5 ...
- For OpenStack, perform this step to establish a connection between
Bastion Host and OpenStack cloud. For VMware, skip this step and move to the
next step.
Source the
openrc.sh
file. Enter the Openstack username and password when prompted. Theopenrc.sh
file sets the necessary environment variables for OpenStack. Once you source the file, ensure that theopenstack-cacert.pem
file exists in the same folder and the file is populated for TLS support:$ source openrc.sh
The following block shows the username and password prompt displayed by the system:Please enter your OpenStack Username: Please enter your OpenStack Password as <username>: Please enter your OpenStack Domain:
- Run the following command to get the list of
nodes:
$ kubectl get nodes -o wide | grep -v control-plane
Sample output:NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME occne6-my-cluster-k8s-node-1 Ready <none> 6d23h v1.25.6 192.168.201.183 <none> Oracle Linux Server 8.7 5.4.17-2136.316.7.el8uek.x86_64 containerd://1.6.15 occne6-my-cluster-k8s-node-2 Ready <none> 6d23h v1.25.6 192.168.201.136 <none> Oracle Linux Server 8.7 5.4.17-2136.316.7.el8uek.x86_64 containerd://1.6.15 occne6-my-cluster-k8s-node-3 Ready <none> 6d23h v1.25.6 192.168.201.131 <none> Oracle Linux Server 8.7 5.4.17-2136.316.7.el8uek.x86_64 containerd://1.6.15 occne6-my-cluster-k8s-node-4 Ready <none> 6d23h v1.25.6 192.168.200.100 <none> Oracle Linux Server 8.7 5.4.17-2136.316.7.el8uek.x86_64 containerd://1.6.15
- Run the following command to obtain the worker node IPs and verify
if the worker node IPs match with the list obtained in Step
4:
$ kubectl exec -it $(kubectl -n occne-infra get pods | grep occne-lb-controller-server) -n occne-infra -- /bin/bash -c "sqlite3 /data/sqlite/db/lbCtrlData.db 'SELECT * FROM nodeIps;'"
Sample output:192.168.201.183 192.168.201.136 192.168.201.131 192.168.200.100
- Run the
removeWorkerNode.py
script.Note:
The system backs up thelbvm/lbCtrlData.json
,cluster.tfvars
,hosts.ini
,terraform.tfstate
, and/etc/hosts
files into the/var/occne/cluster/${OCCNE_CLUSTER}/backUpConfig
directory. These files are backed up only once to back up the original files.$ ./scripts/removeWorkerNode.py
Example for OpenStack deployment:$ ./scripts/removeWorkerNode.py
Sample output:Starting removeWorkerNode instance for the last worker node. - Backing up configuration files... - Checking if cluster.tfvars matches with the terraform state... Succesfully checked the number_of_k8s_nodes parameter in the cluster.tfvars file. - Getting the IP address for the worker node to be deleted... Successfully gathered occne7-devansh-m-marwaha-k8s-node-4's ip: 192.168.200.105. - Draining node - can take considerable time to complete... Successfully drained occne7-devansh-m-marwaha-k8s-node-4 node. - Removing node from the cluster... Successfully removed occne7-devansh-m-marwaha-k8s-node-4 from the cluster. - Running terraform apply to update its state... Successfully applied Openstack terraform apply - check /var/occne/cluster/occne7-devansh-m-marwaha/removeWkrNodeCapture-11282023_090320.log for details - Updating /etc/hosts on all servers... Successfully updated file: /etc/hosts on all servers - check /var/occne/cluster/occne7-devansh-m-marwaha/removeWkrNodeCapture-11282023_090320.log for details. - Updating lbCtrlData.json file... Successfully updated file: /var/occne/cluster/occne7-devansh-m-marwaha/lbvm/lbCtrlData.json. - Updating lb-controller-ctrl-data and lb-controller-master-ip configmap... Successfully created configmap lb-controller-ctrl-data. Successfully created configmap lb-controller-master-ip. - Deleting LB Controller POD: occne-lb-controller-server-fc869755-lm4hd to bind in configmaps... Successfully restarted deployment occne-lb-controller-server. Waiting for occne-lb-controller-server deployment to return to Running status. Deployment "occne-lb-controller-server" successfully rolled out - Update servers from new occne-lb-controller pod... Successfully removed the node: occne7-devansh-m-marwaha-k8s-node-4 from server list for each service in haproxy.cfg on LBVMs. Worker node successfully removed from cluster: occne7-devansh-m-marwaha
Example for VMware deployment:./scripts/removeWorkerNode.py
Sample output:Starting removeWorkerNode instance for the last worker node. Successfully obtained index 3 from node occne5-chandrasekhar-musti-k8s-node-4. - Backing up configuration files... - Checking if cluster.tfvars matches with the terraform state... Succesfully checked the number_of_k8s_nodes parameter in the cluster.tfvars file. - Getting the IP address for the worker node to be deleted... Successfully gathered occne5-chandrasekhar-musti-k8s-node-4's ip: 192.168.1.15. - Draining node - can take considerable time to complete... Successfully drained occne5-chandrasekhar-musti-k8s-node-4 node. - Removing node from the cluster... Successfully removed occne5-chandrasekhar-musti-k8s-node-4 from the cluster. - Running terraform apply to update its state... Successfully applied VmWare terraform apply - check /var/occne/cluster/occne5-chandrasekhar-musti/removeWkrNodeCapture-11282023_105101.log fodetails. - Updating /etc/hosts on all servers... Successfully updated file: /etc/hosts on all servers - check /var/occne/cluster/occne5-chandrasekhar-musti/removeWkrNodeCapture-11282023_1051.log for details. - Updating lbCtrlData.json file... Successfully updated file: /var/occne/cluster/occne5-chandrasekhar-musti/lbvm/lbCtrlData.json. - Updating lb-controller-ctrl-data and lb-controller-master-ip configmap... Successfully created configmap lb-controller-ctrl-data. Successfully created configmap lb-controller-master-ip. - Deleting LB Controller POD: occne-lb-controller-server-7b894fb6b5-5cr8g to bind in configmaps... Successfully restarted deployment occne-lb-controller-server. Waiting for occne-lb-controller-server deployment to return to Running status. Deployment "occne-lb-controller-server" successfully rolled out - Update servers from new occne-lb-controller pod... Successfully removed the node: occne5-chandrasekhar-musti-k8s-node-4 from server list for each service in haproxy.cfg on LBVMs. Worker node successfully removed from cluster: occne5-chandrasekhar-musti
- Verify if the specified node is removed:
- Run the following command to list the worker
nodes:
$ kubectl get nodes -o wide
Sample output:NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME occne6-my-cluster-k8s-ctrl-1 Ready control-plane,master 6d23h v1.25.6 192.168.203.106 <none> Oracle Linux Server 8.7 5.4.17-2136.316.7.el8uek.x86_64 containerd://1.6.15 occne6-my-cluster-k8s-ctrl-2 Ready control-plane,master 6d23h v1.25.6 192.168.202.122 <none> Oracle Linux Server 8.7 5.4.17-2136.316.7.el8uek.x86_64 containerd://1.6.15 occne6-my-cluster-k8s-ctrl-3 Ready control-plane,master 6d23h v1.25.6 192.168.202.248 <none> Oracle Linux Server 8.7 5.4.17-2136.316.7.el8uek.x86_64 containerd://1.6.15 occne6-my-cluster-k8s-node-1 Ready <none> 6d23h v1.25.6 192.168.201.183 <none> Oracle Linux Server 8.7 5.4.17-2136.316.7.el8uek.x86_64 containerd://1.6.15 occne6-my-cluster-k8s-node-2 Ready <none> 6d23h v1.25.6 192.168.201.136 <none> Oracle Linux Server 8.7 5.4.17-2136.316.7.el8uek.x86_64 containerd://1.6.15 occne6-my-cluster-k8s-node-3 Ready <none> 6d23h v1.25.6 192.168.201.131 <none> Oracle Linux Server 8.7 5.4.17-2136.316.7.el8uek.x86_64 containerd://1.6.15
- Run the following command and check if the targeted worker
node is
removed:
$ kubectl exec -it $(kubectl -n occne-infra get pods | grep occne-lb-controller-server) -n occne-infra -- /bin/bash -c "sqlite3 /data/sqlite/db/lbCtrlData.db 'SELECT * FROM nodeIps;'"
Sample output:192.168.201.183 192.168.201.136 192.168.201.131
- Run the following command to list the worker
nodes:
Note:
For any failure or successful run, the system maintains all pipeline outputs in the
/var/occne/cluster/${OCCNE_CLUSTER}/removeWrkNodeCapture-<mmddyyyy_hhmmss>.log
file. The system displays other outputs, messages, or errors directly on the
terminal during the runtime of the script.
7.4.7 Adding a New External Network
This section provides the procedure to add a new external network that applications can use to talk to external clients, by adding a Peer Address Pool (PAP) in a virtualized CNE (vCNE) and Bare Metal, after installation in CNE.
OCCNE_STAGES=(TEST) pipeline.sh
7.4.7.1 Adding a New External Network in vCNE
addpapCapture-<mmddyyyy_hhmmss>.log
. For example,
addPapCapture-09172021_000823.log
. The log includes the output from the
Terraform and the pipeline call to configure the new LBVMs.
addPapSave-<mmddyyyy-hhmmss>
. The following files from the
/var/occne/cluster/<cluster_name>
directory are saved in the
addPapSave-<mmddyyyy-hhmmss>
directory:
- lbvm/lbCtrlData.json
- metallb.auto.tfvars
- mb_resources.yaml
- terraform.tfstate
- hosts.ini
- cluster.tfvars
- On an OpenStack deployment, run the following steps to source the OpenStack
environment file. This step is not required for a VMware deployment as the credential
settings are derived automatically.
- Log in to Bastion Host and change the directory to the cluster
directory:
$ cd /var/occne/cluster/${OCCNE_CLUSTER}
- Source the OpenStack environment
file:
$ source openrc.sh
- Log in to Bastion Host and change the directory to the cluster
directory:
Procedure
7.4.8 Renewing the Platform Service Mesh Root Certificate
This section describes the procedure to renew the root certificate used by the platform service mesh to generate certificates for Mutual Transport Layer Security (mTLS) communication when the Intermediate Certification Authority (ICA) issuer type is used.
- The CNE platform service mesh must have been configured to use the Intermediate CA issuer type.
- A network function configured with the platform service mesh, commonly istio, must be available.
- Renew the root CA certificate
- Verify that root certificate is renewed
7.4.9 Performing an etcd Data Backup
This section describes the procedure to back up the etcd database.
- After a 5G NF is installed, uninstalled, or upgraded
- Before and after CNE is upgraded
- Find Kubernetes controller hostname: Run the following command to
get the names of Kubernetes controller nodes. The backup must be taken from any
one of the controller nodes that is in Ready
state.
$ kubectl get nodes
- Run the etcd-backup script:
- On the Bastion Host, switch to the
/var/occne/cluster/${OCCNE_CLUSTER}/artifacts
directory:$ cd /var/occne/cluster/${OCCNE_CLUSTER}/artifacts
- Run the
etcd_backup.sh
script:$ ./etcd_backup.sh
On running the script, the system prompts you to enter the k8s-ctrl node name. Enter the name of the controller node from which you want to back up the etcd data.
Note:
The script keeps only three backup snapshots in the PVC and automatically deletes the older snapshots.
- On the Bastion Host, switch to the
7.5 Updating OpenStack Credentials
This section describes the procedure to update the OpenStack credentials for vCNE.
Prerequisites
- You must have access to active Bastion Host of the cluster.
- All commands in this procedure must be run from the active CNE Bastion Host.
- You must have knowledge of kubectl and handling base64 encoded and decoded strings.
Modifying Password for Cinder Access
Kubernetes uses the cloud-config secret when interacting with OpenStack Cinder to acquire persistent storage for applications. The following steps describe how to update this secret to include the new password.
- Run the following command to decode and save the current
cloud-config secret configurations in a temporary
file:
$ kubectl get secret cloud-config -n kube-system -o jsonpath="{.data.cloud\.conf}" | base64 --decode > /tmp/decoded_cloud_config.txt
- Run the following command to open the temporary file in vi editor
and update the username and password fields in the file with required
values:
$ vi /tmp/decoded_cloud_config.txt
Sample to edit the username and password:username="new_username" password="new_password"
After updating the credentials, save and exit from the file.
- Run the following command to re-encode the
cloud-config
secret in Base64. Save the encoded output to use it in the following step.$ cat /tmp/decoded_cloud_config.txt | base64 -w0
- Run the following command to edit the
cloud-config
Kubernetes secret:$ kubectl edit secret cloud-config -n kube-system
Refer to the following sample to edit thecloud-config
Kubernetes secret:Note:
Replace<encoded-output>
in the following sample with the encoded output that you saved in the previous step.# Please edit the object below. Lines beginning with a '#' will be ignored, # and an empty file will abort the edit. If an error occurs while saving this file will be # reopened with the relevant failures. # apiVersion: v1 data: cloud.conf: <encoded-output> kind: Secret metadata: annotations: kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"v1","data":{"cloud.conf":"<encoded-output>"},"kind":"Secret","metadata":{"annotations":{},"name":"cloud-config","namespace":"kube-system"}} creationTimestamp: "2022-01-12T02:34:52Z" name: cloud-config namespace: kube-system resourceVersion: "2225" uid: 0994b024-6a4d-41cf-904c type: Opaque
Save the changes and exit the editor.
- Run the following command to remove the temporary
file:
$ rm /tmp/decoded_cloud_config.txt
Modifying Password for OpenStack Cloud Controller Access
Kubernetes uses the external-openstack-cloud-config
secret when interacting with the OpenStack Controller. The following steps describe
the procedure to update the secret to include the new credentials.
- Run the following command to decode the current
external-openstack-cloud-config
secret configurations in a temporary file:$ kubectl get secret external-openstack-cloud-config -n kube-system -o jsonpath="{.data.cloud\.conf}" | base64 --decode > /tmp/decoded_external_openstack_cloud_config.txt
- Run the following command to open the temporary file in vi editor
and update the username and password fields in the file with required
values:
$ vi /tmp/decoded_external_openstack_cloud_config.txt
Sample to edit the username and password:username="new_username" password="new_password"
After updating the credentials, save and exit from the file.
- Run the following command to re-encode
external-openstack-cloud-config
in Base64. Save the encoded output to use it in the following step.$ cat /tmp/decoded_external_openstack_cloud_config.txt | base64 -w0
- Run the following command to edit the Kubernetes Secret named,
external-openstack-cloud-config
:$ kubectl edit secret external-openstack-cloud-config -n kube-system
Refer to the following sample to edit theexternal-openstack-cloud-config
Kubernetes Secret with the new encoded value:Note:
- Replace
<encoded-output>
in the following sample with the encoded output that you saved in the previous step. - An empty file aborts the edit. If an error occurs while saving, the file reopens with the relevant failures.
apiVersion: v1 data: ca.cert: cloud.conf:<encoded-output> kind: Secret metadata: annotations: kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"v1","data":{"ca.cert":" ","cloud.conf":"<encoded-output>"},"kind":"Secret","metadata":{"annotations":{},"name":"external-openstack-cloud-config","namespace":"kube-system"}} creationTimestamp: "2022-07-21T17:05:26Z" name: external-openstack-cloud-config namespace: kube-system resourceVersion: "16" uid: 9c18f914-9c78-401d-ae79 type: Opaque
Save the changes and exit the editor.
- Replace
- Run the following command to remove the temporary
file:
$ rm /tmp/decoded_external_openstack_cloud_config.txt
Restarting Affected Pods to Use the New Password
Note:
Before restarting the services, verify that all the affected Kubernetes resources to be restarted are in healthy state.- Perform the following steps to restart Cinder Container Storage
Interface (Cinder CSI) controller plugin:
- Run the following command to restart Cinder Container
Storage Interface (Cinder CSI)
deployment:
$ kubectl rollout restart deployment csi-cinder-controllerplugin -n kube-system
Sample output:deployment.apps/csi-cinder-controllerplugin restarted
- Run the following command to get the pod and verify if it is
running:
$ kubectl get pods -l app=csi-cinder-controllerplugin -n kube-system
Sample output:NAME READY STATUS RESTARTS AGE csi-cinder-controllerplugin-7c9457c4f8-88sbt 6/6 Running 0 19m
- [Optional]: If the pod is not up or if the pod is in the
crashloop
state, get the logs from thecinder-csi-plugin
container inside thecsi-cinder-controller
pod using labels and validate the logs for more information:$ kubectl logs -l app=csi-cinder-controllerplugin -c cinder-csi-plugin -n kube-system
Sample output to show a successful log retrieval:I0904 21:36:09.162886 1 server.go:106] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
Sample output to show a log retrieval failure:W0904 21:34:34.252515 1 main.go:105] Failed to GetOpenStackProvider: Authentication failed
- Run the following command to restart Cinder Container
Storage Interface (Cinder CSI)
deployment:
- Perform the following steps to restart Cinder Container Storage
Interface (Cinder CSI) nodeplugin daemonset:
- Run the following command to restart Cinder Container
Storage Interface (Cinder CSI) nodeplugin
daemonset:
$ kubectl rollout restart -n kube-system daemonset csi-cinder-nodeplugin
Sample output:daemonset.apps/csi-cinder-nodeplugin restarted
- Run the following command to get the pod and verify if it is
running:
$ kubectl get pods -l app=csi-cinder-nodeplugin -n kube-system
Sample output:NAME READY STATUS RESTARTS AGE csi-cinder-nodeplugin-pqqww 3/3 Running 0 3d19h csi-cinder-nodeplugin-vld6m 3/3 Running 0 3d19h csi-cinder-nodeplugin-xg2kj 3/3 Running 0 3d19h csi-cinder-nodeplugin-z5vck 3/3 Running 0 3d19h csi-cinder-nodeplugin-z5vck 3/3 Running 0 3d19h csi-cinder-nodeplugin-z5vck 3/3 Running 0 3d19h csi-cinder-nodeplugin-z5vck 3/3 Running 0 3d19h
- [Optional]: If the pod is not up or if the pod is in the
crashloop
state, verify the logs for more information
- Run the following command to restart Cinder Container
Storage Interface (Cinder CSI) nodeplugin
daemonset:
- Run the following command to restart the OpenStack cloud controller
daemonset:
- Run the following command to restart the OpenStack cloud
controller
daemonset:
$ kubectl rollout restart -n kube-system daemonset openstack-cloud-controller-manager
Sample output:daemonset.apps/openstack-cloud-controller-manager restarted
- Run the following command to get the pod and verify if it
is
running:
$ kubectl get pods -l k8s-app=openstack-cloud-controller-manager -n kube-system
Sample output:NAME READY STATUS RESTARTS AGE openstack-cloud-controller-manager-qtfff 1/1 Running 0 38m openstack-cloud-controller-manager-sn2pg 1/1 Running 0 38m openstack-cloud-controller-manager-w5dcv 1/1 Running 0 38m
- [Optional]: If the pod is not up, or is in the
crashloop
state, verify the logs for more information.
- Run the following command to restart the OpenStack cloud
controller
daemonset:
Changing Inventory File
When you perform the steps to modify password for Cinder access and modify password for OpenStack cloud controller access, you modify the
Kubernetes secrets to contain the new credentials. However, running the pipeline
(for example, performing a standard upgrade or adding a new node to the cluster)
takes the current credentials stored in the occne.ini
file, causing
the changes to be overridden. Therefore, it is important to update the
occne.ini
file with the new credentials.
- Navigate to the cluster
directory:
$ cd /var/occne/cluster/${OCCNE_CLUSTER}/
- Open the
occne.ini
file:$ vi occne.ini
- Update External Openstack credentials (both username and password) as shown
below:
external_openstack_username = USER external_openstack_password = PASSWORD
- Update Cinder credentials (both username and password) as shown
below:
cinder_username = USER cinder_password = PASSWORD
Updating Credentials for lb-controller-user
Note:
Run all the commands in this section from Bastion Host.- Run the following commands to update lb-controller-user
credentials:
$ echo -n "<Username>" | base64 -w0 | xargs -I{} kubectl -n occne-infra patch --type=merge secret lb-controller-user --patch '{"data":{"USERNAME":"{}"}}'
$ echo -n "<Password>" | base64 -w0 | xargs -I{} kubectl -n occne-infra patch --type=merge secret lb-controller-user --patch '{"data":{"PASSWORD":"{}"}}'
where:- <Username>, is the new OpenStack username.
- <Password>, is the new OpenSatck password.
- Run the following command to restart lb-controller-server to use the new
credentials:
$ kubectl rollout restart deployment occne-lb-controller-server -n occne-infra
- Wait until the
lb-controller
restarts and run the following command to get the lb-controller pod status using labels. Ensure that only one pod is in the Running status:$ kubectl get pods -l app=lb-controller -n occne-infra
Sample output:NAME READY STATUS RESTARTS AGE occne-lb-controller-server-74fd947c7c-vtw2v 1/1 Running 0 50s
- Validate the new credentials by printing the username and password directly
from the new pod's environment
variables:
$ kubectl exec -it $(kubectl get pod -n occne-infra | grep lb-controller-server | cut -d " " -f1) -n occne-infra -- bash -c "echo -n \$USERNAME" $ kubectl exec -it $(kubectl get pod -n occne-infra | grep lb-controller-server | cut -d " " -f1) -n occne-infra -- bash -c "echo -n \$PASSWORD"
7.6 Updating the Guest or Host OS
You must update the host OS (for Bare Metal installations) or guest OS (for virtualized installations) periodically so that CNE has the latest Oracle Linux software. If the CNE is not upgraded recently, or there are known security patches then perform an update by referring to the upgrade procedures in Oracle Communications Cloud Native Core, Cloud Native Environment Installation, Upgrade, and Fault Recovery Guide.
7.7 CNE Grafana Dashboards
Grafana is an observability tool available as open source and enterprise versions. Grafana supports number of data sources such as Prometheus from where it can read data for analytics. You can find the official list of supported data sources at Grafana Datasources.
- CNE Kubernetes dashboard
- CNE Prometheus dashboard
- CNE logging dashboard
- CNE persistent storage dashboard (only for Bare Metal)
Note:
The Grafana dashboards provisioned by CNE are read-only. Refrain from updating or modifying these default dashboards.You can clone these dashboards to customize them as per your requirement and save the customized dashboards in JSON format. This section provides details about the features offered by the open source Grafana version to add the required observability framework to CNE.
7.7.1 Accessing Grafana Interface
This section provides the procedure to access Grafana web interface.
- Perform the following steps to get the Load
Balancer IP address and port number for accessing the Grafana web interface:
- Run the following command to get the Load Balancer IP address of the
Grafana
service:
$ export GRAFANA_LOADBALANCER_IP=$(kubectl get services occne-kube-prom-stack-grafana --namespace occne-infra -o jsonpath="{.status.loadBalancer.ingress[*].ip}")
- Run the following command to get the LoadBalancer port number of the
Grafana
service:
$ export GRAFANA_LOADBALANCER_PORT=$(kubectl get services occne-kube-prom-stack-grafana --namespace occne-infra -o jsonpath="{.spec.ports[*].port}")
- Run the following command to get the complete URL for accessing Grafana in
an external
browser:
$ echo http://$GRAFANA_LOADBALANCER_IP:$GRAFANA_LOADBALANCER_PORT/$OCCNE_CLUSTER/grafana
Sample output:http://10.75.225.60:80/mycne-cluster/grafana
- Run the following command to get the Load Balancer IP address of the
Grafana
service:
- Use the URL obtained in the previous step (in this case, http://10.75.225.60:80/mycne-cluster/grafana) to access the Grafana home page.
- Click Downloads and select Browse.
- Expand the CNE folder to view the CNE dashboards.
Note:
CNE doesn't support user access management on Grafana.
7.7.2 Cloning a Grafana Dashboard
This section describes the procedure to clone a Grafana dashboard.
- Open the dashboard that you want to clone.
- Click the Share dashboard or panel icon next to the dashboard name.
- Select Export and click Save to file to save the dashboard in JSON format in your local system.
- Perform the following steps to import the saved dashboard to
Grafana:
- Click Dashboards and select Import.
- Click Upload JSON file and select the dashboard that you saved in step 3.
- Change the name and UID of the
dashboard.
You have cloned the dashboard successfully. You can now use the cloned dashboard to customize the options as per your requirement.
7.7.3 Restoring a Grafana Dashboard
The default Grafana dashboards provided by CNE are stored as configmap in the CNE cluster and artifact directory to restore them to their default state. This section describes the procedure to restore a Grafana dashboard.
Note:
- This procedure is used to restore the dashboards to the default state (that is, the default dashboards provided by CNE).
- When you restore the dashboards, you lose all the customizations that you made on the dashboards. You can't use this procedure to restore the customizations that you made on top of the CNE default dashboards.
- You can't use this procedure to restore other Grafana dashboards that you created.
- Navigate to the
occne-grafana-dashboard
directory:$ cd /var/occne/cluster/${OCCNE_CLUSTER}/artifacts/occne-grafana-dashboard
- Run the following command to restore all the
dashboards present in the
occne-grafana-dashboard
directory to its default state. The command uses the YAML files of the dashboards in the directory to restore them.$ kubectl -n occne-infra apply -R -f occne-grafana-dashboard
You can also restore a specific dashboard by providing a specific YAML file name in the command. For example, you can use the following command to restore only the CNE Kubernetes dashboard:$ kubectl -n occne-infra apply -f occne-grafana-dashboard/occne-k8s-cluster-dashboard.yaml
7.8 Managing 5G NFs
This section describes procedures to manage 5G NFs in CNE.
7.8.1 Installing an NF
This section describes the procedure to install an NF in the CNE Kubernetes cluster.
Prerequisites
- Load container images and Helm charts onto Central Server
repositories.
Container and Helm repositories are created on a Central Server for easy CNE deployment at multiple customer sites. These repositories store all of the container images and Helm charts required to install CNE. When necessary, Helm pulls container images and Helm charts to the central server repositories on the local CNE Bastion Hosts. Similarly, NF installation uses Helm so that the container images and Helm charts needed to install NFs are loaded onto the same Central Server repositories. This procedure assumes that all container images and Helm charts required to install the NF are already loaded onto the Central Server repositories.
- Determine the NF deployment parameters
The following values determine the NF's identity and where it is deployed. These values are used in the following procedure:
Table 7-15 NF Deployment Parameters
Parameters Value Description nf-namespace Any valid namespace name The namespace where you want to install the NF. Typically each NF is installed in its own namespace. nf-deployment-name Any valid Kubernetes deployment name The name that this NF instance is known to the Kubernetes.
Load NF artifacts onto Bastion Host repositories
All the steps in this section are run on the CNE Bastion Host where the NF installation happens.
- Create a file container_images.txt listing the Container
images and tags as required by the
NF:
<image-name>:<release>
Example:
busybox:1.29.0
- Run the following command to load the container images into the CNE
Container
registry:
$ retrieve_container_images.sh <external-container-repo-name>:<external-container-repo-port> ${HOSTNAME}:5000 < container_images.txt
Example:
$ retrieve_container_images.sh mycentralrepo:5000 ${HOSTNAME%%.*}:5000 < container_images.txt
- Create a file helm_charts.txt listing the Helm chart and
version:
<external-helm-repo-name>/<chart-name> <chart-version>
Example:
mycentralhelmrepo/busybox 1.33.0
- Run the following command to load the charts into the CNE Helm
chart
repository:
$ retrieve_helm.sh /var/www/html/occne/charts http://<external-helm-repo-name>/occne/charts [helm_executable_full_path_if_not_default] < helm_charts.txt
Example:
$ retrieve_helm.sh /var/www/html/occne/charts http://mycentralrepo/occne/charts < helm_charts.txt
Install the NF
- On the Bastion Host, create a YAML file named
<nf-short-name>-values.yaml
to contain the values to be passed to the NF Helm chart. - Add NF-specific values to file
See the NF installation instructions to understand which keys and values must be included in the values file.
- Additional NF configuration
Before installing the NF, see the installation instructions to understand the requirements of additional NF configurations along with Helm chart values.
- Run the following command to install the
NF:
$ helm install --namespace <nf-namespace> --create-namespace -f <nf-short-name>-values.yaml <nf-deployment-name> <chart-or-chart-location>
7.8.2 Upgrading an NF
This section describes the procedure to upgrade a 5G network function that was previously installed in the CNE Kubernetes cluster.
Prerequisites
Load container images and Helm charts onto Central Server repositories.
Container and Helm repositories are created on a Central Server for easy CNE deployment at multiple customer sites. These repositories store all of the container images and Helm charts required to install CNE. When necessary, Helm pulls container images and Helm charts to the central server repositories on the local CNE Bastion Hosts. Similarly, Network Function (NF) installation uses Helm so that the container images and Helm charts needed to install NFs are loaded onto the same Central Server repositories. This procedure assumes that all container images and Helm charts required to install the NF are already loaded onto the Central Server repositories.
Procedure
Load NF artifacts onto Bastion Host repositories
All the steps in this section are run on the CNE Bastion Host where the NF installation happens.
- Create a file container_images.txt listing the Container
images and tags as required by the
NF:
<image-name>:<release>
Example:busybox:1.29.0
- Run the following command to load the container images into the
CNE Container
registry:
$ retrieve_container_images.sh <external-container-repo-name>:<external-container-repo-port> ${HOSTNAME%%.*}:5000 < container_images.txt
Example:
$ retrieve_container_images.sh mycentralrepo:5000 ${HOSTNAME%%.*}:5000 < container_images.txt
- Create a file helm_charts.txt listing the Helm chart and
version:
<external-helm-repo-name>/<chart-name> <chart-version>
Example:
mycentralhelmrepo/busybox 1.33.0
- Run the following command to load the charts into the CNE Helm
chart
repository:
$ retrieve_helm.sh /var/www/html/occne/charts http://<external-helm-repo-name>/occne/charts [helm_executable_full_path_if_not_default] < helm_charts.txt
Example:
$ retrieve_helm.sh /var/www/html/occne/charts http://mycentralrepo/occne/charts < helm_charts.txt
Upgrade the NF
- On the Bastion Host, create a YAML file named
<nf-short-name>-values.yaml
to contain the values to be passed to the NF Helm chart. - Create a YAML file that contains new and changed values
needed by the NF Helm chart.
See the NF installation instructions to understand which keys and values must be included in the values file. Only values for parameters that were not included in the Helm input values applied to the previous release, or parameters whose names changed from the previous release, must be included in this file.
- If the yaml file is created for this upgrade, run the
following command to upgrade the NF with new
values:
$ helm upgrade -f <nf-short-name>-values.yaml <nf-deployment-name> <chart-name-or-chart-location>
Note:
Thenf-deployment-name
value must match the value used when installing the NF.
$ helm upgrade --reuse-values <nf-deployment-name> <chart-name-or-chart-location>
Note:
Thenf-deployment-name
value must match the value used when
installing the NF.
7.8.3 Uninstalling an NF
This section describes the procedure to uninstall a 5G network function that was previously installed in the CNE Kubernetes cluster.
Prerequisites
- Determine the NF deployment parameters. The following values determine the
NF's identity and where it is deployed:
Table 7-16 NF Deployment Parameters
Variable Value Description nf-namespace Any valid namespace name The namespace where you want to install the NF. Typically each NF is installed in its own namespace. nf-deployment-name Any valid Kubernetes deployment name The name by which the Kubernetes identifies this NF instance. - All commands in this procedure must be run from the Bastion Host.
Procedure
- Run the following command to uninstall an
NF:
$ helm uninstall <nf-deployment-name> --namespace <nf-namespace>
- If there are remaining NF resource such as PVCs and namespace, run the following
command to remove them:
- Run the following command to remove residual
PVCs:
$ kubectl --namespace <nf-namespace> get pvc | awk '{print $1}'| xargs -L1 -r kubectl --namespace <nf-namespace> delete pvc
- Run the following command to delete
namespace:
$ kubectl delete namespace <nf-namespace>
Note:
Steps a and b are used to remove all the PVCs from the <nf-namespace> and delete the <nf-namespace>, respectively. If there are other components running in the <nf-namespace>, manually delete the PVCs that need to be removed and skip thekubectl delete namespace <nf-namespace>
command.
- Run the following command to remove residual
PVCs:
7.8.4 Update Alerting Rules for an NF
This section describes the procedure to add or update the alerting rules for any Cloud Native Core 5G NF in Prometheus Operator and OSO.
Prerequisites
- For CNE Prometheus Operator, a YAML file containing an PrometheusRule CRD
defining the NF-specific alerting rules is available. The YAML file must be
an ordinary text file in a valid YAML format with the extension
.yaml
. - For OSO Prometheus, a valid OSO release must be installed and an alert file describing all NF alert rules according to old format is required.
Procedure for Prometheus Operator
- To copy the NF-specific alerting rules file from your computer to the /tmp directory on the Bastion Host, see the Accessing the Bastion Host procedure.
- Run the following command to create or update the PrometheusRule
CRD containing the alerting rules for the
NF:
$ kubectl apply -f /tmp/rules_file.yaml -n occne-infra # To verify the creation of the alert-rules CRD, run the following command: $ kubectl get prometheusrule -n occne-infra NAME AGE occne-alerting-rules 43d occne-dbtier-alerting-rules 43d test-alerting-rules 5m
The alerting rules automatically loads into all running Prometheus instances within 1 minute.
- In the Prometheus GUI, select the Alerts tab. Select individual
rules from the list to view the alert details and verify if the new rules are
loaded.
Figure 7-1 New Alert Rules are loaded in Prometheus GUI
Procedure for OSO
Perform the following steps to add alert rules in OSO promethues GUI:- Take the backup of the current configuration map of OSO
Prometheus:
$ kubectl get configmaps <OSO-prometheus-configmap-name> -o yaml -n <namespace> > /tmp/tempPrometheusConfig.yaml
- Check and add the NF alert file name inside Prometheus configuration map. The NF
alert file names vary from NF to NF. Retrieve the name of the NF alert rules
file to add the name in Prometheus configuration map. Once you retrieve the file
name, run the following commands to add the NF alert file name inside Prometheus
configuration
map:
$ sed -i '/etc\/config\/<nf-alertsname>/d' /tmp/tempPrometheusConfig.yaml $ sed -i '/rule_files:/a\ \- /etc/config/<nf-alertsname>' /tmp/tempPrometheusConfig.yaml
- Update configuration map with the updated
file:
$ kubectl -n <namespace> replace configmap <OSO-prometheus-configmap-name> -f /tmp/tempPrometheusConfig.yaml
- Patch the NF Alert rules in OSO Prometheus configuration map by mentioning the
alert rule file
path:
$ kubectl patch configmap <OSO-prometheus-configmap-name> -n <namespace> --type merge --patch "$(cat ./NF_altertrules.yaml)"
7.8.5 Configuring Egress NAT for an NF
This section provides information about configuring NF microservices that originate egress requests to ensure compatibility with CNE.
Annotation for Specifying Egress Network
Starting CNE 22.4.x, egress requests do not get the IP address of the Kubernetes worker node assigned to the source IP field. Instead, each microservice that originates egress requests specifies an egress network through an annotation. An IP address from the indicated network is inserted into the source IP field for all egress requests.
annotations:
oracle.com.cnc/egress-network: "oam"
Note:
- The value of the annotation must match the name of an external network configured.
- This annotation must not be added for microservices that do not originate egress requests, as it leads to decreased CNE performance.
- CNE does not allow any microservice to pick a separate IP address. When CNE is installed, a single IP address is selected for each network.
- All pods in a microservice get the same source IP address attached to all egress requests.
- CNE 22.4.x supports this annotation in vCNE deployments only.
Configuring Egress Controller Environment
Note:
Do not edit any variables that are not listed in the following table.Table 7-17 Egress Controller Environment Configuration
Environment Variable | Default Value | Possible Value | Description |
---|---|---|---|
DAEMON_MON_TIME | 0.5 | Between 0.1 and 5 | This value reflects the time in seconds and highlights the frequency at which the Egress controller checks the cluster status. |
Configuring Egress NAT for Destination Subnet or IP Address
Destination subnet or IP address must be specified to route traffic trough a particular network. The destination subnet or IP address is specified in the form of a dictionary, where the pools are the dictionary keys and the lists of subnets or IP addresses are the dictionary values.
- Specifying annotation for destination
subnet:
annotations: oracle.com.cnc/egress-destination: '{"<pool>" : ["<subnet_ip_address>/<subnet_mask>"]}'
For example:annotations: oracle.com.cnc/egress-destination: '{"oam" : ["10.20.30.0/24"]}'
- Specifying annotation for destination IP
address:
annotations: oracle.com.cnc/egress-destination: '{"<pool>" : ["<ip_address>"]}'
For example:annotations: oracle.com.cnc/egress-destination: '{"oam" : ["10.20.30.40"]}'
- Specifying annotation for multiple
pools:
annotations: oracle.com.cnc/egress-destination: '{"<pool_one>" : ["<subnet_ip_address>/<subnet_mask>"], "<pool_two>" : ["<subnet_ip_address>/<subnet_mask>"]}'
For example:annotations: oracle.com.cnc/egress-destination: '{"oam" : ["10.20.30.0/24"], "sig" : ["30.20.10.0/24"]}'
- Specifying annotation for multiple pools and multiple
destinations:
annotations: oracle.com.cnc/egress-destination: '{"<pool_one>" : ["<subnet_ip_address>/<subnet_mask>", "<subnet_ip_address>/<subnet_mask>"], "<pool_two>" : ["<subnet_ip_address>/<subnet_mask>", "<ip_address>"]}'
For example:annotations: oracle.com.cnc/egress-destination: '{"oam" : ["10.20.30.0/24", "100.200.30.0/22"], "sig" : ["30.20.10.0/24", "20.10.5.1"]}'
Compatibility Between Egress NAT and Destination Egress NAT
Both Egress NAT and Destination Egress NAT annotations are independent and compatible. This means that they can be used independently or combined to create more specific rules. Egress NAT is enabled to route all traffic from a particular pod through a particular network. Whereas, Destination Egress NAT permits traffic to be routed using a destination subnet or IP address before regular Egress NAT rules are matched within the routing table. This feature allows more granularity to route traffic through a particular network.
sig
network, except the traffic
destined for 10.20.30.0/24
subnet, which is routed through the
oam
network:annotations:
oracle.com.cnc/egress-destination: '{"oam" : ["10.20.30.0/24"]}'
oracle.com.cnc/egress-network: sig