7 Maintenance Procedures

This chapter provides detailed instructions about how to maintain the CNE platform.

7.1 Premaintenance Check for VMware Deployments

This section provides details about the checks that must be run on VMware deployments before performing any maintenance procedures.

  1. Verify the content of the compute/main.tf and compute-lbvm/main.tf files:
    1. Run the following command to verify the content of the compute/main.tf file:
      $ cat /var/occne/cluster/${OCCNE_CLUSTER}/modules/compute/main.tf | grep 'ignore_changes\|override_template_disk' -C 2
      Ensure that the content of the file exactly matches the following content:
       }
      
        override_template_disk {
          bus_type    = "paravirtual"
          size_in_mb  = var.disk
      --
      
        lifecycle {
          ignore_changes = [
            vapp_template_id,
            template_name,
            catalog_name,
            override_template_disk
           ]
        }
      --
        }
      
        override_template_disk {
          bus_type    = "paravirtual"
          size_in_mb  = var.disk
      --
      
        lifecycle {
          ignore_changes = [
            vapp_template_id,
            template_name,
            catalog_name,
            override_template_disk
           ]
        }
    2. Run the following command to verify the content of the compute-lbvm/main.tf file:
      $ cat /var/occne/cluster/${OCCNE_CLUSTER}/modules/compute-lbvm/main.tf | grep 'ignore_changes\|override_template_disk' -C 2 
      Ensure that the content of the file exactly matches the following content:
      }
      
        override_template_disk {
          bus_type    = "paravirtual"
          size_in_mb  = var.disk
      --
      
        lifecycle {
          ignore_changes = [
            vapp_template_id,
            template_name,
            catalog_name,
            override_template_disk
           ]
        }
      --
        }
      
        override_template_disk {
          bus_type    = "paravirtual"
          size_in_mb  = var.disk
      --
      
        lifecycle {
          ignore_changes = [
            vapp_template_id,
            template_name,
            catalog_name,
            override_template_disk
           ]
        }
  2. If the files don't contain the ignore_changes argument, then edit the files and add the argument to each of the "vcd_vapp_vm" resources:
    1. Run the following command to edit the compute/main.tf file:
      $ vi /var/occne/cluster/${OCCNE_CLUSTER}/modules/compute/main.tf
    2. Add the following content between each override_template_disk code block and metadata = var.metadata line for each "vcd_vapp_vm" resource:
      lifecycle {
          ignore_changes = [
            vapp_template_id,
            template_name,
            catalog_name,
            override_template_disk
           ]
        }
    3. Save the compute/main.tf file.
    4. Run the following command to edit the compute-lbvm/main.tf file:
      $ vi /var/occne/cluster/${OCCNE_CLUSTER}/modules/compute/main.tf
    5. Add the following content between each override_template_disk code block and metadata = var.metadata line for each "vcd_vapp_vm" resource:
       lifecycle {
          ignore_changes = [
            vapp_template_id,
            template_name,
            catalog_name,
            override_template_disk
           ]
        }
    6. Save the compute-lbvm/main.tf file.
    7. Repeat step 1 to ensure that the content of the files matches the ones provided in the step.

7.2 Accessing the CNE

This section describes the procedures to access an CNE for maintenance purposes.

7.2.1 Accessing the Bastion Host

This section provides information about how to access a CNE Bastion Host.

Prerequisites

  • SSH private key must be available on the server or VM that is used to access the Bastion Host.
  • The SSH private keys generated or provided during the installation must match the authorized key (public) present in the Bastion Hosts. For more information about the keys, see the installation prerequisites in Oracle Communications Cloud Native Core, Cloud Native Environment Installation, Upgrade, and Fault Recovery Guide.

Procedure

All commands must be run from a server or VM that has network access to the CNE Bastion Hosts. To access the Bastion Host, perform the following tasks.

7.2.1.1 Logging in to the Bastion Host

This section describes the procedure to log in to the Bastion Host.

  1. Determine the Bastion Host IP address.

    Contact your system administrator to obtain the IP addresses of the CNE Bastion Hosts. The system administrator can obtain the IP addresses from the OpenStack Dashboard, VMware Cloud Director, or by other means such as from the BareMetal Hosts.

  2. To log in to the Bastion Host, run the following command:

    Note:

    The default value for <user_name> is cloud-user (for vCNE) or admusr (for Baremetal).
    $ ssh -i /<ssh_key_dir>/<ssh_key_name>.key <user_name>@<bastion_host_ip_address>
7.2.1.2 Copying Files to the Bastion Host

This section describes the procedure to copy the files to the Bastion Host.

  1. Determine the Bastion Host IP address.

    Contact your system administrator to obtain the IP addresses of the CNE Bastion Hosts. The system administrator can obtain the IP addresses from the Openstack Dashboard, VMware Cloud Director, or by other means such as from the BareMetal Hosts.

  2. To copy files to the Bastion Host, run the following command:
    $ scp -i /<ssh_key_dir>/<ssh_key_name>.key <source_file> <user_name>@<bastion_host_ip_address>:/<path>/<dest_file>
7.2.1.3 Managing Bastion Host

The Bastion Host comes with the following built-in scripts to manage the Bastion Hosts:

  • is_active_bastion
  • get_active_bastion
  • get_other_bastions
  • update_active_bastion.sh

These scripts are used to get details about Bastion Hosts, such as checking if the current Bastion Host is the active one and getting the list of other Bastions. This section provides the procedures to manage Bastion Hosts using these scripts.

These scripts are located in the /var/occne/cluster/$OCCNE_CLUSTER/artifacts/ directory. You don't have to change the directory to run these scripts. You can run these scripts from anywhere within a Bastion Host like a system command as the directory containing the scripts is a part of $PATH.

All the scripts interact with the bastion-controller pod and its database directly by querying it or updating it through Kubectl. The commands fails to run in the following conditions:
  • If the lb-controller pod is not running.
  • If the kubectl admin configuration is not set properly.
For more information about the possible errors and troubleshooting, seeTroubleshooting Bastion Host.
7.2.1.3.1 Verifying if the Current Bastion Host is the Active One

This section describes the procedure to verify if the current Bastion Host is the active one using the is_active_bastion script.

  1. Run the following command to check if the current Bastion Host is the active Bastion Host:
    $ is_active_bastion
    On running the command, the system runs the is_active_bastion script to compare the current Bastion IP against the active Bastion IP retrieved from the database of the bastion-controller pod. If the IP stored at the bastion-controller database is equal to the IP of the Bastion where the script is run from, then the system displays the following response:
    IS active-bastion
    If the IP addresses don't match, the system displays the following response indicating, the current Bastion Host is not the active one:
    NOT active-bastion
7.2.1.3.2 Getting the Host IP or Hostname of the Current Bastion Host

This section provides details about getting the Host IP or Hostname of the current Bastion Host using the get_active_bastion script.

  • Run the following command to get the Host IP of the current Bastion Host:
    $ get_active_bastion

    On running the command, the system runs the get_active_bastion script to get the IP address of the Bastion from the bastion-controller DB:

    Sample output:
    192.168.200.10
  • Run the following command to get the Hostname of the current Bastion Host.
    $ DBFIELD=name get_active_bastion

    The DBFIELD=name parameter in the command is the additional parameter that is passed to get the Hostname from the bastion-controller DB.

    Sample output:
    occne1-rainbow-bastion-1
7.2.1.3.3 Getting the List of Other Bastion Hosts

This section provides details about getting the list of other Bastion Hosts in the cluster using the get_other_bastions script.

  • Run the following command to get the Hostnames of other Bastion Hosts in the cluster:
    $ get_other_bastions
    Sample output:
    occne1-rainbow-bastion-2
  • Run the following command to get the Host IPs of the other Bastion Hosts in the cluster.
    $ DBFIELD=ipaddr get_other_bastions

    The DBFIELD=ipaddr parameter in the command is the additional parameter to get the Host IPs from the bastion-controller DB.

    Sample output:
    192.168.200.11
  • You can provide additional parameters to filter the list of other Bastion Hosts in the cluster.

    For example, you can use the CRITERIA="state == 'HEALTHY'" or CRITERIA="state != 'FAILED' filter criteria to fetch the list of other Bastion Hosts in the cluster that are active:

    $ CRITERIA="state == 'HEALTHY'" get_other_bastions
    Sample output:
    occne1-rainbow-bastion-2
7.2.1.3.4 Changing the Bastion Host to Active

This section provides the steps to make a standby Bastion Host as active using the update_active_bastion.sh script.

  • Run the following command to make the current Bastion Host as the active Bastion Host:
    $ update_active_bastion.sh

    On running the command, the system runs the get_active_bastion script to compare the current Bastion IP with the active Bastion IP retrieved from the bastion-controller pod DB. If the IP stored in the bastion-controller DB is equal to the IP of the Bastion where the script is run from, then the system takes no action as the current Bastion is already the active Bastion. Otherwise, the system updates the DB with the current IP of the Bastion Host, making it the new active Bastion.

    Sample output showing the response when the current Bastion is already the active Bastion:
    2023-11-24:17-17-34: Setting 192.168.200.10 as bastion
    Bastion already set to 192.168.200.10
    Sample output showing the response when the current Bastion is updated as the active Bastion:
    2023-11-24:17-29-09: Setting 192.168.200.11 as bastion
7.2.1.4 Troubleshooting Bastion Host

This section describes the issues that you may encounter while using Bastion Host and their troubleshooting guidelines.

Permission Denied Error While Running Kubernetes Command

Problem:

Users may encounter "Permission Denied" error while running Kubernetes commands if there is no proper access.

Error Message:
error: error loading config file "/var/occne/cluster/occne1-rainbow/artifacts/admin.conf": open /var/occne/cluster/occne1-rainbow/artifacts/admin.conf: permission denied
Resolution:

Verify permission access to admin.conf. The user running the command must be able to run basic kubectl commands to use the Bastion scripts.

Commands Take Too Long to Respond and Fail to Return Output

Problem:

A command may take too long to display any output. For example, running the is_active_bastion command may take too long to respond leading to the timed out error.

Error Message:
error: timed out waiting for the condition
Resolution:
  • Verify the status of the bastion-controller. This error can occur if the pods are not running or in a crash state due to various reasons such as lack of resources at the cluster.
  • Print the bastion controller logs to check the issue. For example, print the logs and check if a loop crash error is caused due to lack of resources.
    $ kubectl logs -n ${OCCNE_NAMESPACE} deploy/occne-bastion-controller
    Sample output:
    Error from server (BadRequest): container "bastion-controller" in pod "occne-bastion-controller-797db5f845-hqlm6" is waiting to start: ContainerCreating

Command Not Found Error

Problem:

User may encounter command not found error while running a script.

Error Message:
-bash: is_active_bastion: command not found
Resolution:
Run the following command and verify that the $PATH variable is set properly and contains the artifacts directory.

Note:

By default, CNE sets up the path automatically during the installation.
$ echo $PATH
Sample output showing the correct $PATH:
/home/cloud-user/.local/bin:/home/cloud-user/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/var/occne/cluster/occne1-rainbow/artifacts/istio-1.18.2/bin/:/var/occne/cluster/occne1-rainbow/artifacts

7.3 General Configuration

This section describes the general configuration tasks for CNE.

7.3.1 Configuring SNMP Trap Destinations

This section describes the procedure to set up SNMP notifiers within CNE, such that the AlertManager can send alerts as SNMP traps to one or more SNMP receivers.

  1. Perform the following steps to verify the cluster condition before setting up multiple trap receivers:
    1. Run the following command and verify that the alertmanager and snmp-notifier services are running:
      $ kubectl get services --all-namespaces | grep -E 'snmp-notifier|alertmanager'
      Sample output:
      NAMESPACE     NAME                                            TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)                                  AGE
      occne-infra    occne-kube-prom-stack-kube-alertmanager          LoadBalancer   10.233.16.156   10.75.151.178   80:31100/TCP                        11m
      occne-infra    occne-alertmanager-snmp-notifier                              ClusterIP      10.233.41.30    <none>          9464/TC                             11m
    2. Run the following command and verify that the alertmanager and snmp-notifier pods are running:
      $ kubectl get pods --all-namespaces | grep -E 'snmp-notifier|alertmanager'
      Sample output:
      occne-infra     alertmanager-occne-kube-prom-stack-kube-alertmanager-0           2/2     Running   0          18m
      occne-infra     alertmanager-occne-kube-prom-stack-kube-alertmanager-1           2/2     Running   0          18m
      occne-infra     occne-alertmanager-snmp-notifier-744b755f96-m8vbx                             1/1     Running   0          18m
  2. Perform the following steps to edit the default snmp-destination and add a new snmp-destination:
    1. Run the following command from Bastion Host to get the current snmp-notifier resources:
      $ kubectl get all -n occne-infra | grep snmp
      Sample output:
      pod/occne-alertmanager-snmp-notifier-75656cf4b7-gw55w 1/1 Running 0 37m
      service/occne-alertmanager-snmp-notifier ClusterIP 10.233.29.86 <none> 9464/TCP 10h
      deployment.apps/occne-alertmanager-snmp-notifier 1/1 1 1 10h
      replicaset.apps/occne-alertmanager-snmp-notifier-75656cf4b7 1 1 1 37m
    2. The snmp-destination is the interface IP address of the trap receiver to get the traps. Edit the deployment to modify snmp-destination and add a new snmp-destination when needed:
      1. Run the following command to edit the deployment:
        $ kubectl edit -n occne-infra deployment occne-alertmanager-snmp-notifier
      2. From the vi editor, move down to the snmp-destination section. The default configuration is as follows:
        - --snmp.destination=127.0.0.1:162
      3. Add a new destination to receive the traps.
        For example:
         - --snmp.destination=192.168.200.236:162
      4. If want to add multiple trap receivers, add them in multiple new lines.
        For example:
         - --snmp.destination=192.168.200.236:162
         - --snmp.destination=10.75.135.11:162
         - --snmp.destination=10.33.64.50:162
      5. After editing, use the :x or :wq command to save the exit.
        Sample output:
        deployment.apps/occne-alertmanager-snmp-notifier edited
    3. Perform the following steps to verify the new replicaset and delete the old replicaset:
      1. Run the following command to get the resource and check the restart time to verify that the pod and replicaset are regenerated:
        $  kubectl get all -n occne-infra | grep snmp
        Sample output:
        pod/occne-alertmanager-snmp-notifier-88976f7cc-xs8mv            1/1     Running   0             90s
        service/occne-alertmanager-snmp-notifier                 ClusterIP      10.233.29.86    <none>          9464/TCP                                          10h
        deployment.apps/occne-alertmanager-snmp-notifier           1/1     1            1           10h
        replicaset.apps/occne-alertmanager-snmp-notifier-75656cf4b7           0         0         0       65m
        replicaset.apps/occne-alertmanager-snmp-notifier-88976f7cc            1         1         1       90s
      2. Identify the old replicaset from the previous step and delete it.
        For example, the restart time of the replicaset.apps/occne-alertmanager-snmp-notifier-75656cf4b7 in the previous step output is 65m. This indicates that it is the old replica set. Use the following command to delete the old replicaset:
        $ kubectl delete -n occne-infra replicaset.apps/occne-alertmanager-snmp-notifier-75656cf4b7
        
    4. Port 162 of the server must be open and have some application to catch the traps to test if the new trap receiver receives the SNMP traps. This step may vary depending on the type of server. The following codeblock provides an example for Linux server:
      $ sudo iptables -A INPUT -p udp -m udp --dport 162 -j ACCEPT
      $ sudo dnf install -y tcpdump
      $ sudo tcpdump -n -i <interface of the ip address set in snmp-destination> port 162
7.3.1.1 Verifying Cluster Condition

This section describes how to verify the cluster condition before setting up multiple trap receiver procedure.

  1. To verify that the services are running, run the following command:
    $ kubectl get services --all-namespaces
    NAMESPACE     NAME                                            TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)              AGE
    ...
    occne-infra    occne-kube-prom-stack-kube-alertmanager          LoadBalancer   10.233.16.156   10.75.151.178   80:31100/TCP     11m
    occne-infra    occne-kube-prom-stack-kube-operator              ClusterIP      10.233.59.93    <none>          443/TCP          11m
    occne-infra    occne-kube-prom-stack-kube-prometheus            LoadBalancer   10.233.10.160   10.75.151.161   80:30951/TCP     11m
    occne-infra    occne-kube-prom-stack-kube-state-metrics         ClusterIP      10.233.43.48    <none>          8080/TCP         11m
    occne-infra    occne-kube-prom-stack-prometheus-node-exporter   ClusterIP      10.233.39.145   <none>          9100/TCP         11m
    occne-infra    occne-metrics-server                             ClusterIP      10.233.51.228   <none>          443/TCP          11m
    occne-infra    occne-prometheus-pushgateway                     ClusterIP      10.233.9.91     <none>          9091/TCP         11m
    occne-infra    occne-promxy                                     ClusterIP      10.233.57.218   <none>          80/TCP           11m
    occne-infra    occne-snmp-notifier                              ClusterIP      10.233.41.30    <none>          9464/TC          11m
    ...
  2. To verify that the alertmanager and the snmp-notifier pods are working, run the following command:
    $ kubectl get pods --all-namespaces
    Result:
    ...
    occne-infra     alertmanager-occne-kube-prom-stack-kube-alertmanager-0   2/2   Running   0  18m
    occne-infra     alertmanager-occne-kube-prom-stack-kube-alertmanager-1   2/2   Running   0  18m
    ...
    occne-infra     occne-snmp-notifier-744b755f96-m8vbx                     1/1   Running   0  18m
    ...
7.3.1.2 Configuring the New SNMP Notifier

This section provides information about creating configuration files to run a second SNMP notifier. It provides example values to configure the second SNMP notifier so that Kubernetes can differentiate it from the default SNMP notifier configuration.

If you need to configure more than two SNMP notifiers, you must rerun this task to add additional notifiers.

  1. Create a Service specification YAML file for the new SNMP notifier.
    From the Bastion Host, run the following command to copy the Service specification for the existing SNMP notifier into a YAML file:
    $ kubectl get service occne-snmp-notifier -n occne-infra -o yaml > svc-occne-snmp-notifier-1.yaml
    Sample Service descriptor YAML file:
    apiVersion: v1
    kind: Service
    metadata:
      annotations:
        meta.helm.sh/release-name: occne-snmp-notifier
        meta.helm.sh/release-namespace: occne-infra
      creationTimestamp: "2022-01-24T22:46:47Z"
      labels:
        app: snmp-notifier
        app.kubernetes.io/managed-by: Helm
        chart: snmp-notifier-0.2.0
        heritage: Helm
        release: occne-snmp-notifier
      name: occne-snmp-notifier
      namespace: occne-infra
      resourceVersion: "5817"
      uid: c10cb879-cd34-4715-93dd-7d0c7f6865bc
    spec:
      clusterIP: 10.233.7.201
      clusterIPs:
      - 10.233.7.201
      ipFamilies:
      - IPv4
      ipFamilyPolicy: SingleStack
      ports:
      - name: http
        port: 9464
        protocol: TCP
        targetPort: 9464
      selector:
        app: snmp-notifier
        release: occne-snmp-notifier
      sessionAffinity: None
      type: ClusterIP
    status:
      loadBalancer: {}
    
  2. Modify the Service specification for the new SNMP notifier.

    On the Bastion Host, use vi or another editor to modify the YAML file to create a Service specification for the new SNMP notifier as follows:

    1. Change the following fields:
      1. Change the value of name from "occne-snmp-notifier" to "occne-snmp-notifier-1"
      2. Change the value of port from "9464" to "9465"
      3. Change the value of targetPort from "9464" to "9465"
      For example:
      name: "occne-snmp-notifier-1"
      port: change "9465"
      targetPort: "9465"
    2. Remove the following lines:
      clusterIP: xx.xxx.x.xxx
      clusterIPs:
      - xx.xxx.x.xxx
  3. Create a Deployment specification YAML file for the new SNMP notifier.
    From the Bastion Host, run the following command to copy the Deployment specification for the existing SNMP notifier into a YAML file:
    $ kubectl get deployment occne-snmp-notifier -n occne-infra -o yaml > dp-occne-snmp-notifier-1.yaml
    Sample Deployment specification:
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      annotations:
        deployment.kubernetes.io/revision: "1"
        meta.helm.sh/release-name: occne-snmp-notifier
        meta.helm.sh/release-namespace: occne-infra
      creationTimestamp: "2022-01-24T22:46:47Z"
      generation: 1
      labels:
        app: snmp-notifier
        app.kubernetes.io/managed-by: Helm
        chart: snmp-notifier-0.2.0
        heritage: Helm
        release: occne-snmp-notifier
      name: occne-snmp-notifier
      namespace: occne-infra
      resourceVersion: "5923"
      uid: fe8eb69c-b638-4a4c-b7e6-ac4e387967ff
    spec:
      progressDeadlineSeconds: 600
      replicas: 1
      revisionHistoryLimit: 10
      selector:
        matchLabels:
          app: snmp-notifier
          release: occne-snmp-notifier
      strategy:
        rollingUpdate:
          maxSurge: 25%
          maxUnavailable: 25%
        type: RollingUpdate
      template:
        metadata:
          creationTimestamp: null
          labels:
            app: snmp-notifier
            release: occne-snmp-notifier
        spec:
          containers:
          - args:
            - --alert.default-severity=critical
            - --alert.severities=critical,major,minor,warning,info,clear
            - --alert.severity-label=severity
            - --log.format=logger:stderr
            - --log.level=error
            - --snmp.destination=127.0.0.1:162
            - --snmp.retries=1
            - --snmp.trap-default-oid=1.3.6.1.4.1.1664.1
            - --snmp.trap-description-template=/etc/snmp_notifier/description-template.tpl
            - --snmp.trap-oid-label=oid
            - --web.listen-address=:9464
            env:
            - name: snmpCommunityEnvironmentVariable
              value: public
            image: occne-repo-host:5000/docker.io/maxwo/snmp-notifier:v0.3.0
            imagePullPolicy: IfNotPresent
            name: snmp-notifier
            ports:
            - containerPort: 9464
              name: http
              protocol: TCP
            resources: {}
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
          dnsPolicy: ClusterFirst
          restartPolicy: Always
          schedulerName: default-scheduler
          securityContext: {}
          terminationGracePeriodSeconds: 30
    status:
      availableReplicas: 1
      conditions:
      - lastTransitionTime: "2022-01-24T22:46:49Z"
        lastUpdateTime: "2022-01-24T22:46:49Z"
        message: Deployment has minimum availability.
        reason: MinimumReplicasAvailable
        status: "True"
        type: Available
      - lastTransitionTime: "2022-01-24T22:46:47Z"
        lastUpdateTime: "2022-01-24T22:46:49Z"
        message: ReplicaSet "occne-snmp-notifier-744b755f96" has successfully progressed.
        reason: NewReplicaSetAvailable
        status: "True"
        type: Progressing
      observedGeneration: 1
      readyReplicas: 1
      replicas: 1
      updatedReplicas: 1
  4. Modify the Deployment specification for the new SNMP notifier.

    On the Bastion Host, use vi or any other editor to modify the generated YAML file to create a deployment for the new SNMP notifier.

    1. Change the following fields:
      1. Change the value of name from "occne-snmp-notifier" to "occne-snmp-notifier-1"
      2. Change the value of containerPort from "9464" to "9465"
      For example:
      name: "occne-snmp-notifier-1"
        containerPort: "9465"
    2. Change the IP address on the following line to the IP address of the new SNMP trap receiver:
      1. Set the value of - --snmp.destination= to <trap receiver ip address>:162, where <trap receiver ip address> is IP address of the new SNMP trap receiver.
      2. Set the value of - --web.listen-address= from 9464 to 9645
      For example:
      - --snmp.destination=<trap receiver ip address>:162
      - --web.listen-address=:9464 to 9645
  5. Save the Service specification and Deployment specification YAML files for the new SNMP Notifier.
7.3.1.3 Running the New SNMP Notifier

This section describes how to run the new SNMP notifier.

To run the new SNMP notifier, perform the following steps:
  1. To create the service and deployment for the new SNMP notifier, run the following commands from the Bastion Host:
    $ kubectl apply -f svc-occne-snmp-notifier-1.yaml
    Sample output:
    service.apps/occne-snmp-notifier-1 created
    $ kubectl apply -f dp-occne-snmp-notifier-1.yaml
    Sample output:
    deployment.apps/occne-snmp-notifier-1 created
  2. To verify if the new SNMP notifier is running, run the following commands from the Bastion Host:
    $ kubectl get services --all-namespaces
    Sample output:
    occne-infra    occne-snmp-notifier          ClusterIP     10.233.41.30    <none>   9464/TCP    39m
    occne-infra    occne-snmp-notifier-1        ClusterIP     10.233.21.200   <none>   9465/TCP    29s
    $ kubectl get deployment --all-namespaces
    Sample output:
    occne-infra    occne-snmp-notifier           1/1     1       1     47m
    occne-infra    occne-snmp-notifier-1         1/1     1       1     8m48s
    $ kubectl get pod --all-namespaces
    Sample output:
    occne-infra     occne-snmp-notifier-1-658d646d79-67jwn       1/1     Running   0      100s
    occne-infra     occne-snmp-notifier-744b755f96-m8vbx         1/1     Running   0      48m
7.3.1.4 Configuring the New SNMP Notifier in Alert Manager

This section describes how to configure the new SNMP notifier in Alert Manager.

To configure the new SNMP notifier in AlertManager so that traps can be delivered to it, perform the following steps:
  1. To get the AlertManager Secret configuration, run the following command from the Bastion Host:
    $ kubectl get secret alertmanager-occne-kube-prom-stack-kube-alertmanager -o yaml -n occne-infra
    Example output file when alertmanager.yaml is encoded:
    apiVersion: v1
    data:
      alertmanager.yaml: Z2xvYmFsOgogIHJlc29sdmVfdGltZW91dDogNW0KcmVjZWl2ZXJzOgotIG5hbWU6IGRlZmF1bHQtcmVjZWl2ZXIKICB3ZWJob29rX2NvbmZpZ3M6CiAgLSB1cmw6IGh0dHA6Ly9vY2NuZS1zbm1wLW5vdGlmaWVyOjk0NjQvYWxlcnRzCnJvdXRlOgogIGdyb3VwX2J5OgogIC0gam9iCiAgZ3JvdXBfaW50ZXJ2YWw6IDVtCiAgZ3JvdXBfd2FpdDogMzBzCiAgcmVjZWl2ZXI6IGRlZmF1bHQtcmVjZWl2ZXIKICByZXBlYXRfaW50ZXJ2YWw6IDEyaAogIHJvdXRlczoKICAtIG1hdGNoOgogICAgICBhbGVydG5hbWU6IFdhdGNoZG9nCiAgICByZWNlaXZlcjogZGVmYXVsdC1yZWNlaXZlcgp0ZW1wbGF0ZXM6Ci0gL2V0Yy9hbGVydG1hbmFnZXIvY29uZmlnLyoudG1wbA==
    kind: Secret
    metadata:
      annotations:
        meta.helm.sh/release-name: occne-kube-prom-stack
        meta.helm.sh/release-namespace: occne-infra
      creationTimestamp: "2022-01-24T22:46:34Z"
      labels:
        app: kube-prometheus-stack-alertmanager
        app.kubernetes.io/instance: occne-kube-prom-stack
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/part-of: kube-prometheus-stack
        app.kubernetes.io/version: 18.0.1
        chart: kube-prometheus-stack-18.0.1
        heritage: Helm
        release: occne-kube-prom-stack
      name: alertmanager-occne-kube-prom-stack-kube-alertmanager
      namespace: occne-infra
      resourceVersion: "5175"
      uid: a38eb420-a4d0-4020-a375-ab87421defde
    type: Opaque
  2. To extract the AlertManager configuration as shown in the example output above, in the third line, the alertmanager.yaml key contains AlertManager configuration in encoded base64 format. Run the following command to decode the configuration:
    echo 'Z2xvYmFsOgogIHJlc29sdmVfdGltZW91dDogNW0KcmVjZWl2ZXJzOgotIG5hbWU6IGRlZmF1bHQtcmVjZWl2ZXIKICB3ZWJob29rX2NvbmZpZ3M6CiAgLSB1cmw6IGh0dHA6Ly9vY2NuZS1zbm1wLW5vdGlmaWVyOjk0NjQvYWxlcnRzCnJvdXRlOgogIGdyb3VwX2J5OgogIC0gam9iCiAgZ3JvdXBfaW50ZXJ2YWw6IDVtCiAgZ3JvdXBfd2FpdDogMzBzCiAgcmVjZWl2ZXI6IGRlZmF1bHQtcmVjZWl2ZXIKICByZXBlYXRfaW50ZXJ2YWw6IDEyaAogIHJvdXRlczoKICAtIG1hdGNoOgogICAgICBhbGVydG5hbWU6IFdhdGNoZG9nCiAgICByZWNlaXZlcjogZGVmYXVsdC1yZWNlaXZlcgp0ZW1wbGF0ZXM6Ci0gL2V0Yy9hbGVydG1hbmFnZXIvY29uZmlnLyoudG1wbA=='  | base64 --decode
    Example output file when alertmanager.yaml is decoded:
    global:
      resolve_timeout: 5m
    receivers:
    - name: default-receiver
      webhook_configs:
      - url: http://occne-snmp-notifier:9464/alerts
    route:
      group_by:
      - job
      group_interval: 5m
      group_wait: 30s
      receiver: default-receiver
      repeat_interval: 12h
      routes:
      - match:
          alertname: Watchdog
        receiver: default-receiver
    templates:
    - /etc/alertmanager/config/*.tmpl
  3. To configure the new SNMP notifier, use vi or any other editor to create a new alertmanager.yaml file with the content in the sample output in Step 2.
    Use the sample decoded output file in Step 2 for reference and add the following lines to a new alertmanager.yaml configuration file. These lines must be placed in the receivers section (line 3) of the configuration after default-receiver. Choose any name for the new SNMP notifier receiver.
    receivers:
    - name: default-receiver
      webhook_configs:
      - url: http://occne-snmp-notifier:9464/alerts
    - name: new-receiver-1
      webhook_configs:
      - url: http://occne-snmp-notifier-1:9465/alerts
  4. Add the following lines to the newly created alertmanager.yaml configuration file and save it.
    These lines must be placed in the routes section (line 18) of the configuration. The values for the receiver field must match the receiver name chosen in the previous step.
    routes:
    - receiver: default-receiver
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 12h
      continue: true
    - receiver: new-receiver-1
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 12h
      continue: true
    - match:
        alertname: Watchdog
      receiver: default-receiver
    - match:
        alertname: Watchdog
      receiver: new-receiver-1
    Example YAML file:
    global:
      resolve_timeout: 5m
    receivers:
    - name: default-receiver
      webhook_configs:
      - url: http://occne-snmp-notifier:9464/alerts
    - name: new-receiver-1
      webhook_configs:
      - url: http://occne-snmp-notifier-1:9465/alerts
    route:
      group_by:
      - job
      group_interval: 5m
      group_wait: 30s
      receiver: default-receiver
      repeat_interval: 12h
      routes:
      - receiver: default-receiver
        group_wait: 30s
        group_interval: 5m
        repeat_interval: 12h
        continue: true
      - receiver: new-receiver-1
        group_wait: 30s
        group_interval: 5m
        repeat_interval: 12h
        continue: true
      - match:
          alertname: Watchdog
        receiver: default-receiver
      - match:
          alertname: Watchdog
        receiver: new-receiver-1
    templates:
    - /etc/alertmanager/config/*.tmpl
  5. To encode the new SNMP notifier configuration in base64, run the following command that encodes the alertmanager.yaml file contents:
    
    $ cat alertmanager.yaml | base64 -w0
    Z2xvYmFsOgogIHJlc29sdmVfdGltZW91dDogNW0KcmVjZWl2ZXJzOgotIG5hbWU6IGRlZmF1bHQtcmVjZWl2ZXIKICB3ZWJob29rX2NvbmZpZ3M6CiAgLSB1cmw6IGh0dHA6Ly9vY2NuZS1zbm1wLW5vdGlmaWVyOjk0NjQvYWxlcnRzCi0gbmFtZTogbmV3LXJlY2VpdmVyLTEKICB3ZWJob29rX2NvbmZpZ3M6CiAgLSB1cmw6IGh0dHA6Ly9vY2NuZS1zbm1wLW5vdGlmaWVyLTE6OTQ2NS9hbGVydHMKcm91dGU6CiAgZ3JvdXBfYnk6CiAgLSBqb2IKICBncm91cF9pbnRlcnZhbDogNW0KICBncm91cF93YWl0OiAzMHMKICByZWNlaXZlcjogZGVmYXVsdC1yZWNlaXZlcgogIHJlcGVhdF9pbnRlcnZhbDogMTJoCiAgcm91dGVzOgogIC0gcmVjZWl2ZXI6IGRlZmF1bHQtcmVjZWl2ZXIKICAgIGdyb3VwX3dhaXQ6IDMwcwogICAgZ3JvdXBfaW50ZXJ2YWw6IDVtCiAgICByZXBlYXRfaW50ZXJ2YWw6IDEyaAogIC0gcmVjZWl2ZXI6IG5ldy1yZWNlaXZlci0xCiAgICBncm91cF93YWl0OiAzMHMKICAgIGdyb3VwX2ludGVydmFsOiA1bQogICAgcmVwZWF0X2ludGVydmFsOiAxMmgKICAtIG1hdGNoOgogICAgICBhbGVydG5hbWU6IFdhdGNoZG9nCiAgICByZWNlaXZlcjogZGVmYXVsdC1yZWNlaXZlcgogIC0gbWF0Y2g6CiAgICAgIGFsZXJ0bmFtZTogV2F0Y2hkb2cKICAgIHJlY2VpdmVyOiBuZXctcmVjZWl2ZXItMQp0ZW1wbGF0ZXM6Ci0gL2V0Yy9hbGVydG1hbmFnZXIvY29uZmlnLyoudG1wbAo=
  6. Create a secret for the new SNMP notifier.
    Use vi or a similar editor to create a new alertmanager-secret-k8s.yaml file. The content of the YAML file is as follows:
    $ vi alertmanager-secret-k8s.yaml
    apiVersion: v1
    data:
      alertmanager.yaml: <paste here the encoded content of alertmanager.yaml>
    kind: Secret
    metadata:
      name: alertmanager-occne-kube-prom-stack-kube-alertmanager
      namespace: occne-infra
    type: Opaque
  7. Add the SNMP notifier configuration to the secret.
    After receiving the encoded alertmanager.yaml key, replace the key in the newly created alertmanager-secret-k8s.yaml file (line 4). The resulting YAML file must look similar to the following example:
    $ cat alertmanager-secret-k8s.yaml
    apiVersion: v1
    data:
      alertmanager.yaml: Z2xvYmFsOgogIHJlc29sdmVfdGltZW91dDogNW0KcmVjZWl2ZXJzOgotIG5hbWU6IGRlZmF1bHQtcmVjZWl2ZXIKICB3ZWJob29rX2NvbmZpZ3M6CiAgLSB1cmw6IGh0dHA6Ly9vY2NuZS1zbm1wLW5vdGlmaWVyOjk0NjQvYWxlcnRzCi0gbmFtZTogbmV3LXJlY2VpdmVyLTEKICB3ZWJob29rX2NvbmZpZ3M6CiAgLSB1cmw6IGh0dHA6Ly9vY2NuZS1zbm1wLW5vdGlmaWVyLTE6OTQ2NS9hbGVydHMKcm91dGU6CiAgZ3JvdXBfYnk6CiAgLSBqb2IKICBncm91cF9pbnRlcnZhbDogNW0KICBncm91cF93YWl0OiAzMHMKICByZWNlaXZlcjogZGVmYXVsdC1yZWNlaXZlcgogIHJlcGVhdF9pbnRlcnZhbDogMTJoCiAgcm91dGVzOgogIC0gcmVjZWl2ZXI6IGRlZmF1bHQtcmVjZWl2ZXIKICAgIGdyb3VwX3dhaXQ6IDMwcwogICAgZ3JvdXBfaW50ZXJ2YWw6IDVtCiAgICByZXBlYXRfaW50ZXJ2YWw6IDEyaAogIC0gcmVjZWl2ZXI6IG5ldy1yZWNlaXZlci0xCiAgICBncm91cF93YWl0OiAzMHMKICAgIGdyb3VwX2ludGVydmFsOiA1bQogICAgcmVwZWF0X2ludGVydmFsOiAxMmgKICAtIG1hdGNoOgogICAgICBhbGVydG5hbWU6IFdhdGNoZG9nCiAgICByZWNlaXZlcjogZGVmYXVsdC1yZWNlaXZlcgogIC0gbWF0Y2g6CiAgICAgIGFsZXJ0bmFtZTogV2F0Y2hkb2cKICAgIHJlY2VpdmVyOiBuZXctcmVjZWl2ZXItMQp0ZW1wbGF0ZXM6Ci0gL2V0Yy9hbGVydG1hbmFnZXIvY29uZmlnLyoudG1wbAo=
    kind: Secret
    metadata:
      name: alertmanager-occne-kube-prom-stack-kube-alertmanager
      namespace: occne-infra
    type: Opaque
  8. Run the following command to configure new SNMP Trap Receiver:
    $ kubectl apply -f alertmanager-secret-k8s.yaml -n occne-infra

7.3.2 Changing Network MTU

This section describes the procedure to modify the Maximum Transmission Unit (MTU) of the Kubernetes internal network after the initial CNE installation.

Changing MTU on Internal Interface (eth0) for vCNE (OpenStack or VMware)

  1. Run the following commands from the Bastion host to launch the provision container:
    $ podman run -it --rm --rmi --network host --name DEPLOY_${OCCNE_CLUSTER} -v /var/occne/cluster/${OCCNE_CLUSTER}:/host -v /var/occne:/var/occne:rw -e OCCNEINV=/host/hosts -v /var/www/html/occne:/var/www/html/occne -e 'OCCNEARGS=--extra-vars=occne_hostname=${OCCNE_CLUSTER}-bastion-1 -i /host/occne.ini' ${CENTRAL_REPO}:5000/occne/provision:${OCCNE_VERSION} bash
  2. Run the following command from the provision container Bash shell session to change the MTU for an OpenStack deployment:
    $ ansible -i /host/hosts k8s-cluster -m shell -a 'sudo  nmcli con mod "System eth0" 802-3-ethernet.mtu <MTU value>;  sudo  nmcli con up  "System eth0"'$ exit
  3. Run the following command from the provision container Bash shell session to change the MTU for a VMware deployment:

    Note:

    If the interface connection name isn't ens192, then run the following commands to find the connection name:
    1. Run the "ip address" command to get the IP address.
    2. Run the "nmcli con show" command to get the connection name.
    $ ansible -i /host/hosts k8s-cluster -m shell -a 'sudo  nmcli con mod ens192 802-3-ethernet.mtu <MTU value>;  sudo  nmcli con up ens192'$ exit
Changing MTU on All Interfaces of BareMetal VM Host and VM Guest

Note:

  • The MTU value on the VM host depends on the ToR switch configuration:
    • cisco Nexus9000 93180YC-EX has "system jumbomtu" up to 9216.
    • If you're using port-channel/vlan-interface/uplnk-interface-to-customer-switch, then run the "system jumbomtu <mtu>" command and configure "mtu <value>" up to the value obtained from the command.
    • If you're using other types of ToR switches, you can configure the MTU value of VM host up to the maximum MTU value of the switch. Therefore, check the switches for the maximum MTU value and configure the MTU value accordingly.
  • The following steps are for a standard setup with bastion-1 or master-1 on host-1, bastion-2 or master-2 on host-2, and master-3 on host-3. If you have a different setup, then modify the commands accordingly. Each step in this procedure is performed to change MTU for the VM host and the Bastion on the VM host.
  1. SSH to k8s-host-2 from bastion-1:
    $ ssh k8s-host-2
  2. Run the following command to show all the connections:
    $ nmcli con show 
  3. Run the following commands to modify the MTU value on all the connections:

    Note:

    Modify the connection names in the following commands according to the connection names obtained from step 2.
    $ sudo nmcli con mod bond0 802-3-ethernet.mtu <MTU value>
    $ sudo nmcli con mod bondbr0 802-3-ethernet.mtu <MTU value>
    $ sudo nmcli con mod "vlan<mgmt vlan id>-br" 802-3-ethernet.mtu <MTU value>
  4. Run the following commands if there is vlan<ilo_vlan_id>-br on this host:
    $ sudo nmcli con mod "vlan<ilo vlan id>-br" 802-3-ethernet.mtu <MTU value>
    $ sudo nmcli con up "vlan<ilo vlan id>-br"

    Note:

    • The following sudo nmcli con up commands takes effect on the MTU values modified in the previous step. Ensure that you perform the following steps in the given in the given order. Not following the correct order can make the host-2 and bastion-2 unreachable.
    • The base interface has higher MTU during the sequence. For the bond0 interface, bond0 MTU is greater than or equal to the bondbr0 MTU and vlan bridge interface. The bond0.<vlan id> interface MTU is modified when vlan<vlan id>-br interface is modified and restarted.
  5. Run the following commands if the <MTU value> is lower than the old value:
    $ sudo nmcli con up "vlan<mgmt vlan id>-br"
    $ sudo nmcli con up bond0br0
    $ sudo nmcli con up bond0
  6. Run the following commands if the <MTU value> is higher than the old value:
    $ sudo nmcli con up bond0
    $ sudo nmcli con up bond0br0$ 
    $ sudo nmcli con up "vlan<mgmt vlan id>-br"
  7. After the values are updated on VM host, run the following commands to shut down all the VM guests:
    $ sudo virsh list --all
    $ sudo virsh shutdown <VM guest>

    where, <VM guest> is the VM guest name obtained from the $ sudo virsh list --all command.

  8. Run the virsh list command until the status of the VM guest is changed to "shut off":
    $ sudo virsh list --all
  9. Run the following command to start the VM guest:
    $ sudo virsh  start <VM guest> 

    where, <VM guest> is the name of the VM guest.

  10. Exit form host-2 and return to bastion-1. Wait until bastion-2 is reachable and run the following command to SSH to bastion-2:
    $ ssh bastion-2
  11. Run the following command to list all connections in bastion-2:
    $ nmcli con show 
  12. Run the following commands to modify the MTU value on all the connections in bastion-2:

    Note:

    Modify the connection names in the following commands according to the connection names obtained in step 8.
    $ sudo nmcli con mod "System enp1s0" 802-3-ethernet.mtu <MTU value>
    $ sudo nmcli con mod "System enp2s0" 802-3-ethernet.mtu <MTU value>
    $ sudo nmcli con mod "System enp3s0" 802-3-ethernet.mtu <MTU value>
    $ sudo nmcli con up "System enp1s0"
    $ sudo nmcli con up "System enp2s0"
    $ sudo nmcli con up "System enp3s0"
  13. Wait until bastion-2 is reachable and run the following command to SSH to bastion-2:
    $ ssh bastion-2
  14. Repeat steps 9 and 10 to change the MTU value on k8s-host-1 and bastion-1.
  15. Repeat steps 1 to 10 to change the MTU values on k8s-host-3 and restart all VM guests on it. You can use bastion-1 or bastion-2 for performing this step.

    Note:

    For the VM guests that are controller nodes, perform only the virsh shutdown and virsh start commands to restart the VM guests. The MTU values of these controller nodes are updated in the following section.

Changing MTU on enp1s0 or bond0 Interface for BareMetal Controller or Worker Nodes

  1. Run the following command to launch the provision container:
    $ podman run -it --rm --network host -v /var/occne/cluster/${OCCNE_CLUSTER}:/host winterfell:5000/occne/provision:<release> /bin/bash

    Where, <release> is the currently installed release.

    This creates a Bash shell session running within the provision container.

  2. Run the following commands to change enp1s0 interfaces for controller nodes and validate MTU value of the interface:
    1. Change enp1s0 interfaces for controller nodes:

      Replace <MTU value> in the command with a real integer value.

      $ ansible -i /host/hosts.ini kube-master -m shell -a 'sudo  nmcli con mod "System enp1s0" 802-3-ethernet.mtu <MTU value>;  sudo  nmcli con up "System enp1s0"'
    2. Validate the MTU value of the interface:
      $ ansible -i /host/hosts.ini kube-master -m shell -a 'ip link show enp1s0'
  3. Run the following commands to change bond0 interfaces for worker nodes and validate the MTU value of the interface:
    1. Change bond0 interfaces for controller nodes:

      Replace <MTU value> in the command with a real integer value.

      $ ansible -i /host/hosts.ini kube-node -m shell -a 'sudo  nmcli con mod bond0 802-3-ethernet.mtu <MTU value>;  sudo  nmcli con up bond0'
    2. Validate the MTU value of the interface:
      $ ansible -i /host/hosts.ini kube-node -m shell -a 'ip link show bond0'$ exit
Changing MTU on vxlan Interface (vxlan.calico) for BareMetal and vCNE
  1. Log in to the Bastion host and run the following command:
    $ kubectl edit daemonset calico-node -n kube-system
  2. Locate the line with FELIX_VXLANMTU and replace the current <MTU value> with the new integer value:

    Note:

    vlan.calico has an extra header in the packet. The modified MTU value must be at least 50 lower than the MTU set in previous steps to work.
    - name: FELIX_VXLANMTU
             value: "<MTU value>"
  3. Use :x to save and exit the vi editor and run the following command:
    $ kubectl rollout restart daemonset calico-node -n kube-system
  4. Run the following command to provision container:
    $ podman run -it --rm --network host -v /var/occne/cluster/${OCCNE_CLUSTER}:/host winterfell:5000/occne/provision:${OCCNE_VERSION} /bin/bash
    
  5. Validate the MTU value of the interface on the controller nodes and worker nodes:
    • For BareMetal, run the following command to validate the MTU value:
      $ ansible -i /host/hosts.ini k8s-cluster -m shell -a 'ip link show vxlan.calico'
    • For vCNE (OpenStack or VMware), run the following command to validate the MTU value:
      $ ansible -i /host/hosts k8s-cluster -m shell -a 'ip link show vxlan.calico' 

      Note:

      It takes some time for all the nodes to change to the new MTU. If the MTU value isn't updated, run the command several times to see the changes in the values.
Changing MTU on Calico Interfaces (cali*) for vCNE or BareMetal
  1. Log in to Bastion host and launch the provision container for vCNE or BareMetal using commands from Step 1 of Change MTU on eth0 interface for vCNE and Change MTU on enp1s0 or bond0 interface for BareMetal.
  2. Run the ansible command for all worker nodes from the provision container:

    Note:

    Run this command for worker nodes only and not for controller nodes.
    • Run the following command for a BareMetal deployment:

      Note:

      Replace <MTU value> in the command with an integer value without quote.
      bash-4.4# ansible -i /host/hosts.ini kube-node -m shell -a 'sudo sed -i '/\\\"mtu\\\"/d' /etc/cni/net.d/calico.conflist.template; sudo sed -i "/\\\"type\\\": \\\"calico\\\"/a \ \ \ \ \ \ \\\"mtu\\\": <MTU value>," /etc/cni/net.d/calico.conflist.template'
      bash-4.4# exit
    • Run the following command for a vCNE deployment:

      Note:

      Replace <MTU value> in the command with an integer value without quote.
      bash-4.4# ansible -i /host/hosts kube-node -m shell -a 'sudo sed -i '/\\\"mtu\\\"/d' /etc/cni/net.d/calico.conflist.template; sudo sed -i "/\\\"type\\\": \\\"calico\\\"/a \ \ \ \ \ \ \\\"mtu\\\": <MTU value>," /etc/cni/net.d/calico.conflist.template'
      bash-4.4# exit
  3. Log in to the Bastion Host and run the following command to restart the deamonset:
    $ kubectl rollout restart daemonset calico-node -n kube-system
  4. Run the following commands to delete deployment and reapply with YAML file. The calico interface MTU change takes effect while starting a new pod on the node.
    1. Verify that the deployment is READY 1/1 before delete and reapply:
      $ kubectl get deployment occne-kube-prom-stack-grafana -n occne-infra
      Sample output:
      NAME                            READY   UP-TO-DATE   AVAILABLE   AGE
      occne-kube-prom-stack-grafana   1/1     1            1           10h
    2. Run the following commands to delete the deployment and reapply with YAML file:
      $ kubectl get deployment occne-kube-prom-stack-grafana -n occne-infra -o yaml  > dp-occne-kube-prom-stack-grafana.yaml
      $ kubectl delete deployment occne-kube-prom-stack-grafana -n occne-infra 
      $ kubectl apply -f dp-occne-kube-prom-stack-grafana.yaml
  5. Run the following commands to verify the MTU change on worker nodes:
    • Verify which node has the new pod:
      $ kubectl get pod -A -o wide | grep occne-kube-prom-stack-grafana
      Sample output:
       occne-infra     occne-kube-prom-stack-grafana-79f9b5b488-cl76b                    3/3     Running     0                60s   10.233.120.22   k8s-node-2.littlefinger.lab.us.oracle.com     <none>           <none>
      
    • Use SSH to log in to the node and check the calico interface change. Change only the last interface MTU due to the new pod for the services. Other calico interfaces' MTU will be changed when other services are changed.
      $ ssh k8s-node-2.littlefinger.lab.us.oracle.com
       
      [admusr@k8s-node-2 ~] $ ip link
      Sample output:
      ...
      35: calia44682149a1@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1480 qdisc noqueue state UP mode DEFAULT group default
      link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-7f1a8116-5acf-b7df-5d6a-eb4f56330cf1
      115: calif0adcd64a1c@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu <MTU value> qdisc noqueue state UP mode DEFAULT group default
      link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-7b99dc36-3b3b-75c6-e27c-9045eeb8242d

7.3.3 Changing Metrics Storage Allocation

The following procedure describes how to increase the amount of persistent storage allocated to Prometheus for metrics storage.

Prerequisites

The revised amount of persistent storage required by metrics must be calculated. Rerun the metrics storage calculations as provided in the "Preinstallation Tasks" section of Oracle Communications Cloud Native Core, Cloud Native Environment Installation, Upgrade, and Fault Recovery Guide, and record the calculated total_metrics_storage value.

Note:

When you increase the storage size for Prometheus, the retention size must also be increased to maintain the purging cycle of Prometheus. The default retention is set to 6.8 GB. If the storage is increased to a higher value and retention remains at 6.8 GB, the amount of data that is stored inside the storage is still 6.8 GB. Therefore, follow the Changing Retention Size of Prometheus procedure to calculate the retention size and update the retention size in Prometheus. These steps are applied while performing Step 3.

Procedure

  1. A Prometheus resource is used to configure all Prometheus instances running in CNE. Run the following command to identify the Prometheus resource:
    $ kubectl get prometheus -n occne-infra
    Sample output:
     NAME                                    VERSION   DESIRED   READY   RECONCILED   AVAILABLE   AGE
    occne-kube-prom-stack-kube-prometheus   v2.44.0   2         2       True         True        20h
  2. Run the following command to edit the Prometheus resource:
    $ kubectl edit prometheus occne-kube-prom-stack-kube-prometheus -n occne-infra

    Note:

    You are placed in a vi editor session that contains all of the configuration for the CNE Prometheus instances.

    Scroll down to the line that contains the "storage" key(line-91), then change the value to <desired pv size>. Also scroll down to the line that contains the "replicas" key(line-54), then change the value to 0. This scales down both the pods. The file must look similar to the following example.

    Sample output:
    53       release: occne-kube-prom-stack
        54   replicas: 2
        55   resources:
        56     limits:
        57       cpu: 2000m
        58       memory: 4Gi
        59     requests:
        60       cpu: 2000m
        61       memory: 4Gi
        62   retention: 14d
        63   retentionSize: 6.8GB
        64   routePrefix: /occne4-utpalkant-kumar/prometheus
        65   ruleNamespaceSelector: {}
        66   ruleSelector:
        67     matchLabels:
        68       role: cnc-alerting-rules
        69   scrapeInterval: 1m
        70   scrapeTimeout: 30s
        71   secrets:
        72   - etcd-occne4-utpalkant-kumar-k8s-ctrl-1
        73   - etcd-occne4-utpalkant-kumar-k8s-ctrl-2
        74   - etcd-occne4-utpalkant-kumar-k8s-ctrl-3
        75   securityContext:
        76     fsGroup: 2000
        77     runAsGroup: 2000
        78     runAsNonRoot: true
        79     runAsUser: 1000
        80   serviceAccountName: occne-kube-prom-stack-kube-prometheus
        81   serviceMonitorNamespaceSelector: {}
        82   serviceMonitorSelector: {}
        83   shards: 1
        84   storage:
        85     volumeClaimTemplate:
        86       spec:
        87         accessModes:
        88         - ReadWriteOnce
        89         resources:
        90           requests:
        91             storage: 10Gi
        92         storageClassName: occne-metrics-sc

    Note:

    Type ":wq" to exit the editor session and save the changes. Verify that Prometheus instances are scaled down.
  3. To change the pvc size of Prometheus pods, run the following command:
    $ kubectl edit pvc prometheus-occne-kube-prom-stack-kube-prometheus-db-prometheus-occne-kube-prom-stack-kube-prometheus-0 -n occne-infra

    Note:

    You will be placed in a vi editor session that contains all of the configuration for the CNE Prometheus pvc. Scroll down to the line that contains the "spec.resources.requests.storage" key, then update the value to the <desired pv size>. The file must look similar to the following example:
     spec:
       accessModes:
       - ReadWriteOnce
       resources:
         requests:
           storage: 10Gi
       storageClassName: occne-metrics-sc
       volumeMode: Filesystem

    Type ":wq" to exit the editor session and save the changes.

  4. To verify if the pvc size change is applied, run the following command:
    $ kubectl get pv |grep kube-prom-stack-kube-prometheus-0
    Sample output:
    pvc-3c595b70-4265-42e8-a0ca-623b28ce4221  10Gi  RWO  Delete  Bound  occne-infra/prometheus-occne-kube-prom-stack-kube-prometheus-db-prometheus-occne-kube-prom-stack-kube-prometheus-0   occne-metrics-sc
    

    Note:

    Wait until the new desired size "10Gi" gets reflected. Repeat step 3 and step 4 for "kube-prom-stack-kube-prometheus-1" pvc.
  5. Once both the pv sizes are updated to the new desired size, run the following command to scale up the Prometheus pods:
    $ kubectl edit prometheus occne-kube-prom-stack-kube-prometheus -n occne-infra

    Note:

    You will be placed in a vi editor session that contains all of the configuration for the CNE Prometheus instances. Scroll down to the line that contains the "replicas" key, then change the value back to 2. This scale backs up both the pods. The file must look similar to the following example:
    53       release: occne-kube-prom-stack
    54   replicas: 2
    55   resources:
    56     limits:
    57       cpu: 2000m
    58       memory: 4Gi
    59     requests:
    60       cpu: 2000m
    61       memory: 4Gi
    62   retention: 14d
  6. To verify that the Prometheus pods are up and running, run the following command:
    $ kubectl get pods -n occne-infra | grep kube-prom-stack-kube-prometheus
    Example output:
    prometheus-occne-kube-prom-stack-kube-prometheus-0 2/2 Running 1 40s
    prometheus-occne-kube-prom-stack-kube-prometheus-1 2/2 Running 1 29s

7.3.4 Changing OpenSearch Storage Allocation

This section describes the procedure to increase the amount of persistent storage allocated to OpenSearch for data storage.

Prerequisites

  • Calculate the revised amount of persistent storage required by OpenSearch. Rerun the OpenSearch storage calculations as provided in the "Preinstallation Taks" section of Oracle Communications Cloud Native Core, Cloud Native Environment Installation, Upgrade, and Fault Recovery Guide, and record the calculated log_trace_active_storage and log_trace_inactive_storage values.

Procedure

This procedure uses the value of log_trace_active_storage for opensearch-data PV size and log_trace_inactive_storage for opensearch-master PV size. The following table displays the sample PV sizes considered in this procedure:
OpenSearch Component Current PV Size Desired PV Size
occne-opensearch-master 500Mi 500Mi
occne-opensearch-data 10Gi 200Gi (log_trace_active_storage)
opensearch-data-replicas-count 5 7
Expanding PV size for opensearch-master nodes
  1. Store the output of the current configuration values for the os-master-helm-values.yaml file.
    $ helm -n occne-infra get values occne-opensearch-master > os-master-helm-values.yaml
  2. Update the PVC size block in the os-master-helm-values.yaml file. The PVC size must be updated to the newly required PVC size (in this case, 50Gi as per the sample value considered). The os-master-helm-values.yaml file is required in Step 8 to recreate occne-opensearch-master Statefulset.
    $ vi os-master-helm-values.yaml
    persistence:
      enabled: true
      image: occne-repo-host:5000/docker.io/busybox
      imageTag: 1.31.0
      size: <desired size>Gi
      storageClass: occne-esmaster-sc
  3. Delete the statefulset of occne-opensearch-cluster-master by running the following command:
    $ kubectl -n occne-infra delete sts --cascade=orphan occne-opensearch-cluster-master
  4. Delete the occne-opensearch-cluster-master-2 pod by running the following command:
    $ kubectl -n occne-infra delete pod occne-opensearch-cluster-master-2
  5. Update the PVC storage size in the PVC of occne-opensearch-cluster-master-2 by running the following command:
    $ kubectl -n occne-infra patch -p '{ "spec": { "resources": { "requests": { "storage": "40Gi" }}}}' pvc occne-opensearch-cluster-master-occne-opensearch-cluster-master-2
    
  6. Get the PV volume ID from the PVC of opensearch-master-2:
    $ kubectl get pvc -n occne-infra | grep master-2
    Sample output:
    occne-opensearch-cluster-master-occne-opensearch-cluster-master-2   Bound    pvc-9d9897c1-b7b9-43a3-bf78-f03b91ea4d72   30Gi       RWO            occne-esmaster-sc   17h

    In this case, the PV volume ID in the sample output is pvc-9d9897c1-b7b9-43a3-bf78-f03b91ea4d72.

  7. Hold on to the PV attached to occne-opensearch-cluster-master-2 PVC using the volume ID until the newly updated size gets reflected. Verify the updated PVC value by running the following command:
    $ kubectl get pv -w | grep pvc-9d9897c1-b7b9-43a3-bf78-f03b91ea4d72
    Sample output:
    pvc-9d9897c1-b7b9-43a3-bf78-f03b91ea4d72   30Gi       RWO            Delete           Bound    occne-infra/occne-opensearch-cluster-master-occne-opensearch-cluster-master-2    occne-esmaster-sc            17h
    pvc-9d9897c1-b7b9-43a3-bf78-f03b91ea4d72   40Gi       RWO            Delete           Bound    occne-infra/occne-opensearch-cluster-master-occne-opensearch-cluster-master-2    occne-esmaster-sc            17h
  8. Run Helm upgrade to recreate the occne-opensearch-master statefulset:
    $ helm upgrade -f os-master-helm-values.yaml occne-opensearch-master opensearch-project/opensearch -n occne-infra
  9. Once the deleted pod (master-2) and its statefulset are up and running, check the pod's PVC status and verify if it reflects the updated size.
    $ kubectl get pvc -n occne-infra | grep master-2
    Sample output:
    occne-opensearch-cluster-master-occne-opensearch-cluster-master-2    Bound    pvc-9d9897c1-b7b9-43a3-bf78-f03b91ea4d72  40Gi       RWO            occne-esmaster-sc   17h
    e.g id: pvc-9d9897c1-b7b9-43a3-bf78-f03b91ea4d72
  10. Repeat steps 3 through 9 for each of the remaining pods, one after the other (in order master-1, master-0).
Expanding PV size for opensearch-data nodes
  1. Store the output of the current configuration values for os-master-helm-values.yaml file.
    $ helm -n occne-infra get values occne-opensearch-data > os-data-helm-values.yaml
  2. Update the PVC size block in the os-master-helm-values.yaml file. The PVC size must be updated to the newly required PVC size (in this case, 200Gi as per the sample value considered). The os-master-helm-values.yaml file is required in Step 8 of this procedure to recreate the occne-opensearch-data statefulset.
    $ vi os-data-helm-values.yaml
    Sample output:
    persistence:
      enabled: true
      image: occne-repo-host:5000/docker.io/busybox
      imageTag: 1.31.0
      size: <desired size>Gi
      storageClass: occne-esdata-sc
  3. Delete the statefulset of occne-opensearch-opensearch-data by the running the following command:
    $ kubectl -n occne-infra delete sts --cascade=orphan occne-opensearch-cluster-data
  4. Delete the occne-opensearch-cluster-data-2.
    $ kubectl -n occne-infra delete pod occne-opensearch-cluster-data-2
  5. Update the PVC storage size in the PVC of occne-opensearch-cluster-data-2.
    $ kubectl -n occne-infra patch -p '{ "spec": { "resources": { "requests": { "storage": "20Gi" }}}}' pvc occne-opensearch-cluster-data-occne-opensearch-cluster-data-2
    
  6. Get the PV volume ID from the PVC of opensearch-data-2.
    $ kubectl get pvc -n occne-infra | grep data-2
    Sample output:
    occne-opensearch-cluster-data-occne-opensearch-cluster-data-2   Bound    pvc-80a56d73-d7b7-417f-a7a7-c8484bc8171d   10Gi       RWO            occne-esdata-sc   17h
  7. Hold on to the PV attached to opensearch-data-2 PVC using the volume ID until the newly updated size gets reflected. Verify the updated PVC value by running the following command:
    $ kubectl get pv -w | grep pvc-80a56d73-d7b7-417f-a7a7-c8484bc8171d
    Sample output:
    pvc-80a56d73-d7b7-417f-a7a7-c8484bc8171d   10Gi       RWO            Delete           Bound    occne-infra/occne-opensearch-cluster-data-occne-opensearch-cluster-data-2    occne-esdata-sc            17h
    pvc-80a56d73-d7b7-417f-a7a7-c8484bc8171d   20Gi       RWO            Delete           Bound    occne-infra/occne-opensearch-cluster-data-occne-opensearch-cluster-data-2    occne-esdata-sc            17h
  8. Run helm upgrade to recreate the occne-opensearch-data statefulset
    $ helm upgrade -f os-data-helm-values.yaml occne-opensearch-data opensearch-project/opensearch -n occne-infra
  9. Once the deleted pod (data-2) and its statefulset are up and running, check the pod's PVC status and verify if it reflects the updated size.
    $ kubectl get pvc -n occne-infra | grep data-2
    Sample output:
    occne-opensearch-cluster-data-occne-opensearch-cluster-data-2   Bound    pvc-80a56d73-d7b7-417f-a7a7-c8484bc8171d  20Gi       RWO            occne-esdata-sc   17h
    
  10. Repeat steps 3 through 9 for each of the remaining pods, one after the other (in the order, data-1, data-0,..).

7.3.5 Changing the RAM and CPU Resources for Common Services

This section describes the procedure to change the RAM and CPU resources for CNE common services.

Prerequisites

Before changing the RAM, CPU, or both the resources for CNE common services, make sure that the following prerequisites are met:
  • The cluster must be in a healthy state. This can verified by checking if all the common services are up and running.

    Note:

    • When changing the CPU and RAM resources for any component, the limit value must always be greater than or equal to the requested value.
    • Run all the commands in this section from the Bastion Host.
7.3.5.1 Changing the Resources for Prometheus

This section describes the procedure to change the RAM or CPU resources for Prometheus.

Procedure

  1. Run the following command to edit the Prometheus resource:
    kubectl edit prometheus occne-kube-prom-stack-kube-prometheus -n occne-infra

    The system opens a vi editor session that contains all the configuration for the CNE Prometheus instances.

  2. Scroll to the resources section and change the CPU and Memory resources to the desired values. This updates the resources for both the prometheus pods.
    For example:
    resources:
        limits:
          cpu: 2000m
          memory: 4Gi
        requests:
          cpu: 2000m
          memory: 4Gi
  3. Type :wq to exit the editor session and save the changes.
  4. Verify if both the Prometheus pods are restarted:
    kubectl get pods -n occne-infra |grep kube-prom-stack-kube-prometheus
    Sample output:
    prometheus-occne-kube-prom-stack-kube-prometheus-0              2/2     Running     0              85s
    prometheus-occne-kube-prom-stack-kube-prometheus-1              2/2     Running     0              104s
7.3.5.2 Changing the Resources for Alertmanager

This section describes the procedure to change the RAM or CPU resources for Alertmanager.

Procedure

  1. Run the following command to edit the Alertmanager resource:
    kubectl edit alertmanager occne-kube-prom-stack-kube-alertmanager -n occne-infra

    The system opens a vi editor session that contains all the configuration for the CNE Alertmanager instances.

  2. Scroll to the resources section and change the CPU and Memory resources to the desired values. This updates the resources for the Alertmanager pods.
    For example:
    resources:
        limits:
          cpu: 20m
          memory: 64Mi
        requests:
          cpu: 20m
          memory: 64Mi
  3. Type :wq to exit the editor session and save the changes.
  4. Verify if the Alertmanager pods are restarted:
    kubectl get pods -n occne-infra |grep alertmanager
    Sample output:
    alertmanager-occne-kube-prom-stack-kube-alertmanager-0          2/2     Running     0              16s
    alertmanager-occne-kube-prom-stack-kube-alertmanager-1          2/2     Running     0              35s
7.3.5.3 Changing the Resources for Grafana

This section describes the procedure to change the RAM or CPU resources for Grafana.

Procedure

  1. Run the following command to edit the Grafana resource:
    kubectl edit deploy occne-kube-prom-stack-grafana -n occne-infra

    The system opens a vi editor session that contains all the configuration for the CNE Grafana instances.

  2. Scroll to the resources section and change the CPU and Memory resources to the desired values. This updates the resources for the Grafana pod.
    For example:
    resources:
        limits:
          cpu: 100m
          memory: 128Mi
        requests:
          cpu: 100m
          memory: 128Mi
  3. Type :wq to exit the editor session and save the changes.
  4. Verify if the Grafana pod is restarted:
    kubectl get pods -n occne-infra |grep grafana
    Sample output:
    occne-kube-prom-stack-grafana-84898d89b4-nzkr4                  3/3     Running     0              54s
    
7.3.5.4 Changing the Resources for Kube State Metrics

This section describes the procedure to change the RAM or CPU resources for kube-state-metrics.

Procedure

  1. Run the following command to edit the kube-state-metrics resource:
    kubectl edit deploy occne-kube-prom-stack-kube-state-metrics -n occne-infra

    The system opens a vi editor session that contains all the configuration for the CNE kube-state-metrics instances.

  2. Scroll to the resources section and change the CPU and Memory resources to the desired values. This updates the resources for the kube-state-metrics pod.
    For example:
    resources:
        limits:
          cpu: 20m
          memory: 100Mi
        requests:
          cpu: 20m
          memory: 32Mi
  3. Type :wq to exit the editor session and save the changes.
  4. Verify if the kube-state-metrics pod is restarted:
    kubectl get pods -n occne-infra |grep kube-state-metrics
    Sample output:
    occne-kube-prom-stack-kube-state-metrics-cff54c76c-t5k7p        1/1     Running     0              20s
7.3.5.5 Changing the Resources for OpenSearch

This section describes the procedure to change the RAM or CPU resources for OpenSearch.

Procedure

  1. Run the following command to edit the opensearch-master resource:
    kubectl edit sts occne-opensearch-cluster-master -n occne-infra

    The system opens a vi editor session that contains all the configuration for the CNE opensearch-master instances.

  2. Scroll to the resources section and change the CPU and Memory resources to the desired values. This updates the resources for the opensearch-master pod.
    For example:
    resources:
        limits:
          cpu: "1"
          memory: 2Gi
        requests:
          cpu: "1"
          memory: 2Gi
  3. Type :wq to exit the editor session and save the changes.
  4. Verify if the opensearch-master pods are restarted:
    kubectl get pods -n occne-infra |grep opensearch-cluster-master
    Sample output:
    occne-opensearch-cluster-master-0                           1/1     Running   0             3m34s
    occne-opensearch-cluster-master-1                           1/1     Running   0             4m8s
    occne-opensearch-cluster-master-2                           1/1     Running   0             4m19s

    Note:

    Repeat this procedure for opensearch-data and opensearch-client pods if required.
7.3.5.6 Changing the Resources for OpenSearch Dashboard

This section describes the procedure to change the RAM or CPU resources for OpenSearch Dashboard.

Procedure

  1. Run the following command to edit the opensearch-dashboard resource:
    kubectl edit deploy occne-opensearch-dashboards -n occne-infra

    The system opens a vi editor session that contains all the configuration for the CNE opensearch-dashboard instances.

  2. Scroll to the resources section and change the CPU and Memory resources to the desired values. This updates the resources for the opensearch-dashboard pod.
    For example:
    resources:
        limits:
          cpu: 100m
          memory: 512Mi
        requests:
          cpu: 100m
          memory: 512Mi
  3. Type :wq to exit the editor session and save the changes.
  4. Verify if the opensearch-dashboard pod is restarted:
    kubectl get pods -n occne-infra |grep dashboard
    Sample output:
    occne-opensearch-dashboards-7b7749c5f7-jcs7d                1/1     Running   0              20s
7.3.5.7 Changing the Resources for Fluentd OpenSearch

This section describes the procedure to change the RAM or CPU resources for Fluentd OpenSearch.

Procedure

  1. Run the following command to edit the occne-fluentd-opensearch resource:
    kubectl edit ds occne-fluentd-opensearch -n occne-infra

    The system opens a vi editor session that contains all the configuration for the CNE Fluentd OpenSearch instances.

  2. Scroll to the resources section and change the CPU and memory resources to the desired values. This updates the resources for the Fluentd OpenSearch pods.
    For example:
    resources:
        limits:
          cpu: 100m
          memory: 128Mi
        requests:
          cpu: 100m
          memory: 128Mi
  3. Type :wq to exit the editor session and save the changes.
  4. Verify if the Fluentd OpenSearch pods are restarted:
    kubectl get pods -n occne-infra |grep fluentd-opensearch
    Sample output:
    occne-fluentd-opensearch-kcx87                                      1/1     Running   0             19s
    occne-fluentd-opensearch-m9zhz                                      1/1     Running   0             9s
    occne-fluentd-opensearch-pbbrw                                      1/1     Running   0             14s
    occne-fluentd-opensearch-rstqf                                      1/1     Running   0             4s
7.3.5.8 Changing the Resources for Jaeger Agent

This section describes the procedure to change the RAM or CPU resources for Jaeger Agent.

Procedure

  1. Run the following command to edit the jaeger-agent resource:
    kubectl edit ds occne-tracer-jaeger-agent -n occne-infra

    The system opens a vi editor session that contains all the configuration for the CNE jaeger-agent instances.

  2. Scroll to the resources section and change the CPU and Memory resources to the desired values. This updates the resources for the jaeger-agent pods.
    For example:
    resources:
        limits:
          cpu: 500m
          memory: 512Mi
        requests:
          cpu: 256m
          memory: 128Mi
  3. Type :wq to exit the editor session and save the changes.
  4. Verify if the jaeger-agent pods are restarted:
    kubectl get pods -n occne-infra |grep jaeger-agent
    Sample output:
    occne-tracer-jaeger-agent-dpn4v                                 1/1     Running     0              58s
    occne-tracer-jaeger-agent-dvpnv                                 1/1     Running     0              62s
    occne-tracer-jaeger-agent-h4t67                                 1/1     Running     0              55s
    occne-tracer-jaeger-agent-q92ld                                 1/1     Running     0              51s
7.3.5.9 Changing the Resources for Jaeger Query

This section describes the procedure to change the RAM or CPU resources for Jaeger Query.

Procedure

  1. Run the following command to edit the jaeger-query resource:
    kubectl edit deploy occne-tracer-jaeger-query -n occne-infra

    The system opens a vi editor session that contains all the configuration for the CNE jaeger-query instances.

  2. Scroll to the resources section and change the CPU and Memory resources to the desired values. This updates the resources for the jaeger-query pod.
    For example:
    resources:
        limits:
          cpu: 500m
          memory: 512Mi
        requests:
          cpu: 256m
          memory: 128Mi
  3. Type :wq to exit the editor session and save the changes.
  4. Verify if the jaeger-query pod is restarted:
    kubectl get pods -n occne-infra |grep jaeger-query
    Sample output:
    occne-tracer-jaeger-query-67bdd85fcb-hw67q                      2/2     Running     0              19s

    Note:

    Repeat this procedure for the jaeger-collector pod if required.

7.3.6 Activating and Configuring Local DNS

This section provides information about activating and configuring local DNS.

7.3.6.1 Activating Local DNS
Local DNS allows CNE clusters to perform domain name lookups within Kubernetes clusters. This section provides the procedure to activate the local DNS feature on a CNE cluster.

Note:

Before activating Local DNS, ensure that you are aware about the following conditions:
  • Local DNS does not handle backups of any added record.
  • You must run this procedure to activate local DNS only after installing or upgrading to release 23.4.x.
7.3.6.1.1 Prerequisites
Before activating local DNS, ensure that the following prerequisites are met:
  • Ensure that the cluster is running in a healthy state.
  • Ensure that the CNE cluster is running with version 23.4.x. You can validate the CNE version by echoing the OCCNE_VERSION environment variable on Bastion Host:
    echo $OCCNE_VERSION
  • Ensure that the cluster is running with the Bastion DNS configuration.
7.3.6.1.2 Preactivation Checks

This section provides information about the checks that are performed before activating local DNS.

Determining the Active Bastion Host

  1. Log in to one of the Bastion Hosts (for example, Bastion 1) and determine if that Bastion Host is active or not by running the following command:
    $ is_active_bastion
    The system displays the following output if the Bastion Host is active :
    IS active-bastion
  2. If the current Bastion is not active, then log in to the mate Bastion Host and verify if it is active:
    $ is_active_bastion
    The system displays the following output if the Bastion Host is active :
    IS active-bastion

Verifying if Local DNS is Already Activated

  1. Navigate to the cluster directory:
    $ cd /var/occne/cluster/${OCCNE_CLUSTER}
  2. Open the occne.ini file (for vCNE) or hosts.ini file (for Bare Metal) and verify if the local_dns_enabled variable under the occne:vars header is set to False.
    Example for vCNE:
    $ cat occne.ini
    Sample output:
    [occne:vars]
    .
    local_dns_enabled=False
    .
    Example for Bare Metal:
    $ cat hosts.ini
    Sample output:
    [occne:vars]
    .
    local_dns_enabled=False
    .
    If local_dns_enabled is set to True, then it indicates that local DNS feature is already enabled in the CNE cluster.

    Note:

    Ensure that the first character of the variable value (True or False) is capitalized and there is no space before and after the equal to sign.
7.3.6.1.3 Enabling Local DNS
This section provides the procedure to enable Local DNS in a CNE cluster.
  1. Log in to the active Bastion Host and run the following command to navigate to the cluster directory:
    $ cd /var/occne/cluster/${OCCNE_CLUSTER}
  2. Open the occne.ini file (for vCNE) or hosts.ini file (for Bare metal) in edit mode:

    Example for vCNE:

    $ vi occne.ini
    Example for Bare Metal:
    $ vi hosts.ini
  3. Set the local_dns_enabled variable under the occne:vars header to True. If the local_dns_enabled variable is not present under the occne:vars header, then add the variable.

    Note:

    Ensure that the first character of the variable value (True or False) is capitalized and there is no space before and after the equal to sign.
    For example,
    [occne:vars]
    .
    local_dns_enabled=True
    .
  4. For vCNE (OpenStack or VMware) deployments, additionally add the provider_domain_name and provider_ip_address variables under the occne:vars section of the occne.ini file. You can obtain the provider domain name and IP address from the provider administrator and set the variable values accordingly.
    The following block shows the sample occne.ini file with the additional variables:
    [occne:vars]
    .
    local_dns_enabled=True
    provider_domain_name=<cloud provider domain name>
    provider_ip_address=<cloud provider IP address>
    .
  5. Update the cluster with the new settings in the ini file:
    $ OCCNE_CONTAINERS=(K8S) OCCNE_STAGES=(DEPLOY) OCCNE_ARGS='--tags=coredns' pipeline.sh
7.3.6.1.4 Validating Local DNS

This section provides the steps to validate if you have successfully enabled local DNS.

Local DNS provides the validateLocalDns.py script to validate if you have successfully enabled Local DNS. The validateLocalDns.py script is located in the /var/occne/cluster/${OCCNE_CLUSTER}/artifacts/maintenance/validateLocalDns.py folder. This automated script validates Local DNS by performing the following actions:
  1. Creating a test record
  2. Reloading local DNS
  3. Querying the test record from within a pod
  4. Getting the response (Success status)
  5. Deleting the test record
To run the validateLocalDns.py script:
  1. Log in to the active Bastion Host and navigate to the cluster directory:
    $ cd /var/occne/cluster/${OCCNE_CLUSTER}
  2. Run the validateLocalDns.py script:
    $ ./artifacts/maintenance/validateLocalDns.py
    Sample output:
    Beginning local DNS validation
     - Validating local DNS configuration in occne.ini
     - Adding DNS A record.
     - Adding DNS SRV record.
     - Reloading local coredns.
     - Verifying local DNS A record.
       - DNS A entry has not been propagated, retrying in 10 seconds (retry 1/5)
     - Verifying local DNS SRV record.
     - Deleting DNS SRV record.
     - Deleting DNS A record.
     - Reloading local coredns.
    Validation successful

    Note:

    If the script encounters an error, it returns an error message indicating which part of the process failed. For more information about troubleshooting local DNS errors, see Troubleshooting Local DNS.
  3. Once you successfully enable Local DNS, add the external hostname records using the Local DNS API to resolve external domain names using CoreDNS. For more information, see Adding and Removing DNS Records.
7.3.6.2 Adding and Removing DNS Records

This section provides the procedures to add and remove DNS records ("A" records and SRV records) using Local DNS API to the core DNS configuration.

Each Bastion Host runs a version of the Local DNS API as a service on port 8000. The system doesn't require any authentication from inside a Bastion Host and runs the API requests locally.

7.3.6.2.1 Prerequisites
Before adding or removing DNS records, ensure that the following prerequisites are met:
  • The Local DNS feature must be enabled on the cluster. For more information about enabling Local DNS, see Activating Local DNS.
  • The CNE cluster version must be 23.2.x or above.
7.3.6.2.2 Adding an A Record

This section provides information on how to use the Local DNS API to create or add an A record in the CNE cluster.

The system creates zones to group records with similar domain names together. When an API call is made to the Local DNS API, the request payload (request body) is used to create the zones automatically. The system creates a new zone when it identifies a domain name for the first time. Henceforth, it stores each request coming with the same domain name within the same zone.

Note:

  • You cannot create and maintain identical A records.
  • You cannot create two A records with the same name.
  • You cannot create two A records with the same IP address within the same zone.

The following table provides details on how to use the Local DNS API to add an "A" record:

Table 7-1 Adding an A Record

Request URL HTTP Method Content Type Request Body Response Code Sample Response
http://localhost:8000/occne/dns/a POST application/json
{
  "name": "string",
  "ttl": "integer",
  "ip-address": "string"
}

Note: Define each field in the request body within double quotes (" ").

Sample request:
curl -X POST http://localhost:8000/occne/dns/a \
     -H 'Content-Type: application/json' \
     -d '{"name":"occne.lab.oracle.com","ttl":"3600","ip-address":"175.80.200.20"}'
  • 200: Record Added Successfully.
  • 400: Already exists, request payload/parameters incomplete or not valid.
  • 503: Could not be added, internal error.
200: DNS A record added in coredns file for occne.lab.oracle.com 175.80.200.20 3600, msg SUCCESS: Zone info and A record updated for domain name

The following table provides details about the request body parameters:

Table 7-2 Request Body Parameters

Parameter Required or Optional Type Description
name Required string Fully-Qualified Domain Name (FQDN) to be include in the core DNS.

This parameter can contain multiple subdomains where each subdomain can range between 1 and 63 characters and contain the following characters: [a-zA-Z0-9_-].

This parameter cannot start or end with - or _.

The last segment of this parameter is taken as the domain name to create zones.

For example, sample.oracle.com

ip-address Required string The IP address to locate a service. For example, xxx.xxx.xxx.xxx.

The API supports IPv4 protocol only.

ttl Required integer

The Time To Live (TTL) in seconds. This is the amount of time the record is allowed to be cached by a resolver.

The minimum and the maximum value that can be set are 300 and 3600 respectively.

7.3.6.2.3 Deleting an A Record

This section provides information on how to use the Local DNS API to delete an A record in the CNE cluster.

Note:

  • When the last A record in a zone is deleted, the system deletes the zone as well.
  • You cannot delete an A record that is linked to an existing SRV record. You much first delete the linked SRV record to delete the A record.

The following table provides details on how to use the Local DNS API to delete an "A" record:

Table 7-3 Deleting an A Record

Request URL HTTP Method Content Type Request Body Response Code Sample Response
http://localhost:8000/occne/dns/a DELETE application/json
{
  "name": "string",
  "ip-address": "string"
}

Note: Define each field in the request body within double quotes (" ").

Sample request:
curl -X DELETE http://localhost:8000/occne/dns/a \
     -H 'Content-Type: application/json' \
     -d '{"name":"occne.lab.oracle.com","ip-address":"175.80.200.20"}'
  • 200: Record Deleted Successfully.
  • 400: Record does not exist, or request payload/parameters are incomplete or not valid.
  • 503: Could not be deleted, internal error.
200: DNS A record added in coredns file for occne.lab.oracle.com 175.80.200.20 3600, msg SUCCESS: Zone info and A record updated for domain name

The following table provides details about the request body parameters:

Table 7-4 Request Body Parameters

Parameter Required or Optional Type Description
name Required string Fully-Qualified Domain Name (FQDN).

This parameter can contain multiple subdomains where each subdomain can range between 1 and 63 characters and contain the following characters: [a-zA-Z0-9_-].

This parameter cannot start or end with - or _.

For example, sample.oracle.com

ip-address Required string The IP address to locate a service. For example, xxx.xxx.xxx.xxx.
7.3.6.2.4 Adding an SRV Record

This section provides information on how to use the Local DNS API to create or add an SRV record in the CNE cluster.

SRV records are linked to A records. To add a new SRV record, you must already have an A record in the system. Adding an SRV record creates a dependency between the A record and the SRV record. Therefore, to delete an A record, you must first delete the SRV record pointing to the A record. When you are adding an SRV record, the system adds the SRV record to the designated zone, matching the domain name and the related A record.

Note:

  • You cannot create and maintain identical SRV records. However, you can have a different protocol for the same combo service and target A record.
  • Currently, there is no provision to edit an existing SRV record. If you want to edit an SRV record, then delete the existing SRV record and then re-add the record with the updated parameters (weight, priority, or TTL).

The following table provides details on how to use the Local DNS API to create an SRV record:

Table 7-5 Adding an SRV Record

Request URL HTTP Method Content Type Request Body Response Code Sample Response
https://localhost:8000/occne/dns/srv POST application/json
{
  "service": "string",
  "protocol": "string",
  "dn": "string",
  "ttl": "integer"
  "priority": "integer",
  "weight": "integer",
  "port": "integer",
  "server": "string",
  "a_record": "string"
}

Note: Define each field in the request body within double quotes (" ").

Sample request:
curl -X POST http://localhost:8000/occne/dns/srv \
     -H 'Content-Type: application/json' \
          -d '{"service":"sip","protocol":"tcp","dn":"lab.oracle.com","ttl":"3600","weight":"100","port":"35061","server":"occne","priority":"10","a_record":"occne.lab.oracle.com"}'
  • 200: Record Added Successfully.
  • 400: Request payload/parameters incomplete or not valid.
  • 409: Already exists.
  • 503: Could not be added, internal error.
200: SUCCESS: SRV record successfully added to config map coredns.

The following table provides details about the request body parameters:

Table 7-6 Request Body Parameters

Parameter Required or Optional Type Description
service Required string The symbolic name for the service, such as "sip", and "my_sql".

The value of this parameter can range between 1 and 63 characters and contain the following characters: [a-zA-Z0-9_-]. The parameter cannot start or end with - or _.

protocol Required string The protocol supported by the service. The allowed values are:
  • tcp
  • tcp
dn Required string The domain name that the SRV record is applicable to. This parameter can contain multiple subdomains where each subdomain can range between 1 and 63 characters and contain the following characters: [a-zA-Z0-9_-]. For example: lab.oracle.com.

If the SRV record is applicable to the entire domain, then provide only the domain name without subdomains. For example, oracle.com

The length of the Top Level Domains (TLD) must be between 1 and 6 characters and must only contain the following characters: [a-z]. For example: .com, .net, and .uk.

ttl Required integer

The Time To Live (TTL) in seconds. This is the amount of time the record is allowed to be cached by a resolver.

This value can range between 300 and 3600.

priority Required integer The priority of the current SRV record in comparison to the other SRV records.

The values can range from 0 to n.

weight Required integer The weight of the current SRV record in comparison to the other SRV records with the same priority.

The values can range from 0 to n.

port Required integer The port on which the target service is found.

The values can range from 1 to 65535.

server Required string The name of the machine providing the service without including the domain name (value provided in the dn field).

The value can range between 1 and 63 characters and contain the following characters: [a-zA-Z0-9_-]. The parameter cannot start or end with - or _. For example: occne-node1.

a_record Required string The "A" record name to which the SRV is added.

The "A" record mentioned here must be already added. Otherwise the request fails.

7.3.6.2.5 Deleting an SRV Record

This section provides information on how to use the Local DNS API to delete an SRV record in the CNE cluster.

Note:

To delete an SRV record, the details in the request payload must exactly match the details, such as weight, priority, and ttl, of an existing SRV record.

The following table provides details on how to use the Local DNS API to delete an SRV record:

Table 7-7 Deleting an SRV Record

Request URL HTTP Method Content Type Request Body Response Code Sample Response
https://localhost:8000/occne/dns/srv DELETE application/json
{
  "service": "string",
  "protocol": "string",
  "dn": "string",
  "ttl": "integer"
  "priority": "integer",
  "weight": "integer",
  "port": "integer",
  "server": "string"
  "a_record": "string"
}

Note: Define each field in the request body within double quotes (" ").

Sample request:
curl -X DELETE http://localhost:8000/occne/dns/srv \
     -H 'Content-Type: application/json' \
          -d '{"service":"sip","protocol":"tcp","dn":"lab.oracle.com","ttl":"3600","weight":"100","port":"35061","server":"occne","priority":"10","a_record":"occne.lab.oracle.com"}'
  • 200: Record Deleted Successfully.
  • 400: Record does not exist, or request payload/parameters incomplete or not valid.
  • 503: Could not be deleted, internal error.
200: SUCCESS: SRV record successfully deleted from config map coredns

The following table provides details about the request body parameters:

Table 7-8 Request Body Parameters

Parameter Required or Optional Type Description
service Required string The symbolic name for the service, such as "sip", and "my_sql".

The value of this parameter can range between 1 and 63 characters and contain the following characters: [a-zA-Z0-9_-]. The parameter cannot start or end with - or _.

protocol Required string The protocol supported by the service. The allowed values are:
  • tcp
  • tcp
dn Required string The domain name that the SRV record is applicable to. This parameter can contain multiple subdomains where each subdomain can range between 1 and 63 characters and contain the following characters: [a-zA-Z0-9_-].

The length of the Top Level Domains (TLD) must be between 1 and 6 characters and must only contain the following characters: [a-z]. For example: .com, .net, and .uk.

ttl Required integer

The Time To Live (TTL) in seconds. This is the amount of time the record is allowed to be cached by a resolver.

This value can range between 300 and 3600.

priority Required integer The priority of the current SRV record in comparison to the other SRV records.

The values can range from 0 to n.

weight Required integer The weight of the current SRV record in comparison to the other SRV records with the same priority.

The values can range from 0 to n.

port Required integer The port on which the target service is found.

The values can range from 1 to 65535.

server Required string The name of the machine providing the service minus the domain name (the value in the dn field).

The value can range from 1 and 63 characters and contain the following characters: [a-zA-Z0-9_-]. The parameter cannot start or end with - or _.

a_record Required string The "A" record name from which the SRV is deleted.

The "A" record mentioned here must be already added. Otherwise the request fails.

7.3.6.3 Adding and Removing Forwarding Nameservers

The forward section added in the coreDNS forwards the queries to the nameservers if they are not resolved by coreDNS. This section provides information about adding or removing forwarding nameservers using the forward endpoint provided by Local DNS API.

7.3.6.3.1 Adding Forwarding Nameservers

This section provides information about adding forward section in coreDNS using the forward endpoint provided by Local DNS API.

Note:

  • Add forward section in coreDNS by using the payload data.
  • Ensure that the forward section is not added already.

The following table provides details on how to use the Local DNS API endpoint to add forward section:

Table 7-9 Adding Forwarding Nameserver

Request URL HTTP Method Content Type Request Body Response Code Sample Response
https://localhost:8000/occne/dns/forward POST application/json
{
  "ip-address": "string",
}

Note: Define each field in the request body within double quotes (" ").

Sample request to add forward section:
curl -X POST http://localhost:8000/occne/dns/forward \
     -H 'Content-Type: application/json' \
     -d '{"ip-address":"192.168.100.25,10.75.100.5"}'
  • 200: Forward successfully configured.
  • 400: Request payload/parameters incomplete or not valid.
  • 503: Could not configure, internal error.
200: SUCCESS: Forwarding nameserver list added to coredns successfully

The following table provides details about the request body parameters:

Table 7-10 Request Body Parameters

Parameter Required or Optional Type Description
ip-address Required string The IP addresses to forward the requests. You can define up to 15 IP addresses. The IP addresses must be valid IPv4 addresses and they must be defined as a comma separated list without extra space.
7.3.6.3.2 Removing Forwarding Nameservers

This section provides information about deleting forward section in coreDNS using the forward endpoint provided by Local DNS API.

Note:

  • CNE doesn't support updating the forward section. If you want to update the forward nameservers, then remove the existing forward section and add a new one with the updated data.
  • Before removing the forward section, ensure that there is already a forward section to delete.

The following table provides details on how to use the Local DNS API endpoint to remove forward section:

Table 7-11 Removing Forwarding Nameserver

Request URL HTTP Method Content Type Request Body Response Code Sample Response
https://localhost:8000/occne/dns/forward DELETE NA
Sample request to delete forward section:
curl -X DELETE http://localhost:8000/occne/dns/forward
  • 200: Forward configuration successfully removed.
  • 400: Request payload/parameters incomplete or not valid. Not configured, unable to remove.
  • 503: Could not configure, internal error.
200: SUCCESS: Forwarding nameserver list deleted from coredns successfully
7.3.6.4 Reloading Local or Core DNS Configurations

This section provides information about reloading core DNS configuration using the reload endpoint provided by Local DNS API.

Note:

You must reload the core DNS configuration to commit the last configuration update, whenever you:
  • add or remove multiple records in the same zone
  • update a single or multiple DNS records

The following table provides details on how to use the Local DNS API endpoint to reload the core DNS configuration:

Table 7-12 Reloading Local or Core DNS Configurations

Request URL HTTP Method Content Type Request Body Response Code Sample Response
http://localhost:8000/occne/coredns/reload POST application/json
{
  "deployment-name": "string",
  "namespace": "string"
}
Note:
  • Define each field in the request body within double quotes (" ").
  • Currently, the system always uses the default values for the "deployment-name" and "namespace" parameters and doesn't consider any modification done to them. This will be address in the future releases.
Sample request to reload the core DNS without payload (using the default values):
curl -X POST http://localhost:8000/occne/coredns/reload
Sample request to reload the core DNS using the payload:
curl -X POST http://localhost:8000/occne/coredns/reload \
     -H 'Content-Type: application/json' \
     -d '{"deployment-name":"coredns","namespace":"kube-system"}'
  • 200: Local DNS or Core DNS Reloaded Successfully.
  • 400: Request payload/parameters incomplete or not valid.
  • 503: Could not be reloaded, internal error.
200: Deployment reloaded, msg SUCCESS: Reloaded coredns deployment in ns kube-system

The following table provides details about the request body parameters:

Table 7-13 Request Body Parameters

Parameter Required or Optional Type Description
deployment-name Required string The deployment Name to be reloaded. The value must be a valid Kubernetes deployment name.

The default value is coredns.

namespace Required string The namespace where the deployment exists. The value must be a valid Kubernetes namespace name.

The default value is kube-system.

7.3.6.5 Other Local DNS API Endpoints

This section provides information about the additional endpoints provided by Local DNS API.

Get Data

The Local DNS API provides an endpoint to get the current configuration, zones and records of local DNS or core DNS.

The following table provides details on how to use the Local DNS API endpoint to get the Local DNS or core DNS configuration details:

Table 7-14 Get Local DNS or Core DNS Configurations

Request URL HTTP Method Content Type Request Body Response Code Sample Response
http://localhost:8000/occne/dns/data GET NA Sample request:
curl -X GET http://localhost:8000/occne/dns/data
  • 200: Returns core DNS configmap data, including zones and records.
  • 503: Could not get data, internal error.
200:
[True, {'api_version': 'v1',
 'binary_data': None,
 'data': {'Corefile': '.:53 {\n'
...
# Output Omitted
...
        'db.oracle.com': ';oracle.com db file\n'
                         'oracle.com.                  300                  '
                         'IN      SOA     ns1.oracle.com  andrei.oracle.com '
                         '201307231 3600 10800 86400 3600\n'
                         'occne1.us.oracle.com.                  '
                         '3600                  IN      A       '
                         '10.65.200.182\n'
                         '_sip._tcp.lab.oracle.com 30 IN SRV 10 102 32061 '
                         'occne.lab.oracle.com.\n'
                         'occne.lab.oracle.com.                  '
                         '3600                  IN      A       '
                         '175.80.200.20\n',
...
# Output Omitted
...
7.3.6.6 Troubleshooting Local DNS

This section describes the issues that you may encounter while configuring Local DNS and their troubleshooting guidelines.

By design, the Local DNS functionality is built on top of the core DNS (CoreDNS). Therefore, all the troubleshooting, logging, and configuration management are performed directly on the core DNS. Each cluster runs a CoreDNS deployment (2 pods), with the rolling update strategy. Therefore, any change in the configuration is applied to both the pods one by one. This process can take some time (approximately, 30 to 60 seconds to reload both pods).

A NodeLocalDNS daemonset is a cache implementation of core DNS. The NodeLocalDNS runs as a pod on each node and is used for quick DNS resolution. When a pod requires a certain domain name resolution, it first checks its NodeLocalDNS pod, the one running in the same node, for resolution. If the pod doesn't get the required resolution, then it forwards the request to the core DNS.

All local DNS records are stored inside the core DNS configmap grouped by zones.

Note:

Use the active Bastion to run all the troubleshooting procedures in this section.
7.3.6.6.1 Troubleshooting Local DNS API

This section provides the troubleshooting guidelines for the common scenarios that you may encounter while using Local DNS API.

Validating Local DNS API

Local DNS API, that is used to add or remove DNS records, runs as a service ("bastion_http_server") in all Bastion servers. To validate if the API is up and running, run the following command from a Bastion server:
$ systemctl status bastion_http_server
Sample output:
● bastion_http_server.service - Bastion http server
Loaded: loaded (/etc/systemd/system/bastion_http_server.service; enabled; vendor preset: disabled)
Active: active (running) since Wed 2023-04-12 00:12:51 UTC; 1 day 19h ago
Main PID: 283470 (gunicorn)
Tasks: 4 (limit: 23553)
Memory: 102.6M
CGroup: /system.slice/bastion_http_server.service
├─283470 /usr/bin/python3.6 /usr/local/bin/gunicorn --workers=3 --bind 0.0.0.0:8000 --chdir /bin/bastion_http_setup wsgi:app --max-requests 0 --timeout 5 --keep>
├─283474 /usr/bin/python3.6 /usr/local/bin/gunicorn --workers=3 --bind 0.0.0.0:8000 --chdir /bin/bastion_http_setup wsgi:app --max-requests 0 --timeout 5 --keep>
├─283476 /usr/bin/python3.6 /usr/local/bin/gunicorn --workers=3 --bind 0.0.0.0:8000 --chdir /bin/bastion_http_setup wsgi:app --max-requests 0 --timeout 5 --keep>
└─641094 /usr/bin/python3.6 /usr/local/bin/gunicorn --workers=3 --bind 0.0.0.0:8000 --chdir /bin/bastion_http_setup wsgi:app --max-requests 0 --timeout 5 --keep>

The sample output shows the status of the Bastion http server service as active (running) and enabled. All Bastion servers have their own independent version of this service. Therefore, it is recommended to check the status of all Bastion servers.

Starting or Restarting Local DNS API

If Local DNS API is not running, run the following command to start or restart it:

Command to start Local DNS API:
$ sudo systemctl start bastion_http_server
Command to restart Local DNS API:
$ sudo systemctl restart bastion_http_server

The start and restart commands don’t display any output on completion. To check the status of Local DNS API, perform the Validating Local DNS API procedure.

If bastion_http_server doesn't run even after starting or restarting it, refer to the following section to check its log.

Generating and Checking Local DNS Logs

This section provides details about generating and checking Local DNS logs.

Generating Local DNS Logs

You can use journalctl to get the logs of Local DNS API that runs as a service (bastion_http_server) on each bastion server.

Run the following command to get the logs of Local DNS API:
$ journalctl -u bastion_http_server
Run the following command to print only the latest 20 logs of Local DNS API:
journalctl -u bastion_http_server --no-pager -n 20

Note:

In the interactive mode, you can use the keyboard shortcuts to scroll through the logs. The system displays the latest logs at the end.
Sample output:
-- Logs begin at Tue 2023-04-11 22:36:02 UTC. --
Apr 12 16:33:27 test-bastion-1.novalocal gunicorn[283474]: 2023-04-12 16:33:27,357 BHHTTP:INFO: Request payload: Record name occne.lab.oracle.com record ip 175.80.200.20 [/bin/bastion_http_setup/bastionApp.py:125]
Apr 12 16:33:27 test-bastion-1.novalocal gunicorn[283474]: 2023-04-12 16:33:27,357 BHHTTP:INFO: Domain name oracle.com db name db.oracle.com for record entry [/bin/bastion_http_setup/coreDnsData.py:362]
Apr 12 16:33:27 test-bastion-1.novalocal gunicorn[283474]: 2023-04-12 16:33:27,369 BHHTTP:INFO: SUCCESS:  Validate coredns common config msg data oracle.com [/bin/bastion_http_setup/commons.py:36]
Apr 12 16:33:27 test-bastion-1.novalocal gunicorn[283474]: 2023-04-12 16:33:27,380 BHHTTP:INFO: SUCCESS:  A Record deleted msg data occne.lab.oracle.com [/bin/bastion_http_setup/commons.py:36]
Apr 12 16:33:27 test-bastion-1.novalocal gunicorn[283474]: 2023-04-12 16:33:27,380 BHHTTP:INFO: SUCCESS:  A Record deleted msg data occne.lab.oracle.com [/bin/bastion_http_setup/commons.py:36]
Apr 12 16:33:27 test-bastion-1.novalocal gunicorn[283474]: 2023-04-12 16:33:27,380 BHHTTP:INFO: Domain name oracle.com db name db.oracle.com for record entry [/bin/bastion_http_setup/coreDnsData.py:362]
Apr 12 16:33:27 test-bastion-1.novalocal gunicorn[283474]: 2023-04-12 16:33:27,388 BHHTTP:INFO: SUCCESS:  Validate coredns common config msg data oracle.com [/bin/bastion_http_setup/commons.py:36]
Apr 12 16:33:27 test-bastion-1.novalocal gunicorn[283474]: 2023-04-12 16:33:27,388 BHHTTP:INFO: DNS A record deleted in coredns file for occne.lab.oracle.com 175.80.200.20,  msg SUCCESS:  SUCCESS:  A Record deleted [/bin/bastion_http_setup/commons.py:47]
Apr 12 16:34:13 test-bastion-1.novalocal gunicorn[283474]: 2023-04-12 16:34:13,487 BHHTTP:INFO: Deployment reloaded,  msg SUCCESS:  Reloaded coredns deployment in ns kube-system [/bin/bastion_http_setup/commons.py:47]
Checking Local DNS Logs
Local DNS logs contain information, errors, and debug messages about all the activities in Local DNS. The following table lists some of the sample messages and their description:

Table 7-15 Local DNS Log Messages

Message Type/ Level Description
Deployment reloaded, msg SUCCESS: Reloaded coredns deployment in ns kube-system INFO Success message indicating that the core DNS deployment reloaded successfully.
Validate coredns common config msg data oracle.com INFO Indicates that the module was able to process core DNS configuration data for a specific domain name.
Request payload incomplete. Request requires name and ip-address, error missing param 'ip-address' ERROR Indicates an invalid payload. The API sends this type of messages when the payload used for a given record is not valid or not complete.
FAILED: A record occne.lab.oracle.com does not exists in Zone db.oracle.com ERROR This message is used by an API module to trigger a creation of a new zone. This error message does not require any intervention.
Already exists: DNS A record in coredns file for occne.lab.oracle.com 175.80.200.20 3600, msg SUCCESS: A record occne.lab.oracle.com already exists in Zone db.oracle.com, msg: Record occne.lab.oracle.com cannot be duplicated. ERROR Same domain name error. Records in the same zone cannot be duplicated, have the same name, or share the same IP address. This message is displayed if either of these conditions is true.
DNS A record deleted in coredns file for occne.lab.oracle.com 175.80.200.20, msg SUCCESS: A Record deleted INFO Success message indicating that an A record was deleted successfully.
DNS A record added in coredns file for occne.lab.oracle.com 175.80.200.20 3600, msg SUCCESS: Zone info and A record updated for domain name INFO Success message indicating that the API has successfully added a new A record and updated the zone information.
ERROR in app: Exception on /occne/dns/a [POST] ... Traceback Omitted ERROR Fatal error indicating that an exception has occurred while processing a request. You can get more information by performing a traceback. This type of error is not common and must be reported as a bug.
Zone already present with domain name oracle.com DEBUG This type of debug messages are not enabled by default. They are usually used to print a high amount of information while troubleshooting.
FAILED: Unable to add SRV record: _sip._tcp.lab.oracle.com. 3600 IN SRV 10 100 35061 occne.lab.oracle.com. - record already exists - data: ... Data Omitted ERROR Error message indicating that the record already exists and cannot be duplicated.
7.3.6.6.2 Troubleshooting Core DNS

This section provides information about troubleshooting Core DNS using the core DNS logs.

Local DNS records are added to CoreDNS configuration. Therefore, the logs are generated and reported by the core DNS pods. As per the default configuration, CoreDNS reports information logs only on start up (for example, after a reload) and on running into an error.

  • Run the following command to print all logs from both core DNS pods to the terminal, separated by name:
    $ for pod in $(kubectl -n kube-system get pods | grep coredns | awk '{print $1}'); do echo "----- $pod -----"; kubectl -n kube-system logs $pod; done
    Sample output:
    ----- coredns-8ddb9dc5d-5nvrv -----
    [INFO] plugin/ready: Still waiting on: "kubernetes"
    [INFO] plugin/auto: Inserting zone `occne.lab.oracle.com.' from: /etc/coredns/..2023_04_12_16_34_13.510777403/db.occne.lab.oracle.com
    .:53
    [INFO] plugin/reload: Running configuration SHA512 = 2bc9e13e66182e6e829fe1a954359de92746468f433b8748589dfe16e1afd0e790e1ff75415ad40ad17711abfc7a8348fdda2770af99962db01247526afbe24a
    CoreDNS-1.9.3
    linux/amd64, go1.18.2, 45b0a11
    ----- coredns-8ddb9dc5d-6lf5s -----
    [INFO] plugin/auto: Inserting zone `occne.lab.oracle.com.' from: /etc/coredns/..2023_04_12_16_34_15.930764941/db.occne.lab.oracle.com
    .:53
    [INFO] plugin/reload: Running configuration SHA512 = 2bc9e13e66182e6e829fe1a954359de92746468f433b8748589dfe16e1afd0e790e1ff75415ad40ad17711abfc7a8348fdda2770af99962db01247526afbe24a
    CoreDNS-1.9.3
    linux/amd64, go1.18.2, 45b0a11
  • Additionally, you can pipe the above command to a file for better readability and sharing:
    $ for pod in $(kubectl -n kube-system get pods | grep coredns | awk '{print $1}'); do echo "----- $pod -----"; kubectl -n kube-system logs $pod; done > coredns.logs
    $ vi coredns.logs
  • Run the following command to get the latest logs from any of the CoreDNS pods:
    $ kubectl -n kube-system --tail 20 logs $(kubectl -n kube-system get pods | grep coredns | awk  '{print $1 }' | head -n 1)

    This command prints the latest 20 log entries. You can modify the --tail value as per your requirement.

    Sample output:
    [INFO] plugin/auto: Inserting zone `occne.lab.oracle.com.' from: /etc/coredns/..2023_04_13_19_29_29.1646737834/db.occne.lab.oracle.com
    .:53
    [INFO] plugin/reload: Running configuration SHA512 = 2bc9e13e66182e6e829fe1a954359de92746468f433b8748589dfe16e1afd0e790e1ff75415ad40ad17711abfc7a8348fdda2770af99962db01247526afbe24a
    CoreDNS-1.9.3
    linux/amd64, go1.18.2, 45b0a11
7.3.6.6.3 Troubleshooting DNS Records

This section provides information about validating, and querying internal and external records.

Note:

Use the internal cluster network to resolve the records added to core DNS through local DNS API. The system does not respond if you query for a DNS record from outside the cluster (for example, querying from a Bastion server).

Validating Records

You can use any pod to access and query a DNS record in core DNS. However, most of the pods do not have the network utilities to directly query a record. In such cases, you can include the network utilities, such as bind-utils, bundled with the pods to allow them to access and query records.

You can also use the MetalLB controller pod, which is already bundled with bind-utils, to query the DNS records:
  • Run the following command from a Bastion server to query an A record:
    $ kubectl -n occne-infra exec -i -t $(kubectl -n occne-infra get pod | grep metallb-cont | awk '{print $1}') -- nslookup occne.lab.oracle.com
    Sample output:
    .oracle.com
    Server: 169.254.25.10
    Address: 169.254.25.10:53
     
     
    Name: occne.lab.oracle.com
    Address: 175.80.200.20
  • Run the following command from a Bastion server to query an SRV record:
    $ kubectl -n occne-infra exec -i -t $(kubectl -n occne-infra get pod | grep metallb-cont | awk '{print $1}') -- nslookup -type=srv _sip._tcp.lab.oracle.com
    Sample output:
    Server:         169.254.25.10
    Address:        169.254.25.10:53
     
    _sip._tcp.lab.oracle.com        service = 10 100 35061 occne.lab.oracle.com

    Note:

    Reload the core DNS configuration after adding multiple records to ensure that your changes are applied.
The following code block provides the command to query a DNS record using a pod that is created on a different namespace. By default, this pod is not bundled with the CNE cluster:

Note:

This example considers that an A record is already loaded to occne1.us.oracle.com using the API.
$ kubectl -n occne-demo exec -it test-app -- nslookup occne1.us.oracle.com
Sample output:
.oracle.com
Server: 169.254.25.10
Address: 169.254.25.10:53
 
 
Name: occne1.us.oracle.com
Address: 10.65.200.182

Querying Non Existing or External Records

You cannot access or query an external record or a record that is not added using the API. The system terminates such queries with an error code.

For example:
  • the following codeblock shows a case where a non existing A record is queried:
    $ kubectl -n occne-infra exec -i -t $(kubectl -n occne-infra get pod | grep metallb-cont | awk '{print $1}') -- nslookup not-in.oracle.com
    Sample output:
    Server:         169.254.25.10
    Address:        169.254.25.10:53
     
    ** server can't find not-in.oracle.com: NXDOMAIN
     
    ** server can't find not-in.oracle.com: NXDOMAIN
     
    command terminated with exit code 1
  • the following codeblock shows a case where a non existing SRV record is queried:
    $ kubectl -n occne-infra exec -i -t $(kubectl -n occne-infra get pod | grep metallb-cont | awk '{print $1}') -- nslookup not-in.oracle.com
    Sample output:
    Server:         169.254.25.10
    Address:        169.254.25.10:53
     
    ** server can't find not-in.oracle.com: NXDOMAIN
     
    ** server can't find not-in.oracle.com: NXDOMAIN
     
    command terminated with exit code 1

Querying Internal Services

Core DNS is configured to resolve internal services by default. Therefore, you can query any internal Kubernetes services as usual.

For example,
  • the following codeblock shows a case where an A record is queried from an internal Kubernetes service:
    $ kubectl -n occne-infra exec -i -t $(kubectl -n occne-infra get pod | grep metallb-cont | awk '{print $1}') -- nslookup kubernetes
    Sample output:
    Server:         169.254.25.10
    Address:        169.254.25.10:53
     
     
    Name:   kubernetes.default.svc.test
    Address: 10.233.0.1
     
    ** server can't find kubernetes.svc.test: NXDOMAIN
    ** server can't find kubernetes.svc.test: NXDOMAIN
    ** server can't find kubernetes.test: NXDOMAIN
    ** server can't find kubernetes.test: NXDOMAIN
    ** server can't find kubernetes.occne-infra.svc.test: NXDOMAIN
    ** server can't find kubernetes.occne-infra.svc.test: NXDOMAIN

    The sample output displays the response from default.svc.test as "Kubernetes", as a service, exists only in the default namespace.

  • the following codeblock shows a case where an SRV record is queried from an internal Kubernetes service:
    $ kubectl -n occne-infra exec -i -t $(kubectl -n occne-infra get pod | grep metallb-cont | awk '{print $1}') -- nslookup -type=srv kubernetes.default.svc.test
    Sample output:
    Server:         169.254.25.10
    Address:        169.254.25.10:53
     
    kubernetes.default.svc.occne3-toby-edwards      service = 0 100 443 kubernetes.default.svc.test
     
    ** server can't find kubernetes.svc.test: NXDOMAIN
    ** server can't find kubernetes.occne-infra.svc.test: NXDOMAIN
    ** server can't find kubernetes.test: NXDOMAIN

    The sample output displays the response from default.svc.test as "Kubernetes", as a service, exists only in the default namespace.

7.3.6.6.4 Accessing Configuration Files

This section provides information about accessing configuration files for troubleshooting.

Local DNS does not have any configuration file of its own; it relies entirely on CoreDNS config files. CoreDNS and nodelocaldns (CoreDNS cache daemon) have their configuration files as configmaps on the kube-system namespace.

Note:

Local DNS API takes care of configurations and modifications by default. Therefore, it is not recommended to access or update the configmaps as manual intervention to these files can potentially break the entire CoreDNS functionality.

If there is absolute necessity to access configmap for troubleshooting, then use the data endpoint to access records of all zones along with the CoreDNS configuration.

Sample configuration:
# The following line, starting with "db.DOMAIN-NAME" represents a Zone file
'db.oracle.com': ';oracle.com db file\n'
               'oracle.com.                  300                  ' # All zone files contain a default SOA entry auto generated
               'IN      SOA     ns1.oracle.com  andrei.oracle.com '
               '201307231 3600 10800 86400 3600\n'
               'occne.lab.oracle.com.                  ' # User added A record
               '3600                  IN      A       175.80.200.20\n'
               '_sip._tcp.lab.oracle.com 30 IN SRV 10 102 32061 ' # User added SRV record
               'occne.lab.oracle.com.\n'
               'occne1.us.oracle.com.                  ' # User added A record
               '3600                  IN      A       '
               '10.65.200.182\n'},
7.3.6.6.5 Troubleshooting Validation Script Errors

The local DNS feature provides the validateLocalDns.py script to validate if the Local DNS feature is activated successfully. This section provides information about troubleshooting some of the common issues that occur while using the validateLocalDns.py script.

Local DNS variable is not set properly

You can encounter the following or a similar error message while running the validation script if the local DNS variable is not set properly:
Beginning local DNS validation
 - Getting the occne-metallb-controller pod's name.
 - Validating occne.ini.
Unable to continue - err: Cannot continue - local_dns_enabled variable is set to False, which is not valid to continue..
In such cases, ensure that:
  • the local_dns_enabled variable is set to True: local_dns_enabled=True
  • there are no black spaces before and after the "=" sign
  • the variable is typed correctly as it is case sensitive

Note:

To successfully enable Local DNS, you must follow the entire activation procedure. Otherwise, the system doesn't enable the feature successfully even after you set the ocal_dns_enabled variable to the correct value.

Unable to access the test pod

The validation script uses the occne-metallb-controller pod to validate the test record. This is because the DNS records can be accessed from inside the cluster only, and the MetalLB pod contains the necessary utility tools to access the records by default. You can encounter the following error while running the validation script if the MetalLB pod is not accessible:
Beginning local DNS validation
 - Getting the occne-metallb-controller pod's name.
  - Error while trying to get occne-metallb-controller pod's name, error: ...

In such cases, ensure that the occne-metallb-controller is accessible.

Unable to add a test record

While performing a validation, the validation script creates and removes a test record using the API to check if the Local DNS feature is working properly. You can encounter the following error when the script is unable to create a test record:
Beginning local DNS validation
 - Getting the occne-metallb-controller pod's name.
 - Validating occne.ini.
 - Adding DNS A record.
Unable to continue - err: Failed to add DNS entry.
The following table lists some of the common issues why the script can fail to add a test record, along with the resolutions:

Table 7-16 Validation Script Errors and Resolutions

Issue Error Message Resolution
The script is previously run and interrupted before it finished. The script possibly created a test record the previous time it was run unsuccessfully. When the script is run again, it tries to create a duplicate test record and fails. Cannot add a duplicate record.

Test record: name:occne.dns.local.com, ip-address: 10.0.0.3

Delete the existing test record from the system and rerun the validation script.
A record similar to the test record is added manually. Cannot add a duplicate record.

Test record: name:occne.dns.local.com, ip-address: 10.0.0.3

Delete the existing test record from the system and rerun the validation script.
Local DNS API is not available. The Local DNS API is not running or is in an error state Validate if the Local DNS feature is enabled properly. For more information, see Troubleshooting Local DNS API.
Local DNS API returns 50X status code. Kubernetes Admin Configmap missing or misconfigured Check if Kubernetes admin.conf is properly set to allow the API to interact with Kubernetes.

Note:

The name and ip-address of the test record are managed by the script. Use these details for validation purpose only.

Unable to reload configuration

You can encounter the following error if the validation script fails to reload the core DNS configuration:
Beginning local DNS validation
 - Getting the occne-metallb-controller pod's name.
 - Validating occne.ini.
 - Adding DNS A record.
 - Adding DNS SRV record.
 - Reloading local coredns.
  - Error while trying to reload the local coredns, error: .... # Reason Omitted

In such cases, analyze the cause of the issue using the Local DNS logs. For more information, see Troubleshooting Local DNS API.

Other miscellaneous errors

If you are encountering other miscellaneous errors (such as, "unable to remove record"), follow the steps in the Troubleshooting Local DNS API section to generate logs and analyze the issue.

7.4 Managing the Kubernetes Cluster

This section provides instructions on how to manage the Kubernetes Cluster.

7.4.1 Creating CNE Cluster Backup

This section describes the procedure to create a backup of CNE cluster data using the createClusterBackup.py script.

Critical CNE data can be damaged or lost during a fault recovery scenario. Therefore, it is advised to take a backup of your CNE cluster data regularly. These backups can be used to restore your CNE cluster when the cluster data is lost or damaged.

Backing up a CNE cluster data involves the following steps:
  1. Backing up Bastion Host data
  2. Backing up Kubernetes data using Velero

The createClusterBackup.py script is used to backup both the bastion host data and Kubernetes data.

Prerequisites

Before creating CNE cluster backup, ensure that the following prerequisites are met:

  • Velero must be activated successfully. For Velero installation procedure, see Oracle Communications Cloud Native Core, Cloud Native Environment Installation, Upgrade, and Fault Recovery Guide.
  • Velero v1.10.0 server must be installed and running.
  • Velero CLI for v1.10.0 must be installed and running.
  • boto3 python module must be installed. For more information, see the "Configuring PIP Repository" section in the Oracle Communications Cloud Native Core, Cloud Native Environment Installation, Upgrade, and Fault Recovery Guide.
  • The S3 Compatible Object Storage Provider must be configured and ready to be used.
  • The following S3 related credentials must be available:
    • Endpoint Url
    • Access Key Id
    • Secret Access Key
    • Region Name
    • Bucket Name
  • An external S3 compatible data store to store backup data must have been configured while installing CNE.
  • Cluster must be in a good state, that is all content included in the following namespaces must be up and running:
    • occne-infra
    • cert-manager
    • kube-system
    • rook-ceph (for bare metal)
    • istio-system
  • All bastion-controller and lb-controller PVCs must be in "Bound" status.

Note:

  • This procedure creates only a CNE cluster backup that contains bastion host data, including Kubernetes.
  • For Kubernetes, this procedure creates the backup content included in the following namespaces only:
    • occne-infra
    • cert-manager
    • kube-system
    • rook-ceph (for bare metal)
    • istio-system
  • You must take the bastion backup in the ACTIVE bastion only.
7.4.1.1 Creating a Backup of Bastion Host and Kubernetes Data

This section describes the procedure to back up the Bastion Host and Kubernetes data using the createClusterBackup.py script.

Procedure
  1. Run the following command to verify if you are currently on an active Bastion. If you are not, log in to an active Bastion and continue this procedure.
    $ is_active_bastion
    Sample output:
    IS active-bastion
  2. Use the following commands to run the createClusterBackup.py script:
    $ cd /var/occne/cluster/${OCCNE_CLUSTER}/artifacts
    $ ./backup/createClusterBackup.py
    Sample output:
    Initializing cluster backup occne-cluster-20230717-183615
     
    No /var/occne/cluster/occne-cluster/artifacts/backup/cluster_backups_log.json log file, creating new one
    Creating bastion backup: 'occne-cluster-20230717-183615'
     
    Successfully created bastion backup
     
    GENERATED LOG FILE AT: /var/occne/cluster/occne-cluster/createBastionBackup-20230717-183615.log
     
    Creating velero backup: 'occne-cluster-20230717-183615'
     
    Successfully created velero backup
     
    Successfully created cluster backup
    GENERATED LOG FILE AT: /var/occne/cluster/occne-cluster/createClusterBackup.py-20230717-183615.log
  3. If the createClusterBackup.py script fails due to a missing boto3 library, then perform the following steps to add your proxy and download boto3. Else, move to Step 3.
    1. Run the following commands to install boto3 library:
      export http_proxy=YOUR_PROXY
      export https_proxy=$http_proxy
      export HTTP_PROXY=$http_proxy
      export HTTPS_PROXY=$http_proxy
       
       
      pip3 install boto3

      While installing boto3 library, you may see a warning regarding the versions of dependencies. You can ignore the warning as the boto3 library can work without these dependencies.

    2. Once you install boto3 library, run the following commands to unset the proxy:
      unset HTTP_PROXY
      unset https_proxy
      unset http_proxy
      unset HTTPS_PROXY
  4. Navigate to the /home/cloud-user directory and verify if the backup tar file is generated.
  5. Log in to your S3 cloud storage and verify if the Bastion Host data is uploaded successfully.
7.4.1.2 Verifying Backup in S3 Bucket

This section describes the procedure to verify the CNE cluster data backup in S3 bucket.

S3 bucket contains two folders:
  • bastion-data-backups: for storing Bastion backup
  • velero-backup: for storing Velero backup
  1. Verify if the Bastion Host data is stored as a .tar file in the {BUCKET_NAME}/bastion-data-backups/{CLUSTER-NAME}/{BACKUP_NAME} folder. Where, {CLUSTER-NAME} is the name of the cluster and{BACKUP_NAME} is the name of the backup.
  2. Verify if the Velero Kubernetes backup is stored in the {BUCKET_NAME}/velero-backup/{BACKUP_NAME}/ folder. Where, {BACKUP_NAME} is the name of the backup.

    Caution:

    The velero-backup folder must not be modified manually as this folder is managed by Velero. Modifying the folder can corrupt the structure or files.

    For information about restoring CNE cluster from a backup, see "Restoring CNE from Backup" in Oracle Communications Cloud Native Core, Cloud Native Environment Installation, Upgrade, and Fault Recovery Guide.

7.4.2 Renewing Kubernetes Certificates

The Kubernetes certificates used in your CNE cluster are valid for a period of one year. These certificates include various important files that secure the communication within your cluster, such as the API server certificate, the etcd certificate, and the controller manager certificate. In order to maintain the security and operation of your CNE Kubernetes cluster, it is important to keep these certificates updated. The certificates are renewed automatically during the CNE upgrade. If you have not performed an CNE upgrade in the last year, you must run this procedure to renew your certificates for the continued operation of the CNE Kubernetes cluster.

Introduction

Kubernetes uses many different TLS certificates to secure access to internal services. There certificates are automatically renewed during upgrade. However, if upgrade is not performed regularly, these certificates may expire and cause the Kubernetes cluster to fail. To avoid this situation follow the procedure below to renew all certificates used by Kubernetes. This procedure can also be used to renew expired certificates and restore access to the Kubernetes cluster.

List of K8s internal certificates

The following table contains the list of Kubernetes (K8s) internal certificates and their validity.

Table 7-17 K8s internal certificates and their validity

Node Type Componet name .key file path validity(in years) .crt file path validity(in years) .pem file path validity(in years) .conf file path validity(in years)
K8 Controller calico /etc/calico/certs/ca_cert.crt 100 /etc/calico/certs/key.pem
/etc/calico/certs/cert.crt 100
etcd /etc/kubernetes/ssl/etcd/ca.key /etc/pki/ca-trust/source/anchors/etcd-ca.crt 100 /etc/ssl/etcd/ssl/admin-<node_name>-key.pem /etc/ssl/etcd/openssl.conf
/etc/kubernetes/ssl/etcd/peer.key /etc/kubernetes/ssl/etcd/ca.crt 10 /etc/ssl/etcd/ssl/admin-<node_name>.pem 100
/etc/kubernetes/ssl/etcd/server.key /etc/kubernetes/ssl/etcd/server.crt 1 /etc/ssl/etcd/ssl/ca-key.pem
/etc/kubernetes/ssl/etcd/healthcheck-client.key /etc/kubernetes/ssl/etcd/healthcheck-client.crt 1
/etc/kubernetes/ssl/apiserver-etcd-client.key /etc/kubernetes/ssl/apiserver-etcd-client.crt 1
/etc/kubernetes/ssl/etcd/peer.crt 1 /etc/ssl/etcd/ssl/ca.pem 100
kubernetes /etc/kubernetes/ssl/ca.key /etc/kubernetes/ssl/ca.crt 10 /etc/kubernetes/admin.conf
/etc/kubernetes/ssl/apiserver.key /etc/kubernetes/ssl/apiserver.crt 1 /etc/kubernetes/kubelet.conf
/etc/kubernetes/ssl/apiserver-kubelet-client.key /etc/kubernetes/ssl/apiserver-kubelet-client.crt 1 /etc/kubernetes/controller-manager.conf
/etc/kubernetes/ssl/front-proxy-ca.key /etc/kubernetes/ssl/front-proxy-ca.crt 10 /etc/kubernetes/scheduler.conf
/etc/kubernetes/ssl/front-proxy-client.key /etc/kubernetes/ssl/front-proxy-client.crt 1
/etc/kubernetes/ssl/sa.key
K8 Worker calico /etc/calico/certs/ca_cert.crt 10 /etc/calico/certs/key.pem
/etc/calico/certs/cert.crt 10
etcd /etc/pki/ca-trust/source/anchors/etcd-ca.crt 100 /etc/ssl/etcd/ssl/ca.pem 100
/etc/ssl/etcd/ssl/node-<node_name>-key.pem
/etc/ssl/etcd/ssl/node-<node_name>.pem 100
kubernetes /etc/kubernetes/ssl/ca.crt 10 /etc/kubernetes/kubeadm-client.conf
/etc/kubernetes/bootstrap-kubelet.conf
/etc/kubernetes/kubelet.conf
You can use the above table to keep a record of the Kubernetes certificates and their validity details. The Validity column in the table can be filled as per your certificates' validity.

Prerequisites

Caution:

It is very important to run this procedure on each controller node and verify that the certificates are renewed successfully to avoid cluster failures. The controller nodes are the orchestrator and maintainers of the metadata of all objects and components of the cluster. If you do not run this procedure on all the controller nodes and the certificates expire, the integrity of the cluster and the applications that are deployed on the cluster are staged at risk. This causes the communication within the internal components to be lost resulting in a total cluster failure. In such a case, a recovery of each controller node or in the worst scenario, the recovery of the complete cluster is required.

Checking the Certificate Expiry

Log in to any Kubernetes controller node and run the following command to verify the expiration date for the Kubernetes certificates:
$ sudo su
# export PATH=$PATH:/usr/local/bin
# kubeadm certs check-expiration [check-expiration] Reading configuration from the cluster...
Sample output:
[check-expiration] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
W0227 14:21:24.800867  107604 utils.go:69] The recommended value for "clusterDNS" in "KubeletConfiguration" is: [10.233.0.10]; the provided value is: [169.254.25.10]
 
CERTIFICATE                EXPIRES                  RESIDUAL TIME   CERTIFICATE AUTHORITY   EXTERNALLY MANAGED
admin.conf                 Feb 24, 2023 00:00 UTC   <invalid>       ca                      no
apiserver                  Feb 24, 2023 00:00 UTC   <invalid>       ca                      no
apiserver-kubelet-client   Feb 24, 2023 00:00 UTC   <invalid>       ca                      no
controller-manager.conf    Feb 24, 2023 00:00 UTC   <invalid>       ca                      no
front-proxy-client         Feb 24, 2023 00:00 UTC   <invalid>       front-proxy-ca          no
scheduler.conf             Feb 24, 2023 00:00 UTC   <invalid>       ca                      no
 
CERTIFICATE AUTHORITY   EXPIRES                  RESIDUAL TIME   EXTERNALLY MANAGED
ca                      Feb 18, 2033 16:35 UTC   9y              no
front-proxy-ca          Feb 18, 2033 16:35 UTC   9y              no

Procedure

  1. To renew the certificates, log in to each controller node and run the following steps.
    1. Run the following commands as a root user:
      
      sudo su
      export PATH=$PATH:/usr/local/bin
    2. Take a backup of the ssl directory:
      cp -r /etc/kubernetes/ssl /etc/kubernetes/ssl_backup
    3. Renew all kubeadm certificates:
      kubeadm certs renew all
      Sample output:
      [renew] Reading configuration from the cluster...
      [renew] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
      W1105 16:27:03.069743  752969 utils.go:69] The recommended value for "clusterDNS" in "KubeletConfiguration" is: [10.233.0.10]; the provided value is: [169.254.25.10]
       
      certificate embedded in the kubeconfig file for the admin to use and for kubeadm itself renewed
      certificate for serving the Kubernetes API renewed
      certificate for the API server to connect to kubelet renewed
      certificate embedded in the kubeconfig file for the controller manager to use renewed
      certificate for the front proxy client renewed
      certificate embedded in the kubeconfig file for the scheduler manager to use renewed
      Done renewing certificates. You must restart the kube-apiserver, kube-controller-manager, kube-scheduler and etcd, so that they can use the new certificates.
    4. Update the file on the default location with the renewed certificates:
      # cp /etc/kubernetes/admin.conf /root/.kube/config
    5. Restart the static pods so that they can use the new certificates:
      # kubectl delete pod/kube-apiserver-$HOSTNAME -n kube-system
      # kubectl delete pod/kube-controller-manager-$HOSTNAME -n kube-system
      # kubectl delete pod/kube-scheduler-$HOSTNAME -n kube-system
      # systemctl restart kubelet
    6. As an alternate to step d, you can use the crictl command to restart the pods related to kubelet service. This option allows you to delete the pods without restarting the kubelet service. Perform the following steps to delete the pods without restarting the kubelet service:
      1. Run the following command to list the pods:
        root@occne2-k8s-ctrl-1 ~]# crictl ps
        Sample output:
        CONTAINER           IMAGE               CREATED             STATE        NAME                                 ATTEMPT     POD ID              POD
        6c64a4f1e31ba       3f1feef5d13e2       3 minutes ago       Running      kube-controller-manager              3           7a8257e016fce       kube-controller-manager-k8s-ctrl-1
        f62e5a1571e70       1dfd746c32714       3 minutes ago       Running      kube-scheduler                       3           a136c832d6af1       kube-scheduler-occne2-k8s-ctrl-1
        bb4df2896f5c4       0137c32dad849       3 minutes ago       Running      kube-apiserver                       750         7a988edce05ca       kube-apiserver-occne2-k8s-ctrl-1
        3dbf50a1c7a14       5bae806f8f123       2 weeks ago         Running      node-cache                           0           2a2c27aa8b811       nodelocaldns-2f8j9
        807dd81b678b9       fb25fc1345285       2 weeks ago         Running      openstack-cloud-controller-manager   0           31ed0a3507462       openstack-cloud-controller-manager-5jdwf
        52c219441e86b       54637cb36d4a1       2 weeks ago         Running      calico-node                          0           eb81082dcf4c9       calico-node-sdtlx
        589278b841a68       8ef37ea6581b5       2 weeks ago         Running      kube-proxy                           0           26c8593266ed9       kube-proxy-6kk5b
      2. Identify the kube-scheduler, kube-apiserver, and kube-controller-manager pods. You can display these pods using the pods sub-command:
        [root@occne2-k8s-ctrl-1 ~]# crictl pods
        Sample output:
        POD ID              CREATED             STATE               NAME                                                        NAMESPACE           ATTEMPT             RUNTIME
        a136c832d6af1       3 minutes ago       Ready               kube-scheduler-occne2-k8s-ctrl-1                            kube-system         1                   (default)
        7a988edce05ca       3 minutes ago       Ready               kube-apiserver-occne2-k8s-ctrl-1                            kube-system         2                   (default)
        7a8257e016fce       3 minutes ago       Ready               kube-controller-manager-occne2-k8s-ctrl-1                   kube-system         1                   (default)
        2a2c27aa8b811       2 weeks ago         Ready               nodelocaldns-2f8j9                                          kube-system         0                   (default)
        31ed0a3507462       2 weeks ago         Ready               openstack-cloud-controller-manager-5jdwf                    kube-system         0                   (default)
        eb81082dcf4c9       2 weeks ago         Ready               calico-node-sdtlx                                           kube-system         0                   (default)
        26c8593266ed9       2 weeks ago         Ready               kube-proxy-6kk5b                                            kube-system         0                   (default)
      3. Run the following command to remove the pods identified in step b:
        [root@occne2-k8s-ctrl-1 ~]# crictl rmp a136c832d6af1 7a988edce05ca 7a8257e016fce
        Sample output:
        pod sandbox "7a8257e016fce" is running, please stop it first
        pod sandbox "a136c832d6af1" is running, please stop it first
        pod sandbox "7a988edce05ca" is running, please stop it first
      4. If you get a message stating that the pods are already running, then stop the pods using the following command and then run the remove command in step e:
        [root@occne2-k8s-ctrl-1 ~]# crictl stopp a136c832d6af1 7a988edce05ca 7a8257e016fce
        Sample output:
        Stopped sandbox a136c832d6af1
        Stopped sandbox 7a988edce05ca
        Stopped sandbox 7a8257e016fce
      5. Run the following command to remove the pods:
        [root@occne2-k8s-ctrl-1 ~]# crictl rmp a136c832d6af1 7a988edce05ca 7a8257e016fce
        Sample output:
        Removed sandbox a136c832d6af1
        Removed sandbox 7a8257e016fce
        Removed sandbox 7a988edce05ca
        
      6. Wait for the pods to restart and then run the following command to verify if the pods are restarted. You can check the pods' time of creation (CREATED) to verify if the pods are restarted.
        [root@occne2-k8s-ctrl-1 ~]# crictl pods
        Sample output:
        POD ID              CREATED             STATE               NAME                                                        NAMESPACE           ATTEMPT             RUNTIME
        894bc1da7b74f       24 seconds ago      Ready               kube-scheduler-occne2-k8s-ctrl-1                            kube-system         2                   (default)
        db6813f7fd012       24 seconds ago      Ready               kube-apiserver-occne2-k8s-ctrl-1                            kube-system         3                   (default)
        c30216ce82ee7       24 seconds ago      Ready               kube-controller-manager-occne2-k8s-ctrl-1                   kube-system         2                   (default)
        2a2c27aa8b811       2 weeks ago         Ready               nodelocaldns-2f8j9                                          kube-system         0                   (default)
        31ed0a3507462       2 weeks ago         Ready               openstack-cloud-controller-manager-5jdwf                    kube-system         0                   (default)
        eb81082dcf4c9       2 weeks ago         Ready               calico-node-sdtlx                                           kube-system         0                   (default)
        26c8593266ed9       2 weeks ago         Ready               kube-proxy-6kk5b                                            kube-system         0                   (default)
  2. Run the following command to verify the restart of the containers for the new certificates to take effect:
    # kubectl get pod -n kube-system | egrep -i 'kube-apiserver-*|kube-controller-manager*|kube-scheduler-*'
    Sample output:
    kube-apiserver-occne3-cgbu-cne-dbtier-k8s-ctrl-1            1/1     Running   0          23h
    kube-apiserver-occne3-cgbu-cne-dbtier-k8s-ctrl-2            1/1     Running   1          23h
    kube-apiserver-occne3-cgbu-cne-dbtier-k8s-ctrl-3            0/1     Running   1          14s
    kube-controller-manager-occne3-cgbu-cne-dbtier-k8s-ctrl-1   1/1     Running   1          23h
    kube-controller-manager-occne3-cgbu-cne-dbtier-k8s-ctrl-2   1/1     Running   2          23h
    kube-controller-manager-occne3-cgbu-cne-dbtier-k8s-ctrl-3   0/1     Running   0          13s
    kube-scheduler-occne3-cgbu-cne-dbtier-k8s-ctrl-1            1/1     Running   0          23h
    kube-scheduler-occne3-cgbu-cne-dbtier-k8s-ctrl-2            1/1     Running   2          23h
    kube-scheduler-occne3-cgbu-cne-dbtier-k8s-ctrl-3            0/1     Running   1          13s
  3. Run the following command to validate the renewal of the certificates for the controller node:
    # kubeadm certs check-expiration
    Sample output:
    [check-expiration] Reading configuration from the cluster...
    [check-expiration] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
    W0227 14:32:27.788539  113098 utils.go:69] The recommended value for "clusterDNS" in "KubeletConfiguration" is: [10.233.0.10]; the provided value is: [169.254.25.10]
     
    CERTIFICATE                EXPIRES                  RESIDUAL TIME   CERTIFICATE AUTHORITY   EXTERNALLY MANAGED
    admin.conf                 Feb 25, 2024 16:35 UTC   359d            ca                      no
    apiserver                  Feb 25, 2024 16:35 UTC   359d            ca                      no
    apiserver-kubelet-client   Feb 25, 2024 16:35 UTC   359d            ca                      no
    controller-manager.conf    Feb 25, 2024 16:35 UTC   359d            ca                      no
    front-proxy-client         Feb 25, 2024 16:35 UTC   359d            front-proxy-ca          no
    scheduler.conf             Feb 25, 2024 16:35 UTC   359d            ca                      no
     
    CERTIFICATE AUTHORITY   EXPIRES                  RESIDUAL TIME   EXTERNALLY MANAGED
    ca                      Feb 18, 2033 16:35 UTC   9y              no
    front-proxy-ca          Feb 18, 2033 16:35 UTC   9y              no
  4. Perform Step 1 to Step 4 on each controller node.
  5. Update the admin.conf config file on Bastion Host and the lb-controller-admin configmap:
    1. Log in to the k8s-controller node:
      $ ssh <k8s-ctrl-node>
    2. Copy the /etc/kubernetes/admin.conf config file from the controller node to the artifacts directory in Bastion and update the server address to https://lb-apiserver.kubernetes.local:6443 in the admin.conf file:

      Note:

      If you are on a BareMetal deployment, use admusr as the user name instead of <OCCNE_USER> in the following command.
      $ sudo scp /etc/kubernetes/admin.conf <OCCNE_USER>@<OCCNE_CLUSTER>-bastion-1:/var/occne/cluster/<OCCNE_CLUSTER>/artifacts
    3. Update the server address to https://lb-apiserver.kubernetes.local:6443 in the admin.conf file:
      $ sed -i 's#https://127.0.0.1:6443#https://lb-apiserver.kubernetes.local:6443#' /var/occne/cluster/${OCCNE_CLUSTER}/artifacts/admin.conf
    4. If you are using a vCNE deployment, delete the existing lb-controller-admin secret. Otherwise, move to the next step.
      $ kubectl -n occne-infra delete secret lb-controller-admin-config
    5. Recreate the new lb-controller-admin secret from the new admin.conf file.

      Note:

      Ensure that you replace ${OCCNE_CLUSTER} with the correct value accordingly to your system.
      $ kubectl -n occne-infra create secret generic lb-controller-admin-config --from-file=/var/occne/cluster/${OCCNE_CLUSTER}/artifacts/admin.conf
    6. Patch the lb-controller-admin-config secret:
      $ echo -n "$(kubectl get secret lb-controller-admin-config -n occne-infra -o jsonpath='{.data.admin\.conf}' | base64 -d | sed 's#https://lb-apiserver.kubernetes.local:6443#https://kubernetes.default:443#g')" | base64 -w0 | xargs -I{} kubectl -n occne-infra patch secret lb-controller-admin-config --patch '{"data":{"admin.conf":"{}"}}'
      
    7. If you are using a vCNE deployment, restart the lb-controller pod:
      $ kubectl scale deployment/occne-lb-controller-server -n occne-infra --replicas=0
      $ kubectl scale deployment/occne-lb-controller-server -n occne-infra --replicas=1

7.4.3 Renewing the Kubernetes Secrets Encryption Key

This section describes the procedure to renew the key that is used to encrypt the Kubernetes Secrets stored in the CNE Kubernetes cluster.

Procedure

Note:

Secret encryption is enabled by default during CNE installation or upgrade.

The key that is used to encrypt Kubernetes Secrets does not expire. However, it is recommended to change the encryption key periodically to ensure the security of your Kubernetes Secrets. If you think that your key is compromised, you must change the encryption key immediately.

To renew a Kubernetes Secrets encryption key, perform the following steps:

  1. From bastion host, run the following commands:
    $ NEW_KEY=$(head -c 32 /dev/urandom | base64)
    $ KEY_NAME=$(cat /dev/random | tr -dc '[:alnum:]' | head -c 10)
    $ kubectl get nodes | awk '/control-plane/ {print $1}' | xargs -I{} ssh {} " sudo sed -i '/keys:$/a\        - name: key_$KEY_NAME\n\          secret: $NEW_KEY' /etc/kubernetes/ssl/secrets_encryption.yaml; sudo cat /etc/kubernetes/ssl/secrets_encryption.yaml"
    

    This creates a random encryption key with a random key name, and adds it to the /etc/kubernetes/ssl/secrets_encryption.yaml file within each controller node. The output shows the new encryption key, the key name, and the contents of the /etc/kubernetes/ssl/secrets_encryption.yaml file.

    Sample Output:
    This site is for the exclusive use of Oracle and its authorized customers and partners. Use of this site by customers and partners is subject to the Terms of Use and Privacy Policy for this site, as well as your contract with Oracle. Use of this site by Oracle employees is subject to company policies, including the Code of Conduct. Unauthorized access or breach of these terms may result in termination of your authorization to use this site and/or civil and criminal penalties.
    kind: EncryptionConfig
    apiVersion: v1
    resources:
      - resources:
        - secrets
     
        providers:
        - secretbox:
            keys:
            - name: key_ZOJ1Hf5OCx
              secret: l+CaDTmMkC85LwJRiWJ0LQPYVtOyZ0TdtNZ2ij+kuGA=
            - name: key
              secret: ZXJ1Ulk2U0xSbWkwejdreTlJWkFrZmpJZjhBRzg4U00=
        - identity: {}
  2. Restart the API server by running the following command. This ensures that all the secrets get encrypted with the new key while encrypting the secrets in the next step:
    kubectl get nodes | awk '/control-plane/ {print $1}' | xargs -I{} ssh {} " sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml ~; sleep 2; sudo mv ~/kube-apiserver.yaml /etc/kubernetes/manifests"
    
  3. To encrypt all the existing secrets with a new key, run the following command:
    kubectl get secrets --all-namespaces -o json | kubectl replace -f
    Sample output:
    -secret/occne-cert-manager-webhook-ca replaced
    secret/sh.helm.release.v1.occne-cert-manager.v1 replaced
    secret/istio-ca-secret replaced
    secret/cloud-config replaced
    secret/external-openstack-cloud-config replaced
    secret/occne-kyverno-svc.kyverno.svc.kyverno-tls-ca replaced
    secret/occne-kyverno-svc.kyverno.svc.kyverno-tls-pair replaced
    secret/sh.helm.release.v1.occne-kyverno-policies.v1 replaced
    secret/sh.helm.release.v1.occne-kyverno.v1 replaced
    secret/alertmanager-occne-kube-prom-stack-kube-alertmanager replaced
    secret/etcd-occne6-j-jorge-l-lopez-k8s-ctrl-1 replaced
    secret/etcd-occne6-j-jorge-l-lopez-k8s-ctrl-2 replaced
    secret/etcd-occne6-j-jorge-l-lopez-k8s-ctrl-3 replaced
    secret/lb-controller-user replaced
    secret/occne-alertmanager-snmp-notifier replaced
    secret/occne-kube-prom-stack-grafana replaced
    secret/occne-kube-prom-stack-kube-admission replaced
    secret/occne-kube-prom-stack-kube-prometheus-scrape-confg replaced
    secret/occne-metallb-memberlist replaced
    secret/occne-tracer-jaeger-elasticsearch replaced
    secret/prometheus-occne-kube-prom-stack-kube-prometheus replaced
    secret/prometheus-occne-kube-prom-stack-kube-prometheus-tls-assets-0 replaced
    secret/prometheus-occne-kube-prom-stack-kube-prometheus-web-config replaced
    secret/sh.helm.release.v1.occne-alertmanager-snmp-notifier.v1 replaced
    secret/sh.helm.release.v1.occne-bastion-controller.v1 replaced
    secret/sh.helm.release.v1.occne-fluentd-opensearch.v1 replaced
    secret/sh.helm.release.v1.occne-kube-prom-stack.v1 replaced
    secret/sh.helm.release.v1.occne-lb-controller.v1 replaced
    secret/sh.helm.release.v1.occne-metallb.v1 replaced
    secret/sh.helm.release.v1.occne-metrics-server.v1 replaced
    secret/sh.helm.release.v1.occne-opensearch-client.v1 replaced
    secret/sh.helm.release.v1.occne-opensearch-dashboards.v1 replaced
    secret/sh.helm.release.v1.occne-opensearch-data.v1 replaced
    secret/sh.helm.release.v1.occne-opensearch-master.v1 replaced
    secret/sh.helm.release.v1.occne-promxy.v1 replaced
    secret/sh.helm.release.v1.occne-tracer.v1 replaced
    secret/webhook-server-cert replaced
    Error from server (Conflict): error when replacing "STDIN": Operation cannot be fulfilled on secrets "alertmanager-occne-kube-prom-stack-kube-alertmanager-generated": the object has been modified; please apply your changes to the latest version and try again
    Error from server (Conflict): error when replacing "STDIN": Operation cannot be fulfilled on secrets "alertmanager-occne-kube-prom-stack-kube-alertmanager-tls-assets-0": the object has been modified; please apply your changes to the latest version and try again
    Error from server (Conflict): error when replacing "STDIN": Operation cannot be fulfilled on secrets "alertmanager-occne-kube-prom-stack-kube-alertmanager-web-config": the object has been modified; please apply your changes to the latest version and try again

    Note:

    You may see some errors on the output depending on how the secret is created. You can ignore these errors and verify the encrypted secret using the following step.
  4. To verify if the new key is used for encrypting the existing secrets, run the following command from a controller node. Replace <cert pem file>, <key pem file> and <secret> in the following command with the corresponding values.
    sudo ETCDCTL_API=3 /usr/local/bin/etcdctl --cert /etc/ssl/etcd/ssl/<cert pem file> --key /etc/ssl/etcd/ssl/<key pem file> get /registry/secrets/default/<secret> -w fields | grep Value
    
    Example:
    [cloud-user@occne3-user-k8s-ctrl-3 ~]$ sudo ETCDCTL_API=3 /usr/local/bin/etcdctl --cert /etc/ssl/etcd/ssl/node-occne3-user-k8s-ctrl-1.pem --key /etc/ssl/etcd/ssl/node-occne3-user-k8s-ctrl-1-key.pem get /registry/secrets/default/secret1 -w fields | grep Value
    "Value" : "k8s:enc:secretbox:v1:key_ZOJ1Hf5OCx:&9\x90\u007f'*6\x0e\xf8]\x98\xd7t1\xa9|\x90\x93\x88\xebc\xa9\xfe\x82<\xebƞ\xaa\x17$\xa4\x14%m\xb7<\x1d\xf7N\b\xa7\xbaZ\xb0\xd4#\xbev)\x1bv9\x19\xdel\xab\x89@\xe7\xaf$L\xb8)\xc9\x1bl\x13\xc1V\x1b\xf7\bX\x88\xe7\ue131\x1dG\xe2_\x04\xa2\xf1n\xf5\x1dP\\4\xe7)^\x81go\x99\x98b\xbb\x0eɛ\xc0R;>աj\xeeV54\xac\x06̵\t\x1b9\xd5N\xa77\xd9\x03㵮\x05\xfb%\xa1\x81\xd5\x0e \xcax\xc4\x1cz6\xf3\xd8\xf9?Щ\x9a%\x9b\xe5\xa7й\xcd!,\xb8\x8b\xc2\xcf\xe2\xf2|\x8f\x90\xa9\x05y\xc5\xfc\xf7\x87\xf9\x13\x0e4[i\x12\xcc\xfaR\xdf3]\xa2V\x1b\xbb\xeba6\x1c\xba\v\xb0p}\xa5;\x16\xab\x8e\xd5Ol\xb7\x87BW\tY;寄ƻ\xcaċ\x87Y;\n;/\xf2\x89\xa1\xcc\xc3\xc9\xe3\xc5\v\x1b\x88\x84Ӯ\xc6\x00\xb4\xed\xa5\xe2\xfa\xa9\xff \xd9kʾ\xf2\x04\x8f\x81,l"

    This example shows a new key, key_ZOJ1Hf5OCx, being used to encrypt secret1.

7.4.4 Removing a Kubernetes Controller Node

This section describes the procedure to remove a controller node from the CNE Kubernetes cluster in a vCNE deployment.

Note:

  • A controller node must be removed from the cluster only when it is required for maintenance.
  • This procedure is applicable for vCNE (OpenStack and VMWare) deployments only.
  • This procedure is applicable for removing a single controller node only.
7.4.4.1 Removing a Controller Node in OpenStack Deployment

This section describes the procedure to remove a single controller node from the CNE Kubernetes cluster in an OpenStack deployment.

Procedure
  1. Locate the controller node internal IP address by running the following command from the Bastion Host:
    $ kubectl get nodes -o wide | egrep control |  awk '{ print $1, $2, $6}'
    For example:
    $ [cloud-user@occne7-test-bastion-1 ~]$ kubectl get node -o wide | egrep control |  awk '{ print $1, $2, $6}'
    
    Sample output:
    
    occne7-test-k8s-ctrl-1 NotReady 192.168.201.158
    occne7-test-k8s-ctrl-2 Ready 192.168.203.194
    occne7-test-k8s-ctrl-3 Ready 192.168.200.115

    Note that the status of controller node 1 is NotReady in the sample output.

  2. Run the following commands to backup the terraform.tfstate file:
    $ cd /var/occne/cluster/${OCCNE_CLUSTER}
    $ cp terraform.tfstate ${OCCNE_CLUSTER}/terraform.tfstate.backup
  3. From the Bastion Host, use SSH to log in to a working controller node and run the following commands to list the etcd members:
    $ ssh <working control node hostname>
    # sudo su
    # source /etc/etcd.env
    # /usr/local/bin/etcdctl --endpoints https://<working control node IP address>:2379 --cacert=${ETCD_PEER_TRUSTED_CA_FILE} --cert=${ETCD_CERT_FILE} --key=${ETCD_KEY_FILE} member list
    
    For example:
    $ ssh occne7-test-k8s-ctrl-2
    
    [cloud-user@occne7-test-k8s-ctrl-2]$ sudo su
    
    [root@occne7-test-k8s-ctrl-2 cloud-user]# source /etc/etcd.env
    
    [root@occne7-test-k8s-ctrl-2 cloud-user]# /usr/local/bin/etcdctl --endpoints https://192.168.203.194:2379 --cacert=${ETCD_PEER_TRUSTED_CA_FILE} --cert=${ETCD_CERT_FILE} --key=${ETCD_KEY_FILE} member list
    52513ddd2aa49770, started, etcd1, https://192.168.201.158:2380, https://192.168.201.158:2379, false
    80845fb2b5120458, started, etcd3, https://192.168.200.115:2380, https://192.168.200.115:2379, false
    f1200d9975868073, started, etcd2, https://192.168.203.194:2380, https://192.168.203.194:2379, false
    1. From the output, identify the etcd (etcd1, etcd2, or etcd3) to which the failed controller node belongs.
    2. Copy the controller node ID that is displayed in the first column of the output to be used later in the procedure.
  4. If the failed controller node is reachable, use SSH to log in to the failed controller node from the Bastion Host and stop etcd service by running the following commands:
    $ ssh <failed control node hostname>
     
    $ sudo systemctl stop etcd
    Example:
    $ ssh occne7-test-k8s-ctrl-1
     
    $ sudo systemctl stop etcd
  5. From the Bastion Host, use SSH to log in to a working controller node and remove the failed controller node from the etcd member list:
    $ ssh <working control node hostname>
    $ sudo su
    $ source /etc/etcd.env
    $ /usr/local/bin/etcdctl --endpoints https://<working control node IP address>:2379 --cacert=${ETCD_PEER_TRUSTED_CA_FILE} --cert=${ETCD_CERT_FILE} --key=${ETCD_KEY_FILE} member remove <failed control node ID>
    
    Example:
    [root@occne7-test-k8s-ctrl-2 cloud-user]# /usr/local/bin/etcdctl --endpoints https://192.168.203.194:2379 --cacert=${ETCD_PEER_TRUSTED_CA_FILE} --cert=${ETCD_CERT_FILE} --key=${ETCD_KEY_FILE} member remove 52513ddd2aa49770
    
    Sample output:
    Member 52513ddd2aa49770 removed from cluster f347ab69786ba4f7
  6. Validate if the failed node is removed from the etcd member list:
    $ /usr/local/bin/etcdctl --endpoints https://<working control node IP address>:2379 --cacert=${ETCD_PEER_TRUSTED_CA_FILE} --cert=${ETCD_CERT_FILE} --key=${ETCD_KEY_FILE} member list
    
    For example:
    [root@occne7-test-k8s-ctrl-2 cloud-user]# /usr/local/bin/etcdctl --endpoints https://192.168.203.194:2379 --cacert=${ETCD_PEER_TRUSTED_CA_FILE} --cert=${ETCD_CERT_FILE} --key=${ETCD_KEY_FILE} member list
    80845fb2b5120458, started, etcd3, https://192.168.200.115:2380, https://192.168.200.115:2379, false
    f1200d9975868073, started, etcd2, https://192.168.203.194:2380, https://192.168.203.194:2379, false
  7. From the Bastion Host, switch the controller nodes in terraform.tfstate by running the following commands:

    Note:

    Perform this step only if the failed controller node is a etcd1 member.
    $ cd /var/occne/cluster/$OCCNE_CLUSTER
    $ cp terraform.tfstate terraform.tfstate.original
    $ python3 scripts/switchTfstate.py
    For example:
    [cloud-user@occne7-test-bastion-1]$ python3 scripts/switchTfstate.py
    Sample output:
    Beginning tfstate switch order k8s control nodes
     
            terraform.tfstate.lastversion created as backup
     
    Controller Nodes order before rotation:
    occne7-test-k8s-ctrl-1
    occne7-test-k8s-ctrl-2
    occne7-test-k8s-ctrl-3
     
    Controller Nodes order after rotation:
    occne7-test-k8s-ctrl-2
    occne7-test-k8s-ctrl-3
    occne7-test-k8s-ctrl-1
     
    Success: terraform.tfstate rotated for cluster occne7-test
  8. Remove the failed controller node from the cluster by performing one the following steps in the Bastion Host depending on whether the failed controller node is reachable or not:
    • If the failed controller node is reachable, run the following commands to remove the controller node from the cluster:
      $ kubectl cordon <failed control node hostname>
      $ kubectl drain <failed control node hostname> --force --ignore-daemonsets  --delete-emptydir-data
      $ kubectl delete node <failed control node hostname>
       
      Example:
      $ [cloud-user@occne7-test-bastion-1]$ kubectl cordon occne7-test-k8s-ctrl-1
      $ [cloud-user@occne7-test-bastion-1]$ kubectl drain occne7-test-k8s-ctrl-1 --force --ignore-daemonsets  --delete-emptydir-data 
      $ [cloud-user@occne7-test-bastion-1]$ kubectl delete node occne7-test-k8s-ctrl-1
    • If the failed controller node is not reachable, run the following commands to remove the controller node from the cluster:
      $ kubectl cordon <failed control node hostname>
      $ kubectl delete node <failed control node hostname>
      Example:
      $ [cloud-user@occne7-test-bastion-1]$ kubectl cordon occne7-test-k8s-ctrl-1
      $ [cloud-user@occne7-test-bastion-1]$ kubectl delete node occne7-test-k8s-ctrl-1
  9. Verify if the failed controller node is deleted from cluster.
    $ kubectl get node
    Sample output:
    [cloud-user@occne7-test-bastion-1]$ kubectl get node
    NAME                   STATUS ROLES                AGE VERSION
    occne7-test-k8s-ctrl-2 Ready  control-plane,master 82m v1.23.7
    occne7-test-k8s-ctrl-3 Ready  control-plane,master 82m v1.23.7
    occne7-test-k8s-node-1 Ready  <none>               81m v1.23.7
    occne7-test-k8s-node-2 Ready  <none>               81m v1.23.7
    occne7-test-k8s-node-3 Ready  <none>               81m v1.23.7
    occne7-test-k8s-node-4 Ready  <none>               81m v1.23.7

    Note:

    If you are not able to run kubectl commands from the Bastion Host, update the /var/occne/cluster/$OCCNE_CLUSTER/artifacts/admin.conf file with the new working node IP address:
    vi /var/occne/cluster/occne7-test/artifacts/admin.conf
     server: https://192.168.203.194:6443
  10. Delete the failed controller node's instance using the Openstack GUI:
    1. Log in to OpenStack cloud using your credentials.
    2. From the Compute menu, select Instances, and locate the failed controller node's instance that you want to delete, as shown in the following image:

      OpenStack_Locate_Instance

    3. On the instance record, click the drop-down option in the Actions column, select Delete Instance to delete the failed controller node's instance, as shown in the following image:

      OpenStack_Delete_Instance

7.4.4.2 Removing a Controller Node in VMware Deployment

This section describes the procedure to remove a single controller node from the CNE Kubernetes cluster in a VMware deployment.

Procedure
  1. Locate the controller node internal IP address by running the following command from the Bastion Host:
    $ kubectl get node -o wide | egrep ctrl |  awk '{ print $1, $2, $6}'
    Sample output:
    
    occne7-test-k8s-ctrl-1 192.168.201.158
    occne7-test-k8s-ctrl-2 192.168.203.194
    occne7-test-k8s-ctrl-3 192.168.200.115
    For example:
    $ [cloud-user@occne7-test-bastion-1 ~]$ kubectl get node -o wide | egrep control |  awk '{ print $1, $2, $6}'
    
    Sample output:
    
    occne7-test-k8s-ctrl-1 NotReady 192.168.201.158
    occne7-test-k8s-ctrl-2 Ready 192.168.203.194
    occne7-test-k8s-ctrl-3 Ready 192.168.200.115

    Note that the status of control node 1 is NotReady in the sample output.

  2. Backup the terraform.tfstate file by running the following commands:
    $ cd /var/occne/cluster/${OCCNE_CLUSTER}
    $ cp terraform.tfstate ${OCCNE_CLUSTER}/terraform.tfstate.backup
  3. On the Bastion Host, use SSH to log in to a working controller node and run the following commands to list the etcd members:
    $ ssh <working control node hostname>
    # sudo su
    # source /etc/etcd.env
    # /usr/local/bin/etcdctl --endpoints https://<working control node IP address>:2379 --cacert=${ETCD_PEER_TRUSTED_CA_FILE} --cert=${ETCD_CERT_FILE} --key=${ETCD_KEY_FILE} member list
    For example:
    $ ssh occne7-test-k8s-ctrl-2
      
    [cloud-user@occne7-test-k8s-ctrl-2]$ sudo su
      
    [root@occne7-test-k8s-ctrl-2 cloud-user]# source /etc/etcd.env
      
    [root@occne7-test-k8s-ctrl-2 cloud-user]# /usr/local/bin/etcdctl --endpoints https://192.168.203.194:2379 --cacert=${ETCD_PEER_TRUSTED_CA_FILE} --cert=${ETCD_CERT_FILE} --key=${ETCD_KEY_FILE} member list
    Sample output:
    52513ddd2aa49770, started, etcd1, https://192.168.201.158:2380, https://192.168.201.158:2379, false
    80845fb2b5120458, started, etcd3, https://192.168.200.115:2380, https://192.168.200.115:2379, false
    f1200d9975868073, started, etcd2, https://192.168.203.194:2380, https://192.168.203.194:2379, false
    1. From the output, identify the etcd (etcd1, etcd2, or etcd3) to which the failed controller node belongs.
    2. Copy the controller node ID that is displayed in the first column of the output to be used later in the procedure.
  4. If the failed controller node is reachable, use SSH to log in to the failed controller node from the Bastion Host and stop etcd service by running the following commands:
    $ ssh <failed control node hostname>
     
    $ sudo systemctl stop etcd
    For example:
    $ ssh occne7-test-k8s-ctrl-1
     
    $ sudo systemctl stop etcd
  5. From the Bastion Host, use SSH to log in to a working controller node and remove the failed controller node from the etcd member list:
    $ ssh <working control node hostname>
    $ sudo su
    $ source /etc/etcd.env
    $ /usr/local/bin/etcdctl --endpoints https://<working control node IP address>:2379 --cacert=${ETCD_PEER_TRUSTED_CA_FILE} --cert=${ETCD_CERT_FILE} --key=${ETCD_KEY_FILE} member remove <failed control node ID>
    
    For example:
    [root@occne7-test-k8s-ctrl-2 cloud-user]# /usr/local/bin/etcdctl --endpoints https://192.168.203.194:2379 --cacert=${ETCD_PEER_TRUSTED_CA_FILE} --cert=${ETCD_CERT_FILE} --key=${ETCD_KEY_FILE} member remove 52513ddd2aa49770
    Sample output:
    Member 52513ddd2aa49770 removed from cluster f347ab69786ba4f7
  6. Validate if the failed node is removed from the etcd member list:
    $ /usr/local/bin/etcdctl --endpoints https://<working control node IP address>:2379 --cacert=${ETCD_PEER_TRUSTED_CA_FILE} --cert=${ETCD_CERT_FILE} --key=${ETCD_KEY_FILE} member list
    
    For example:
    [root@occne7-test-k8s-ctrl-2 cloud-user]# /usr/local/bin/etcdctl --endpoints https://192.168.203.194:2379 --cacert=${ETCD_PEER_TRUSTED_CA_FILE} --cert=${ETCD_CERT_FILE} --key=${ETCD_KEY_FILE} member list
    
    Sample output:
    80845fb2b5120458, started, etcd3, https://192.168.200.115:2380, https://192.168.200.115:2379, false
    f1200d9975868073, started, etcd2, https://192.168.203.194:2380, https://192.168.203.194:2379, false
  7. From the Bastion Host, switch the controller nodes in terraform.tfstate by running the following commands:

    Note:

    Perform this step only if the failed controller node is a etcd1 member.
    $ cd /var/occne/cluster/${OCCNE_CLUSTER}
    $ cp terraform.tfstate terraform.tfstate.original
    $ python3 scripts/switchTfstate.py
    For example:
    [cloud-user@occne7-test-bastion-1]$ python3 scripts/switchTfstate.py
    Sample output:
    Beginning tfstate switch order k8s control nodes
     
            terraform.tfstate.lastversion created as backup
     
    Controller Nodes order before rotation:
    occne7-test-k8s-ctrl-1
    occne7-test-k8s-ctrl-2
    occne7-test-k8s-ctrl-3
     
    Controller Nodes order after rotation:
    occne7-test-k8s-ctrl-2
    occne7-test-k8s-ctrl-3
    occne7-test-k8s-ctrl-1
     
    Success: terraform.tfstate rotated for cluster occne7-test
  8. Remove the failed controller node from the cluster by performing one the following steps in the Bastion Host depending on whether the failed controller node is reachable or not:
    • If the failed controller node is reachable, run the following commands to remove the controller node from the cluster:
      $ kubectl cordon <failed control node hostname>
      $ kubectl drain <failed control node hostname> --force --ignore-daemonsets  --delete-emptydir-data
      $ kubectl delete node <failed control node hostname>
       
      For example:
      $ [cloud-user@occne7-test-bastion-1]$ kubectl cordon occne7-test-k8s-ctrl-1
      $ [cloud-user@occne7-test-bastion-1]$ kubectl drain occne7-test-k8s-ctrl-1 --force --ignore-daemonsets  --delete-emptydir-data
      $ [cloud-user@occne7-test-bastion-1]$ kubectl delete node occne7-test-k8s-ctrl-1
    • If the failed controller node is not reachable, run the following commands to remove the controller node from the cluster:
      $ kubectl cordon <failed control node hostname>
      $ kubectl delete node <failed control node hostname>
      Example:
      $ [cloud-user@occne7-test-bastion-1]$ kubectl cordon occne7-test-k8s-ctrl-1  
      $ [cloud-user@occne7-test-bastion-1]$ kubectl delete node occne7-test-k8s-ctrl-1
  9. Verify if the failed controller node is deleted from cluster.
    $ kubectl get node
    Sample output:
    NAME                   STATUS ROLES                AGE VERSION
    occne7-test-k8s-ctrl-2 Ready  control-plane,master 82m v1.23.7
    occne7-test-k8s-ctrl-3 Ready  control-plane,master 82m v1.23.7
    occne7-test-k8s-node-1 Ready  <none>               81m v1.23.7
    occne7-test-k8s-node-2 Ready  <none>               81m v1.23.7
    occne7-test-k8s-node-3 Ready  <none>               81m v1.23.7
    occne7-test-k8s-node-4 Ready  <none>               81m v1.23.7

    Note:

    If you are not able to run kubectl commands from the Bastion Host, update the /var/occne/cluster/$OCCNE_CLUSTER/artifacts/admin.conf file with the new working node IP address:
    vi /var/occne/cluster/occne7-test/artifacts/admin.conf
    
    server: https://192.168.203.194:6443
  10. Delete the failed controller node's VM using the VMWare GUI:
    1. Log in to VMware cloud using your credentials.
    2. From the Compute menu, select Virtual Machines, and locate the failed controller node's VM to delete, as shown in the following image:

      VMWare_Locate_VM

    3. From the Actions menu, select Delete to delete the failed controller node's VM, as shown in the following image:

      VMWare_Delete_VM

7.4.5 Adding a Kubernetes Worker Node

This section provides the procedure to add additional worker nodes to a previously installed CNE Kubernetes cluster.

Note:

  • For a BereMetal installation, ensure that you are familiar with the inventory file preparation procedure. For more information about this procedure, see "Inventory File Preparation" section in Oracle Communications Cloud Native Core, Cloud Native Environment Installation, Upgrade, and Fault Recovery Guide.
  • Run this procedure from the active Bastion Host only.
  • You can add only one node at a time using this procedure.

Adding a Kubernetes Worker Node on BareMetal

Note:

For any failure or successful run, the system maintains all Terraform and pipeline output in the /var/occne/cluster/${OCCNE_CLUSTER}/addWrkNodeCapture-<mmddyyyy_hhmmss>.log file.
  1. Log in to Bastion Host and verify if it's an active Bastion Host. If the Bastion Host isn't an active Bastion Host, then log in to another.
    Use the following command to check if the Bastion Host is an active Bastion Host:
    $ is_active_bastion
    The system displays the following output if the Bastion Host is an active Bastion Host:

    IS active-bastion

    The system displays the following output if the Bastion Host isn't an active Bastion Host:

    NOT active-bastion

  2. Run the following command to navigate to the cluster directory:
    $ cd /var/occne/cluster/${OCCNE_CLUSTER}/
  3. Run the following command to open the host.ini file in edit mode:
    $ vi host.ini
  4. Add the following line containing the node details under the [host_hp_gen_X] or [host_netra_X] hardware header:
    [host_hp_gen_10]/[host_netra_X]
    <NODE_FULL_NAME> ansible_host=<ipv4> hp_ilo=<ipv4> mac=<mac-address> pxe_config_ks_nic=<nic0> pxe_config_nic_list=<nic0>,<nic1>,<nic2> pxe_uefi=False
    where, <NODE_FULL_NAME> is the full name of the node that is added.

    Note:

    • <NODE_FULL_NAME>, ansible_host, hp_ilo or netra_ilom, and mac are the required parameters and their values must be unique in the host.ini file.
    • <mac-address> must be a string of six two-digit hexadecimal numbers separated by a dash. For example, a2-27-3d-d3-b4-00.
    • All IP addresses must be in proper IPV4 format.
    • pxe_config_ks_nic, pxe_config_nic_list, and pxe_uefi are the optional parameters. The node details can also contain other optional parameters that are not listed here.
    • All the required and optional parameters must be in the <KEY>=<VALUE> format without any space between the equal to sign.
    • The parameters must have a valid value if required.
    • Comments must be added in a separate line using # and must not be added at the end of the line.

    For example, the following code block shows the node details of a worker node (k8s-node-5.test.us.oracle.com) added under the [host_hp_gen_10] hardware header:

    ...
    [host_hp_gen_10]
    k8s-host-1.test.us.oracle.com ansible_host=179.1.5.2 hp_ilo=172.16.9.44 mac=a2-27-3d-d3-b4-00 oam_host=10.75.216.13
    k8s-host-2.test.us.oracle.com ansible_host=179.1.5.3 hp_ilo=172.16.9.45 mac=4d-d9-1a-e2-7e-e8 oam_host=10.75.216.14
    k8s-host-3.test.us.oracle.com ansible_host=179.1.5.4 hp_ilo=172.16.9.46 mac=e1-15-b4-1d-32-10
     
    k8s-node-1.test.us.oracle.com ansible_host=179.1.5.5 hp_ilo=172.16.9.47 mac=3b-d2-2d-f6-1e-20
    k8s-node-2.test.us.oracle.com ansible_host=179.1.5.6 hp_ilo=172.16.9.48 mac=a8-1a-37-b1-c0-dc
    k8s-node-3.test.us.oracle.com ansible_host=179.1.5.7 hp_ilo=172.16.9.49 mac=a4-be-2d-3f-21-f0
    k8s-node-4.test.us.oracle.com ansible_host=179.1.5.8 hp_ilo=172.16.9.35 mac=3a-d9-2c-e6-35-18
    k8s-node-5.test.us.oracle.com ansible_host=179.1.5.9 hp_ilo=172.16.9.46 mac=2a-e1-c3-d4-12-a9
    ...
  5. Add the full name of the node under the [kube-node] header.
    [kube-node]
    <NODE_FULL_NAME>

    where, <NODE_FULL_NAME> is the full name of the node that is added.

    For example, the following code block shows the full name of the worker node (k8s-node-5.test.us.oracle.com) added under the [kube-node] header:
    ...
    [kube-node]
    k8s-node-1.test.us.oracle.com
    k8s-node-2.test.us.oracle.com
    k8s-node-3.test.us.oracle.com
    k8s-node-4.test.us.oracle.com
    k8s-node-5.test.us.oracle.com
    ...
  6. Save the host.ini file and exit.
  7. Navigate to the maintenance directory:
    $ cd /var/occne/cluster/${OCCNE_CLUSTER}/artifacts/maintenance
  8. The addBmWorkerNode.py script in the maintenance directory is used to add Kubernetes worker node on BareMetal. Run the following command to add one worker node at a time:
    $ ./addBmWorkerNode.py -nn <NODE_FULL_NAME>

    where, <NODE_FULL_NAME> is the full name of the node that you added to the host.ini file in the previous steps.

    For example:
    $ ./addBmWorkerNode.py -nn k8s-5.test.us.oracle.com
    Sample output:
    Beginning add worker node: k8s-5.test.us.oracle.com
     - Backing up configuration files
     - Verify hosts.ini values
     - Updating OS on new node
        - Successfully run Provisioning pipeline - check /var/occne/cluster/test/addBmWkrNodeCapture-05192023_145823.log for details.
     - Scaling new node into cluster
        - Successfully run k8_install scale playbook - check /var/occne/cluster/test/addBmWkrNodeCapture-05192023_145823.log for details.
     - Updating /etc/hosts on all nodes with new node
        - Successfully updated file: /etc/hosts on all servers - check /var/occne/cluster/test/addBmWkrNodeCapture-05192023_145823.log for details.
     - Running verification
     - Restarting rook-ceph operator
       - rook-ceph pods ready!
    Worker node: k8s-5.test.us.oracle.com added successfully
  9. Run the following commands to verify if the node is added successfully:
    1. Run the following command and verify if the new node is in the Ready state:
      $ kubectl get nodes
      Sample output:
      NAME                              STATUS   ROLES           AGE     VERSION
      k8s-master-1.test.us.oracle.com   Ready    control-plane   7d15h   v1.25.6
      k8s-master-2.test.us.oracle.com   Ready    control-plane   7d15h   v1.25.6
      k8s-master-3.test.us.oracle.com   Ready    control-plane   7d15h   v1.25.6
      k8s-node-1.test.us.oracle.com     Ready    <none>          7d15h   v1.25.6
      k8s-node-2.test.us.oracle.com     Ready    <none>          7d15h   v1.25.6
      k8s-node-4.test.us.oracle.com     Ready    <none>          7d15h   v1.25.6
      k8s-node-5.test.us.oracle.com     Ready    <none>          14m     v1.25.6
    2. Run the following command and verify if all pods are in the Running or Completed state:
      $ kubectl get pod -A
    3. Run the following command and verify if the services are running and the service GUIs are reachable:
      $ kubectl get svc -A

Adding a Kubernetes Worker Node on vCNE (OpenStack and VMware)

Note:

For any failure or successful run, the system maintains all Terraform and pipeline output in the /var/occne/cluster/${OCCNE_CLUSTER}/addWrkNodeCapture-<mmddyyyy_hhmmss>.log file.
  1. Log in to a Bastion Host and ensure if all the pods are in the Running or Completed state:
    $ kubectl get pod -A
  2. Verify if the services are reachable and if the common services GUIs are accessible using the LoadBalancer EXTERNAL-IPs:
    $ kubectl get svc -A | grep LoadBalancer
    $ curl <svc_external_ip>
  3. Navigate to the cluster directory:
    $ cd /var/occne/cluster/$OCCNE_CLUSTER/
  4. Run the following command to open the $OCCNE_CLUSTER/cluster.tfvars file. Search for the number_of_k8s_nodes parameter in the file and increment the value of the parameter by 1.
    $ vi $OCCNE_CLUSTER/cluster.tfvars
    The following example shows the current value of number_of_k8s_nodes set to 5:
    ...
    # k8s nodes
    #
    number_of_k8s_nodes = 5
    ...
    The following example shows the value of number_of_k8s_nodes incremented by 1 to 6.
    ...
    # k8s nodes
    #
    number_of_k8s_nodes = 6
    ...
  5. For OpenStack, perform this step to source the openrc.sh file and establish a connection between Bastion Host and OpenStack cloud. The openrc.sh file sets the necessary environment variables for OpenStack. For VMware, skip this step and move to the next step.
    1. Source the openrc.sh file.
      $ source openrc.sh
    2. Enter the OpenStack username and password when prompted.
      The following block shows the username and password prompt displayed by the system:
      Please enter your OpenStack Username:
       Please enter your OpenStack Password as <username>:

      Note:

      Once you source the file, ensure that the openstack-cacert.pem file exists in the same folder and the file is populated with appropriate certificates if TLS is supported.
  6. Run the addWorkerNode.py script:

    Note:

    The system backs up number of files such as lbvm/lbCtrlData.json, cluster.tfvars, hosts.ini, terraform.tfstate (renamed to terraform.tfstate.ORIG), and /etc/hosts into the /var/occne/cluster/${OCCNE_CLUSTER}/backUpConfig directory. These files are backed up only once to take a backup of the original files.
    $ ./scripts/addWorkerNode.py

    Example for OpenStack deployment:

    $ ./scripts/addWorkerNode.py
    Sample output:
    Starting addWorkerNode instance for the last worker node.
     
     - Backing up configuration files...
     
     - Checking if cluster.tfvars matches with the terraform state...
       Succesfully checked the number_of_k8s_nodes parameter in the cluster.tfvars file.
     
     - Running terraform apply to update its state...
       Successfully applied Openstack terraform apply - check /var/occne/cluster/occne7-devansh-m-marwaha/addWkrNodeCapture-11282023_090910.log for details
     
     - Running pipeline.sh for provision - can take considerable time to complete...
       Successfully run Provisioning pipeline - check /var/occne/cluster/occne7-devansh-m-marwaha/addWkrNodeCapture-11282023_090910.log for details.
     
     - Running pipeline.sh for k8s_install - can take considerable time to complete...
       Successfully run K8s pipeline - check /var/occne/cluster/occne7-devansh-m-marwaha/addWkrNodeCapture-11282023_090910.log for details.
     
     - Get IP address for the new worker node...
       Successfully retrieved IP address of the new worker node.
     
     - Get name for the new worker node...
       Successfully retrieved the name of the new worker node.
     
     - Update /etc/hosts files on all previous servers...
       Successfully updated file: /etc/hosts on all servers - check /var/occne/cluster/occne7-devansh-m-marwaha/addWkrNodeCapture-11282023_090910.log for details.
     
     - Update lbCtrlData.json file...
       Successfully updated file: /var/occne/cluster/occne7-devansh-m-marwaha/lbvm/lbCtrlData.json.
     
     - Update lb-controller-ctrl-data and lb-controller-master-ip configmap...
       Successfully created configmap lb-controller-ctrl-data.
       Successfully created configmap lb-controller-master-ip.
     
     - Deleting LB Controller POD: occne-lb-controller-server-5d75bb744d-dmr6x to bind in configmaps...
       Successfully restarted deployment occne-lb-controller-server.
       Waiting for occne-lb-controller-server deployment to return to Running status.
       Deployment "occne-lb-controller-server" successfully rolled out
     
     - Update servers from new occne-lb-controller pod...
       Successfully updated LBVMs haproxy.cfg with new worker nodes.
     
    Worker node successfully added to cluster: occne7-devansh-m-marwaha
    Example for VMware deployment:
    $ ./scripts/addWorkerNode.py
    Sample output:
    Starting addWorkerNode instance for the last worker node.
     
     - Backing up configuration files...
     
     - Checking if cluster.tfvars matches with the terraform state...
       Succesfully checked the number_of_k8s_nodes parameter in the cluster.tfvars file.
     
     - Running terraform apply to update its state...
       VmWare terraform apply -refresh-only successful - check /var/occne/cluster/occne5-chandrasekhar-musti/addWkrNodeCapture-11282023_115313.log for details.
       VmWare terraform apply successful - node - check /var/occne/cluster/occne5-chandrasekhar-musti/addWkrNodeCapture-11282023_115313.log for details.
     
     - Get name for the new worker node...
       Successfully retrieved the name of the new worker node.
     
     - Running pipeline.sh for provision - can take considerable time to complete...
       Successfully run Provisioning pipeline - check /var/occne/cluster/occne5-chandrasekhar-musti/addWkrNodeCapture-11282023_115313.log for details.
     
     - Running pipeline.sh for k8s_install - can take considerable time to complete...
       Successfully run K8s pipeline - check /var/occne/cluster/occne5-chandrasekhar-musti/addWkrNodeCapture-11282023_115313.log for details.
     
     - Get IP address for the new worker node...
       Successfully retrieved IP address of the new worker node occne5-chandrasekhar-musti-k8s-node-4.
     
     - Update /etc/hosts files on all previous servers...
       Successfully updated file: /etc/hosts on all servers - check /var/occne/cluster/occne5-chandrasekhar-musti/addWkrNodeCapture-11282023_115313.log for details.
     
     - Update lbCtrlData.json file...
       Successfully updated file: /var/occne/cluster/occne5-chandrasekhar-musti/lbvm/lbCtrlData.json.
     
     - Update lb-controller-ctrl-data and lb-controller-master-ip configmap...
       Successfully created configmap lb-controller-ctrl-data.
       Successfully created configmap lb-controller-master-ip.
     
     - Deleting LB Controller POD: occne-lb-controller-server-5d8cd867b7-s5gb2 to bind in configmaps...
       Successfully restarted deployment occne-lb-controller-server.
       Waiting for occne-lb-controller-server deployment to return to Running status.
       Deployment "occne-lb-controller-server" successfully rolled out
     
     - Update servers from new occne-lb-controller pod...
       Successfully updated server list for each service in haproxy.cfg on LBVMs with new node: occne5-chandrasekhar-musti-k8s-node-4.
     
    Worker node successfully added to cluster: occne5-chandrasekhar-musti
  7. If there's a failure in Step 6, perform the following steps to rerun the addWorkerNode.py script:
    1. Copy backup files to the original files:
      $ cp /var/occne/cluster/${OCCNE_CLUSTER}/cluster.tfvars ${OCCNE_CLUSTER}/cluster.tfvars
      $ cp /var/occne/cluster/${OCCNE_CLUSTER}/backupConfig/lbCtrlData.json lbvm/lbCtrlData.json
      # sudo cp /var/occne/cluster/${OCCNE_CLUSTER}/backupConfig/hosts /etc/hosts
    2. If you ran Podman commands before the failure, then drain the new pods before rerunning the script:
      $ kubectl drain --ignore-daemonsets <worker_node_hostname>
      For example:
      $ kubectl drain --ignore-daemonsets ${OCCNE_CLUSTER}-k8s-node-5
    3. Rerun the addWorkerNode.py script:
      $ scripts/addWorkerNode.py
    4. After rerunning the addWorkerNode.py script, uncordon the nodes:
      $ kubectl uncordon <new node>
      For example:
      $ kubectl uncordon ${OCCNE_CLUSTER}-k8s-node-5
  8. Verify the nodes, pods, and services:
    1. Verify if the new nodes are in Ready state by running the following command:
      $ kubectl get nodes
    2. Verify if all pods are in the Running or Completed state by running the following command:
      $ kubectl get pod -A -o wide
    3. The services are running and the services GUIs are reachable:
      $ kubectl get svc -A

7.4.6 Removing a Kubernetes Worker Node

This section describes the procedure to remove a worker node from the CNE Kubernetes cluster after the original CNE installation. This procedure is used to remove a worker node that is unreachable (crashed or powered off), or that is up and running.

Procedure

Note:

  • This procedure is used to remove only one node at a time. If you want to remove multiple nodes, then perform this procedure on each node.
  • Removing multiple worker nodes can cause unwanted side effects such as increasing the overall load of your cluster. Therefore, before removing multiple nodes, make sure that there is enough capacity left in the cluster.
  • CNE requires a minimum of three worker nodes to properly run some of the common services such as, Opensearch, Bare Metal Rook Ceph cluster, and any daemonsets that require three or more replicas.
  • For a vCNE deployment, this procedure is used to remove only the last worker node in the Kubernetes. Therefore, refrain from using this procedure to remove any other worker node.
Remove a Kubernetes Worker Node in Openstack and VMware Deployment

Note:

For any failure or successful run, the system maintains all terraform and pipeline output in the /var/occne/cluster/${OCCNE_CLUSTER}/removeWrkNodeCapture-<mmddyyyy_hhmmss>.log file.
  1. Log in to a Bastion Host and verify the following:
    1. Run the following command to verify if all pods are the Running or Completed:
      $ kubectl get pod -A
      Sample output:
      NAMESPACE       NAME                                                            READY   STATUS                  RESTARTS         AGE
      cert-manager    occne-cert-manager-6dcffd5b9-jpzmt                              1/1     Running                 1 (3h17m ago)    4h56m
      cert-manager    occne-cert-manager-cainjector-5d6bccc77d-f4v56                  1/1     Running                 2 (3h15m ago)    3h48m
      cert-manager    occne-cert-manager-webhook-b7f4b7bdc-rg58k                      0/1     Completed               0                3h39m
      cert-manager    occne-cert-manager-webhook-b7f4b7bdc-tx7gz                      1/1     Running                 0                3h17m
      ...
    2. Run the following command to verify if the service LoadBalancer IPs are reachable and common service GUIs are running:
      $ kubectl get svc -A | grep LoadBalancer
      Sample output:
      occne-infra    occne-kibana                                     LoadBalancer   10.233.36.151   10.75.180.113   80:31659/TCP                                    4h57m
      occne-infra    occne-kube-prom-stack-grafana                    LoadBalancer   10.233.63.254   10.75.180.136   80:32727/TCP                                    4h56m
      occne-infra    occne-kube-prom-stack-kube-alertmanager          LoadBalancer   10.233.32.135   10.75.180.204   80:30155/TCP                                    4h56m
      occne-infra    occne-kube-prom-stack-kube-prometheus            LoadBalancer   10.233.3.37     10.75.180.126   80:31964/TCP                                    4h56m
      occne-infra    occne-promxy-apigw-nginx                         LoadBalancer   10.233.42.250   10.75.180.4     80:30100/TCP                                    4h56m
      occne-infra    occne-tracer-jaeger-query                        LoadBalancer   10.233.4.43     10.75.180.69    80:32265/TCP,16687:30218/TCP                    4h56m
  2. Navigate to the /var/occne/cluster/${OCCNE_CLUSTER}/ directory:
    $ cd /var/occne/cluster/${OCCNE_CLUSTER}/
  3. Open the $OCCNE_CLUSTER/cluster.tfvars file and decrement the value of the number_of_k8s_nodes field by 1:
    $ vi $OCCNE_CLUSTER/cluster.tfvars
    The following example shows the current value of number_of_k8s_nodes set to 6:
    ...
    # k8s nodes
    #
    number_of_k8s_nodes = 6
    ...
    The following example shows the value of number_of_k8s_nodes decremented by 1 to 5:
    ...
    # k8s nodes
    #
    number_of_k8s_nodes = 5
    ...
  4. For OpenStack, perform this step to establish a connection between Bastion Host and OpenStack cloud. For VMware, skip this step and move to the next step.
    Source the openrc.sh file. Enter the Openstack username and password when prompted. The openrc.sh file sets the necessary environment variables for OpenStack. Once you source the file, ensure that the openstack-cacert.pem file exists in the same folder and the file is populated for TLS support:
    $ source openrc.sh
    The following block shows the username and password prompt displayed by the system:
    Please enter your OpenStack Username:
     Please enter your OpenStack Password as <username>:
     Please enter your OpenStack Domain:
  5. Run the following command to get the list of nodes:
    $ kubectl get nodes -o wide | grep -v control-plane
    Sample output:
    NAME     STATUS   ROLES           AGE     VERSION   INTERNAL-IP       EXTERNAL-IP   OS-IMAGE                  KERNEL-VERSION                    CONTAINER-RUNTIME
    occne6-my-cluster-k8s-node-1   Ready    <none>          6d23h   v1.25.6   192.168.201.183   <none>        Oracle Linux Server 8.7   5.4.17-2136.316.7.el8uek.x86_64   containerd://1.6.15
    occne6-my-cluster-k8s-node-2   Ready    <none>          6d23h   v1.25.6   192.168.201.136   <none>        Oracle Linux Server 8.7   5.4.17-2136.316.7.el8uek.x86_64   containerd://1.6.15
    occne6-my-cluster-k8s-node-3   Ready    <none>          6d23h   v1.25.6   192.168.201.131   <none>        Oracle Linux Server 8.7   5.4.17-2136.316.7.el8uek.x86_64   containerd://1.6.15
    occne6-my-cluster-k8s-node-4   Ready    <none>          6d23h   v1.25.6   192.168.200.100   <none>        Oracle Linux Server 8.7   5.4.17-2136.316.7.el8uek.x86_64   containerd://1.6.15
  6. Run the following command to obtain the worker node IPs and verify if the worker node IPs match with the list obtained in Step 4:
    $ kubectl exec -it $(kubectl -n occne-infra get pods | grep occne-lb-controller-server) -n occne-infra -- /bin/bash -c "sqlite3 /data/sqlite/db/lbCtrlData.db 'SELECT * FROM nodeIps;'"
    Sample output:
    192.168.201.183
    192.168.201.136
    192.168.201.131
    192.168.200.100
  7. Run the removeWorkerNode.py script.

    Note:

    The system backs up the lbvm/lbCtrlData.json, cluster.tfvars, hosts.ini, terraform.tfstate, and /etc/hosts files into the /var/occne/cluster/${OCCNE_CLUSTER}/backUpConfig directory. These files are backed up only once to back up the original files.
    $ ./scripts/removeWorkerNode.py
    Example for OpenStack deployment:
    $ ./scripts/removeWorkerNode.py
    Sample output:
    Starting removeWorkerNode instance for the last worker node.
     
     - Backing up configuration files...
     
     - Checking if cluster.tfvars matches with the terraform state...
       Succesfully checked the number_of_k8s_nodes parameter in the cluster.tfvars file.
     
     - Getting the IP address for the worker node to be deleted...
       Successfully gathered occne7-devansh-m-marwaha-k8s-node-4's ip: 192.168.200.105.
     
     - Draining  node - can take considerable time to complete...
       Successfully drained occne7-devansh-m-marwaha-k8s-node-4 node.
     
     - Removing  node from the cluster...
       Successfully removed occne7-devansh-m-marwaha-k8s-node-4 from the cluster.
     
     - Running terraform apply to update its state...
       Successfully applied Openstack terraform apply - check /var/occne/cluster/occne7-devansh-m-marwaha/removeWkrNodeCapture-11282023_090320.log for details
     
     - Updating /etc/hosts on all servers...
       Successfully updated file: /etc/hosts on all servers - check /var/occne/cluster/occne7-devansh-m-marwaha/removeWkrNodeCapture-11282023_090320.log for details.
     
     - Updating lbCtrlData.json file...
       Successfully updated file: /var/occne/cluster/occne7-devansh-m-marwaha/lbvm/lbCtrlData.json.
     
     - Updating lb-controller-ctrl-data and lb-controller-master-ip configmap...
       Successfully created configmap lb-controller-ctrl-data.
       Successfully created configmap lb-controller-master-ip.
     
     - Deleting LB Controller POD: occne-lb-controller-server-fc869755-lm4hd to bind in configmaps...
       Successfully restarted deployment occne-lb-controller-server.
       Waiting for occne-lb-controller-server deployment to return to Running status.
       Deployment "occne-lb-controller-server" successfully rolled out
     
     - Update servers from new occne-lb-controller pod...
       Successfully removed the node: occne7-devansh-m-marwaha-k8s-node-4 from server list for each service in haproxy.cfg on LBVMs.
     
    Worker node successfully removed from cluster: occne7-devansh-m-marwaha
    Example for VMware deployment:
    ./scripts/removeWorkerNode.py
    Sample output:
    Starting removeWorkerNode instance for the last worker node.
       Successfully obtained index 3 from node occne5-chandrasekhar-musti-k8s-node-4.
     
     - Backing up configuration files...
     
     - Checking if cluster.tfvars matches with the terraform state...
       Succesfully checked the number_of_k8s_nodes parameter in the cluster.tfvars file.
     
     - Getting the IP address for the worker node to be deleted...
       Successfully gathered occne5-chandrasekhar-musti-k8s-node-4's ip: 192.168.1.15.
     
     - Draining  node - can take considerable time to complete...
       Successfully drained occne5-chandrasekhar-musti-k8s-node-4 node.
     
     - Removing  node from the cluster...
       Successfully removed occne5-chandrasekhar-musti-k8s-node-4 from the cluster.
     
     - Running terraform apply to update its state...
       Successfully applied VmWare terraform apply - check /var/occne/cluster/occne5-chandrasekhar-musti/removeWkrNodeCapture-11282023_105101.log fodetails.
     
     - Updating /etc/hosts on all servers...
       Successfully updated file: /etc/hosts on all servers - check /var/occne/cluster/occne5-chandrasekhar-musti/removeWkrNodeCapture-11282023_1051.log for details.
     
     - Updating lbCtrlData.json file...
       Successfully updated file: /var/occne/cluster/occne5-chandrasekhar-musti/lbvm/lbCtrlData.json.
     
     - Updating lb-controller-ctrl-data and lb-controller-master-ip configmap...
       Successfully created configmap lb-controller-ctrl-data.
       Successfully created configmap lb-controller-master-ip.
     
     - Deleting LB Controller POD: occne-lb-controller-server-7b894fb6b5-5cr8g to bind in configmaps...
       Successfully restarted deployment occne-lb-controller-server.
       Waiting for occne-lb-controller-server deployment to return to Running status.
       Deployment "occne-lb-controller-server" successfully rolled out
     
     - Update servers from new occne-lb-controller pod...
       Successfully removed the node: occne5-chandrasekhar-musti-k8s-node-4 from server list for each service in haproxy.cfg on LBVMs.
     
    Worker node successfully removed from cluster: occne5-chandrasekhar-musti
  8. Verify if the specified node is removed:
    1. Run the following command to list the worker nodes:
      $ kubectl get nodes -o wide
      Sample output:
      NAME                             STATUS   ROLES                  AGE   VERSION   INTERNAL-IP       EXTERNAL-IP   OS-IMAGE                  KERNEL-VERSION                      CONTAINER-RUNTIME
      occne6-my-cluster-k8s-ctrl-1   Ready    control-plane,master   6d23h v1.25.6   192.168.203.106   <none>        Oracle Linux Server 8.7   5.4.17-2136.316.7.el8uek.x86_64   containerd://1.6.15
      occne6-my-cluster-k8s-ctrl-2   Ready    control-plane,master   6d23h v1.25.6   192.168.202.122   <none>        Oracle Linux Server 8.7   5.4.17-2136.316.7.el8uek.x86_64   containerd://1.6.15
      occne6-my-cluster-k8s-ctrl-3   Ready    control-plane,master   6d23h v1.25.6   192.168.202.248   <none>        Oracle Linux Server 8.7   5.4.17-2136.316.7.el8uek.x86_64   containerd://1.6.15
      occne6-my-cluster-k8s-node-1   Ready    <none>                 6d23h v1.25.6   192.168.201.183   <none>        Oracle Linux Server 8.7   5.4.17-2136.316.7.el8uek.x86_64   containerd://1.6.15
      occne6-my-cluster-k8s-node-2   Ready    <none>                 6d23h v1.25.6   192.168.201.136   <none>        Oracle Linux Server 8.7   5.4.17-2136.316.7.el8uek.x86_64   containerd://1.6.15
      occne6-my-cluster-k8s-node-3   Ready    <none>                 6d23h v1.25.6   192.168.201.131   <none>        Oracle Linux Server 8.7   5.4.17-2136.316.7.el8uek.x86_64   containerd://1.6.15
    2. Run the following command and check if the targeted worker node is removed:
      $ kubectl exec -it $(kubectl -n occne-infra get pods | grep occne-lb-controller-server) -n occne-infra -- /bin/bash -c "sqlite3 /data/sqlite/db/lbCtrlData.db 'SELECT * FROM nodeIps;'"
      Sample output:
      192.168.201.183
      192.168.201.136
      192.168.201.131
Remove a Kubernetes Worker Node in Bare Metal Deployment

Note:

For any failure or successful run, the system maintains all pipeline outputs in the /var/occne/cluster/${OCCNE_CLUSTER}/removeWrkNodeCapture-<mmddyyyy_hhmmss>.log file. The system displays other outputs, messages, or errors directly on the terminal during the runtime of the script.
  1. Log in to Bastion Host and verify if it's an active Bastion Host. If the Bastion Host isn't an active Bastion Host, then log in to another.
    Use the following command to check if the Bastion Host is an active Bastion Host:
    $ is_active_bastion
    The system displays the following output if the Bastion Host is an active Bastion Host:

    IS active-bastion

    The system displays the following output if the Bastion Host isn't an active Bastion Host:

    NOT active-bastion

  2. Navigate to the /var/occne/cluster/${OCCNE_CLUSTER}/artifacts/maintenance directory:
    $ cd /var/occne/cluster/${OCCNE_CLUSTER}/artifacts/maintenance
  3. Run the following command to get the list of nodes and identify the node to be removed:

    Note:

    • The node that you want to remove must be present in the list and must be a worker node.
    • The node can be in any status (Ready or NotReady).
    $ kubectl get nodes
    Sample output:
    NAME                              STATUS      ROLES           AGE     VERSION
    k8s-master-1.test.us.oracle.com   Ready       control-plane   7d15h   v1.25.6
    k8s-master-2.test.us.oracle.com   Ready       control-plane   7d15h   v1.25.6
    k8s-master-3.test.us.oracle.com   Ready       control-plane   7d15h   v1.25.6
    k8s-node-1.test.us.oracle.com     Ready       <none>          7d15h   v1.25.6
    k8s-node-2.test.us.oracle.com     Ready       <none>          7d15h   v1.25.6
    k8s-node-3.test.us.oracle.com     NotReady    <none>          7d15h   v1.25.6
    k8s-node-4.test.us.oracle.com     Ready       <none>          7d15h   v1.25.6

    For example, consider that the k8s-node-3.test.us.oracle.com node that is in NotReady state is removed in this procedure.

  4. Run the following command to remove the targeted worker node:

    Note:

    • Remove one node at a time. If you want to remove multiple worker nodes, remove them one by one.
    • Removing a node can take several minutes. Once the script is run, don't cancel it in between; wait for the script to complete.
    $ ./removeBmWorkerNode.py -nn <NODE_FULL_NAME>
    where, <NODE_FULL_NAME> is the name of the node that you want to remove (in this case, k8s-3.test.us.oracle.com).
    For example:
    $ ./removeBmWorkerNode.py -nn k8s-3.test.us.oracle.com
    Sample output:
    Beginning remove worker node: k8s-3.test.us.oracle.com
     - Backing up configuration files.
        - Backup folder: /var/occne/cluster/littlefinger/backupConfig
     - Checking if the rook-ceph toolbox deployment already exists.
        - rook-ceph toolbox deployment does not exist.
        - Creating the rook-ceph toolbox deployment.
           - rook-ceph toolbox created.
     - Waiting for Toolbox pod to be Running.
        - Waiting for toolbox pod in namespace rook-ceph to be in Running state, current state:
        - ToolBox pod in namespace rook-ceph is now in Running state.
     - Getting the OSD Id.
        - Found OSD ID: 3, for worker node: k8s-3.test.us.oracle.com
     - Scaling down the rook-ceph-operator deployment.
        - rook-ceph-operator deployment scaled down.
     - Scaling down the OSD deployment.
        - rook-ceph-osd-3 deployment scaled down.
     - Purging OSD Id.
        - Unable to purge OSD.3 on attempt 1/12, retrying in 5 seconds...
        - OSD.3 purged.
        - OSD host k8s-3-test-us-oracle-com removed.
     - Deleting OSD deployment.
        - rook-ceph-osd-3 deleted.
     - Scaling up the rook-ceph-operator deployment.
        - rook-ceph-operator deployment scaled up.
     - Checking if the rook-ceph toolbox deployment exists.
        - Removing the rook-ceph toolbox deployment.
           - rook-ceph toolbox removed.
     - Removing BM worker node, could take considerable amount of time.
        - k8s-3.test.us.oracle.com is reachable
           - Removing... Please Wait. Do not cancel.
           - Removed. Check /var/occne/cluster/test/removeBmWkrNodeCapture-06162023_155106.log for details.
     - Cleanup leftover rook-ceph-mon deployment
        - Found pod rook-ceph-mon-g-64b489c6ff-h4bcr, getting deployment...
           - Found deployment rook-ceph-mon-g, deleting...
        - rook-ceph-mon cleanup completed.
    Worker node: k8s-3.test.us.oracle.com removed Successfully
  5. Run the following command to verify if the targeted worker node is removed (in this case, k8s-3.test.us.oracle.com):
    $ kubectl get nodes
    Sample output:
    NAME                              STATUS   ROLES           AGE     VERSION
    k8s-master-1.test.us.oracle.com   Ready    control-plane   7d15h   v1.25.6
    k8s-master-2.test.us.oracle.com   Ready    control-plane   7d15h   v1.25.6
    k8s-master-3.test.us.oracle.com   Ready    control-plane   7d15h   v1.25.6
    k8s-node-1.test.us.oracle.com     Ready    <none>          7d15h   v1.25.6
    k8s-node-2.test.us.oracle.com     Ready    <none>          7d15h   v1.25.6
    k8s-node-4.test.us.oracle.com     Ready    <none>          7d15h   v1.25.6

    The k8s-3.test.us.oracle.com node deleted in step 5 is not present in the sample output.

  6. Check if the cluster is healthy by checking the pods and services:
    $ kubectl get pods -A
    $ kubectl get svc -A

    There are many ways to check the health of a cluster. Checking the pods and services is one of the simple options. You can use your preferred way to check the cluster health.

  7. Navigate to the /var/occne/cluster/${OCCNE_CLUSTER}/ directory:
    $ cd /var/occne/cluster/${OCCNE_CLUSTER}/
  8. Edit the hosts.ini file and remove the lines containing the details of the node that is removed (in this case, k8s-3.test.us.oracle.com):
    1. Open the host.ini file in edit mode:
      $ vi hosts.ini
    2. Remove the lines containing the details of the node that is removed, from both the hardware header [host_hp_gen_X]/[host_netra_X] and [kube-node] header.

      Note:

      As per the example considered in this procedure, the node that is removed is k8s-node-3.test.us.oracle.com. The lines to be removed are highlighted in bold in the following sample output.
      [host_hp_gen_X]/[host_netra_X] # Host header may vary depending on your hardware
      k8s-host-1.test.us.oracle.com ansible_host=179.1.5.2 hp_ilo=172.16.9.44 mac=a2-27-3d-d3-b4-00 oam_host=10.75.216.13
      k8s-host-2.test.us.oracle.com ansible_host=179.1.5.3 hp_ilo=172.16.9.45 mac=4d-d9-1a-e2-7e-e8 oam_host=10.75.216.14
      k8s-host-3.test.us.oracle.com ansible_host=179.1.5.4 hp_ilo=172.16.9.46 mac=e1-15-b4-1d-32-10
       
      k8s-node-1.test.us.oracle.com ansible_host=179.1.5.5 hp_ilo=172.16.9.47 mac=3b-d2-2d-f6-1e-20
      k8s-node-2.test.us.oracle.com ansible_host=179.1.5.6 hp_ilo=172.16.9.48 mac=a8-1a-37-b1-c0-dc
      k8s-node-3.test.us.oracle.com ansible_host=179.1.5.7 hp_ilo=172.16.9.49 mac=a4-be-2d-3f-21-f0 
      k8s-node-4.test.us.oracle.com ansible_host=179.1.5.8 hp_ilo=172.16.9.35 mac=3a-d9-2c-e6-35-18
      .
      .
      .
      [kube-node]
      k8s-node-1.test.us.oracle.com
      k8s-node-2.test.us.oracle.com
      k8s-node-3.test.us.oracle.com
      k8s-node-4.test.us.oracle.com
    3. Save the file and exit.

7.4.7 Adding a New External Network

This section provides the procedure to add a new external network that applications can use to talk to external clients, by adding a Peer Address Pool (PAP) in a virtualized CNE (vCNE) and Bare Metal, after installation in CNE.

The cluster must be in good working condition. All pods and services must be running and no existing LBVMs must be in a FAILED state. Use the following command to run a full cluster test before starting the procedure:
OCCNE_STAGES=(TEST) pipeline.sh
7.4.7.1 Adding a New External Network in vCNE
The following procedure provides the steps to add a single Peer Address Pool (PAP) in vCNE.
Each time the script is run, the system creates a log file with a timestamp. The format of the log file is, addpapCapture-<mmddyyyy_hhmmss>.log. For example, addPapCapture-09172021_000823.log. The log includes the output from the Terraform and the pipeline call to configure the new LBVMs.
Each time the script is run, the system creates a new directory, addPapSave-<mmddyyyy-hhmmss>. The following files from the /var/occne/cluster/<cluster_name> directory are saved in the addPapSave-<mmddyyyy-hhmmss> directory:
  • lbvm/lbCtrlData.json
  • metallb.auto.tfvars
  • mb_resources.yaml
  • terraform.tfstate
  • hosts.ini
  • cluster.tfvars
Prerequisites
  1. On an OpenStack deployment, run the following steps to source the OpenStack environment file. This step is not required for a VMware deployment as the credential settings are derived automatically.
    1. Log in to Bastion Host and change the directory to the cluster directory:
      $ cd /var/occne/cluster/${OCCNE_CLUSTER}
    2. Source the OpenStack environment file:
      $ source openrc.sh

Procedure

  1. Update the /var/occne/cluster/<cluster_name>/<cluster_name>/cluster.tfvars file to include the new pool that is required.
    For more information about the following, see Oracle Communications Cloud Native Core, Cloud Native Environment Installation, Upgrade, and Fault Recovery Guide:
    • Updating the cluster.tfvars file.
    • Adding a Peer Address Pool (PAP) object to the existing cluster.tfvars file.
    • Adding the additional PAP name to the occne_metallb_peer_addr_pool_names list.
    • Adding the additional pool objects to the occne_metallb_list object.
  2. Run the addPeerAddrPools.py script:
    The script is located on the Bastion Host at /var/occne/cluster/<cluster_name>/scripts. The script is fully automated. When the script completes, the new LBVM pair is fully integrated into the deployment. The script doesn't require any parameter and performs all the steps required to complete the addition of the new Peer Address Pool (PAP) including a validation at the end. All output echos back to the terminal.
    It is recommended to perform the following before running the script:
    • Start a terminal output capture process to ensure that all data output is captured (for example, the script).
    • Run tail -f on the current addPapCapture log file to see how the Terraform and pipeline is running.
    1. Run the following commands to run the addPeerAddrPools.py script:
      $ cd /var/occne/cluster/<cluster_name>
      $ ./scripts/addPeerAddrPools.py

      Sample output to show a new PAP, "sig", added. The cluster_name in the sample output is considered as occne1-cluster:

      - Capturing current lbCtrlData.json data...
       
       - Generating new metallb.auto.tfvars file...
         Successfully generated the metallb.auto.tfvars file.
       
       - Executing terraform apply to generate new LBVM and port resources...
         . terraform apply attempt 1
         Successful execution of terraform apply -auto-approve -compact-warnings -var-file=/var/occne/cluster/occne1-cluster/occne1-cluster/cluster.tfvars
       
       - Generating new mb_resources.yaml file...
         Successfully updated the /var/occne/cluster/occne1-cluster/mb_resources.yaml file.
       
       - Applying new mb_resources.yaml file...
         Successfully applied mb_resources.yaml file.
       
       - Generating new lbvm/lbCtrlData.json file (can take longer due to number of existing LBVM pairs)...
         Successfully updated the lbvm/lbCtrlData.json file.
       
       - Generating new pool list...
       
       - Updating hosts.ini with new LBVMs...
         Successfully updated hosts.ini.
       
       - Running pipeline to config the new LBVMs (can take 10 minutes or more)...
         Successfully updated the new LBVM config.
       
       - Updating /etc/hosts with new LBVMs...
         Successfully updated /etc/hosts
       
       - Generating new config maps for lb-controller...
         . Updating config map: lb-controller-ctrl-data from file: /var/occne/cluster/occne1-cluster/lbvm/lbCtrlData.json
           Successfully applied new config map: lb-controller-ctrl-data from file: /var/occne/occne1-cluster/lbvm/lbCtrlData.json
           Pausing for 30 seconds to allow completion of config map: lb-controller-ctrl-data update from file: /var/occne/cluster/occne1-cluster/lbvm/lbCtrlData.json.
         . Updating config map: lb-controller-master-ip from file: /etc/hosts
           Successfully applied new config map: lb-controller-master-ip from file: /etc/hosts
           Pausing for 30 seconds to allow completion of config map: lb-controller-master-ip update from file: /etc/hosts.
         . Updating config map: lb-controller-mb-resources from file: /var/occne/cluster/occne1-cluster/mb_resources.yaml
           Successfully applied new config map: lb-controller-mb-resources from file: /var/occne/cluster/occne1-cluster/mb_resources.yaml
           Pausing for 30 seconds to allow completion of config map: lb-controller-mb-resources update from file: /var/occne/cluster/occne1-cluster/mb_resources.yaml.
           Successfully restarted deployment: occne-lb-controller-server
           Waiting for occne-lb-controller-server deployment to return to Running status.
           Deployment "occne-lb-controller-server" successfully rolled out
       
       - Generating the LBVM HAProxy templates...
         Successfully updated the templates files in the LBVMs.
       
       - Updating the lb-controller Db...
         Successfully added new pools to the lb_controller Db.
       
       - Restarting occne-egress-controller pods...
           Successfully restarted daemonset: occne-egress-controller
           Waiting for occne-egress-controller daemonset to return to Running status.
         Daemon set "occne-egress-controller" successfully rolled out
       
       - Restarting occne-metallb-controller pod...
           Successfully restarted deployment: occne-metallb-controller
           Waiting for occne-metallb-controller deployment to return to Running status.
         Deployment "occne-metallb-controller" successfully rolled out
       
       - Restarting occne-metallb-speaker pods...
           Successfully restarted daemonset: occne-metallb-speaker
           Waiting for occne-metallb-speaker daemonset to return to Running status.
         Daemon set "occne-metallb-speaker" successfully rolled out
       
       - Validating LBVM configuration
         . Validating IPAddressPool CRD for pool: sig
           IPAddressPool CRD validated successfully for pool: sig
         . Validating BGPAdvertisements CRD for pool: sig
           BGPAdvertisements bgpadvertisement1 validated successfully for pool: sig
         . Validating ACTIVE LBVM for pool: sig
           ACTIVE LBVM haproxy.cfg validation successful for pool: sig
         . Validating LB Controller Db for pool: tme
           LB Controller Db validation successful for pool: sig
       
      Successfully added new LBVMs and ports for the new Peer Address Pool.
    2. Log in to the OpenStack GUI and validate if the new pool LBVMs are created and the new external egress port is attached to the Active LBVM.

      Verify External Network in OpenStack GUI

7.4.7.2 Adding a New External Network in Bare Metal

This section describes the procedure to add an additional network to an existing bare metal installation of CNE.

Note:

Run all commands in this procedure from the Bastion Host.
  1. Navigate to the /var/occne/cluster/${OCCNE_CLUSTER} directory.
  2. Edit the mb_resources.yaml file to reflect the new network configuration. Add the following new address pool configuration as per the directions and example provided in the "Populate the MetalLB Configuration" section of Oracle Communications Cloud Native Core, Cloud Native Environment Installation, Upgrade, and Fault Recovery Guide:
    • Add the new IPAddressPool section.
    • Add the new pool's name to the BGPAdvertisement description.
  3. Run the following command to apply the new metalLB configuration file:
    $ kubectl apply -f mb_resources.yaml
  4. Perfrom the following steps to delete the address pool:
    1. Remove the address pool from the IPAddressPool section of the mb_resources.yaml file.
    2. Remove the address pool's name from the BGPAdvertisement description.
    3. Run the following command to effect the removal of the advertisement:
      kubectl apply -f mb_resources.yaml
    4. Run the following command to delete the now orphaned IPAddressPool resource description:
      kubectl delete IPAddressPool <pool-name>  -n <namespace>

7.4.8 Renewing the Platform Service Mesh Root Certificate

This section describes the procedure to renew the root certificate used by the platform service mesh to generate certificates for Mutual Transport Layer Security (mTLS) communication when the Intermediate Certification Authority (ICA) issuer type is used.

Prerequisites
  1. The CNE platform service mesh must have been configured to use the Intermediate CA issuer type.
  2. A network function configured with the platform service mesh, commonly istio, must be available.
Procedure
  1. Renew the root CA certificate
    1. Obtain a new certificate and key

      Generate a new signing certificate and key from the external CA that generated the original root CA certificate and key. The generated certificate and key values must be base64 encoded.

      Check the Certificate Authority documentation for generating the required certificate and key.

    2. Set required environment variables

      Set the OCCNE_NEW_CA_CERTIFICATE and OCCNE_NEW_CA_KEY as follows:

      $ OCCNE_NEW_CA_CERTIFICATE=<base64 encoded root CA certificate>
      $ OCCNE_NEW_CA_KEY=<base64 encoded root CA key>

      Note:

      Ensure that for each certificate and key is encoded with a base64 value.
      Example:
      OCCNE_NEW_CA_CERTIFICATE=LS0tLS1CRUdJTiBDRVJUSUZJQ0F...0tRU5EIENFUlRJRklDQVRFLS0tLS0K
      OCCNE_NEW_CA_KEY=LS0tLS1CRUdJTiBSU0EgUFJJVkFURSBLRVk...CBSU0EgUFJJVkFURSBLRVktLS0tLQo
    3. Renew the root certificate
      Run the following script to renew the root certificate:
      $ /var/occne/cluster/${OCCNE_CLUSTER}/artifacts/renew_istio_root_certificate.sh ${OCCNE_NEW_CA_CERTIFICATE} ${OCCNE_NEW_CA_KEY}
  2. Verify that root certificate is renewed
    1. Check that the Kubernetes Secret containing the certificate and key is updated
      Run the following command to retrieve the istio-ca secret:
      $ kubectl -n istio-system get secret istio-ca -o jsonpath="{.data['tls\.crt']}"

      Verify that the retrieved certificate and key match with the ones provided in the previous step.

    2. Verify that the service mesh is configured to use the new certificate:
      1. Run the following command to retrieve the Istio configured root certificate:
        $ kubectl -n istio-system get cm istio-ca-root-cert -o jsonpath="{.data['root-cert\.pem']}" | base64 -w0; echo
      2. Verify that the retrieved certificate and key match with the ones provided in the previous step.
    3. Verify that NF pods are aware of the new root certificate.
      Check that NF pod is updated with new root certificate value. It must be same as ${OCCNE_NEW_CA_CERTIFICATE}. Repeat the step for other NF pods as follows:
      # set the istio binary path
      $ ISTIO_BIN=/var/occne/cluster/${OCCNE_CLUSTER}/artifacts/istio-$(cat /var/occne/cluster/${OCCNE_CLUSTER}/artifacts/CFG_container_images.txt | grep istio/pilot | cut -d: -f 2)/bin/istioctl
       
      # set the pod name
      $ POD_NAME=<nf-pod-name>
       
      # set the namespace
      $ NAME_SPACE=<nf-namespace>
       
      # execute istio proxy-config command to get the root certificate of the NF pod
      $ ${ISTIO_BIN} proxy-config secret ${POD_NAME} -n ${NAME_SPACE} -o json | grep -zoP 'trustedCa.*\n.*inlineBytes.*' | tail -n 1 | awk -d: '{ print $2 }' | tr -d \"

7.4.9 Performing an etcd Data Backup

This section describes the procedure to back up the etcd database.

A backup copy of the etcd database is required to restore the CNE Kubernetes cluster in case all controller nodes fail simultaneously. You must back up the etcd data in the following scenarios:
  • After a 5G NF is installed, uninstalled, or upgraded
  • Before and after CNE is upgraded
This way the backed-up etcd data can be used to recover the CNE Kubernetes cluster during disaster scenarios. The etcd data is consistent across all controller nodes within a cluster. Therefore it is sufficient to take a backup from any one of the active Kubernetes controller node.
Procedure
  1. Find Kubernetes controller hostname: Run the following command to get the names of Kubernetes controller nodes. The backup must be taken from any one of the controller nodes that is in Ready state.
    $ kubectl get nodes
  2. Run the etcd-backup script:
    1. On the Bastion Host, switch to the /var/occne/cluster/${OCCNE_CLUSTER}/artifacts directory:
      $ cd /var/occne/cluster/${OCCNE_CLUSTER}/artifacts
    2. Run the etcd_backup.sh script:
      $ ./etcd_backup.sh

      On running the script, the system prompts you to enter the k8s-ctrl node name. Enter the name of the controller node from which you want to back up the etcd data.

      Note:

      The script keeps only three backup snapshots in the PVC and automatically deletes the older snapshots.

7.5 Updating Grafana Password

This section describes the procedure to update the password of Grafana which is used to access the graphical user interface of Grafana dashboard.

Prerequisites

  1. A CNE cluster of version 22.x or above must be deployed.
  2. The Grafana pod must be up and running.
  3. The Load Balancer IP of Grafana must be accessible through browser.

Limitations and Expectations

  • This procedure works on the existing Grafana deployments instantly. However, this procedure requires a restart of the Grafana pod causing temporary unavailability of the service.
  • The grafana-cli is used to update the password by running an exec to get into the pod, which means it's ephemeral.
  • The newly set password is not persistent. Therefore, if the Grafana pod restarts or crashes due to any unfortunate incidents, then rerun this procedure to set the password. Otherwise, the GUI displays the Invalid Username or Password errors.
  • Only the password that is updated during an installation is persistent. However, if you change this password impromptu and patch the new password into the secret again, the password change doesn't take effect. To update and persist a password, use this procedure during installation of the CNE cluster.

Procedure

  1. Log in to the Bastion Host of your CNE cluster.
  2. Get the name of the Grafana pod in the occne-infra namespace:
    $ kubectl get pods -n occne-infra | grep grafana
    For example:
    $ kubectl get pods | grep grafana
    Sample output:
    occne-prometheus-grafana-7f5fb7c4d4-lxqsr 3/3 Running 0 12m
  3. Use exec to get into the Grafana pod:
    $ kco exec -it occne-prometheus-grafana-7f5fb7c4d4-lxqsr bash
  4. Run the following command to update the password:

    Note:

    Use a strong unpredictable password consisting of complex strings of more than 10 mixed characters.
    $ grafana-cli admin reset-admin-password <password>

    where, <password> is the new password to be updated.

    For example:
    occne-prometheus-grafana-7f5fb7c4d4-lxqsr:/usr/share/grafana$ grafana-cli admin reset-admin-password samplepassword123#
    Sample output:
    INFO [01-29|07:44:25] Starting Grafana logger=settings version= commit= branch= compiled=1970-01-01T00:00:00Z
    WARN [01-29|07:44:25] "sentry" frontend logging provider is deprecated and will be removed in the next major version. Use "grafana" provider instead. logger=settings
    INFO [01-29|07:44:25] Config loaded from logger=settings file=/usr/share/grafana/conf/defaults.ini
    INFO [01-29|07:44:25] Config overridden from Environment variable logger=settings var="GF_PATHS_DATA=/var/lib/grafana/"
    INFO [01-29|07:44:25] Config overridden from Environment variable logger=settings var="GF_PATHS_LOGS=/var/log/grafana"
    INFO [01-29|07:44:25] Config overridden from Environment variable logger=settings var="GF_PATHS_PLUGINS=/var/lib/grafana/plugins"
    INFO [01-29|07:44:25] Config overridden from Environment variable logger=settings var="GF_PATHS_PROVISIONING=/etc/grafana/provisioning"
    INFO [01-29|07:44:25] Config overridden from Environment variable logger=settings var="GF_SECURITY_ADMIN_USER=admin"
    INFO [01-29|07:44:25] Config overridden from Environment variable logger=settings var="GF_SECURITY_ADMIN_PASSWORD=*********"
    INFO [01-29|07:44:25] Target logger=settings target=[all]
    INFO [01-29|07:44:25] Path Home logger=settings path=/usr/share/grafana
    INFO [01-29|07:44:25] Path Data logger=settings path=/var/lib/grafana/
    INFO [01-29|07:44:25] Path Logs logger=settings path=/var/log/grafana
    INFO [01-29|07:44:25] Path Plugins logger=settings path=/var/lib/grafana/plugins
    INFO [01-29|07:44:25] Path Provisioning logger=settings path=/etc/grafana/provisioning
    INFO [01-29|07:44:25] App mode production logger=settings
    INFO [01-29|07:44:25] Connecting to DB logger=sqlstore dbtype=sqlite3
    INFO [01-29|07:44:25] Starting DB migrations logger=migrator
    INFO [01-29|07:44:25] migrations completed logger=migrator performed=0 skipped=484 duration=908.661µs
    INFO [01-29|07:44:25] Envelope encryption state logger=secrets enabled=true current provider=secretKey.v1
     
    Admin password changed successfully ✔
  5. After updating the password, log in to the Grafana GUI using the updated password. Ensure that you are aware of the limitations listed in Limitations and Expectations.

7.6 Updating OpenStack Credentials

This section describes the procedure to update the OpenStack credentials for vCNE.

Prerequisites

  1. You must have access to active Bastion Host of the cluster.
  2. All commands in this procedure must be run from the active CNE Bastion Host.
  3. You must have knowledge of kubectl and handling base64 encoded and decoded strings.

Modifying Password for Cinder Access

Kubernetes uses the cloud-config secret when interacting with OpenStack Cinder to acquire persistent storage for applications. The following steps describe how to update this secret to include the new password.

  1. Run the following command to decode and save the current cloud-config secret configurations in a temporary file:
    $ kubectl get secret cloud-config -n kube-system -o jsonpath="{.data.cloud\.conf}" | base64 --decode > /tmp/decoded_cloud_config.txt
  2. Run the following command to open the temporary file in vi editor and update the username and password fields in the file with required values:
    $ vi /tmp/decoded_cloud_config.txt
    Sample to edit the username and password:
    username="new_username"
    password="new_password"

    After updating the credentials, save and exit from the file.

  3. Run the following command to re-encode the cloud-config secret in Base64. Save the encoded output to use it in the following step.
    $ cat /tmp/decoded_cloud_config.txt | base64 -w0
  4. Run the following command to edit the cloud-config Kubernetes Secret named:
    $ kubectl edit secret cloud-config -n kube-system
    Refer to the following sample to edit the cloud-config Kubernetes Secret:

    Note:

    Replace <encoded-output> in the following sample with the encoded output that you saved in the previous step.
    # Please edit the object below. Lines beginning with a '#' will be ignored,
    # and an empty file will abort the edit. If an error occurs while saving this file will be
    # reopened with the relevant failures.
    #
    apiVersion: v1
    data:
     cloud.conf: <encoded-output>
    kind: Secret
    metadata:
     annotations:
     kubectl.kubernetes.io/last-applied-configuration: |
     {"apiVersion":"v1","data":{"cloud.conf":"<encoded-output>"},"kind":"Secret","metadata":{"annotations":{},"name":"cloud-config","namespace":"kube-system"}}
     creationTimestamp: "2022-01-12T02:34:52Z"
     name: cloud-config
     namespace: kube-system
     resourceVersion: "2225"
     uid: 0994b024-6a4d-41cf-904c
    type: Opaque

    Save the changes and exit the editor.

  5. Run the following command to remove the temporary file:
    $ rm /tmp/decoded_cloud_config.txt

Modifying Password for OpenStack Cloud Controller Access

Kubernetes uses the external-openstack-cloud-config secret when interacting with the OpenStack Controller. The following steps describe the procedure to update the secret to include the new credentials.

  1. Run the following command to decode the current external-openstack-cloud-config secret configurations in a temporary file:
    $ kubectl get secret external-openstack-cloud-config -n kube-system -o jsonpath="{.data.cloud\.conf}" | base64 --decode > /tmp/decoded_external_openstack_cloud_config.txt
    
  2. Run the following command to open the temporary file in vi editor and update the username and password fields in the file with required values:
    $ vi /tmp/decoded_external_openstack_cloud_config.txt
    Sample to edit the username and password:
    username="new_username"
    password="new_password"

    After updating the credentials, save and exit from the file.

  3. Run the following command to re-encode external-openstack-cloud-config in Base64. Save the encoded output to use it in the following step.
    $ cat /tmp/decoded_external_openstack_cloud_config.txt | base64 -w0
  4. Run the following command to edit the Kubernetes Secret named external-openstack-cloud-config:
    $ kubectl edit secret external-openstack-cloud-config -n kube-system
    Refer to the following sample to edit the external-openstack-cloud-config Kubernetes Secret with the new encoded value:

    Note:

    Replace <encoded-output> in the following sample with the encoded output that you saved in the previous step.
    # Please edit the object below. Lines beginning with a '#' will be ignored,
    # and an empty file will abort the edit. If an error occurs while saving this file will be
    # reopened with the relevant failures.
    #
    apiVersion: v1
    data:
      cloud.conf: <encoded-output>
    kind: Secret
    metadata:
      annotations:
        kubectl.kubernetes.io/last-applied-configuration: |
          {"apiVersion":"v1","data":{"cloud.conf":"<encoded-output>"},"kind":"Secret","metadata":{"annotations":{},"name":"external-openstack-cloud-config","namespace":"kube-system"}}
      creationTimestamp: "2022-07-21T17:05:26Z"
      name: external-openstack-cloud-config
      namespace: kube-system
      resourceVersion: "16"
      uid: 9c18f914-9c78-401d-ae79
    type: Opaque

    Save the changes and exit the editor.

  5. Run the following command to remove the temporary file:
    $ rm /tmp/decoded_external_openstack_cloud_config.txt

Restarting Affected Pods to Use the New Password

Services that use OpenStack credentials must be restarted to use the new password. The following steps describe how to restart the services.

Note:

Before restarting the services, verify that all the affected Kubernetes resources to be restarted are in healthy state.
  1. Perform the following steps to restart Cinder Container Storage Interface (Cinder CSI) controller plugin:
    1. Run the following command to restart Cinder Container Storage Interface (Cinder CSI) deployment:
      $ kubectl rollout restart deployment csi-cinder-controllerplugin -n kube-system
      Sample output:
      deployment.apps/csi-cinder-controllerplugin restarted
    2. Run the following command to get the pod and verify if it is running:
      $ kubectl get pods -l app=csi-cinder-controllerplugin -n kube-system
      Sample output:
      NAME                                           READY   STATUS    RESTARTS   AGE
      csi-cinder-controllerplugin-7c9457c4f8-88sbt   6/6     Running   0          19m
    3. [Optional]: If the pod is not up or if the pod is in the crashloop state, get the logs from the cinder-csi-plugin container inside the csi-cinder-controller pod using labels and validate the logs for more information:
      $ kubectl logs -l app=csi-cinder-controllerplugin -c cinder-csi-plugin -n kube-system
      Sample output to show a successful log retrieval:
      I0904 21:36:09.162886       1 server.go:106] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
      Sample output to show a log retrieval failure:
      W0904 21:34:34.252515       1 main.go:105] Failed to GetOpenStackProvider: Authentication failed
  2. Perform the following steps to restart Cinder Container Storage Interface (Cinder CSI) nodeplugin daemonset:
    1. Run the following command to restart Cinder Container Storage Interface (Cinder CSI) nodeplugin daemonset:
      $ kubectl rollout restart -n kube-system daemonset csi-cinder-nodeplugin
      Sample output:
      daemonset.apps/csi-cinder-nodeplugin restarted
    2. Run the following command to get the pod and verify if it is running:
      $ kubectl get pods -l app=csi-cinder-nodeplugin -n kube-system
      Sample output:
      NAME                          READY   STATUS    RESTARTS   AGE
      csi-cinder-nodeplugin-pqqww   3/3     Running   0          3d19h
      csi-cinder-nodeplugin-vld6m   3/3     Running   0          3d19h
      csi-cinder-nodeplugin-xg2kj   3/3     Running   0          3d19h
      csi-cinder-nodeplugin-z5vck   3/3     Running   0          3d19h
    3. [Optional]: If the pod is not up or if the pod is in the crashloop state, verify the logs for more information
  3. Run the following command to restart the OpenStack cloud controller daemonset:
    1. Run the following command to restart the OpenStack cloud controller daemonset:
      $ kubectl rollout restart -n kube-system daemonset openstack-cloud-controller-manager
      Sample output:
      daemonset.apps/openstack-cloud-controller-manager restarted
    2. Run the following command to get the pod and verify if it is running:
      $ kubectl get pods -l k8s-app=openstack-cloud-controller-manager -n kube-system
      Sample output:
      NAME                                       READY   STATUS    RESTARTS   AGE
      openstack-cloud-controller-manager-qtfff   1/1     Running   0          38m
      openstack-cloud-controller-manager-sn2pg   1/1     Running   0          38m
      openstack-cloud-controller-manager-w5dcv   1/1     Running   0          38m
    3. [Optional]: If the pod is not up, or is in the crashloop state, verify the logs for more information.

Changing Inventory File

When you perform the steps to modify password for Cinder access and modify password for OpenStack cloud controller access, you modify the Kubernetes secrets to contain the new credentials. However, running the pipeline (for example, performing a standard upgrade or adding a new node to the cluster) takes the current credentials stored in the occne.ini file, causing the changes to be overridden. Therefore, it is important to update the occne.ini file with the new credentials.

  1. Navigate to the cluster directory:
    $ cd /var/occne/cluster/${OCCNE_CLUSTER}/
  2. Open the occne.ini file:
    $ vi occne.ini
  3. Update External Openstack credentials (both username and password) as shown below:
    external_openstack_username = USER
    external_openstack_password = PASSWORD
  4. Update Cinder credentials (both username and password) as shown below:
    cinder_username = USER
    cinder_password = PASSWORD

Updating Credentials for lb-controller-user

Note:

Run all the commands in this section from Bastion Host.
  1. Run the following commands to update lb-controller-user credentials:
    $ echo -n "<Username>" | base64 -w0 | xargs -I{} kubectl -n occne-infra patch --type=merge secret lb-controller-user --patch '{"data":{"USERNAME":"{}"}}'
    $ echo -n "<Password>" | base64 -w0 | xargs -I{} kubectl -n occne-infra patch --type=merge secret lb-controller-user --patch '{"data":{"PASSWORD":"{}"}}'
    where:
    • <Username>, is the new OpenStack username.
    • <Password>, is the new OpenSatck password.
  2. Run the following command to restart lb-controller-server to use the new credentials:
    $ kubectl rollout restart deployment occne-lb-controller-server -n occne-infra
  3. Wait until the lb-controller restarts and run the following command to get the lb-controller pod status using labels. Ensure that only one pod is in the Running status:
    $ kubectl get pods -l app=lb-controller -n occne-infra
    Sample output:
    NAME                                          READY   STATUS    RESTARTS   AGE
    occne-lb-controller-server-74fd947c7c-vtw2v   1/1     Running   0          50s
  4. Validate the new credentials by printing the username and password directly from the new pod's environment variables:
    $ kubectl exec -it $(kubectl get pod -n occne-infra | grep lb-controller-server | cut -d " " -f1) -n occne-infra -- bash -c "echo -n \$USERNAME"
    $ kubectl exec -it $(kubectl get pod -n occne-infra | grep lb-controller-server | cut -d " " -f1) -n occne-infra -- bash -c "echo -n \$PASSWORD"

7.7 Updating the Guest or Host OS

You must update the host OS (for Bare Metal installations) or guest OS (for virtualized installations) periodically so that CNE has the latest Oracle Linux software. If the CNE is not upgraded recently, or there are known security patches then perform an update by referring to the upgrade procedures in Oracle Communications Cloud Native Core, Cloud Native Environment Installation, Upgrade, and Fault Recovery Guide.

7.8 CNE Grafana Dashboards

Grafana is an observability tool available as open source and enterprise versions. Grafana supports number of data sources such as Prometheus from where it can read data for analytics. You can find the official list of supported data sources at Grafana Datasources.

CNE offers the following default Grafana dashboards:
  • CNE Kubernetes dashboard
  • CNE Prometheus dashboard
  • CNE logging dashboard
  • CNE persistent storage dashboard (only for Bare Metal)

Note:

The Grafana dashboards provisioned by CNE are read-only. Refrain from updating or modifying these default dashboards.

You can clone these dashboards to customize them as per your requirement and save the customized dashboards in JSON format. This section provides details about the features offered by the open source Grafana version to add the required observability framework to CNE.

7.8.1 Accessing Grafana Interface

This section provides the procedure to access Grafana web interface.

  1. Perform the following steps to get the Load Balancer IP address and port number for accessing the Grafana web interface:
    1. Run the following command to get the Load Balancer IP address of the Grafana service:
      $ export GRAFANA_LOADBALANCER_IP=$(kubectl get services occne-kube-prom-stack-grafana --namespace occne-infra -o jsonpath="{.status.loadBalancer.ingress[*].ip}")
    2. Run the following command to get the LoadBalancer port number of the Grafana service:
      $ export GRAFANA_LOADBALANCER_PORT=$(kubectl get services occne-kube-prom-stack-grafana --namespace occne-infra -o jsonpath="{.spec.ports[*].port}")
    3. Run the following command to get the complete URL for accessing Grafana in an external browser:
      $ echo http://$GRAFANA_LOADBALANCER_IP:$GRAFANA_LOADBALANCER_PORT/$OCCNE_CLUSTER/grafana
      Sample output:
      http://10.75.225.60:80/mycne-cluster/grafana
  2. Use the URL obtained in the previous step (in this case, http://10.75.225.60:80/mycne-cluster/grafana) to access the Grafana home page.
  3. Click Downloads and select Browse.
  4. Expand the CNE folder to view the CNE dashboards.

    Note:

    CNE doesn't support user access management on Grafana.

7.8.2 Cloning a Grafana Dashboard

This section describes the procedure to clone a Grafana dashboard.

  1. Open the dashboard that you want to clone.
  2. Click the Share dashboard or panel icon next to the dashboard name.
  3. Select Export and click Save to file to save the dashboard in JSON format in your local system.
  4. Perform the following steps to import the saved dashboard to Grafana:
    1. Click Dashboards and select Import.
    2. Click Upload JSON file and select the dashboard that you saved in step 3.
    3. Change the name and UID of the dashboard.

      You have cloned the dashboard successfully. You can now use the cloned dashboard to customize the options as per your requirement.

7.8.3 Restoring a Grafana Dashboard

The default Grafana dashboards provided by CNE are stored as configmap in the CNE cluster and artifact directory to restore them to their default state. This section describes the procedure to restore a Grafana dashboard.

Note:

  • This procedure is used to restore the dashboards to the default state (that is, the default dashboards provided by CNE).
  • When you restore the dashboards, you lose all the customizations that you made on the dashboards. You can't use this procedure to restore the customizations that you made on top of the CNE default dashboards.
  • You can't use this procedure to restore other Grafana dashboards that you created.
  1. Navigate to the occne-grafana-dashboard directory:
    $ cd /var/occne/cluster/${OCCNE_CLUSTER}/artifacts/occne-grafana-dashboard
  2. Run the following command to restore all the dashboards present in the occne-grafana-dashboard directory to its default state. The command uses the YAML files of the dashboards in the directory to restore them.
    $ kubectl -n occne-infra apply -R -f occne-grafana-dashboard
    You can also restore a specific dashboard by providing a specific YAML file name in the command. For example, you can use the following command to restore only the CNE Kubernetes dashboard:
    $ kubectl -n occne-infra apply -f occne-grafana-dashboard/occne-k8s-cluster-dashboard.yaml

7.9 Managing 5G NFs

This section describes procedures to manage 5G NFs in CNE.

7.9.1 Installing an NF

This section describes the procedure to install an NF in the CNE Kubernetes cluster.

Prerequisites

  • Load container images and Helm charts onto Central Server repositories.

    Container and Helm repositories are created on a Central Server for easy CNE deployment at multiple customer sites. These repositories store all of the container images and Helm charts required to install CNE. When necessary, Helm pulls container images and Helm charts to the central server repositories on the local CNE Bastion Hosts. Similarly, NF installation uses Helm so that the container images and Helm charts needed to install NFs are loaded onto the same Central Server repositories. This procedure assumes that all container images and Helm charts required to install the NF are already loaded onto the Central Server repositories.

  • Determine the NF deployment parameters
    The following values determine the NF's identity and where it is deployed. These values are used in the following procedure:

    Table 7-18 NF Deployment Parameters

    Parameters Value Description
    nf-namespace Any valid namespace name The namespace where you want to install the NF. Typically each NF is installed in its own namespace.
    nf-deployment-name Any valid Kubernetes deployment name The name that this NF instance is known to the Kubernetes.

Load NF artifacts onto Bastion Host repositories

All the steps in this section are run on the CNE Bastion Host where the NF installation happens.

Load NF container images
  1. Create a file container_images.txt listing the Container images and tags as required by the NF:
    <image-name>:<release>

    Example:

    busybox:1.29.0
  2. Run the following command to load the container images into the CNE Container registry:
    $ retrieve_container_images.sh <external-container-repo-name>:<external-container-repo-port> ${HOSTNAME}:5000 < container_images.txt
    

    Example:

    $ retrieve_container_images.sh mycentralrepo:5000 ${HOSTNAME%%.*}:5000 < container_images.txt
Load NF Helm charts
  1. Create a file helm_charts.txt listing the Helm chart and version:
    <external-helm-repo-name>/<chart-name> <chart-version>

    Example:

    mycentralhelmrepo/busybox 1.33.0
  2. Run the following command to load the charts into the CNE Helm chart repository:
    $ retrieve_helm.sh /var/www/html/occne/charts http://<external-helm-repo-name>/occne/charts [helm_executable_full_path_if_not_default] < helm_charts.txt
    

    Example:

    $ retrieve_helm.sh /var/www/html/occne/charts http://mycentralrepo/occne/charts < helm_charts.txt
    

Install the NF

  1. On the Bastion Host, create a YAML file named <nf-short-name>-values.yaml to contain the values to be passed to the NF Helm chart.
  2. Add NF-specific values to file

    See the NF installation instructions to understand which keys and values must be included in the values file.

  3. Additional NF configuration

    Before installing the NF, see the installation instructions to understand the requirements of additional NF configurations along with Helm chart values.

  4. Run the following command to install the NF:
    $ helm install --namespace <nf-namespace> --create-namespace -f <nf-short-name>-values.yaml <nf-deployment-name> <chart-or-chart-location>
    

7.9.2 Upgrading an NF

This section describes the procedure to upgrade a 5G network function that was previously installed in the CNE Kubernetes cluster.

Prerequisites

Load container images and Helm charts onto Central Server repositories.

Container and Helm repositories are created on a Central Server for easy CNE deployment at multiple customer sites. These repositories store all of the container images and Helm charts required to install CNE. When necessary, Helm pulls container images and Helm charts to the central server repositories on the local CNE Bastion Hosts. Similarly, Network Function (NF) installation uses Helm so that the container images and Helm charts needed to install NFs are loaded onto the same Central Server repositories. This procedure assumes that all container images and Helm charts required to install the NF are already loaded onto the Central Server repositories.

Procedure

Perform the following steps to upgrade an NF. All commands must be run from the Bastion Host.
  1. Load NF artifacts onto Bastion Host repositories.
  2. Upgrade the NF.

Load NF artifacts onto Bastion Host repositories

All the steps in this section are run on the CNE Bastion Host where the NF installation happens.

Load NF container images
  1. Create a file container_images.txt listing the Container images and tags as required by the NF:
    <image-name>:<release>
    Example:
    busybox:1.29.0
  2. Run the following command to load the container images into the CNE Container registry:
    $ retrieve_container_images.sh <external-container-repo-name>:<external-container-repo-port> ${HOSTNAME%%.*}:5000 < container_images.txt
    

    Example:

    $ retrieve_container_images.sh mycentralrepo:5000 ${HOSTNAME%%.*}:5000 < container_images.txt
    
Load NF Helm charts
  1. Create a file helm_charts.txt listing the Helm chart and version:
    <external-helm-repo-name>/<chart-name> <chart-version>

    Example:

    mycentralhelmrepo/busybox 1.33.0
  2. Run the following command to load the charts into the CNE Helm chart repository:
    $ retrieve_helm.sh /var/www/html/occne/charts http://<external-helm-repo-name>/occne/charts [helm_executable_full_path_if_not_default] < helm_charts.txt
    

    Example:

    $ retrieve_helm.sh /var/www/html/occne/charts http://mycentralrepo/occne/charts < helm_charts.txt  
    

Upgrade the NF

Change Helm input values used in previous NF release
To change any input value in the Helm chart during the upgrade, refer to the NF-specific upgrade instructions in <NF-specific> Installation, Upgrade, and Fault Recovery Guide for the new release. If any new input parameters are added in the new release, then run the following steps:
  1. On the Bastion Host, create a YAML file named <nf-short-name>-values.yaml to contain the values to be passed to the NF Helm chart.
  2. Create a YAML file that contains new and changed values needed by the NF Helm chart.

    See the NF installation instructions to understand which keys and values must be included in the values file. Only values for parameters that were not included in the Helm input values applied to the previous release, or parameters whose names changed from the previous release, must be included in this file.

  3. If the yaml file is created for this upgrade, run the following command to upgrade the NF with new values:
    $ helm upgrade -f <nf-short-name>-values.yaml <nf-deployment-name> <chart-name-or-chart-location>
    

    Note:

    The nf-deployment-name value must match the value used when installing the NF.
Retain all Helm input values used in previous NF release
If there is no requirement to change the Helm chart input values during upgrade and if no new input parameters were added in the new release, then run the following command to upgrade and retain all the NF values:
$ helm upgrade --reuse-values <nf-deployment-name> <chart-name-or-chart-location>

Note:

The nf-deployment-name value must match the value used when installing the NF.

7.9.3 Uninstalling an NF

This section describes the procedure to uninstall a 5G network function that was previously installed in the CNE Kubernetes cluster.

Prerequisites

  • Determine the NF deployment parameters. The following values determine the NF's identity and where it is deployed:

    Table 7-19 NF Deployment Parameters

    Variable Value Description
    nf-namespace Any valid namespace name The namespace where you want to install the NF. Typically each NF is installed in its own namespace.
    nf-deployment-name Any valid Kubernetes deployment name The name by which the Kubernetes identifies this NF instance.
  • All commands in this procedure must be run from the Bastion Host.

Procedure

  1. Run the following command to uninstall an NF:
    $ helm uninstall <nf-deployment-name> --namespace <nf-namespace>
  2. If there are remaining NF resource such as PVCs and namespace, run the following command to remove them:
    1. Run the following command to remove residual PVCs:
      $ kubectl --namespace <nf-namespace> get pvc | awk '{print $1}'| xargs -L1 -r kubectl --namespace <nf-namespace> delete pvc
      
    2. Run the following command to delete namespace:
      $ kubectl delete namespace <nf-namespace>

      Note:

      Steps a and b are used to remove all the PVCs from the <nf-namespace> and delete the <nf-namespace>, respectively. If there are other components running in the <nf-namespace>, manually delete the PVCs that need to be removed and skip the kubectl delete namespace <nf-namespace> command.

7.9.4 Update Alerting Rules for an NF

This section describes the procedure to add or update the alerting rules for any Cloud Native Core 5G NF in Prometheus Operator and OSO.

Prerequisites

  • For CNE Prometheus Operator, a YAML file containing an PrometheusRule CRD defining the NF-specific alerting rules is available. The YAML file must be an ordinary text file in a valid YAML format with the extension .yaml.
  • For OSO Prometheus, a valid OSO release must be installed and an alert file describing all NF alert rules according to old format is required.

Procedure for Prometheus Operator

  1. To copy the NF-specific alerting rules file from your computer to the /tmp directory on the Bastion Host, see the Accessing the Bastion Host procedure.
  2. Run the following command to create or update the PrometheusRule CRD containing the alerting rules for the NF:
    $ kubectl apply -f /tmp/rules_file.yaml -n occne-infra
    # To verify the creation of the alert-rules CRD, run the following command:
    $ kubectl get prometheusrule -n occne-infra
    NAME                              AGE
    occne-alerting-rules              43d
    occne-dbtier-alerting-rules       43d
    test-alerting-rules               5m

    The alerting rules automatically loads into all running Prometheus instances within 1 minute.

  3. In the Prometheus GUI, select the Alerts tab. Select individual rules from the list to view the alert details and verify if the new rules are loaded.

    Figure 7-1 New Alert Rules are loaded in Prometheus GUI


    New Alert Rules are loaded in Prometheus GUI

Procedure for OSO

Perform the following steps to add alert rules in OSO promethues GUI:
  1. Take the backup of the current configuration map of OSO Prometheus:
    $ kubectl get configmaps <OSO-prometheus-configmap-name> -o yaml -n <namespace> > /tmp/tempPrometheusConfig.yaml
  2. Check and add the NF alert file name inside Prometheus configuration map. The NF alert file names vary from NF to NF. Retrieve the name of the NF alert rules file to add the name in Prometheus configuration map. Once you retrieve the file name, run the following commands to add the NF alert file name inside Prometheus configuration map:
    $ sed -i '/etc\/config\/<nf-alertsname>/d' /tmp/tempPrometheusConfig.yaml
    $ sed -i '/rule_files:/a\    \- /etc/config/<nf-alertsname>' /tmp/tempPrometheusConfig.yaml
  3. Update configuration map with the updated file:
    $ kubectl -n <namespace> replace configmap <OSO-prometheus-configmap-name> -f /tmp/tempPrometheusConfig.yaml
  4. Patch the NF Alert rules in OSO Prometheus configuration map by mentioning the alert rule file path:
    $ kubectl patch configmap <OSO-prometheus-configmap-name> -n <namespace> --type merge --patch "$(cat ./NF_altertrules.yaml)"

7.9.5 Configuring Egress NAT for an NF

This section provides information about configuring NF microservices that originate egress requests to ensure compatibility with CNE.

Annotation for Specifying Egress Network

Starting CNE 22.4.x, egress requests do not get the IP address of the Kubernetes worker node assigned to the source IP field. Instead, each microservice that originates egress requests specifies an egress network through an annotation. An IP address from the indicated network is inserted into the source IP field for all egress requests.

For each microservice that originates egress requests, add the following annotation to the deployment specification and its pods:
annotations:
  oracle.com.cnc/egress-network: "oam"

Note:

  • The value of the annotation must match the name of an external network configured.
  • This annotation must not be added for microservices that do not originate egress requests, as it leads to decreased CNE performance.
  • CNE does not allow any microservice to pick a separate IP address. When CNE is installed, a single IP address is selected for each network.
  • All pods in a microservice get the same source IP address attached to all egress requests.
  • CNE 22.4.x supports this annotation in vCNE deployments only.

Configuring Egress Controller Environment

Egress controller runs as Kubernetes resource of type DaemonSet. The following table provides details about the environmental variables that can be edited in the Egress controller manifest file:

Note:

Do not edit any variables that are not listed in the following table.

Table 7-20 Egress Controller Environment Configuration

Environment Variable Default Value Possible Value Description
DAEMON_MON_TIME 0.5 Between 0.1 and 5 This value reflects the time in seconds and highlights the frequency at which the Egress controller checks the cluster status.

Configuring Egress NAT for Destination Subnet or IP Address

Destination subnet or IP address must be specified to route traffic trough a particular network. The destination subnet or IP address is specified in the form of a dictionary, where the pools are the dictionary keys and the lists of subnets or IP addresses are the dictionary values.

The following annotations show different scenarios of adding pools and destination subnets or IP addresses with examples:
  • Specifying annotation for destination subnet:
    annotations:
      oracle.com.cnc/egress-destination: '{"<pool>" : ["<subnet_ip_address>/<subnet_mask>"]}'
    For example:
    annotations:
      oracle.com.cnc/egress-destination: '{"oam" : ["10.20.30.0/24"]}'
  • Specifying annotation for destination IP address:
    annotations:
      oracle.com.cnc/egress-destination: '{"<pool>" : ["<ip_address>"]}'
    For example:
    annotations:
      oracle.com.cnc/egress-destination: '{"oam" : ["10.20.30.40"]}'
  • Specifying annotation for multiple pools:
    annotations:
      oracle.com.cnc/egress-destination: '{"<pool_one>" : ["<subnet_ip_address>/<subnet_mask>"], "<pool_two>" : ["<subnet_ip_address>/<subnet_mask>"]}'
    For example:
    annotations:
      oracle.com.cnc/egress-destination: '{"oam" : ["10.20.30.0/24"], "sig" : ["30.20.10.0/24"]}'
  • Specifying annotation for multiple pools and multiple destinations:
    annotations:
      oracle.com.cnc/egress-destination: '{"<pool_one>" : ["<subnet_ip_address>/<subnet_mask>", "<subnet_ip_address>/<subnet_mask>"], "<pool_two>" : ["<subnet_ip_address>/<subnet_mask>", "<ip_address>"]}'
    For example:
    annotations:
      oracle.com.cnc/egress-destination: '{"oam" : ["10.20.30.0/24", "100.200.30.0/22"], "sig" : ["30.20.10.0/24", "20.10.5.1"]}'

Compatibility Between Egress NAT and Destination Egress NAT

Both Egress NAT and Destination Egress NAT annotations are independent and compatible. This means that they can be used independently or combined to create more specific rules. Egress NAT is enabled to route all traffic from a particular pod through a particular network. Whereas, Destination Egress NAT permits traffic to be routed using a destination subnet or IP address before regular Egress NAT rules are matched within the routing table. This feature allows more granularity to route traffic through a particular network.

For example, the following annotations show both the features combined to route a pod's traffic through the sig network, except the traffic destined for 10.20.30.0/24 subnet, which is routed through the oam network:
annotations:
  oracle.com.cnc/egress-destination: '{"oam" : ["10.20.30.0/24"]}'
  oracle.com.cnc/egress-network: sig