Kubernetes Engine Issues

This section describes known issues and workarounds related to Oracle Private Cloud Appliance Kubernetes Engine (OKE).

Supported OCI Terraform Provider Versions

The Oracle Private Cloud Appliance Kubernetes Engine (OKE) guide provides example Terraform scripts to configure OKE resources. To use these scripts, you must install both Terraform and the Oracle Cloud Infrastructure (OCI) Terraform provider.

If you use Terraform scripts with Kubernetes Engine (OKE), in your provider block, specify the version of the OCI Terraform provider to install as at least v4.50.0 but no greater than v6.36.0:

provider "oci" {
    version          = ">= 4.50.0, <= 6.36.0"
...
}

Note: If you are using Terraform for node-pool create, use the following code:

"" is_force_delete_after_grace_duration = "true"
node-eviction_node_pool_settings {
      is_force_delete_after_grace_duration = "true"
    }

Bug: 37934227

Version: 3.0.2

Enable Add-on Work Request Initially in Failed State

When you enable an add-on, the work request might initially show that the add-on installation failed instead of showing the add-on installation as pending. The add-on state should be Needs Attention. The add-on state should change to Active after reconciliation, and the work request state should change to Succeeded.

Workaround: Wait for the reconciliation process to run a couple of times. If the work request is still in Failed state and the add-on is still in Needs Attention state after a couple of reconciliation runs, then investigate as described in "Add-on Reconciliation" in the Managing OKE Cluster Add-ons chapter of the Oracle Private Cloud Appliance Kubernetes Engine (OKE) guide.

Bug: 37967658

Version: 3.0.2

Create Cluster Does Not Support Extension Parameters

In Private Cloud Appliance Release 3.0.2-b1185392, some cluster control plane node properties are specified by using OraclePCA defined tags.

In the previous release, Private Cloud Appliance Release 3.0.2-b1081557, these defined tags are not recognized. You must use free-form tags to specify these values.

Workaround: In Private Cloud Appliance Release 3.0.2-b1081557, use free-form tags to provide the following information for control plane nodes:

Your public SSH key.

Specify sshkey for the tag key. Paste your public SSH key into the Value field.

Important:

You cannot add an SSH key after the cluster is created.
Number of nodes.

By default, the number of nodes in the control plane is 3. You can specify 1, 3, or 5 nodes. To specify the number of control plane nodes, specify cp_node_count for the tag key, and enter 1, 3, or 5 in the Value field.
Node shape.

For Private Cloud Appliance X10 systems, the shape of the control plane nodes is VM.PCAStandard.E5.Flex and you cannot change it. For all other Private Cloud Appliance systems, the default shape is VM.PCAStandard1.1, and you can specify a different shape.

To use a different shape, specify cp_node_shape for the tag key, and enter the name of the shape in the Value field. For a description of each shape, see Compute Shapes in the Oracle Private Cloud Appliance Concepts Guide.
Node shape configuration.

If you specify a shape that is not a flexible shape, do not specify a shape configuration. The number of OCPUs and amount of memory are set to the values shown for this shape in "Standard Shapes" in Compute Shapes in the Oracle Private Cloud Appliance Concepts Guide.

If you specify a flexible shape, you can change the default shape configuration.

To provide shape configuration information, specify cp_node_shape_config for the tag key. You must specify the number of OCPUs (ocpus) you want. You can optionally specify the total amount of memory you want (memoryInGBs). The default value for gigabytes of memory is 16 times the number you specify for OCPUs.

The following are examples of node shape configuration values. Enter everything, including the surrounding single quotation marks, in the Value field for the tag. In the first example, the default amount of memory will be configured.
```
'{"ocpus":1}'
'{"ocpus":2, "memoryInGBs":24}'
```

Bug: 36979754

Version: 3.0.2

Nodes in Failing State After Upgrade or Patch

Upgrade or patch of an appliance that has OKE clusters with node pools can cause some nodes to move into the FAILING state even though the underlying compute instance is in the RUNNING state.

If you experience this issue, perform the following workaround.

Workaround: Use the following method to replace the failed nodes with new active nodes, automatically transferring workloads from the failed nodes to the new nodes.

Delete the nodes that are in state FAILING or FAILED. Do not increase the size of the node pool (do not scale up the node pool).

The deleted nodes are cordoned and drained and their workloads are automatically transferred to the new nodes that are created to keep the node pool at the same size.

Bug: 36814183

Version: 3.0.2

OKE Requires Switch Firmware Upgrade on Systems with Administration Network

If your Private Cloud Appliance is configured with a separate administration network, the appliance and data center networking need reconfiguration to enable the traffic flows required by the Oracle Private Cloud Appliance Kubernetes Engine (OKE). In addition, the reconfiguration of the network is dependent on functionality included in a new version of the switch software.

Workaround: Upgrade or patch the software of the switches in your appliance. Reconfigure the network. You can find details and instructions in the following documentation sections:

"Upgrading the Switch Software" in the Oracle Private Cloud Appliance Upgrade Guide
"Patching the Switch Software" in the Oracle Private Cloud Appliance Patching Guide
"Securing the Network" in the Oracle Private Cloud Appliance Security Guide

This section includes a port matrix for systems with a separate administration network. Use it to configure routing and firewall rules, so the required traffic is enabled in a secure way.

Bug: 36073167

Version: 3.0.2

Previously Used Image Is No Longer Listed

The Compute Web UI and the compute image list command list only the three most recently published versions of each major distribution (for example, Oracle Linux 9) of an image. If an upgrade or patch delivers an updated version of an OKE node image, for example the same image with a newer Kubernetes version, and that major distribution image had already been delivered three times, the fourth most recently published version of that image will no longer be listed.

Previously delivered images are still accessible, even though they are not listed.

Workaround: To use an image that you have used before but is no longer listed, use the OCI CLI to create the node pool, and specify the OCID of the image. To get the OCID of the image you want, use the ce node-pool get command for a node pool where you used this image before.

Bug: 36862970

Version: 3.0.2

Tag Filters Not Available for Kubernetes Node Pools and Nodes

Unlike Oracle Cloud Infrastructure, Private Cloud Appliance currently does not provide the functionality to use Tag Filters for tables listing Kubernetes node pools and nodes. Tag filtering is available for Kubernetes clusters.

Workaround: There is no workaround. The UI does not provide the tag filters in question.

Bug: 36091835

Version: 3.0.2

OKE-Specific Tags Must Not Be Deleted

Certain properties and functions of OKE are enabled through resource tags. These reserved tags are not created by the IAM service, but by users who apply them to resources. Therefore, the IAM service cannot prevent users from deleting such tags. If they are deleted, the OKE service might not work as expected.

Workaround: Do not attempt to delete the resource tags used for specific OKE service functionality. If you delete these tags, you must create them again.

Bug: 37157933

Version: 3.0.2

Unable to Delete an OKE Cluster in Failed State

To deploy a cluster, Oracle Private Cloud Appliance Kubernetes Engine (OKE) uses various types of cloud resources that can also be managed through other infrastructure services, such as compute instances and load balancers. However, OKE cluster resources must be manipulated only through the OKE service, to avoid inconsistencies. If the network load balancer of an OKE cluster is deleted outside the control of the OKE service, that cluster ends up in a failed state and you will no longer be able to delete it.

Workaround: This is a known issue with the Cluster API Provider. If a cluster is in failed state and its network load balancer is no longer present, it must be cleaned up manually. Contact Oracle for assistance.

Bug: 36193835

Version: 3.0.2

UI and CLI Represent Eviction Grace Period Differently

The minimum and default grace period before a node is evicted from a worker node pool is 20 seconds. The OCI CLI displays this value accurately and allows you to modify the grace period in seconds or minutes, using the ISO8601 format. For example, you could change the default of 20 seconds (="PT20S") to 3 minutes (="PT3M") by specifying a new value in the --node-eviction-node-pool-settings command parameter.

In contrast, the Compute Web UI parses the ISO8601 time format into an integer value and displays the eviction grace period in minutes. As a result, the 20 second default appears as 0 minutes in the Node Pool Information tab of the Kubernetes Cluster detail page.

This behavior differs from the Oracle Cloud Infrastructure console (UI), which is capable of displaying time in minutes as a decimal value (for example: 0.35 minutes). It has no minimum grace period, so zero is a valid entry.

Workaround: To set or check the precise eviction grace period of a node pool, use the OCI CLI and specify time in the ISO8601 format. When using the Compute Web UI, consider the limitations described.

Bug: 36696595

Version: 3.0.2

Nodes in Node Pool Not Automatically Distributed Across Fault Domains

When you create an OKE node pool without selecting specific fault domains, the Compute service handles distribution of the nodes across the fault domains. By design, node pool nodes (and compute instances in general) are assigned to the compute nodes with the highest available resource capacity. Due to VM activity and differences in resource consumption, the load between the three fault domains might vary considerably. Therefore, the auto-distribution logic cannot guarantee that nodes of the same node pool are spread evenly across fault domains. In fact, all nodes might end up in the same fault domain, which is not preferred.

Workaround: For the best distribution of node pool nodes across fault domains, do not rely on auto-distribution. Instead, select the fault domains to use when creating the node pool.

Bug: 36901742

Version: 3.0.2

API Reference on Appliance Not Up-to-Date for OKE Service

Every Private Cloud Appliance provides online API reference pages, conveniently accessible from your browser. For the Compute Enclave, these pages are located at https://console.mypca.mydomain/api-reference. This API reference is not current for all services, including for Oracle Private Cloud Appliance Kubernetes Engine (OKE).

Workaround: The REST API for Oracle Private Cloud Appliance Compute Enclave shows up-to-date parameters and values in the descriptions of the CreateCluster and CreateNodePool operations.

Note:

Both the console api-reference and the Oracle Help Center REST API for Oracle Private Cloud Appliance Compute Enclave show parameters and parameter values that are not supported because they do not apply to Private Cloud Appliance. If you use these, you might receive a not supported error message, or the parameter or value will accepted by the API but will do nothing.

Bug: 35710716, 36852746

Version: 3.0.2

OKE Cluster Creation Fails

OKE cluster creation might fail if the system is configured with a domain name that contains uppercase characters. Uppercase characters are not supported in domain names.

Workaround: Contact Oracle Support.

Bug: 36611385

Version: 3.0.2

Backend Gets Removed From Load Balancer After Node Cycling of Node Pool

The backend of the load balancer that exposes the underlying application gets detached from the existing load balancer when its node pool is been updated and node cycled with maximumUnavailable as 1. This happens only when maximumUnavailable is set to 1 and the number of nodes in the node pool is just 1. Though after node cycling the new node rejoins the cluster and node cycling is successful, the new node is not associated as a backend to the service load balancer.

Workaround: Delete the service object and recreate or use maximumUnavailable as 0 when node cycling if the cluster has only one node pool with 1 node.

Bug: 38484329

Version: 3.0.2

Upgrade from 3.0.2-b1392231 to 3.0.2-b1483396 OKE Cluster Created intially at M3.11.1 Fails Upgrade to OKE 1.31.6 Post Upgrade

Following a rack upgrade from software releaes 3.0.2-b1392231 to software release 3.0.2-b1483396, OKE clusters originally created on software release 3.0.2-b1392231 might encounter an invalid compartment error and cannot be upgraded to Kubernetes version 1.31.6.

Workaround: To resolve this, perform a node pool upgrade to version 1.30.10 or a similar version first. Once the node pool is updated, retry upgrading the cluster to version 1.31.6 or the required version.

Bug: 38409461

Version: 3.0.2

Restarted `csi-oci-controller` Pods of OKE Cluster in `ImagePullBackOff` post BYOC

After setting up bring your own certificate and updating the certificates on the cluster, new or restarted csi-oci-controller pods might not start because they’re unable to pull the required image, resulting in an ImagePullBackOff status.

Workaround: To resolve the issue, delete the control plane nodes from your cluster. After deletion, re-register these nodes. This causes the system to launch new pods, which should now start normally.

Bug: 38457116

Version: 3.0.2

Cluster Create with Version v1.31.6 Fails With Connection Error When DNS Domain Not Defined for OKE VCN

After upgrade to software release 3.0.2-b1483396, clusters created with OKE version 1.31.6 do not come up successfully, and control plane tasks, such as CoreDNS and Flannel Add On deployments, fail because of DNS lookup errors, similar to:

dial tcp: lookup <control-plane-node> on 169.254.168.254:53: no such host

Workaround: This problem arises if the VCN was created without a DNS Domain Name. To resolve the issue, ensure that dns-label is set for the VCN.

Bug: 38309694

Version: 3.0.2