Kubernetes Engine Issues
This section describes known issues and workarounds related to Oracle Private Cloud Appliance Kubernetes Engine (OKE).
Cloning Feature for Block Volume PV Using CSI Plugin Is Not Available
The cloning feature for existing block volumes is not available for PVCs created using the CSI volume plugin in your worker node applications.
For block volume persistent storage, use the CSI plugin as described in Creating Persistent Block Volume Storage.
Workaround: No workaround is available to clone an existing volume using the CSI plugin.
Bug: 36252730
Version: 3.0.2
Supported OCI Terraform Provider Versions
The Oracle Private Cloud Appliance Kubernetes Engine (OKE) guide provides example Terraform scripts to configure OKE resources. To use these scripts, you must install both Terraform and the Oracle Cloud Infrastructure (OCI) Terraform provider.
If you use Terraform scripts with Kubernetes Engine (OKE), in
your provider
block, specify the version of the OCI Terraform provider to
install as at least v4.50.0 but no greater than v6.36.0:
provider "oci" { version = ">= 4.50.0, <= 6.36.0" ... }
Bug: 37934227
Version: 3.0.2
Enable Add-on Work Request Initially in Failed State
When you enable an add-on, the work request might initially show that the add-on installation failed instead of showing the add-on installation as pending. The add-on state should be Needs Attention. The add-on state should change to Active after reconciliation, and the work request state should change to Succeeded.
Workaround: Wait for the reconciliation process to run a couple of times. If the work request is still in Failed state and the add-on is still in Needs Attention state after a couple of reconciliation runs, then investigate as described in "Add-on Reconciliation" in the Managing OKE Cluster Add-ons chapter of the Oracle Private Cloud Appliance Kubernetes Engine (OKE) guide.
Bug: 37967658
Version: 3.0.2
Create Cluster Does Not Support Extension Parameters
In Private Cloud Appliance Release 3.0.2-b1185392, some cluster control plane node properties are specified by using OraclePCA defined tags.
In the previous release, Private Cloud Appliance Release 3.0.2-b1081557, these defined tags are not recognized. You must use free-form tags to specify these values.
Workaround: In Private Cloud Appliance Release 3.0.2-b1081557, use free-form tags to provide the following information for control plane nodes:
-
Your public SSH key.
Specify
sshkey
for the tag key. Paste your public SSH key into the Value field.Important:
You cannot add an SSH key after the cluster is created.
-
Number of nodes.
By default, the number of nodes in the control plane is 3. You can specify 1, 3, or 5 nodes. To specify the number of control plane nodes, specify
cp_node_count
for the tag key, and enter 1, 3, or 5 in the Value field. -
Node shape.
For Private Cloud Appliance X10 systems, the shape of the control plane nodes is VM.PCAStandard.E5.Flex and you cannot change it. For all other Private Cloud Appliance systems, the default shape is VM.PCAStandard1.1, and you can specify a different shape.
To use a different shape, specify
cp_node_shape
for the tag key, and enter the name of the shape in the Value field. For a description of each shape, see Compute Shapes in the Oracle Private Cloud Appliance Concepts Guide. -
Node shape configuration.
If you specify a shape that is not a flexible shape, do not specify a shape configuration. The number of OCPUs and amount of memory are set to the values shown for this shape in "Standard Shapes" in Compute Shapes in the Oracle Private Cloud Appliance Concepts Guide.
If you specify a flexible shape, you can change the default shape configuration.
To provide shape configuration information, specify
cp_node_shape_config
for the tag key. You must specify the number of OCPUs (ocpus
) you want. You can optionally specify the total amount of memory you want (memoryInGBs
). The default value for gigabytes of memory is 16 times the number you specify for OCPUs.The following are examples of node shape configuration values. Enter everything, including the surrounding single quotation marks, in the Value field for the tag. In the first example, the default amount of memory will be configured.
'{"ocpus":1}' '{"ocpus":2, "memoryInGBs":24}'
Bug: 36979754
Version: 3.0.2
Nodes in Failing State After Upgrade or Patch
Upgrade or patch of an appliance that has OKE
clusters with node pools can cause some nodes to move into the FAILING
state
even though the underlying compute instance is in the RUNNING
state.
If you experience this issue, perform the following workaround.
Workaround: Use the following method to replace the failed nodes with new active nodes, automatically transferring workloads from the failed nodes to the new nodes.
Delete the nodes that are in state FAILING
or FAILED
. Do
not increase the size of the node pool (do not scale up the node pool).
The deleted nodes are cordoned and drained and their workloads are automatically transferred to the new nodes that are created to keep the node pool at the same size.
Bug: 36814183
Version: 3.0.2
OKE Requires Switch Firmware Upgrade on Systems with Administration Network
If your Private Cloud Appliance is configured with a separate administration network, the appliance and data center networking need reconfiguration to enable the traffic flows required by the Oracle Private Cloud Appliance Kubernetes Engine (OKE). In addition, the reconfiguration of the network is dependent on functionality included in a new version of the switch software.
Workaround: Upgrade or patch the software of the switches in your appliance. Reconfigure the network. You can find details and instructions in the following documentation sections:
-
"Upgrading the Switch Software" in the Oracle Private Cloud Appliance Upgrade Guide
-
"Patching the Switch Software" in the Oracle Private Cloud Appliance Patching Guide
-
"Securing the Network" in the Oracle Private Cloud Appliance Security Guide
This section includes a port matrix for systems with a separate administration network. Use it to configure routing and firewall rules, so the required traffic is enabled in a secure way.
Bug: 36073167
Version: 3.0.2
GPU Shapes Must Not Be Selected for Creating Node Pool
On a system that includes nodes with GPUs installed, when you create an OKE node pool, it is possible to select a GPU shape. The operation will succeed, but the OKE cluster is unable to use GPUs because the compute images have no drivers for them. GPU shapes provide access to scarce and expensive resources, which are intended for dedicated workloads on regular compute instances.
Workaround: When creating an OKE node pool you must always select a standard or flexible shape.
Bug: 37576565
Version: 3.0.2
Previously Used Image Is No Longer Listed
The Compute Web UI and the compute image
list
command list only the three most recently published versions of each major
distribution (for example, Oracle Linux 9) of an image. If
an upgrade or patch delivers an updated version of an OKE node image, for example the same image with a
newer Kubernetes version, and that major distribution image
had already been delivered three times, the fourth most recently published version of that
image will no longer be listed.
Previously delivered images are still accessible, even though they are not listed.
Workaround: To use an image that you have used before but is no
longer listed, use the OCI CLI to create the node
pool, and specify the OCID of the image. To get the OCID of the image you want, use the
ce node-pool get
command for a node pool where you used this image
before.
Bug: 36862970
Version: 3.0.2
Tag Filters Not Available for Kubernetes Node Pools and Nodes
Unlike Oracle Cloud Infrastructure, Private Cloud Appliance currently does not provide the functionality to use Tag Filters for tables listing Kubernetes node pools and nodes. Tag filtering is available for Kubernetes clusters.
Workaround: There is no workaround. The UI does not provide the tag filters in question.
Bug: 36091835
Version: 3.0.2
OKE-Specific Tags Must Not Be Deleted
Certain properties and functions of OKE are enabled through resource tags. These reserved tags are not created by the IAM service, but by users who apply them to resources. Therefore, the IAM service cannot prevent users from deleting such tags. If they are deleted, the OKE service might not work as expected.
Workaround: Do not attempt to delete the resource tags used for specific OKE service functionality. If you delete these tags, you must create them again.
Bug: 37157933
Version: 3.0.2
Unable to Delete an OKE Cluster in Failed State
To deploy a cluster, Oracle Private Cloud Appliance Kubernetes Engine (OKE) uses various types of cloud resources that can also be managed through other infrastructure services, such as compute instances and load balancers. However, OKE cluster resources must be manipulated only through the OKE service, to avoid inconsistencies. If the network load balancer of an OKE cluster is deleted outside the control of the OKE service, that cluster ends up in a failed state and you will no longer be able to delete it.
Workaround: This is a known issue with the Cluster API Provider. If a cluster is in failed state and its network load balancer is no longer present, it must be cleaned up manually. Contact Oracle for assistance.
Bug: 36193835
Version: 3.0.2
UI and CLI Represent Eviction Grace Period Differently
The minimum and default grace period before a node is evicted from a worker node pool is 20
seconds. The OCI CLI displays this value
accurately and allows you to modify the grace period in seconds or minutes, using the ISO8601
format. For example, you could change the default of 20 seconds (="PT20S") to 3 minutes
(="PT3M") by specifying a new value in the --node-eviction-node-pool-settings
command parameter.
In contrast, the Compute Web UI parses the ISO8601 time format into an integer value and displays the eviction grace period in minutes. As a result, the 20 second default appears as 0 minutes in the Node Pool Information tab of the Kubernetes Cluster detail page.
This behavior differs from the Oracle Cloud Infrastructure console (UI), which is capable of displaying time in minutes as a decimal value (for example: 0.35 minutes). It has no minimum grace period, so zero is a valid entry.
Workaround: To set or check the precise eviction grace period of a node pool, use the OCI CLI and specify time in the ISO8601 format. When using the Compute Web UI, consider the limitations described.
Bug: 36696595
Version: 3.0.2
Nodes in Node Pool Not Automatically Distributed Across Fault Domains
When you create an OKE node pool without selecting specific fault domains, the Compute service handles distribution of the nodes across the fault domains. By design, node pool nodes (and compute instances in general) are assigned to the compute nodes with the highest available resource capacity. Due to VM activity and differences in resource consumption, the load between the three fault domains might vary considerably. Therefore, the auto-distribution logic cannot guarantee that nodes of the same node pool are spread evenly across fault domains. In fact, all nodes might end up in the same fault domain, which is not preferred.
Workaround: For the best distribution of node pool nodes across fault domains, do not rely on auto-distribution. Instead, select the fault domains to use when creating the node pool.
Bug: 36901742
Version: 3.0.2
API Reference on Appliance Not Up-to-Date for OKE Service
Every Private Cloud Appliance provides online API reference pages, conveniently accessible from your browser. For the Compute Enclave, these pages are located at https://console.mypca.mydomain/api-reference. This API reference is not current for all services, including for Oracle Private Cloud Appliance Kubernetes Engine (OKE).
Workaround: The REST API for Oracle Private Cloud Appliance Compute Enclave shows up-to-date parameters and values in the descriptions of the CreateCluster and CreateNodePool operations.
Note:
Both the console api-reference
and the Oracle Help Center REST API for Oracle Private Cloud Appliance Compute Enclave show parameters and parameter values
that are not supported because they do not apply to Private Cloud Appliance. If you use these, you might receive
a not supported error message, or the parameter or value will accepted by the API but will
do nothing.
Bug: 35710716, 36852746
Version: 3.0.2
OKE Cluster Creation Fails
OKE cluster creation might fail if the system is configured with a domain name that contains uppercase characters. Uppercase characters are not supported in domain names.
Workaround: Contact Oracle Support.
Bug: 36611385
Version: 3.0.2
Review Page for OKE Cluster Creation with VCN-Native Pod Networking Displays Wrong Pod CIDR
You can create OKE clusters with VCN-Native Pod Networking, so that pods use IP addresses from the VCN range, which provides more flexible control of network traffic. However, the review page, which is displayed in the UI before you submit the new cluster configuration, shows the default Flannel Overlay subnet as the Pods CIDR Block. This information is incorrect, but it does not affect the actual cluster network configuration.
Workaround: This is a data display error. It is harmless and can be ignored.
Bug: 37815929
Version: 3.0.2
Service of Type LoadBalancer Stuck in Pending State
Allowing connections from outside Private Cloud Appliance to a containerized application requires an external load balancer. You set it up as a Kubernetes service of type LoadBalancer. However, if the manifest file to create the load balancer service does not explicitly disable security list management, the load balancer gets stuck in Pending state.
Workaround: When creating the service of type LoadBalancer to expose a containerized application outside the Private Cloud Appliance environment, ensure that the manifest file contains an annotation that sets the security list management mode to None. For example:
apiVersion: v1
kind: Service
metadata:
name: my-nginx-svc
labels:
app: nginx
annotations:
oci.oraclecloud.com/load-balancer-type: "lb"
service.beta.kubernetes.io/oci-load-balancer-shape: "400Mbps"
service.beta.kubernetes.io/oci-load-balancer-security-list-management-mode: None
spec:
type: LoadBalancer
ports:
- port: 80
selector:
app: nginx
Bug: 37199903
Version: 3.0.2
Node Cycling Operation Times Out for Large Pools
Node cycling in OKE node pools of more than 30 nodes sometimes ends with a timeout. This is likely caused by intermittent Cluster API Provider connection issues. If a timeout occurs during node cycling, some of the nodes might not have completed the process, and are left in a state that does not match the latest specification.
Workaround: Manually delete the nodes that do not match the specification, and scale the cluster up again to the required number of nodes.
-
To identify nodes that were not cycled, list the nodes by creation timestamp. Cycled nodes are typically only minutes old, while uncycled nodes will be the oldest in the list.
kubectl get nodes --sort-by=.metadata.creationTimestamp --kubeconfig <your_cluster.kubeconfig>
-
Cordon and drain the uncycled nodes in case of existing application deployments. Ensure that the nodes have been drained before you proceed.
-
Log in to one of the management nodes and manually delete the uncycled nodes by deleting the corresponding Machine.
kubectl delete Machine <node_name> -n oke
As a result, a new node with updated settings is created.
Bug: 37145441
Version: 3.0.2