Serviceability Issues
This section describes known issues and workarounds related to service, support, upgrade and data protection features.
Order of Upgrading Components Has Changed
When updating the platform, you must update the compute nodes first. Failing to update the compute nodes in this order can cause the upgrade to fail and disrupt the system.
- Compute Nodes
- Management Nodes
- Management Node Operating System
- MySQL Cluster Database
- Secret Service
- Component Firmware
- Kubernetes Cluster
- Microservices
Bug: 34358305
Version: 3.0.1
DR Configurations Are Not Automatically Refreshed for Terminated Instances
If you terminate an instance that is part of a DR configuration, then a switchover or failover operation will fail due to the terminated instance. The correct procedure is to remove the instance from the DR configuration first, and then terminate the instance. If you forget to remove the instance first, you must refresh the DR configuration manually so that the entry for the terminated instance is removed. Keeping the DR configurations in sync with the state of their associated resources is critical in protecting against data loss.
Workaround: This behavior is expected. Either remove the instance from the DR configuration before terminating, or refresh the DR configuration if you terminated the instance without removing it first.
Bug: 33265549
Version: 3.0.1
Rebooting a Management Node while the Cluster State is Unhealthy Causes Platform Integrity Issues
Rebooting the management nodes is a delicate procedure because it requires many internal interdependent operations to be executed in a controlled manner, with accurate timing and often in a specific order. If a management node fails to reboot correctly and rejoin the cluster, it can lead to a destabilization of the appliance platform and infrastructure services. Symptoms include: microservice pods in CrashLoopBackOff state, data conflicts between MySQL cluster nodes, repeated restarts of the NDB cluster daemon process, and so on.
Workaround: Before rebooting a management node, always verify that
the MySQL cluster is in a healthy state. From the management node command line, run the
command shown in the example below. If your output does not look similar and indicates a
cluster issue, you should power-cycle the affected management node through its ILOM using the
restart /System
command.
As a precaution, if you need to reboot all the management nodes – for example in a full management cluster upgrade scenario –, observe an interval of at least 10 minutes between two management node reboot operations.
# ndb_mgm -e show Connected to Management Server at: pcamn01:1186 Cluster Configuration --------------------- [ndbd(NDB)] 3 node(s) id=17 @253.255.0.33 (mysql-8.0.25 ndb-8.0.25, Nodegroup: 0) id=18 @253.255.0.34 (mysql-8.0.25 ndb-8.0.25, Nodegroup: 0, *) id=19 @253.255.0.35 (mysql-8.0.25 ndb-8.0.25, Nodegroup: 0) [ndb_mgmd(MGM)] 3 node(s) id=1 @253.255.0.33 (mysql-8.0.25 ndb-8.0.25) id=2 @253.255.0.34 (mysql-8.0.25 ndb-8.0.25) id=3 @253.255.0.35 (mysql-8.0.25 ndb-8.0.25) [mysqld(API)] 18 node(s) id=33 @253.255.0.33 (mysql-8.0.25 ndb-8.0.25) id=34 @253.255.0.33 (mysql-8.0.25 ndb-8.0.25) [...]
Bug: 34484128
Version: 3.0.2
ULN Mirror Is Not a Required Parameter for Compute Node Patching
In the current implementation of the patching functionality, the ULN field is required for all patch requests. The administrator uses this field to provide the URL to the ULN mirror that is set up inside the data center network. However, compute nodes are patched in a slightly different way, in the sense that patches are applied from an secondary, internal ULN mirror on the shared storage of the management nodes. As a result, the ULN URL is technically not required to patch a compute node, but the patching code does consider it a mandatory parameter, so it must be entered.
Workaround: When patching a compute node, include the URL to the data center ULN mirror as a parameter in your patch request. Regardless of the URL provided, the secondary ULN mirror accessible from the management nodes is used to perform the patching.
Bug: 33730639
Version: 3.0.1
Patch Command Times Out for Network Controller
When patching the platform, the process may fail due to a time-out while updating the network controller. If this is the case, logs will contain entries like "ERROR [pcanwctl upgrade Failed]".
Workaround: Execute the same patch command again. The operation should succeed.
Bug: 33963876
Version: 3.0.1
Upgrade Commands Fail when One Storage Controller Is Unavailable
The ZFS Storage Appliance has two controllers operating in an HA cluster, meaning it continues to operate when one of the controllers goes down. However, with one controller unavailable, upgrade-related operations will fail due to a connection error in the RabbitMQ internal message bus: "Error in RabbitMQ service: No response received after 90 seconds". Even viewing the upgrade job history is not possible, because the upgrade service is unable to send a response.
Workaround: Make sure that both storage controllers are up and running. Then, rerun the required upgrade commands.
Bug: 34507825
Version: 3.0.2
Instances with a Shared Block Volume Cannot Be Part of Different Disaster Recovery Configurations
Multiple instances can have a block volume attached that is shared between them. If you add those instances to a disaster recovery (DR) configuration, their attached volumes are moved to a dedicated ZFS storage project. However, if the instances belong to different DR configurations, each one with its own separate ZFS storage project, the system cannot move any shared block volume as this always results in an invalid DR configuration. Therefore, the Disaster Recovery service does not support adding compute instances with shared block volumes to different DR configurations.
Workaround: Consider including instances with a shared block volume in the same DR configuration, or attaching different block volumes to each instance instead of a shared volume.
Bug: 34566745
Version: 3.0.2
Time-out Occurs when Generating Support Bundle
When you request assistance from Oracle Support, it is usually required to upload a support bundle with your request. A support bundle is generated from a management node using a command similar to this example:
# support-bundles -m time_slice --all -s 2022-01-01T00:00:00.000Z -e 2022-01-02T00:00:00.000Z
If there is a very large number of log entries to be collected for the specified time slice, the process could time out with API exception and an error message that says "unable to execute command". In actual fact, the data collection will continue in the background, but the error is caused by a time-out of the websocket connection to the Kubernetes pod running the data collection process.
Workaround: If you encounter this time-out issue when collecting data for a support bundle, try specifying a shorter time slice to reduce the amount of data collected. If the process completes within 30 minutes the error should not occur.
Bug: 33749450
Version: 3.0.2
DR Operations Intermittently Fail
During certain conditions of heavy load, Site Guard users performing DR operations on the Private Cloud Appliance 3.0 can encounter out-of-session errors when Site Guard EM scripts attempt to perform DR operations using the PCA DR REST API.
This condition occurs when the system is overloaded with requests.
Workaround: Retry the operation.
Bug: 33934952
Version: 3.0.1, 3.0.2
MN01 Host Upgrade Fails When it is the Last Management Node to Upgrade
Upgrades and patches to the management nodes are performed in a sequential order. When
MN01
falls last in that order, the management node upgrade or patch
operation fails. To avoid this issue, ensure that the Management Node Virtual IP address is
assigned to MN02
before you start any management node upgrade or patching
operations.
Workaround: Assign the Management Note Virtual IP address to
MN02
before you upgrade or patch.
# pcs resource move mgmt-rg pcamn02
Bug: 35554754
Version: 3.0.2
Failure Draining Node when Patching or Upgrading the Kubernetes Cluster
To avoid that microservice pods go into an inappropriate state, each Kubernetes node is drained before being upgraded to the next available version. The Upgrader allows all pods to be evicted gracefully before proceeding with the node. However, if a pod is stuck or is not evicted in time, the upgrade or patch process stops.
Workaround: If a Kubernetes node cannot be drained because a pod is not evicted, you must manually evict the pod that causes the failure.
-
Log on to the Kubernetes node using ssh, and run the following command, using the appropriate host name:
# kubectl drain pcamn00 --ignore-daemonsets --delete-local-data
Wait for the draining to complete. The command output should indicate:
node/pcamn00 drained
. -
If the drain command fails, the output indicates which pod is causing the failure. Either run the drain command again and add the
--force
option, or use the delete command.# kubectl delete pod pod-name --force
-
Rerun the Kubernetes upgrade or patch command. The Upgrader continues from where the process was interrupted.
Bug: 35677796
Version: 3.0.2
Oracle Auto Service Request Disabled after Upgrade
When a Private Cloud Appliance has been registered for Oracle Auto Service Request (ASR), and the service is enabled on the appliance, the ASR service may become disabled after an upgrade of the appliance software. The issue has been observed when upgrading to version 3.0.2-b925538.
Workaround: After the appliance software upgrade, verify the ASR configuration. If the ASR service is disabled, manually enable it again. See "Using Auto Service Requests" in the Status and Health Monitoring chapter of the Oracle Private Cloud Appliance Administrator Guide.
Bug: 35704133
Version: 3.0.2