Serviceability Issues
This section describes known issues and workarounds related to service, support, upgrade and data protection features.
Order of Upgrading Components Has Changed
When updating the platform, you must update the compute nodes first. Failing to update the compute nodes in this order can cause the upgrade to fail and disrupt the system.
- Compute Nodes
- Management Nodes
- Management Node Operating System
- MySQL Cluster Database
- Secret Service
- Component Firmware
- Kubernetes Cluster
- Microservices
Bug: 34358305
Version: 3.0.1
DR Configurations Are Not Automatically Refreshed for Terminated Instances
If you terminate an instance that is part of a DR configuration, then a switchover or failover operation will fail due to the terminated instance. The correct procedure is to remove the instance from the DR configuration first, and then terminate the instance. If you forget to remove the instance first, you must refresh the DR configuration manually so that the entry for the terminated instance is removed. Keeping the DR configurations in sync with the state of their associated resources is critical in protecting against data loss.
Workaround: This behavior is expected. Either remove the instance from the DR configuration before terminating, or refresh the DR configuration if you terminated the instance without removing it first.
Bug: 33265549
Version: 3.0.1
Rebooting a Management Node while the Cluster State is Unhealthy Causes Platform Integrity Issues
Rebooting the management nodes is a delicate procedure because it requires many internal interdependent operations to be executed in a controlled manner, with accurate timing and often in a specific order. If a management node fails to reboot correctly and rejoin the cluster, it can lead to a destabilization of the appliance platform and infrastructure services. Symptoms include: microservice pods in CrashLoopBackOff state, data conflicts between MySQL cluster nodes, repeated restarts of the NDB cluster daemon process, and so on.
Workaround: Before rebooting a management node, always verify that
the MySQL cluster is in a healthy state. From the management node command line, run the
command shown in the example below. If your output does not look similar and indicates a
cluster issue, you should power-cycle the affected management node through its ILOM using the
restart /System
command.
As a precaution, if you need to reboot all the management nodes – for example in a full management cluster upgrade scenario –, observe an interval of at least 10 minutes between two management node reboot operations.
# ndb_mgm -e show Connected to Management Server at: pcamn01:1186 Cluster Configuration --------------------- [ndbd(NDB)] 3 node(s) id=17 @253.255.0.33 (mysql-8.0.25 ndb-8.0.25, Nodegroup: 0) id=18 @253.255.0.34 (mysql-8.0.25 ndb-8.0.25, Nodegroup: 0, *) id=19 @253.255.0.35 (mysql-8.0.25 ndb-8.0.25, Nodegroup: 0) [ndb_mgmd(MGM)] 3 node(s) id=1 @253.255.0.33 (mysql-8.0.25 ndb-8.0.25) id=2 @253.255.0.34 (mysql-8.0.25 ndb-8.0.25) id=3 @253.255.0.35 (mysql-8.0.25 ndb-8.0.25) [mysqld(API)] 18 node(s) id=33 @253.255.0.33 (mysql-8.0.25 ndb-8.0.25) id=34 @253.255.0.33 (mysql-8.0.25 ndb-8.0.25) [...]
Bug: 34484128
Version: 3.0.2
ULN Mirror Is Not a Required Parameter for Compute Node Patching
In the current implementation of the patching functionality, the ULN field is required for all patch requests. The administrator uses this field to provide the URL to the ULN mirror that is set up inside the data center network. However, compute nodes are patched in a slightly different way, in the sense that patches are applied from an secondary, internal ULN mirror on the shared storage of the management nodes. As a result, the ULN URL is technically not required to patch a compute node, but the patching code does consider it a mandatory parameter, so it must be entered.
Workaround: When patching a compute node, include the URL to the data center ULN mirror as a parameter in your patch request. Regardless of the URL provided, the secondary ULN mirror accessible from the management nodes is used to perform the patching.
Bug: 33730639
Version: 3.0.1
Patch Command Times Out for Network Controller
When patching the platform, the process may fail due to a time-out while updating the network controller. If this is the case, logs will contain entries like "ERROR [pcanwctl upgrade Failed]".
Workaround: Execute the same patch command again. The operation should succeed.
Bug: 33963876
Version: 3.0.1
Low Transfer Speed when Downloading ISO Image for Appliance Upgrade
During preparation of an appliance software upgrade, you might observe very low transfer speeds while the ISO image is being downloaded to the appliance internal storage. If the download is expected to take many hours to complete, it is likely that the spine switches are performing a very high number of translations. It indicates that the spine switches need to be reloaded.
In addition, on systems running appliance software versions 3.0.2-b1081557 - 3.0.2-b1325160 where active Kubernetes Engine clusters are present, the OKE network load balancer might be unreachable.
Workaround: Contact Oracle for assistance. Instructions will be provided to check the NAT statistics and reload the switches.
When upgrading the appliance software to the latest version, switches are reloaded during the preparation phase. From this point forward the workaround should no longer be required.
Bug: 37807342
Version: 3.0.2
Switch Upgrade or Patch Procedure Not Blocked When Ports Are Down
The spine and leaf switches are configured in pairs for high availability (HA), and new firmware is installed through a rolling upgrade or patch operation, which is launched with a single command. If certain ports in a switch are down, the HA configuration is impacted, and the upgrade or patch operation causes a brief outage while the new firmware is installed. There is no check in place to block upgrade and patch commands when switch ports are down. Connectivity is restored, but the inactive switch ports must be fixed to reenable HA.
Workaround: Before upgrading or patching the leaf and spine switches, ensure that all necessary ports on all devices are active. Verify switch status in Grafana. If unhealthy ports are detected, ensure that this issue is fixed first. Unavailable switch ports must be fixed so the high-availability configuration of the switch pair(s) can be restored.
Bug: 37049316
Version: 3.0.1
Upgrade Oracle Cloud Infrastructure Images Fails Waiting for Response from Workflow Service
Upgrade procedures consist of many tasks, which are orchestrated through the Upgrader Workflow Service (UWS). During the upgrade of the Oracle Cloud Infrastructure images on the appliance, the UWS might fail to send a response and cause the upgrade workflow to time out. The Upgrader log records the issue as follows:
[2025-05-21 19:38:57 1393662] ERROR (util_tasks:306) [Waiting for UWS response (Waiting for a response from UWS)] Failed Did not receive a response from UWS, manually import with 'importPlatformImages' command [2025-05-21 19:38:58 1393662] INFO (util_tasks:310) [UpgradePlanUpdateTask (None)] Not Run Task did not run. This task only runs when OciConfigurationTask's upgrade is True AND Upgrade OCI Instance images has status of Passed. [2025-05-21 19:38:58 1393662] INFO (oci_configuration:52) Component='ociImages', path='None', upgrade-required=True, upgrade-plan=True
Workaround: Retry the images upgrade from the Service Web UI or Service CLI. The upgrade is expected to complete successfully at the next attempt.
Bug: 37984531
Version: 3.0.2
Upgrade Fails Due to Incomplete Backup Job
When upgrading the appliance software, the platform upgrade stage might fail because an internal backup does not complete successfully before a timeout occurs. When this happens, command output includes:
[2025-02-09 01:59:43 5321] INFO (util_tasks:1773) Waiting for BRS cronjob Upgrade to finish processing. [2025-02-09 01:59:43 5321] ERROR (util_tasks:306) [BRS cronjob Upgrade (Recreate BRS cronjob)] Failed
Logs contain details similar to this example:
Tasks 64 - Name = BRS cronjob Upgrade Tasks 64 - Message = Command: ['kubectl', '-n', 'default', 'exec', 'brs-76c968c746-58gm4', '-c', 'brs', '--', '/usr/sbin/default-backup'] failed (255): stderr: time="2025-02-09T01:47:59Z" level=error msg="exec failed: unable to start container process: error adding pid 76969 to cgroups: failed to write 76969: openat2 /sys/fs/cgroup/systemd/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod4a3334cf_a8fd_4608_a709_3a6703c6627c.slice/crio-d97e021a3b22d8d805b8fcede91266f59fb177002fab108d3f34affd43858d95.scope/cgroup.procs: no such file or directory" command terminated with exit code 255
Workaround: Log in to a management node. Delete the pod that controls the backup (BRS service) cronjob. A new pod is launched automatically.
# kubectl delete pod brs-76c968c746-58gm4 pod "brs-76c968c746-58gm4" deleted # kubectl get pods -A | grep brs default brs-76c968c746-kcdnh 3/3 Running 0 47s # kubectl exec -it brs-76c968c746-kcdnh -c brs -- /usr/sbin/default-backup # echo $? 0
Bug: 37572149
Version: 3.0.2
IAM Service Reports Sync Status Error After Upgrade Preparation Commands
During the preparation phase of an appliance software upgrade, the latest Upgrader
functionality is added to the running system. The Admin Service is a required component in
this process, which is already updated when the preUpgrade
command is run.
This causes a temporary version mismatch between the IAM Service and the Admin Service, with
which it is tightly integrated.
New Admin API calls to IAM might fail until the corresponding version of the IAM Service is available on the system, as illustrated by this Service CLI example:
PCA-ADMIN> show iamservice Data: Id = 66acfdd4-aa4e-4bda-ba9d-001d67fccf96 Type = IamService IAM Link Mode = AUTO_SYNC Overall Communication State = Error Communication Error Message = Error processing iam.syncstatus.get response: PCA_GENERAL_000014: Error returned from IAM service. Code: 'NotSupportedError'. Message: 'Not Supported' Name = Iam Service Work State = Normal FaultIds 1 = id:f021e0a9-11b5-4483-a077-7a62049e637b type:Fault name:IamServiceSyncStatusFaultStatusFault(Iam Service)
Workaround: The error message suggests there is a sync issue with the IAM Service, but in fact all internal operations are proceeding as expected. The IAM Service is up-to-date when the platform and containerized microservices upgrade steps are completed, after which the error disappears. No workaround is required.
Bug: 37775091
Version: 3.0.2
Instances with a Shared Block Volume Cannot Be Part of Different Disaster Recovery Configurations
Multiple instances can have a block volume attached that is shared between them. If you add those instances to a disaster recovery (DR) configuration, their attached volumes are moved to a dedicated ZFS storage project. However, if the instances belong to different DR configurations, each one with its own separate ZFS storage project, the system cannot move any shared block volume as this always results in an invalid DR configuration. Therefore, the Disaster Recovery service does not support adding compute instances with shared block volumes to different DR configurations.
Workaround: Consider including instances with a shared block volume in the same DR configuration, or attaching different block volumes to each instance instead of a shared volume.
Bug: 34566745
Version: 3.0.2
DR Failover Error Because Initiator Group No Longer Exists
In the Native Disaster Recovery service, a failover plan is meant to be used when the primary system is down. However, the service does not prevent an administrator performing a failover while the primary system is up and running. During a failover, to be able to perform role reversal between primary and standby, the DR service changes the LUN information in the replicated data stored on the standby system's ZFS Storage Appliance. If the primary system is online, it sends replication updates every 5 minutes, which will revert those changes made to the LUN parameters on the standby. This causes role reversal to fail with an error similar to this example:
The initiator group '225fc1ba7c38_grp' no longer exists. It may have been destroyed or renamed by another administrator, or this LUN may have been imported from another system.
Workaround: Do not perform a failover when the primary system is online. Use switchover instead.
If you do run into this failover error, the recovery procedure involves changes on the ZFS Storage Appliance. Contact Oracle for assistance.
Bug: 37988746
Version: 3.0.2
Time-out Occurs when Generating Support Bundle
When you request assistance from Oracle Support, it is usually required to upload a support bundle with your request. A support bundle is generated from a management node using a command similar to this example:
# support-bundles -m time_slice --all -s 2022-01-01T00:00:00.000Z -e 2022-01-02T00:00:00.000Z
If there is a very large number of log entries to be collected for the specified time slice, the process could time out with API exception and an error message that says "unable to execute command". In actual fact, the data collection will continue in the background, but the error is caused by a time-out of the websocket connection to the Kubernetes pod running the data collection process.
Workaround: If you encounter this time-out issue when collecting data for a support bundle, try specifying a shorter time slice to reduce the amount of data collected. If the process completes within 30 minutes the error should not occur.
Bug: 33749450
Version: 3.0.2
DR Operations Intermittently Fail
During certain conditions of heavy load, Site Guard users performing DR operations on the Private Cloud Appliance 3.0 can encounter out-of-session errors when Site Guard EM scripts attempt to perform DR operations using the PCA DR REST API.
This condition occurs when the system is overloaded with requests.
Workaround: Retry the operation.
Bug: 33934952
Version: 3.0.1, 3.0.2
MN01 Host Upgrade Fails When it is the Last Management Node to Upgrade
Upgrades and patches to the management nodes are performed in a sequential order. When
MN01
falls last in that order, the management node upgrade or patch
operation fails. To avoid this issue, ensure that the Management Node Virtual IP address is
assigned to MN02
before you start any management node upgrade or patching
operations.
Workaround: Assign the Management Note Virtual IP address to
MN02
before you upgrade or patch.
# pcs resource move mgmt-rg pcamn02
Bug: 35554754
Version: 3.0.2
Failure Draining Node when Patching or Upgrading the Kubernetes Cluster
To avoid that microservice pods go into an inappropriate state, each Kubernetes node is drained before being upgraded to the next available version. The Upgrader allows all pods to be evicted gracefully before proceeding with the node. However, if a pod is stuck or is not evicted in time, the upgrade or patch process stops.
Workaround: If a Kubernetes node cannot be drained because a pod is not evicted, you must manually evict the pod that causes the failure.
-
Log on to the Kubernetes node using ssh, and run the following command, using the appropriate host name:
# kubectl drain pcamn00 --ignore-daemonsets --delete-local-data
Wait for the draining to complete. The command output should indicate:
node/pcamn00 drained
. -
If the drain command fails, the output indicates which pod is causing the failure. Either run the drain command again and add the
--force
option, or use the delete command.# kubectl delete pod pod-name --force
For example:
# kubectl delete pod pod-name --force
-
Rerun the Kubernetes upgrade or patch command. The Upgrader continues from where the process was interrupted.
Bug: 37291231
Version: 3.0.2
Oracle Auto Service Request Disabled after Upgrade
When a Private Cloud Appliance has been registered for Oracle Auto Service Request (ASR), and the service is enabled on the appliance, the ASR service may become disabled after an upgrade of the appliance software. The issue has been observed when upgrading to version 3.0.2-b925538.
Workaround: After the appliance software upgrade, verify the ASR configuration. If the ASR service is disabled, manually enable it again. See "Using Auto Service Requests" in the Status and Health Monitoring chapter of the Oracle Private Cloud Appliance Administrator Guide.
Bug: 35704133
Version: 3.0.2
Compute Node in NotReady
State Blocks Upgrade or Patching
A known Linux issue may cause a compute node to block patching or upgrade. This issue occurs
after a compute node is upgraded and rebooted. The node reboots to a NotReady
state which prevents further patching or upgrading.
Workaround: Reboot the impacted compute node and continue with the upgrade or patch.
Bug: 36835607
Version: 3.0.2
Site Guard Precheck Jobs Fail
"Incorrect hostname found in DNS for local IP nn.nn.nnn.nnn"
This error occurs when the domain name contains uppercase letters. Uppercase characters are not supported in domain names.
Workaround: Contact Oracle Support.
Bug: 36710199
Version: 3.0.2
Instance Migration Fails During Appliance Upgrade
When upgrading Private Cloud Appliance to software version 3.0.2-b1325160, instances might fail to migrate to another compute node. Rollback causes affected instances to return to their original host compute node, which interrupts active workloads. The issue is caused by competing commands and time-outs during statistics collection at the level of the hypervisor.
As part of this particular version upgrade, vmstats-exporter
is uninstalled
during the preparation phase, and reinstalled during platform upgrade. Until that time, VM
stats show NO DATA
.
Workaround: When the upgrade to version 3.0.2-b1325160 is complete, the issue is resolved.
Bug: 36936775
Version: 3.0.2
Restore NTP Configuration After Upgrade To PCA Version: PCA:3.0.2-b1392231
After an upgrade to software version: 3.0.2-b1392231, time synchronization across nodes is lost, which eventually results in cluster and certificate validation failures.
This issue is fixed in fixed in software version: 3.0.2-b1410505
Workaround: For the workaround see [ PCA 3.x ] Restore NTP Configuration After Upgrade To PCA Version: PCA:3.0.2-b1392231 (M3.11) (Doc ID 3089503.1).
Bug: 38035199
Version: 3.0.2
DR Configurations Are Not Supported Between Different Private Cloud Appliance Platforms
DR configurations are only supported when the DR connection is between two Private Cloud Appliances of the same platform. For example, a Private Cloud Appliance X9 rack can only be configured for DR with another Private Cloud Appliance X9 rack.
Workaround: Only create a DR configuration between two Private Cloud Appliance racks of the same platform: X9 to X9, X10 to X10, X11 to X11.
Bug: 38111294
Version: 3.0.1
Need to Increase Timeout Value for Migration During CN Upgrade
When a system is operating at scale compute node upgrades may time out. If your environment is operating a significant number of virtual machines, contact Oracle Support to perform live compute node upgrades to ensure seamless VM migration.
Workaround: Contact Oracle Support.
Bug: 38205860
Version: 3.0.2