Serviceability Issues

This section describes known issues and workarounds related to service, support, upgrade and data protection features.

Order of Upgrading Components Has Changed

When updating the platform, you must update the compute nodes first. Failing to update the compute nodes in this order can cause the upgrade to fail and disrupt the system.

Workaround: Complete platform upgrades in this order:
  1. Compute Nodes
  2. Management Nodes
  3. Management Node Operating System
  4. MySQL Cluster Database
  5. Secret Service
  6. Component Firmware
  7. Kubernetes Cluster
  8. Microservices

Bug: 34358305

Version: 3.0.1

DR Configurations Are Not Automatically Refreshed for Terminated Instances

If you terminate an instance that is part of a DR configuration, then a switchover or failover operation will fail due to the terminated instance. The correct procedure is to remove the instance from the DR configuration first, and then terminate the instance. If you forget to remove the instance first, you must refresh the DR configuration manually so that the entry for the terminated instance is removed. Keeping the DR configurations in sync with the state of their associated resources is critical in protecting against data loss.

Workaround: This behavior is expected. Either remove the instance from the DR configuration before terminating, or refresh the DR configuration if you terminated the instance without removing it first.

Bug: 33265549

Version: 3.0.1

Rebooting a Management Node while the Cluster State is Unhealthy Causes Platform Integrity Issues

Rebooting the management nodes is a delicate procedure because it requires many internal interdependent operations to be executed in a controlled manner, with accurate timing and often in a specific order. If a management node fails to reboot correctly and rejoin the cluster, it can lead to a destabilization of the appliance platform and infrastructure services. Symptoms include: microservice pods in CrashLoopBackOff state, data conflicts between MySQL cluster nodes, repeated restarts of the NDB cluster daemon process, and so on.

Workaround: Before rebooting a management node, always verify that the MySQL cluster is in a healthy state. From the management node command line, run the command shown in the example below. If your output does not look similar and indicates a cluster issue, you should power-cycle the affected management node through its ILOM using the restart /System command.

As a precaution, if you need to reboot all the management nodes – for example in a full management cluster upgrade scenario –, observe an interval of at least 10 minutes between two management node reboot operations.

# ndb_mgm -e show
Connected to Management Server at: pcamn01:1186
Cluster Configuration
---------------------
[ndbd(NDB)]     3 node(s)
id=17   @253.255.0.33  (mysql-8.0.25 ndb-8.0.25, Nodegroup: 0)
id=18   @253.255.0.34  (mysql-8.0.25 ndb-8.0.25, Nodegroup: 0, *)
id=19   @253.255.0.35  (mysql-8.0.25 ndb-8.0.25, Nodegroup: 0)

[ndb_mgmd(MGM)] 3 node(s)
id=1    @253.255.0.33  (mysql-8.0.25 ndb-8.0.25)
id=2    @253.255.0.34  (mysql-8.0.25 ndb-8.0.25)
id=3    @253.255.0.35  (mysql-8.0.25 ndb-8.0.25)

[mysqld(API)]   18 node(s)
id=33   @253.255.0.33  (mysql-8.0.25 ndb-8.0.25)
id=34   @253.255.0.33  (mysql-8.0.25 ndb-8.0.25)
[...]

Bug: 34484128

Version: 3.0.2

ULN Mirror Is Not a Required Parameter for Compute Node Patching

In the current implementation of the patching functionality, the ULN field is required for all patch requests. The administrator uses this field to provide the URL to the ULN mirror that is set up inside the data center network. However, compute nodes are patched in a slightly different way, in the sense that patches are applied from an secondary, internal ULN mirror on the shared storage of the management nodes. As a result, the ULN URL is technically not required to patch a compute node, but the patching code does consider it a mandatory parameter, so it must be entered.

Workaround: When patching a compute node, include the URL to the data center ULN mirror as a parameter in your patch request. Regardless of the URL provided, the secondary ULN mirror accessible from the management nodes is used to perform the patching.

Bug: 33730639

Version: 3.0.1

Patch Command Times Out for Network Controller

When patching the platform, the process may fail due to a time-out while updating the network controller. If this is the case, logs will contain entries like "ERROR [pcanwctl upgrade Failed]".

Workaround: Execute the same patch command again. The operation should succeed.

Bug: 33963876

Version: 3.0.1

Low Transfer Speed when Downloading ISO Image for Appliance Upgrade

During preparation of an appliance software upgrade, you might observe very low transfer speeds while the ISO image is being downloaded to the appliance internal storage. If the download is expected to take many hours to complete, it is likely that the spine switches are performing a very high number of translations. It indicates that the spine switches need to be reloaded.

In addition, on systems running appliance software versions 3.0.2-b1081557 - 3.0.2-b1325160 where active Kubernetes Engine clusters are present, the OKE network load balancer might be unreachable.

Workaround: Contact Oracle for assistance. Instructions will be provided to check the NAT statistics and reload the switches.

When upgrading the appliance software to the latest version, switches are reloaded during the preparation phase. From this point forward the workaround should no longer be required.

Bug: 37807342

Version: 3.0.2

Switch Upgrade or Patch Procedure Not Blocked When Ports Are Down

The spine and leaf switches are configured in pairs for high availability (HA), and new firmware is installed through a rolling upgrade or patch operation, which is launched with a single command. If certain ports in a switch are down, the HA configuration is impacted, and the upgrade or patch operation causes a brief outage while the new firmware is installed. There is no check in place to block upgrade and patch commands when switch ports are down. Connectivity is restored, but the inactive switch ports must be fixed to reenable HA.

Workaround: Before upgrading or patching the leaf and spine switches, ensure that all necessary ports on all devices are active. Verify switch status in Grafana. If unhealthy ports are detected, ensure that this issue is fixed first. Unavailable switch ports must be fixed so the high-availability configuration of the switch pair(s) can be restored.

Bug: 37049316

Version: 3.0.1

Upgrade Oracle Cloud Infrastructure Images Fails Waiting for Response from Workflow Service

Upgrade procedures consist of many tasks, which are orchestrated through the Upgrader Workflow Service (UWS). During the upgrade of the Oracle Cloud Infrastructure images on the appliance, the UWS might fail to send a response and cause the upgrade workflow to time out. The Upgrader log records the issue as follows:

[2025-05-21 19:38:57 1393662] ERROR (util_tasks:306) [Waiting for UWS response (Waiting for a response from UWS)] Failed 
Did not receive a response from UWS, manually import with 'importPlatformImages' command
[2025-05-21 19:38:58 1393662] INFO (util_tasks:310) [UpgradePlanUpdateTask (None)] Not Run 
Task did not run. This task only runs when OciConfigurationTask's upgrade is True AND Upgrade OCI Instance images has status of Passed.
[2025-05-21 19:38:58 1393662] INFO (oci_configuration:52) Component='ociImages', path='None', upgrade-required=True, upgrade-plan=True

Workaround: Retry the images upgrade from the Service Web UI or Service CLI. The upgrade is expected to complete successfully at the next attempt.

Bug: 37984531

Version: 3.0.2

Upgrade Fails Due to Incomplete Backup Job

When upgrading the appliance software, the platform upgrade stage might fail because an internal backup does not complete successfully before a timeout occurs. When this happens, command output includes:

[2025-02-09 01:59:43 5321] INFO (util_tasks:1773) Waiting for BRS cronjob Upgrade to finish processing.
[2025-02-09 01:59:43 5321] ERROR (util_tasks:306) [BRS cronjob Upgrade (Recreate BRS cronjob)] Failed

Logs contain details similar to this example:

Tasks 64 - Name = BRS cronjob Upgrade
Tasks 64 - Message = Command: ['kubectl', '-n', 'default', 'exec', 'brs-76c968c746-58gm4', '-c', 'brs', '--', '/usr/sbin/default-backup'] failed (255): stderr: time="2025-02-09T01:47:59Z" level=error msg="exec failed: unable to start container process: error adding pid 76969 to cgroups: failed to write 76969: openat2 /sys/fs/cgroup/systemd/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod4a3334cf_a8fd_4608_a709_3a6703c6627c.slice/crio-d97e021a3b22d8d805b8fcede91266f59fb177002fab108d3f34affd43858d95.scope/cgroup.procs: no such file or directory"
command terminated with exit code 255

Workaround: Log in to a management node. Delete the pod that controls the backup (BRS service) cronjob. A new pod is launched automatically.

# kubectl delete pod brs-76c968c746-58gm4
pod "brs-76c968c746-58gm4" deleted

# kubectl get pods -A | grep brs
default                   brs-76c968c746-kcdnh           3/3     Running    0       47s

# kubectl exec -it brs-76c968c746-kcdnh -c brs -- /usr/sbin/default-backup
# echo $?
0

Bug: 37572149

Version: 3.0.2

IAM Service Reports Sync Status Error After Upgrade Preparation Commands

During the preparation phase of an appliance software upgrade, the latest Upgrader functionality is added to the running system. The Admin Service is a required component in this process, which is already updated when the preUpgrade command is run. This causes a temporary version mismatch between the IAM Service and the Admin Service, with which it is tightly integrated.

New Admin API calls to IAM might fail until the corresponding version of the IAM Service is available on the system, as illustrated by this Service CLI example:

PCA-ADMIN> show iamservice
Data:
 Id = 66acfdd4-aa4e-4bda-ba9d-001d67fccf96
 Type = IamService
 IAM Link Mode = AUTO_SYNC
 Overall Communication State = Error
 Communication Error Message = Error processing iam.syncstatus.get response: PCA_GENERAL_000014: Error returned from IAM service. Code: 'NotSupportedError'. Message: 'Not Supported'
 Name = Iam Service
 Work State = Normal
 FaultIds 1 = id:f021e0a9-11b5-4483-a077-7a62049e637b type:Fault name:IamServiceSyncStatusFaultStatusFault(Iam Service)

Workaround: The error message suggests there is a sync issue with the IAM Service, but in fact all internal operations are proceeding as expected. The IAM Service is up-to-date when the platform and containerized microservices upgrade steps are completed, after which the error disappears. No workaround is required.

Bug: 37775091

Version: 3.0.2

Instances with a Shared Block Volume Cannot Be Part of Different Disaster Recovery Configurations

Multiple instances can have a block volume attached that is shared between them. If you add those instances to a disaster recovery (DR) configuration, their attached volumes are moved to a dedicated ZFS storage project. However, if the instances belong to different DR configurations, each one with its own separate ZFS storage project, the system cannot move any shared block volume as this always results in an invalid DR configuration. Therefore, the Disaster Recovery service does not support adding compute instances with shared block volumes to different DR configurations.

Workaround: Consider including instances with a shared block volume in the same DR configuration, or attaching different block volumes to each instance instead of a shared volume.

Bug: 34566745

Version: 3.0.2

DR Failover Error Because Initiator Group No Longer Exists

In the Native Disaster Recovery service, a failover plan is meant to be used when the primary system is down. However, the service does not prevent an administrator performing a failover while the primary system is up and running. During a failover, to be able to perform role reversal between primary and standby, the DR service changes the LUN information in the replicated data stored on the standby system's ZFS Storage Appliance. If the primary system is online, it sends replication updates every 5 minutes, which will revert those changes made to the LUN parameters on the standby. This causes role reversal to fail with an error similar to this example:

The initiator group '225fc1ba7c38_grp' no longer exists. It may have been destroyed or 
renamed by another administrator, or this LUN may have been imported from another system.

Workaround: Do not perform a failover when the primary system is online. Use switchover instead.

If you do run into this failover error, the recovery procedure involves changes on the ZFS Storage Appliance. Contact Oracle for assistance.

Bug: 37988746

Version: 3.0.2

Time-out Occurs when Generating Support Bundle

When you request assistance from Oracle Support, it is usually required to upload a support bundle with your request. A support bundle is generated from a management node using a command similar to this example:

# support-bundles -m time_slice --all -s 2022-01-01T00:00:00.000Z -e 2022-01-02T00:00:00.000Z

If there is a very large number of log entries to be collected for the specified time slice, the process could time out with API exception and an error message that says "unable to execute command". In actual fact, the data collection will continue in the background, but the error is caused by a time-out of the websocket connection to the Kubernetes pod running the data collection process.

Workaround: If you encounter this time-out issue when collecting data for a support bundle, try specifying a shorter time slice to reduce the amount of data collected. If the process completes within 30 minutes the error should not occur.

Bug: 33749450

Version: 3.0.2

DR Operations Intermittently Fail

During certain conditions of heavy load, Site Guard users performing DR operations on the Private Cloud Appliance 3.0 can encounter out-of-session errors when Site Guard EM scripts attempt to perform DR operations using the PCA DR REST API.

This condition occurs when the system is overloaded with requests.

Workaround: Retry the operation.

Bug: 33934952

Version: 3.0.1, 3.0.2

MN01 Host Upgrade Fails When it is the Last Management Node to Upgrade

Upgrades and patches to the management nodes are performed in a sequential order. When MN01 falls last in that order, the management node upgrade or patch operation fails. To avoid this issue, ensure that the Management Node Virtual IP address is assigned to MN02 before you start any management node upgrade or patching operations.

Workaround: Assign the Management Note Virtual IP address to MN02 before you upgrade or patch.

# pcs resource move mgmt-rg pcamn02

Bug: 35554754

Version: 3.0.2

Failure Draining Node when Patching or Upgrading the Kubernetes Cluster

To avoid that microservice pods go into an inappropriate state, each Kubernetes node is drained before being upgraded to the next available version. The Upgrader allows all pods to be evicted gracefully before proceeding with the node. However, if a pod is stuck or is not evicted in time, the upgrade or patch process stops.

Workaround: If a Kubernetes node cannot be drained because a pod is not evicted, you must manually evict the pod that causes the failure.

  1. Log on to the Kubernetes node using ssh, and run the following command, using the appropriate host name:

    # kubectl drain pcamn00 --ignore-daemonsets --delete-local-data

    Wait for the draining to complete. The command output should indicate: node/pcamn00 drained.

  2. If the drain command fails, the output indicates which pod is causing the failure. Either run the drain command again and add the --force option, or use the delete command.

    # kubectl delete pod pod-name --force

    For example:

    # kubectl delete pod pod-name --force
  3. Rerun the Kubernetes upgrade or patch command. The Upgrader continues from where the process was interrupted.

Bug: 37291231

Version: 3.0.2

Oracle Auto Service Request Disabled after Upgrade

When a Private Cloud Appliance has been registered for Oracle Auto Service Request (ASR), and the service is enabled on the appliance, the ASR service may become disabled after an upgrade of the appliance software. The issue has been observed when upgrading to version 3.0.2-b925538.

Workaround: After the appliance software upgrade, verify the ASR configuration. If the ASR service is disabled, manually enable it again. See "Using Auto Service Requests" in the Status and Health Monitoring chapter of the Oracle Private Cloud Appliance Administrator Guide.

Bug: 35704133

Version: 3.0.2

Compute Node in NotReady State Blocks Upgrade or Patching

A known Linux issue may cause a compute node to block patching or upgrade. This issue occurs after a compute node is upgraded and rebooted. The node reboots to a NotReady state which prevents further patching or upgrading.

Workaround: Reboot the impacted compute node and continue with the upgrade or patch.

Bug: 36835607

Version: 3.0.2

Site Guard Precheck Jobs Fail

When Site Guard users perform DR prechecks, the prechecks might fail with an error:
"Incorrect hostname found in DNS for local IP nn.nn.nnn.nnn"

This error occurs when the domain name contains uppercase letters. Uppercase characters are not supported in domain names.

Workaround: Contact Oracle Support.

Bug: 36710199

Version: 3.0.2

Instance Migration Fails During Appliance Upgrade

When upgrading Private Cloud Appliance to software version 3.0.2-b1325160, instances might fail to migrate to another compute node. Rollback causes affected instances to return to their original host compute node, which interrupts active workloads. The issue is caused by competing commands and time-outs during statistics collection at the level of the hypervisor.

As part of this particular version upgrade, vmstats-exporter is uninstalled during the preparation phase, and reinstalled during platform upgrade. Until that time, VM stats show NO DATA.

Workaround: When the upgrade to version 3.0.2-b1325160 is complete, the issue is resolved.

Bug: 36936775

Version: 3.0.2

Restore NTP Configuration After Upgrade To PCA Version: PCA:3.0.2-b1392231

After an upgrade to software version: 3.0.2-b1392231, time synchronization across nodes is lost, which eventually results in cluster and certificate validation failures.

This issue is fixed in fixed in software version: 3.0.2-b1410505

Workaround: For the workaround see [ PCA 3.x ] Restore NTP Configuration After Upgrade To PCA Version: PCA:3.0.2-b1392231 (M3.11) (Doc ID 3089503.1).

Bug: 38035199

Version: 3.0.2

DR Configurations Are Not Supported Between Different Private Cloud Appliance Platforms

DR configurations are only supported when the DR connection is between two Private Cloud Appliances of the same platform. For example, a Private Cloud Appliance X9 rack can only be configured for DR with another Private Cloud Appliance X9 rack.

Workaround: Only create a DR configuration between two Private Cloud Appliance racks of the same platform: X9 to X9, X10 to X10, X11 to X11.

Bug: 38111294

Version: 3.0.1