Serviceability Issues

Order of Upgrading Components Has Changed

When updating the platform, you must update the compute nodes first. Failing to update the compute nodes in this order can cause the upgrade to fail and disrupt the system.

Workaround: Complete platform upgrades in this order:

Compute Nodes
Management Nodes
Management Node Operating System
MySQL Cluster Database
Secret Service
Component Firmware
Kubernetes Cluster
Microservices

Bug: 34358305

Version: 3.0.1

DR Configurations Are Not Automatically Refreshed for Terminated Instances

If you terminate an instance that is part of a DR configuration, then a switchover or failover operation will fail due to the terminated instance. The correct procedure is to remove the instance from the DR configuration first, and then terminate the instance. If you forget to remove the instance first, you must refresh the DR configuration manually so that the entry for the terminated instance is removed. Keeping the DR configurations in sync with the state of their associated resources is critical in protecting against data loss.

Workaround: This behavior is expected. Either remove the instance from the DR configuration before terminating, or refresh the DR configuration if you terminated the instance without removing it first.

Bug: 33265549

Version: 3.0.1

Rebooting a Management Node while the Cluster State is Unhealthy Causes Platform Integrity Issues

Rebooting the management nodes is a delicate procedure because it requires many internal interdependent operations to be executed in a controlled manner, with accurate timing and often in a specific order. If a management node fails to reboot correctly and rejoin the cluster, it can lead to a destabilization of the appliance platform and infrastructure services. Symptoms include: microservice pods in CrashLoopBackOff state, data conflicts between MySQL cluster nodes, repeated restarts of the NDB cluster daemon process, and so on.

Workaround: Before rebooting a management node, always verify that the MySQL cluster is in a healthy state. From the management node command line, run the command shown in the example below. If your output does not look similar and indicates a cluster issue, you should power-cycle the affected management node through its ILOM using the restart /System command.

As a precaution, if you need to reboot all the management nodes – for example in a full management cluster upgrade scenario –, observe an interval of at least 10 minutes between two management node reboot operations.

# ndb_mgm -e show
Connected to Management Server at: pcamn01:1186
Cluster Configuration
---------------------
[ndbd(NDB)]     3 node(s)
id=17   @253.255.0.33  (mysql-8.0.25 ndb-8.0.25, Nodegroup: 0)
id=18   @253.255.0.34  (mysql-8.0.25 ndb-8.0.25, Nodegroup: 0, *)
id=19   @253.255.0.35  (mysql-8.0.25 ndb-8.0.25, Nodegroup: 0)

[ndb_mgmd(MGM)] 3 node(s)
id=1    @253.255.0.33  (mysql-8.0.25 ndb-8.0.25)
id=2    @253.255.0.34  (mysql-8.0.25 ndb-8.0.25)
id=3    @253.255.0.35  (mysql-8.0.25 ndb-8.0.25)

[mysqld(API)]   18 node(s)
id=33   @253.255.0.33  (mysql-8.0.25 ndb-8.0.25)
id=34   @253.255.0.33  (mysql-8.0.25 ndb-8.0.25)
[...]

Bug: 34484128

Version: 3.0.2

ULN Mirror Is Not a Required Parameter for Compute Node Patching

In the current implementation of the patching functionality, the ULN field is required for all patch requests. The administrator uses this field to provide the URL to the ULN mirror that is set up inside the data center network. However, compute nodes are patched in a slightly different way, in the sense that patches are applied from an secondary, internal ULN mirror on the shared storage of the management nodes. As a result, the ULN URL is technically not required to patch a compute node, but the patching code does consider it a mandatory parameter, so it must be entered.

Workaround: When patching a compute node, include the URL to the data center ULN mirror as a parameter in your patch request. Regardless of the URL provided, the secondary ULN mirror accessible from the management nodes is used to perform the patching.

Bug: 33730639

Version: 3.0.1

Patch Command Times Out for Network Controller

When patching the platform, the process may fail due to a time-out while updating the network controller. If this is the case, logs will contain entries like "ERROR [pcanwctl upgrade Failed]".

Workaround: Execute the same patch command again. The operation should succeed.

Bug: 33963876

Version: 3.0.1

Upgrade Commands Fail when One Storage Controller Is Unavailable

The ZFS Storage Appliance has two controllers operating in an HA cluster, meaning it continues to operate when one of the controllers goes down. However, with one controller unavailable, upgrade-related operations will fail due to a connection error in the RabbitMQ internal message bus: "Error in RabbitMQ service: No response received after 90 seconds". Even viewing the upgrade job history is not possible, because the upgrade service is unable to send a response.

Workaround: Make sure that both storage controllers are up and running. Then, rerun the required upgrade commands.

Bug: 34507825

Version: 3.0.2

Instances with a Shared Block Volume Cannot Be Part of Different Disaster Recovery Configurations

Multiple instances can have a block volume attached that is shared between them. If you add those instances to a disaster recovery (DR) configuration, their attached volumes are moved to a dedicated ZFS storage project. However, if the instances belong to different DR configurations, each one with its own separate ZFS storage project, the system cannot move any shared block volume as this always results in an invalid DR configuration. Therefore, the Disaster Recovery service does not support adding compute instances with shared block volumes to different DR configurations.

Workaround: Consider including instances with a shared block volume in the same DR configuration, or attaching different block volumes to each instance instead of a shared volume.

Bug: 34566745

Version: 3.0.2

Time-out Occurs when Generating Support Bundle

When you request assistance from Oracle Support, it is usually required to upload a support bundle with your request. A support bundle is generated from a management node using a command similar to this example:

# support-bundles -m time_slice --all -s 2022-01-01T00:00:00.000Z -e 2022-01-02T00:00:00.000Z

If there is a very large number of log entries to be collected for the specified time slice, the process could time out with API exception and an error message that says "unable to execute command". In actual fact, the data collection will continue in the background, but the error is caused by a time-out of the websocket connection to the Kubernetes pod running the data collection process.

Workaround: If you encounter this time-out issue when collecting data for a support bundle, try specifying a shorter time slice to reduce the amount of data collected. If the process completes within 30 minutes the error should not occur.

Bug: 33749450

Version: 3.0.2

DR Operations Intermittently Fail

During certain conditions of heavy load, Site Guard users performing DR operations on the Private Cloud Appliance 3.0 can encounter out-of-session errors when Site Guard EM scripts attempt to perform DR operations using the PCA DR REST API.

This condition occurs when the system is overloaded with requests.

Workaround: Retry the operation.

Bug: 33934952

Version: 3.0.1, 3.0.2

MN01 Host Upgrade Fails When it is the Last Management Node to Upgrade

Upgrades and patches to the management nodes are performed in a sequential order. When MN01 falls last in that order, the management node upgrade or patch operation fails. To avoid this issue, ensure that the Management Node Virtual IP address is assigned to MN02 before you start any management node upgrade or patching operations.

Workaround: Assign the Management Note Virtual IP address to MN02 before you upgrade or patch.

# pcs resource move mgmt-rg pcamn02

Bug: 35554754

Version: 3.0.2

Failure Draining Node when Patching or Upgrading the Kubernetes Cluster

To avoid that microservice pods go into an inappropriate state, each Kubernetes node is drained before being upgraded to the next available version. The Upgrader allows all pods to be evicted gracefully before proceeding with the node. However, if a pod is stuck or is not evicted in time, the upgrade or patch process stops.

Workaround: If a Kubernetes node cannot be drained because a pod is not evicted, you must manually evict the pod that causes the failure.

Log on to the Kubernetes node using ssh, and run the following command, using the appropriate host name:
```
# kubectl drain pcamn00 --ignore-daemonsets --delete-local-data
```
Wait for the draining to complete. The command output should indicate: node/pcamn00 drained.
If the drain command fails, the output indicates which pod is causing the failure. Either run the drain command again and add the --force option, or use the delete command.
```
# kubectl delete pod pod-name --force
```
Rerun the Kubernetes upgrade or patch command. The Upgrader continues from where the process was interrupted.

Bug: 35677796

Version: 3.0.2

Oracle Auto Service Request Disabled after Upgrade

When a Private Cloud Appliance has been registered for Oracle Auto Service Request (ASR), and the service is enabled on the appliance, the ASR service may become disabled after an upgrade of the appliance software. The issue has been observed when upgrading to version 3.0.2-b925538.

Workaround: After the appliance software upgrade, verify the ASR configuration. If the ASR service is disabled, manually enable it again. See "Using Auto Service Requests" in the Status and Health Monitoring chapter of the Oracle Private Cloud Appliance Administrator Guide.

Bug: 35704133

Version: 3.0.2