6 Known Limitations and Workarounds

This chapter provides information about the known limitations and workarounds for Oracle Private Cloud Appliance (PCA).

Oracle Private Cloud Appliance Hardware

This section describes hardware-related limitations and workarounds.

Compute Node Boot Sequence Interrupted by LSI Bios Battery Error

When a compute node is powered off for an extended period of time, a week or longer, the LSI BIOS may stop because of a battery error, waiting for the user to press a key in order to continue.

Workaround: Wait for approximately 10 minutes to confirm that the compute node is stuck in boot. Use the Reprovision button in the Oracle Private Cloud Appliance Dashboard to reboot the server and restart the provisioning process.

Bug 16985965

Reboot From Oracle Linux Prompt May Cause Management Node to Hang

When the reboot command is issued from the Oracle Linux command line on a management node, the operating system could hang during boot. Recovery requires manual intervention through the server ILOM.

Workaround: When the management node hangs during (re-)boot, log in to the ILOM and run these two commands in succession: stop -f /SYS and start /SYS. The management node should reboot normally.

Bug 28871758

Oracle ZFS Storage Appliance More Aggressively Fails Slow Disks

Oracle ZFS Storage Appliance IDR 8.8.44 5185.1 has a fault management architecture that more aggressively fails slower disks (FMA DISK-8000-VP). Disk failures can be seen because the slow-disk telemetry system-wide variable is set lower.

If you encounter this issue, the following command will show ireport.io.scsi.cmd.disk.dev.slow.read with DISK-8000-VP and the HDD disk location.

> maintenance problems show

For more information, see the Oracle Support article Oracle ZFS Storage Appliance: Handling DISK-8000-VP 'fault.io.disk.slow_rw' (Doc ID 2906318.1).

Workaround:

If you determine you have a single UNAVAIL disk or multiple disks that are faulted and in a DEGRADED state, engage Oracle Support to investigate and correct the issue.

Oracle ZFS Storage Appliance Firmware Upgrade 8.7.20 Requires A Two-Phased Procedure

Oracle Private Cloud Appliance racks shipped prior to Release 2.3.4 have all been factory-installed with an older version of the Operating Software (AK-NAS) on the controllers of the ZFS Storage Appliance. A new version has been qualified for use with Oracle Private Cloud Appliance Release 2.3.4, but a direct upgrade is not possible. An intermediate upgrade to version 8.7.14 is required.

Workaround: Upgrade the firmware of storage heads twice: first to version 8.7.14, then to version 8.7.20. Both required firmware versions are provided as part of the Oracle Private Cloud Appliance Release 2.3.4 controller software. For upgrade instructions, refer to "Upgrading the Operating Software on the Oracle ZFS Storage Appliance" in Upgrading Oracle Private Cloud Appliance in the Oracle Private Cloud Appliance Administration Guide for Release 2.4.4.

Bug 28913616

Interruption of iSCSI Connectivity Leads to LUNs Remaining in Standby

If network connectivity between compute nodes and their LUNs is disrupted, it may occur that one or more compute nodes mark one or more iSCSI LUNs as being in standby state. The system cannot automatically recover from this state without operations requiring downtime, such as rebooting VMs or even rebooting compute nodes. The standby LUNs are caused by the specific methods that the Linux kernel and the ZFS Storage Appliance use to handle failover of LUN paths.

Workaround: This issue was resolved in the ZFS Storage Appliance firmware version AK 8.7.6. Customers who have run into issues with missing LUN paths and standby LUNs, should update the ZFS Storage Appliance firmware to version AK 8.7.6 or later before upgrading Oracle Private Cloud Appliance.

Bug 24522087

Emulex Fibre Channel HBAs Discover Maximum 128 LUNs

When using optional Broadcom/Emulex Fibre Channel expansion cards in Oracle Server X8-2 compute nodes, and your FC configuration results in more than 128 LUNs between the compute nodes and the FC storage hardware, it may occur that only 128 LUNs are discovered. This is typically caused by a driver parameter for Emulex HBAs.

Workaround: Update the Emulex lpcf driver settings by performing the steps below on each affected compute node.

  1. On the compute node containing the Emulex card, modify the file /etc/default/grub. At the end of the GRUB_CMDLINE_LINUX parameter, append the scsi_mod and lpfc module options shown.

    GRUB_CMDLINE_LINUX="crashkernel=auto rd.lvm.lv=vg/lvroot rd.lvm.lv=vg/lvswap \
    rd.lvm.lv=vg/lvusr rhgb quiet numa=off transparent_hugepage=never \
    scsi_mod.max_luns=4096
                               scsi_mod.max_report_luns=4096
                               lpfc.lpfc_max_luns=4096"
  2. Rebuild the grub configuration with the new parameters.

    # grub2-mkconfig -o /boot/efi/EFI/redhat/grub.cfg
  3. Reboot the compute node.

Bug 30461433, 33114489

Fibre Channel LUN Path Discovery Is Disrupted by Other Oracle VM Operations

During the setup of Fibre Channel storage, when the zones on the FC switch have been created, the LUNs become visible to the connected compute nodes. Discovery operations are started automatically, and all discovered LUNs are added to the multipath configuration on the compute nodes. If the storage configuration contains a large number of LUNs, the multipath configuration may take a long time to complete. As long as the multipath configuration has not finished, the system is under high load, and concurrent Oracle VM operations may prevent some of the FC LUN paths from being added to multipath.

Workaround: It is preferred to avoid Oracle VM operations during FC LUN discovery. Especially all operations related to compute node provisioning and tenant group configuration are disruptive, because they include refreshing the storage layer. When LUNs become visible to the compute nodes, they are detected almost immediately. In contrast, the multipath configuration stage is time-consuming and resource-intensive.

Use the lsscsi command to determine the number of detected LUN paths. The command output is equal to the number of LUN paths plus the system disk. Next, verify that all paths have been added to multipath. The multipath configuration is complete once the multipath -ll command output is equal to the output of the lsscsi command minus 1 (for the system disk).

# lsscsi | wc -l
251
# multipath -ll | grep "active ready running" | wc -l
250

When you have established that the multipath configuration is complete, all Oracle VM operations can be resumed.

Bug 30461555

Poor Oracle VM Performance During Configuration of Fibre Channel LUNs

Discovering Fibre Channel LUNs is a time-consuming and resource-intensive operation. As a result, Oracle VM jobs take an unusually long time to complete. Therefore, it is advisable to complete the FC storage configuration and make sure that the configuration is stable before initiating new Oracle VM operations.

Workaround: Schedule Fibre Channel storage setup and configuration changes at a time when no other Oracle VM operations are required. Verify that all FC configuration jobs have been completed, as explained in Fibre Channel LUN Path Discovery Is Disrupted by Other Oracle VM Operations. When the FC configuration is finished, all Oracle VM operations can be resumed.

Bug 30461478

ILOM Firmware Does Not Allow Loopback SSH Access

In Oracle Integrated Lights Out Manager (ILOM) firmware releases newer than 3.2.4, the service processor configuration contains a field named allowed_services that controls which services are permitted on an interface. By default, SSH is not permitted on the loopback interface. However, Oracle Enterprise Manager uses this mechanism to register Oracle Private Cloud Appliance management nodes. Therefore, SSH must be enabled manually if the ILOM version is newer than 3.2.4.

Workaround: On management nodes running an ILOM version more recent than 3.2.4, make sure that SSH is included in the allowed_services field of the network configuration. Log into the ILOM CLI through the NETMGT Ethernet port and enter the following commands:

-> cd /SP/network/interconnect
-> set hostmanaged=false
-> set allowed_services=fault-transport,ipmi,snmp,ssh
-> set hostmanaged=true 

Bug 26953763

incorrect opcode Messages in the Console Log

Any installed packages that use the mstflint command with a device (-d flag) format using the PCI ID will generate the mst_ioctl 1177: incorrect opcode = 8008d10 error message. Messages similar to the following appear in the console log:

Sep 26 09:50:12 ovcacn10r1 kernel: [  218.707917]   MST::  : print_opcode  549: MST_PARAMS=8028d001 
Sep 26 09:50:12 ovcacn10r1 kernel: [  218.707919]   MST::  : print_opcode  551: PCICONF_READ4=800cd101 
Sep 26 09:50:12 ovcacn10r1 kernel: [  218.707920]   MST::  : print_opcode  552: PCICONF_WRITE4=400cd102 

This issue is caused by an error in the PCI memory mapping associated with the InfiniBand ConnectX device. The messages can be safely ignored, the reported error has no impact on PCA functionality.

Workaround: Using mstflint, access the device from the PCI configuration interface, instead of the PCI ID.

[root@ovcamn06r1 ~]# mstflint -d /proc/bus/pci/13/00.0 q
Image type: FS2
FW Version: 2.11.1280
Device ID: 4099
HW Access Key: Disabled
Description: Node Port1 Port2 Sysimage
GUIDs: 0010e0000159ed0c 0010e0000159ed0d 0010e0000159ed0e 0010e0000159ed0f
MACs: 0010e059ed0d 0010e059ed0e
VSD:
PSID: ORC1090120019 

Bug 29623624

Megaraid Firmware Crash Dump Is Not Available

ILOM console logs may contain many messages similar to this:

[ 1756.232496] megaraid_sas 0000:50:00.0: Firmware crash dump is not available
[ 1763.578890] megaraid_sas 0000:50:00.0: Firmware crash dump is not available
[ 2773.220852] megaraid_sas 0000:50:00.0: Firmware crash dump is not available

These are notifications, not errors or warnings. The crash dump feature in the megaraid controller firmware is not enabled, as it is not required in Oracle Private Cloud Appliance.

Workaround: This behavior is not a bug. No workaround is required.

Bug 30274703

North-South Traffic Connectivity Fails After Restarting Network

This issue may occur if you have not up upgraded the Cisco Switch firmware to version NX-OS I7(7) or later. See "Upgrading the Cisco Switch Firmware" in Upgrading Oracle Private Cloud Appliance in the Oracle Private Cloud Appliance Administration Guide for Release 2.4.4.

Bug 29585636

Some Services Require an Upgrade of Hardware Management Pack

Certain secondary services running on Oracle Private Cloud Appliance, such as Oracle Auto Service Request or the Oracle Enterprise Manager Agent, depend on a specific or minimum version of the Oracle Hardware Management Pack. By design, the Controller Software upgrade does not include the installation of a new Oracle Hardware Management Pack or server ILOM version included in the ISO image. This may leave the Hardware Management Pack in a degraded state and not fully compatible with the ILOM version running on the servers.

Workaround: When upgrading the Oracle Private Cloud Appliance Controller Software, make sure that all component firmware matches the qualified versions for the installed Controller Software release. To ensure correct operation of services depending on the Oracle Hardware Management Pack, make sure that the relevant oracle-hmp*.rpm packages are upgraded to the versions delivered in the Controller Software ISO.

Bug 30123062

Compute Nodes Containing Emulex HBA Card With Maximum FC Paths Reboots With Errors in Oracle VM Manager UI

If a compute node contains an Emulex FC HBA and is configured with 500 LUNs/4000 paths, or 1000 LUNs/4000 paths, you might see the following errors upon reboot of that compute node.

Rack1-Repository errors:

Description: OVMEVT_00A000D_000 Presented repository: Rack1-Repository,
mount: ovcacn31r1_/OVS/Repositories/0004fb00000300009f334f0aad38872b, no
longer found on server: ovcacn31r1.
Please unpresent/present the repository on this server
(fsMountAbsPath: /OVS/Repositories/0004fb00000300009f334f0aad38872b,
fsMountSharePath: , fsMountName: 0004fb00000500003150bc24d6f7c2d5
OVMEVT_00A002D_002 Repository: [RepositoryDbImpl]
0004fb00000300009f334f0aad38872b (Rack1-Repository), is unmounted but in Dom0
DB

Compute Node error:

Description: OVMEVT_003500D_003 Active data was not found. Cluster service is
probably not running.

[root@ovcacn31r1 ~]# service o2cb status
Driver for "configfs": Loaded
Filesystem "configfs": Mounted
Stack glue driver: Loaded
Stack plugin "o2cb": Loaded
Driver for "ocfs2_dlmfs": Loaded
Filesystem "ocfs2_dlmfs": Mounted
Checking O2CB cluster "59b95c6b5c6bc782": Offline
Debug file system at /sys/kernel/debug: mounted
Workaround: Clear the errors for the compute node and the Rack1-Repository as follows.
  1. For the compute node follow these directions:

    Attempting to Present a repository fails with "Cannot present the Repository to server: <hostname> . Cluster is currently down on the server" Oracle Support Document 2041602.1.

  2. For the Rack1-Repository, acknowledge the critcal error, then refresh the repository.

Bug 33124747

Compute Nodes Containing FC HBA with Maximum FC Paths in Dead State After Reprovisioning

If you are reprovisioning a compute node that contains a Fibre Channel HBA with one of the following configurations, reprovisioning fails and leaves the compute node in a dead state.

  • 500 FC LUNs/4000 FC paths

  • 1000 FC LUNs/4000 FC paths

To avoid this issue, follow the directions below to reprovision these types of compute nodes.

Note:

Compute nodes with FC LUNs less than or equal to 128 FC LUNs with 2 paths each succeeds in reprovisioning without this workaround.

Workaround:

  1. Log in to the external storage and remove the compute node's FC initiator from the initiator group (the initiator group that was used to create the max FC paths).

  2. Log in to the compute node and run the multipath -F command to flush out the FC LUNs that are no longer available. multipath -ll will now only show 3 default LUNs.

    [root@ovcacn32r1 ~]# multipath -F                                            
                                                                          
    Jul 21 17:23:12 | 3600144f0d0d725c7000060f5ecb30004: map in use
    Jul 21 17:23:18 | 3600062b20200c6002889e3a010d81476: map in use
    Jul 21 17:23:22 | 3600144f0d0d725c7000060f5ecb10003: map in use
    [root@ovcacn32r1 ~]# multipath -ll
    3600144f0d0d725c7000060f5ecb30004 dm-502 SUN,ZFS Storage 7370
    size=3.0T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
    `-+- policy='round-robin 0' prio=50 status=active
      `- 11:0:0:3   sdbks 71:1664  active ready running
    3600062b20200c6002889e3a010d81476 dm-0 AVAGO,MR9361-16i
    size=1.0T features='1 queue_if_no_path' hwhandler='0' wp=rw
    `-+- policy='round-robin 0' prio=1 status=active
      `- 8:2:1:0    sdb   8:16     active ready running
    3600144f0d0d725c7000060f5ecb10003 dm-501 SUN,ZFS Storage 7370
    size=12G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
    `-+- policy='round-robin 0' prio=50 status=active
      `- 11:0:0:1   sdbkr 71:1648  active ready running
  3. Reprovision the compute node.

  4. (Emulex only) Log in to the compute node and re-apply the grub customization for the Emulex driver, see Emulex Fibre Channel HBAs Discover Maximum 128 LUNs.

  5. Log in to the external storage and re-add the compute node's FC initiator into the initiator group.

  6. Log in to the Oracle VM Manager UI and add the compute node as an admin server to the Unmanaged FibreChannel Storage Array. Refresh the Unmanaged FibreChannel Storage Array. Max FC paths should be restored.

Bug 33134228

Compute Node FC HBA (QLogic/Emulex) with FC LUNs Having Path Flapping

You might encounter path flapping when hundreds of FC LUNs are presented to a compute node in the following scenarios:

  • After a compute node reprovision

  • After a compute node upgrade

  • After exposing a compute node with hundreds of new LUNs (either by LUN creation on the storage array or by fabric rezoning)

If path flapping is occurring on your system, you will see the following errors on your compute node:

  • The tailf /var/log/devmon.log command shows many event messages similar to the following:

    AGENT_NOTIFY EVENT : Jul 29 19:56:38 {STORAGE} [CHANGE_DM_SD] (dm-961)
    3600144f0d987aa07000061027d9c48c6-10:0:0:1917
    (failed:0x2100000e1e1b95c0:3600144f0d987aa07000061027d9c48c6)
    AGENT_NOTIFY EVENT : Jul 29 19:56:38 {STORAGE} [CHANGE_DM_SD] (dm-988)
    3600144f0d987aa07000061027db248e1-10:0:0:1971
    (failed:0x2100000e1e1b95c0:3600144f0d987aa07000061027db248e1)
    AGENT_NOTIFY EVENT : Jul 29 19:56:38 {STORAGE} [CHANGE_DM_SD] (dm-988)
    3600144f0d987aa07000061027db248e1-10:0:0:1971
    (active:0x2100000e1e1b95c0:3600144f0d987aa07000061027db248e1)
    AGENT_NOTIFY EVENT : Jul 29 19:56:39 {STORAGE} [CHANGE_DM_SD] (dm-961)
    3600144f0d987aa07000061027d9c48c6-10:0:0:1917
    (active:0x2100000e1e1b95c0:3600144f0d987aa07000061027d9c48c6)

    This issue is resolved when /var/log/devmon.log stops logging new CHANGE_DM_SD messages.

  • The systemd-udevd process consumes 100% CPU in top command.

    This is resolved when systemd-udevd no longer consumes a large percentage of the CPU.

  • The multipath -ll command does not show all the LUNs. The command might show a fraction of the LUNs expected.

  • The multipath -ll | grep "active ready running" | wc -l command might not count all the LUNs. The command might show a fraction of the LUNs expected.

Workaround: Follow this procedure to resolve path flapping:

  1. Log in to the compute node as root and execute the systemctl restart multipathd command.

  2. Continue to execute the above detection commands until all 4 outputs are resolved, and you see the correct amount of FC LUNs/paths.

  3. If any of the monitoring scenarios does not resolve after 3-4 minutes, repeat step 1.

Bug 33171816

Upgrade Compute Node Fails with Fibre Channel LUNs

If your compute node contains an Emulex or QLogic Fiber Channel HBA, the compute node upgrade procedure might fail because of a Fibre Channel LUN path flapping problem. Use the following workaround to avoid the issue.

Workaround:[PCA 2.4.4] Upgrade Compute Node with Fibre Channel Luns may Fail due to FC Path Flapping (Doc ID 2794501.1).

PCA Faultmonitor Check firewall_monitor Fails Due to nc:command not found

If your compute node fails the Faultmonitor firewall_monitor check and displays the following log error, you are encountering a port error which creates a false report and pushes it to Phone Home, if Phone Home is enabled. The firewall_monitor verifies whether the required ports for Oracle VM Manager and the compute node are opened or not.

[2021-08-03 16:30:15 605830] ERROR (ovmfaultmonitor_utils:487) invalid
literal for int() with base 10: '-bash: nc: command not found'
Traceback (most recent call last):
  File
"/usr/lib/python2.7/site-packages/ovca/monitor/faultmonitor/ovmfaultmonitor/ov
mfaultmonitor_utils.py", line 458, in firewall_monitor
    cmd_outputs[server][port] = int(output.strip())
ValueError: invalid literal for int() with base 10: '-bash: nc: command not
found.

Workaround: To manually fix this error apply the workaround documented at: Oracle Support Document 2797364.1 ([PCA 2.4.4] Faultmonitor Check firewall_monitor Fails due to "nc: command not found").

Certain TZ Configuration is Failing on Cisco Switches

Starting in 2016, tzdata implemented numeric timezone abbreviations like "+03" for new timezones. The Cisco switches only support timezones abbrevations with alphabetic abbreviations like "ASTT". Attempting to change the timezone on a Cisco switch could cause the following error in the ovca.log file:

[2022-05-30 15:08:13 10455] ERROR (cisco:145) Configuration failed
partially:Clock timezone set:: Timezone name should contain alphabets only

Workaround: Do not change the timezones on the Cisco switches. Cisco switches will always report the time in UTC.

Bug 34223027

NFS Shares on Internal ZFS Will Fail After ZFS Firmware Update If vnic Owned by SN02

Starting with ZFSSA AK version 8.8.30, it is now a requirement that the address used to mount shares from a pool must be formally owned by the same head which formally owns the pool, as shown by Configuration -> Cluster.

Software version 2.4.4.1 introduces a new pre-check whichs flags any storage network interfaces that have owner = ovcasn02r1 so the customer can manually correct the owner to ovcasn01r1 before proceeding with the upgrade. If you see the following error, proceed to the workaround below.

[2022-05-24 18:57:22 33554] ERROR (precheck:154) [ZFSSA Storage Network Interfaces Check (Ensure ovcasn01r1 is the owner of all customer-created storage network interfaces)] Failed
The check failed: Detected customer-created storage network interface(s) owned by ovcasn02r1: net/vnic10, net/vnic11, net/vnic12, net/vnic7, net/vnic8, net/vnic9

Workaround: See [PCA 2.4.4.1] Pre-check "ZFSSA Storage Network Interfaces Check" Fails (Doc ID 2876150.1).

Bug 34192251

Repository Size is Not Reflecting Properly in OVMM GUI

The Oracle VM Manager GUI can report an incorrect repository size, which may cause VMs to hang because the repository is actually full. Use another method to check the repository size, like the compute node df output or the OVM CLI.

Workaround: Check the repository size using OVM CLI.

OVM> show repository name=NFS-ExtZFSSA-Repository
Command: show repository name=NFS-ExtZFSSA-Repository
Status: Success
Time: 2021-10-19 10:48:15,294 UTC
Data:
  File System = 14a6cf21-a170-41aa-9a09-7b768aaabc6f  [nfs on
192.168.40.242:/export/NFS-Ext-Repo]
  Manager UUID = 0004fb000001000087ae02edfd0534dc
  File System Free (GiB) = 998.48
  File System Total (GiB) = 1018.84
  File System Used (GiB) = 20.37      
  Used % = 2.0
  Apparent Size (GiB) = 25.0
  Capacity % = 2.5

Bug 33455258

Oracle Private Cloud Appliance Software

This section describes software-related limitations and workarounds.

Do Not Install Additional Software on Appliance Components

Oracle Private Cloud Appliance is delivered as an appliance: a complete and controlled system composed of selected hardware and software components. If you install additional software packages on the pre-configured appliance components, be it a compute node, management node or storage component, you introduce new variables that potentially disrupt the operation of the appliance as a whole. Unless otherwise instructed, Oracle advises against the installation or upgrade of additional packages, either from a third party or from Oracle's own software channels like the Oracle Linux YUM repositories.

Workaround: Do not install additional software on any internal Oracle Private Cloud Appliance system components. If your internal processes require certain additional tools, contact your Oracle representative to discuss these requirements.

X8 Server Hard Reset Required After Upgrade

When you upgrade to Oracle Private Cloud Appliance Release 2.4.4.2, a workaround for bug 34686980 is applied during the upgrade of the X8 servers.

After the upgrade, the servers must be power cycled to activate the change.

Workaround: Power cycle the servers to activate the workaround that was applied during upgrade.

Instructions for performing this power cycle are included in [PCA 2.4.4.2] Upgrade Guide (Doc ID 2914968.1).

See also [PCA 2.4.x] How to Disable ADDDC in X7-2/X8-2 Nodes (Doc ID 2916308.1) and Power Cycle the Server in the Oracle Servers X8-2 and X8-2L Operating Systems Installation Guide.

Bug: 34686980

Upgrader UI Installation Step Is Skipped

This issue causes the pca_upgrader to skip the ui_install step. The following message appears in the log file: "The Oracle VM upgrade has already been completed on the other manager. Skipping OVCA UI installation on management-node-name."

If you previously performed an upgrade and did not perform the preventive step described in the following workaround, you might need to manually install the UI.

Workaround:Before you begin the upgrade, delete the following file to prevent skipping UI installation:

/nfs/shared_storage/pca_upgrader/pxe_upgrade/.standby_upgrade_complete

To manually install the UI if you did not delete the .standby_upgrade_complete file prior to performing the upgrade, see [PCA 2.x] Accessing Dashboard Gets HTTP 404 Error After Upgrading Management Nodes (Doc ID 2491230.1).

Bug: 34906041

Deleting a Storage Network Is Allowed When the Network Is Assigned to NFS or iSCSI Storage

You are able to delete a storage_network network while that network is assigned to nfs-storage or iscsi-storage.

Workaround: Before you delete a storage_network network, use show nfs-storage and show iscsi-storage to verify that the storage_network is not assigned.

Bug 34767928

Node Manager Does Not Show Node Offline Status

The role of the Node Manager database is to track the various states a compute node goes through during provisioning. After successful provisioning the database continues to list a node as running, even if it is shut down. For nodes that are fully operational, the server status is tracked by Oracle VM Manager. However, the Oracle Private Cloud Appliance Dashboard displays status information from the Node Manager. This may lead to inconsistent information between the Dashboard and Oracle VM Manager, but it is not considered a bug.

Workaround: To verify the status of operational compute nodes, use the Oracle VM Manager user interface.

Bug 17456373

Compute Node State Changes Despite Active Provisioning Lock

The purpose of a lock of the type provisioning or all_provisioning is to prevent all compute nodes from starting or continuing a provisioning process. However, when you attempt to reprovision a running compute node from the Oracle Private Cloud Appliance CLI while an active lock is in place, the compute node state changes to "reprovision_only" and it is marked as "DEAD". Provisioning of the compute node continues as normal when the provisioning lock is deactivated.

Bug 22151616

Compute Nodes Are Available in Oracle VM Server Pool Before Provisioning Completes

Compute node provisioning can take up to several hours to complete. However, those nodes are added to the Oracle VM server pool early on in the process, but they are not placed in maintenance mode. In theory the discovered servers are available for use in Oracle VM Manager, but you must not attempt to alter their configuration in any way before the Oracle Private Cloud Appliance Dashboard indicates that provisioning has completed.

Workaround: Wait for compute node provisioning to finish. Do not modify the compute nodes or server pool in any way in Oracle VM Manager.

Bug 22159111

Virtual Machines Remain in Running Status when Host Compute Node Is Reprovisioned

Using the Oracle Private Cloud Appliance CLI it is possible to force the reprovisioning of a compute node even if it is hosting running virtual machines. The compute node is not placed in maintenance mode. Consequently, the active virtual machines are not shut down or migrated to another compute node. Instead these VMs remain in running status and Oracle VM Manager reports their host compute node as "N/A".

Caution:

Reprovisioning a compute node that hosts virtual machines is considered bad practice. Good practice is to migrate all virtual machines away from the compute node before starting a reprovisioning operation or software update.

Workaround: In this particular condition the VMs can no longer be migrated. They must be killed and restarted. After a successful restart they return to normal operation on a different host compute node in accordance with start policy defined for the server pool.

Bug 22018046

Ethernet-Based System Management Nodes Have Non-Functional bond0 Network Interface

When the driver for network interface bonding is loaded, the system automatically generates a default bond0 interface. However, this interface is not activated or used in the management nodes of an Oracle Private Cloud Appliance with the Ethernet-based network architecture.

Workaround: The bond0 interface is not configured in any usable way and can be ignored on Ethernet-based systems. On InfiniBand-based systems, the bond0 interface is functional and configured.

Bug 29559810

Network Performance Is Impacted by VxLAN Encapsulation

The design of the all-Ethernet network fabric in Oracle Private Cloud Appliance relies heavily on VxLAN encapsulation and decapsulation. This extra protocol layer requires additional CPU cycles and consequently reduces network performance compared to regular tagged or untagged traffic. In particular the connectivity to and from VMs can be affected. To compensate for the CPU load of VxLAN processing, the MTU (Maximum Transmission Unit) on VM networks can be increased to 9000 bytes, which is the setting across the standard appliance networks. However, the network paths should be analyzed carefully to make sure that the larger MTU setting is supported between the end points: if an intermediate network device only supports an MTU of 1500 bytes, then the fragmentation of the 9000 byte packets will result in a bigger performance penalty.

Workaround: If the required network performance cannot be obtained with a default MTU of 1500 bytes for regular VM traffic, you should consider increasing the MTU to 9000 bytes; on the VM network and inside the VM itself.

Bug 29664090

Altering Custom Network VLAN Tag Is Not Supported

When you create a custom network, it is technically possible – though not supported – to alter the VLAN tag in Oracle VM Manager. However, when you attempt to add a compute node, the system creates the network interface on the server but fails to enable the modified VLAN configuration. At this point the custom network is stuck in a failed state: neither the network nor the interfaces can be deleted, and the VLAN configuration can no longer be changed back to the original tag.

Workaround: Do not modify appliance-level networking in Oracle VM Manager. There are no documented workarounds and any recovery operation is likely to require significant downtime of the Oracle Private Cloud Appliance environment.

Bug 23250544

Configuring Uplinks with Breakout Ports Results in Port Group Named 'None'

When you split uplink ports for custom network configuration by means of a breakout cable, and subsequently start configuring the port pairs through the Oracle Private Cloud Appliance CLI, all four breakout ports are stored in the configuration database at the same time. This means that when you add the first two of four breakout ports to a port group, the remaining two breakout ports on the same cable are automatically added to another port group named "None", which remains disabled. When you add the second pair of breakout ports to a port group, "None" is replaced with the port group name of your choice, and the port group is enabled. The sequence of commands in the example shows how the configuration changes step by step:

PCA> create uplink-port-group custom_ext_1 '1:1 1:2' 10g-4x
Status: Success

PCA> list uplink-port-group
Port_Group_Name    Ports      Mode    Speed    Breakout_Mode    Enabled   State
---------------    -----      ----    -----    -------------    -------   -----
default_5_1        5:1 5:2    LAG     10g      10g-4x           True      (up)* Not all ports are up
default_5_2        5:3 5:4    LAG     10g      10g-4x           False     down
custom_ext_1       1:1 1:2    LAG     10g      10g-4x           True      up
None               1:3 1:4    LAG     10g      10g-4x           False     up
----------------
4 rows displayed
Status: Success

PCA> create uplink-port-group custom_ext_2 '1:3 1:4' 10g-4x
Status: Success

PCA> list uplink-port-group
Port_Group_Name    Ports      Mode    Speed    Breakout_Mode    Enabled   State
---------------    -----      ----    -----    -------------    -------   -----
default_5_1        5:1 5:2    LAG     10g      10g-4x           True      (up)* Not all ports are up
default_5_2        5:3 5:4    LAG     10g      10g-4x           False     down
custom_ext_1       1:1 1:2    LAG     10g      10g-4x           True      up
custom_ext_2       1:3 1:4    LAG     10g      10g-4x           True      up
----------------
4 rows displayed
Status: Success

Workaround: This behavior is by design, because it is a requirement that all four breakout ports must be added to the network configuration at the same time. When a port group is named "None", and it consists of two ports in a 4-way breakout cable, which are otherwise (temporarily) unconfigured, this can be ignored.

Bug 30426198

DPM Server Pool Policy Interrupts Synchronization of Tenant Group Settings

Tenant groups in Oracle Private Cloud Appliance are based on Oracle VM server pools, with additional configuration for network and storage across the servers included in the tenant group. When a compute node is added to a tenant group, its network and storage configuration is synchronized with the other servers already in the tenant group. This process takes several minutes, and could therefore be interrupted if a distributed power management (DPM) policy is active for the Oracle VM server pool. The DPM policy may force the new compute node to shut down because it contains no running virtual machines, while the tenant group configuration process on the compute node is still in progress. The incomplete configuration causes operational issues at the level of the compute node or even the tenant group.

Workaround: If server pool policies are a requirement, it is suggested to turn them off temporarily when modifying tenant groups or during the installation and configuration of expansion compute nodes.

Bug 30478940

Host Network Parameter Validation Is Too Permissive

When you define a host network, it is possible to enter invalid or contradictory values for the Prefix, Netmask and Route_Destination parameters. For example, when you enter a prefix with "0" as the first octet, the system attempts to configure IP addresses on compute node Ethernet interfaces starting with 0. Also, when the netmask part of the route destination you enter is invalid, the network is still created, even though an exception occurs. When such a poorly configured network is in an invalid state, it cannot be reconfigured or deleted with standard commands.

Workaround: Double-check your CLI command parameters before pressing Enter. If an invalid network configuration is applied, use the --force option to delete the network.

Bug 25729227

Virtual Appliances Cannot Be Imported Over a Host Network

A host network provides connectivity between compute nodes and hosts external to the appliance. It is implemented to connect external storage to the environment. If you attempt to import a virtual appliance, also known as assemblies in previous releases of Oracle VM and Oracle Private Cloud Appliance, from a location on the host network, it is likely to fail, because Oracle VM Manager instructs the compute nodes to use the active management node as a proxy for the import operation.

Workaround: Make sure that the virtual appliance resides in a location accessible from the active management node.

Bug 25801215

Customizations for ZFS Storage Appliance in multipath.conf Are Not Supported

The ZFS stanza in multipath.conf is controlled by the Oracle Private Cloud Appliance software. The internal ZFS Storage Appliance is a critical component of the appliance and the multipath configuration is tailored to the internal requirements. You should never modify the ZFS parameters in multipath.conf, because it could adversely affect the appliance performance and functionality.

Even if customizations were applied for (external) ZFS storage, they are overwritten when the Oracle Private Cloud Appliance Controller Software is updated. A backup of the file is saved prior to the update. Customizations in other stanzas of multipath.conf, for storage devices from other vendors, are preserved during upgrades.

Bug 25821423

Customer Created LUNs Are Mapped to the Wrong Initiator Group

When adding LUNs on the Oracle Private Cloud Appliance internal ZFS Storage Appliance you must add them under the "OVM" target group. Only this default target group is supported; there can be no additional target groups. However, on the initiator side you should not use the default configuration, otherwise all LUNs are mapped to the "All Initiators" group, and accessible for all nodes in the system. Such a configuration may cause several problems within the appliance.

Additional, custom LUNs on the internal storage must instead be mapped to one or more custom initiator groups. This ensures that the LUNs are mapped to the intended initiators, and are not remapped by the appliance software to the default "All Initiators" group.

Workaround: When creating additional, custom LUNs on the internal ZFS Storage Appliance, always use the default target group, but make sure the LUNs are mapped to one or more custom initiator groups.

Bugs 22309236 and 18155778

Storage Head Failover Disrupts Running Virtual Machines

When a failover occurs between the storage heads of a ZFS Storage Appliance, virtual machine operation could be disrupted by temporary loss of disk access. Depending on the guest operating system, and on the configuration of the guest and Oracle VM, a VM could hang, power off or reboot. This behavior is caused by an iSCSI configuration parameter that does not allow sufficient recovery time for the storage failover to complete.

Workaround: Increase the value of node.session.timeo.replacement_timeout in the file /etc/iscsi/iscsid.conf. For details, refer to the support note with Doc ID 2189806.1.

Bug 24439070

Changing Multiple Component Passwords Causes Authentication Failure in Oracle VM Manager

When several different passwords are set for different appliance components using the Oracle Private Cloud Appliance Dashboard, you could be locked out of Oracle VM Manager, or communication between Oracle VM Manager and other components could fail, as a result of authentication failures. The problem is caused by a partially failed password update, whereby a component has accepted the new password while another component continues to use the old password to connect.

The risk of authentication issues is considerably higher when Oracle VM Manager and its directly related components Oracle WebLogic Server and Oracle MySQL database are involved. A password change for these components requires the ovmm service to restart. If another password change occurs within a matter of a few minutes, the operation to update Oracle VM Manager accordingly could fail because the ovmm service was not active. An authentication failure will prevent the ovmm service from restarting.

Workaround: If you set different passwords for appliance components using the Oracle Private Cloud Appliance Dashboard, change them one by one with a 10 minute interval. If the ovmm service is stopped as a result of a password change, wait for it to restart before making further changes. If the ovmm service fails to restart due to authentication issues, it may be necessary to replace the file /nfs/shared_storage/wls1/servers/AdminServer/security/boot.properties with the previous version of the file (boot.properties.old).

Bug 26007398

ILOM Password of Expansion Compute Nodes Is Not Synchronized During Provisioning

After the rack components have been configured with a custom password, any compute node ILOM of a newly installed expansion compute node does not automatically take over the password set by the user in the Wallet. The compute node provisions correctly, and the Wallet maintains access to its ILOM even though it uses the factory-default password. However, it is good practice to make sure that custom passwords are correctly synchronized across all components.

Workaround: Set or update the compute node ILOM password using the Oracle Private Cloud Appliance Dashboard or CLI. This sets the new password both in the Wallet and the compute node ILOM.

Bug 26143197

SSH Host Key Mismatch After Management Node Failover

When logging in to the active management node using SSH, you typically use the virtual IP address shared between both management nodes. However, since they are separate physical hosts, they have a different host key. If the host key is stored in the SSH client, and a failover to the secondary management node occurs, the next attempt to create an SSH connection through the virtual IP address results in a host key verification failure.

Workaround: Do not store the host key in the SSH client. If the key has been stored, remove it from the client's file system; typically inside the user directory in .ssh/known_hosts.

Bug 22915408

External Storage Cannot Be Discovered Over Data Center Network

The default compute node configuration does not allow connectivity to additional storage resources in the data center network. Compute nodes are connected to the data center subnet to enable public connectivity for the virtual machines they host, but the compute nodes' network interfaces have no IP address in that subnet. Consequently, SAN or file server discovery will fail.

Bug 17508885

Mozilla Firefox Cannot Establish Secure Connection with User Interface

Both the Oracle Private Cloud Appliance Dashboard and the Oracle VM Manager user interface run on an architecture based on Oracle WebLogic Server, Oracle Application Development Framework (ADF) and Oracle JDK 6. The cryptographic protocols supported on this architecture are SSLv3 and TLSv1.0. Mozilla Firefox version 38.2.0 or later no longer supports SSLv3 connections with a self-signed certificate. As a result, an error message might appear when you try to open the user interface login page.

Workaround: Override the default Mozilla Firefox security protocol as follows:

  1. In the Mozilla Firefox address bar, type about:config to access the browser configuration.

  2. Acknowledge the warning about changing advanced settings by clicking I'll be careful, I promise!.

  3. In the list of advanced settings, use the Search bar to filter the entries and look for the settings to be modified.

  4. Double-click the following entries and then enter the new value to change the configuration preferences:

    • security.tls.version.fallback-limit: 1

    • security.ssl3.dhe_rsa_aes_128_sha: false

    • security.ssl3.dhe_rsa_aes_256_sha: false

  5. If necessary, also modify the configuration preference security.tls.insecure_fallback_hosts and enter the affected hosts as a comma-separated list, either as domain names or as IP addresses.

  6. Close the Mozilla Firefox advanced configuration tab. The pages affected by the secure connection failure should now load normally.

Bug 21622475 and 21803485

Virtual Machine with High Availability Takes Five Minutes to Restart when Failover Occurs

The compute nodes in an Oracle Private Cloud Appliance are all placed in a single clustered server pool during provisioning. A clustered server pool is created as part of the provisioning process. One of the configuration parameters is the cluster time-out: the time a server is allowed to be unavailable before failover events are triggered. To avoid false positives, and thus unwanted failovers, the Oracle Private Cloud Appliance server pool time-out is set to 300 seconds. As a consequence, a virtual machine configured with high availability (HA VM) can be unavailable for 5 minutes when its host fails. After the cluster time-out has passed, the HA VM is automatically restarted on another compute node in the server pool.

This behavior is as designed; it is not a bug. The server pool cluster configuration causes the delay in restarting VMs after a failover has occurred.

CLI Command update appliance Is Deprecated

The Oracle Private Cloud Appliance command line interface contains the update appliance command, which is used in releases prior to 2.3.4 to unpack a Controller Software image and update the appliance with a new software stack. This functionality is now part of the Upgrader tool, so the CLI command is deprecated and will be removed in the next release.

Workaround: Future updates and upgrades will be executed through the Oracle Private Cloud Appliance Upgrader.

Bug 29913246

Certain CLI Commands Fail in Single-command Mode

The Oracle Private Cloud Appliance command line interface can be used in an interactive mode, using a closed shell environment, or in a single-command mode. When using the single-command mode, commands and arguments are entered at the Oracle Linux command prompt as a single line. If such a single command contains special characters, such as quotation marks, they may be stripped out and interpreted incorrectly.

Workaround: Use the CLI in interactive mode to avoid special characters being stripped out of command arguments. If you must use single-command mode, use single and double quotation marks around the arguments where required, so that only the outer quotation marks are stripped out. For example, change this command from:

# pca-admin create uplink-port-group myPortGroup '2:1 2:2' 10g-4x

to

# pca-admin create uplink-port-group myPortGroup "'2:1 2:2'" 10g-4x

Do not use doubles of the same quotation marks.

Bug 30421250

Upgrader Checks Logged in Different Order

Due to a change in how the Oracle Private Cloud Appliance Upgrader test are run, the output of the checks could be presented in a different order each time the tests are run.

This behavior is not a bug. There is no workaround required.

Bug 30078487

Virtual Machine Loses IP Address Due to DHCP Timeout During High Network Load

When an Oracle Private Cloud Appliance is configured to the maximum limits and a high load is running, a situation may occur where general DHCP/IP bandwidth limits are exceeded. In this case the DHCP client eventually reaches a timeout, and as a result the virtual machine IP address is lost, then reset to 0.0.0.0. This is normal behavior when the system is operating at full bandwidth capacity.

Workaround: When adequate bandwidth is available, recover from the situation by issuing the dhclient command from the virtual machine to request a new IP address.

Bug 30143723

Adding the Virtual Machine Role to the Storage Network Causes Cluster to Lose Heartbeat Networking

Attempting to add the Virtual Machine role to the storage network in Oracle VM Manger on an Oracle Private Cloud Appliance can cause your cluster to lose heartbeat networking, which will impact running Virtual Machines and their workloads. This operation is not supported on Oracle Private Cloud Appliance.

Workaround: Do not add the VM role to the storage-int network.

Bug 30936974

Adding Virtual Machine Role to the Management Network Causes Oracle VM Manager to Lose Contact with the Compute Nodes

Attempting to add the Virtual Machine role to the management network in Oracle VM Manger on an Oracle Private Cloud Appliance causes you to lose connectivity with your compute nodes. The compute nodes are still up, however your manager can not communicate with the compute nodes, which leaves your rack in a degraded state. This operation is not supported on Oracle Private Cloud Appliance.

Workaround: Do not add the VM role to the mgmt-int network.

Bug 30937049

Inadvertant Reboot of Stand-by Management Node During Upgrade Suspends Upgrade

When upgrading to Oracle Private Cloud Appliance Controller Software release 2.4.3 from either release 2.3.4 or 2.4.x releases you are required to upgrade the original stand-by management node first. Part of that upgrade is a reboot of this node which happens automatically during the upgrade process. After this reboot the original stand-by management node becomes the new active node. The next step is to upgrade the original active management node. However, if instead, you inadvertently reboot the original stand-by node again (the node that is now the new active) you will be unable to proceed with the upgrade because this will cause Oracle Private Cloud Appliance services on the new active node to fail.

Workaround: Reboot the original active node. This restarts the Oracle Private Cloud Appliance services on the new active node and you can proceed with upgrading the original active node.

Bug 30968544

Loading Incompatible Spine Switch Configuration Causes Storage Network Outage

When upgrading to Oracle Private Cloud Appliance Controller Software release 2.4.3 on an ethernet-based system do not attempt to make any manual changes to the spine switch configurations prior to the completion of the storage network upgrade. Doing so could cause the management nodes to lose access to the storage network. The management nodes may also get rebooted.

Additionally, once an upgrade to Controller Software release 2.4.3 is complete on an ethernet-based system, do not attempt to reload a spine switch backup from a prior software release. This could cause the management nodes to lose access to the storage network. The management nodes may also get rebooted. For example, you may see these error messages:

192.0.2.1 is unreachable
[root@ovcamn05r1 data]# ping 192.0.2.1
PING 192.0.2.1 (192.0.2.1) 56(84) bytes of data.
 
 
Mount points under shared storage are gone.
[root@ovcamn05r1 ~]# ls /nfs/shared_storage/
logs NO_STORAGE_MOUNTED
 
 
No master management node any more. o2cb service is offline. Both management nodes are slave now.
[root@ovcamn05r1 ~]# pca-check-master
o2cb service is offline.
NODE: 192.0.2.2 MASTER: False

Workaround: Manually roll back the changes made on the spine switch configurations, then reboot both management nodes.

Bug 31407007

Cloud Backup Task Hangs When a ZFSSA Takeover is Performed During Backup

When the connection to the ZFS storage appliance is interrupted, the Oracle Cloud Infrastructure process will terminate the operation and mark it failed in the task database. In some cases, such as a management node reboot, there is no mechanism to update the state.

Workaround: When the task is unable to change state, delete the task from the task database, delete the oci_backup lock file, and institute a new backup operation. See "Cloud Backup" in Monitoring and Managing Oracle Private Cloud Appliance in the Oracle Private Cloud Appliance Administration Guide for Release 2.4.4.

Bug 31028898

Export VM to Oracle Cloud Infrastructure Job Shows as Aborted During MN Failover but it is Running in the Background

If there is an Export VM to Oracle Cloud Infrastructure job running when an active management node reboots or crashes, that job status changes to Aborted on Oracle VM Manager. In some cases, the export job will continue on the Exporter Appliance, despite the Abort message.

Workaround: Restart the Export VM to Oracle Cloud Infrastructure job. If the job is still running in the background, a pop up message shows An export operation is already in progress for VM. If the export job was aborted gracefully with the management node failover, then the export job is restarted.

Bug 31687516

Remove Deprecated pca-admin diagnose software Command

As of the Oracle Private Cloud Appliance Software Controller version 2.4.3 release, the pca-admin diagnose software command is no longer functional.

Workaround: Use the diagnostic functions now available through a separate health check tool. See "Health Monitoring" in Monitoring and Managing Oracle Private Cloud Appliance in the Oracle Private Cloud Appliance Administration Guide for Release 2.4.4 for more information.

Bug 31705580

Virtual Machine get Message Failed After 200 Seconds - Observed When kube clusters are Created Concurrently

When using the Oracle Private Cloud Appliance Cloud Native Environment release 1.2 OVA to create kube clusters, if you attempt to start multiple clusters at the same time, some clusters may fail with the following message:

Error_Code           VM_ERROR_004
Error_Message        Error (VM_ERROR_004): Virtual machine
autonas-cc3-master-3 get message failed after 200 seconds:
com.oracle.linux.keepalived.master-addr,com.oracle.linux.k8s.error,com.oracle.
linux.k8s.script-result,com.oracle.linux.keepalived.error.

Workaround: Stop the kube cluster that has failed, then restart that kube cluster.

Bug 32799556

Kube Cluster Creation/Deletion Should Not Be in Progress When Management Node Upgrade is Initiated

When upgrading management nodes from Software Controller release 2.4.3 to Software Controller release 2.4.4, do not initiate any kube cluster start or stop operations. As part of the upgrade procedure, a management node failover occurs. This failover can cause a kube cluster to go into a degraded state, if the kube cluster was attempting to start or stop at the time of the upgrade.

Workaround: Stop the kube cluster that has failed, then restart that kube cluster. These operations will clean up and recreate any VMs that were corrupted.

Bug 32880993

o2cb Service Status Reports "Registering O2CB cluster "ocfs2": Failed" State After Compute Node Provisioned

After compute nodes are provisioned, during the upgrade from Oracle Private Cloud Appliance release 2.4.3 to 2.4.4, you may encounter error messages with the o2cb service. When queried, the service is in the active state, but some clusters may show a failed state, as seen in the example below.

[root@ovcacn08r1 ~]# service o2cb status
Redirecting to /bin/systemctl status o2cb.service
 o2cb.service - Load o2cb Modules
   Loaded: loaded (/usr/lib/systemd/system/o2cb.service; enabled; vendor
preset: disabled)
   Active: active (exited) since Thu 2021-04-22 09:00:51 UTC; 21h ago
 Main PID: 2407 (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/o2cb.service

Apr 22 09:00:51 ovcacn08r1 o2cb.init[2407]: Loading stack plugin "o2cb": OK
Apr 22 09:00:51 ovcacn08r1 o2cb.init[2407]: Loading filesystem "ocfs2_dlmfs":
OK
Apr 22 09:00:51 ovcacn08r1 o2cb.init[2407]: Creating directory '/dlm': OK
Apr 22 09:00:51 ovcacn08r1 o2cb.init[2407]: Mounting ocfs2_dlmfs filesystem
at /dlm: OK
Apr 22 09:00:51 ovcacn08r1 o2cb.init[2407]: Setting cluster stack "o2cb": OK
Apr 22 09:00:51 ovcacn08r1 o2cb.init[2407]: Registering O2CB cluster "ocfs2":
Failed
Apr 22 09:00:51 ovcacn08r1 o2cb.init[2407]: o2cb: Unknown cluster 'ocfs2'
Apr 22 09:00:51 ovcacn08r1 o2cb.init[2407]: Unregistering O2CB cluster
"ocfs2": Failed
Apr 22 09:00:51 ovcacn08r1 o2cb.init[2407]: o2cb: Cluster 'ocfs2' is not
active
Apr 22 09:00:51 ovcacn08r1 systemd[1]: Started Load o2cb Modules.
[root@ovcacn08r1 ~]#

Workaround: The failed messages are incorrectly reporting the status of the clusters, the clusters are functioning properly. It is safe to ignore these error messages. To clear the false messages, restart the o2cb service and check the status.

[root@ovcacn10r1 ~]# service o2cb restart
Redirecting to /bin/systemctl restart o2cb.service
[root@ovcacn10r1 ~]# service o2cb status

Bug 32667300

Compute Node Upgrade Restores Default Repository When Compute Node Was Previously Not Part of Any Tenant Group or Repository

When upgrading compute nodes from Software Controller release 2.4.3 to Software Controller release 2.4.4, if you have a compute node that is a part of the default tenant group but has no assigned repositories, the upgrade process restores the default repository to that compute node.

Workaround: If you wish to keep a compute node with no assigned repositories, and the upgrade process assigns the default repository to that compute node, simply unpresent the repository from that compute node after upgrade.

Bug 32847571

Check of Local Repository to Ensure Target Compute Node is Empty

When upgrading compute nodes from Software Controller release 2.4.3 to Software Controller release 2.4.4, if the local repository for a compute node being upgraded has any ISO, VM Files, VM Templates, Virtual Appliances or Virtual Disks present, the upgrade precheck fails. This is expected behavior to ensure the data inside the local repository is retained before the upgrade occurs, which erases that data. If your compute node upgrade pre-checks fail, move all objects located in the compute node local repository to another repository, then retry the upgrade. If there is no need to retain any ISOs, VM Files, VM Templates, Virtual Appliances or Virtual Disks, delete them in order to make the local repository empty.

Workaround: Move items to another repository and retry the upgrade.

  1. Log in to the Oracle VM Manger Web UI for the compute node you are upgrading
  2. Move each file type as described below:

    Table 6-1 Moving Items Out of the Local Repository

    Item Steps

    ISO

    1. Clone the ISOs to other repositories.
    2. Delete the ISO files from the local repository.

    VM Template

    1. Move the template to another repository using clone customiser.

    Virtual Appliances

    1. Create VMs using each of the virtual appliances.
    2. Create virtual appliances from those VMs just create using "Export to Virtual Appliance" and point them to other repositories.
    3. Delete the virtual appliances (created in step 1) from the local repository.

    Vitrual Disks

    • If the VMs using these Virtual disks are in the local repository, migrate the corresponding VMs along with the Virtual disks (residing in local repository) to some other repository, using clone customiser.
    • If the VMs using these Virtual disks, are not in the local repository, (for example, few or all Virtual disks of some VMs reside in the local repository), follow these steps:
      1. Stop the VMs using those virtual disks.
      2. Clone the virtual disks with clone target as some other repository. This clone target repository should be presented to the compute node on which the VMs are hosted
      3. Delete the actual virtual disks from the VMs.
      4. Attach the cloned virtual disks to their corresponding VMs
      5. Start the VMs.

    VM Files

    1. Migrate the corresponding VMs to some other repository.
  3. Run the pca_upgrader in verify mode to confirm the pre-checks pass.

    If the pre-checks pass, run the upgrade.

    [root@ovcamn05r1 ~]# pca_upgrader -V -t compute -c ovcacnXXr1
    PCA Rack Type: PCA X8_BASE.
    Please refer to log file /nfs/shared_storage/pca_upgrader/log/pca_upgrader_<timestamp>.log for more details.
     
     
    Beginning PCA Compute Node Pre-Upgrade Checks...
     
    Check target Compute Node exists                                        1/8
    Check the provisioning lock is not set                                  2/8
    Check OVCA release on Management Nodes                                  3/8
    Check Compute Node's Tenant matches Server Pool                         4/8
    Check target Compute Node has no local networks VNICs                   5/8
    Check target Compute Node has no VMs                                    6/8
    Check local repository of target Compute Node is empty                  7/8
    Check no physical disks on target Compute Node have repositories        8/8
     
    PCA Compute Node Pre-Upgrade Checks completed after 0 minutes
     
    ---------------------------------------------------------------------------
    PCA Compute Node Pre-Upgrade Checks                                  Passed
    ---------------------------------------------------------------------------
    Check target Compute Node exists                                     Passed
    Check the provisioning lock is not set                               Passed
    Check OVCA release on Management Nodes                               Passed
    Check Compute Node's Tenant matches Server Pool                      Passed
    Check target Compute Node has no local networks VNICs                Passed
    Check target Compute Node has no VMs                                 Passed
    Check local repository of target Compute Node is empty               Passed
    Check no physical disks on target Compute Node have repositories     Passed
    ---------------------------------------------------------------------------
    Overall Status                                                       Passed
    ---------------------------------------------------------------------------
    PCA Compute Node Pre-Upgrade Checks                                  Passed
    Please refer to log file /nfs/shared_storage/pca_upgrader/log/pca_upgrader_<timestamp>.log for more details.
     
    [root@ovcamn05r1 ~]#
  4. After you successfully perform the upgrade, restore the files you just backed up to the local repository on the newly-upgraded compute node (ovcacnXXr1-localfsrepo). You can use the table above to restore the items, or find detailed instructions in Repositories Tab in the Oracle VM Manager User's Guide.

Bug 33093080

Check No Physical Disks on Target Compute Node Have Repositories

When upgrading compute nodes from Software Controller release 2.4.3 to Software Controller release 2.4.4, if there are repositories present on any Physical Disks (iSCSI/FC) and those Physical Disks (iSCSI/FC) are only presented to the compute node which is being upgraded, the precheck will fail.

Workaround: Release the ownership of the repository from the physical disk.

Note:

Check all physical disks that are only presented to the compute node being upgraded for repositories. You must perform this procedure for each repository that is present on each of these physical disks.

Pre-Upgrade Steps

  1. Log in to the Oracle VM Manager Web UI.
  2. In the Servers and VMs tab, select the appropriate server pool and validate that the compute node is part of that server pool.
  3. From the Repositories tab, select the repository and note the physical disk over which the repository lies.
  4. From the Repositories tab, select the repository, then edit the concerned repository and check Release Ownership.
  5. From Repository tab, click Show All Repositories, then select the repository and delete it.

    This only deletes the repository from Oracle VM Manager and not the actual filesystem on the physical disk.

Retry the Compute Node Upgrade

  1. Run the pca_upgrader in verify mode to confirm the pre-checks pass.
    [root@ovcamn05r1 ~]# pca_upgrader -V -t compute -c ovcacnXXr1
    PCA Rack Type: PCA X8_BASE.
    Please refer to log file /nfs/shared_storage/pca_upgrader/log/pca_upgrader_<timestamp>.log for more details.
     
     
    Beginning PCA Compute Node Pre-Upgrade Checks...
     
    Check target Compute Node exists                                        1/8
    Check the provisioning lock is not set                                  2/8
    Check OVCA release on Management Nodes                                  3/8
    Check Compute Node's Tenant matches Server Pool                         4/8
    Check target Compute Node has no local networks VNICs                   5/8
    Check target Compute Node has no VMs                                    6/8
    Check local repository of target Compute Node is empty                  7/8
    Check no physical disks on target Compute Node have repositories        8/8
     
    PCA Compute Node Pre-Upgrade Checks completed after 0 minutes
     
    ---------------------------------------------------------------------------
    PCA Compute Node Pre-Upgrade Checks                                  Passed
    ---------------------------------------------------------------------------
    Check target Compute Node exists                                     Passed
    Check the provisioning lock is not set                               Passed
    Check OVCA release on Management Nodes                               Passed
    Check Compute Node's Tenant matches Server Pool                      Passed
    Check target Compute Node has no local networks VNICs                Passed
    Check target Compute Node has no VMs                                 Passed
    Check local repository of target Compute Node is empty               Passed
    Check no physical disks on target Compute Node have repositories     Passed
    ---------------------------------------------------------------------------
    Overall Status                                                       Passed
    ---------------------------------------------------------------------------
    PCA Compute Node Pre-Upgrade Checks                                  Passed
    Please refer to log file /nfs/shared_storage/pca_upgrader/log/pca_upgrader_<timestamp>.log for more details.
     
    [root@ovcamn05r1 ~]#
  2. If the pre-checks pass, run the upgrade.
    [root@ovcamn05r1 ~]# pca_upgrader -U -t compute -c ovcacnXXr1
    

Post Upgrade Steps to Restore the Repository

  1. In the Storage tab, click on the SAN Server which hold the physical disks and refresh the physical disk (which held the repository before the upgrade).
  2. In the Storage tab, select Shared File System/Local File System for the corresponding file system for the physical disk on which you had the repository, then click the refresh button.
  3. In the Repository tab, click Show All Repositories, then confirm the repository (which was deleted earlier in this procedure) is restored.
  4. From Repository tab, click Show All Repositories, then edit the repository that was deleted pre-upgrade. Click on take ownership and select the same server pool it was associated with prior to the upgrade.
  5. Select the repository and click Refresh Selected Repository.

Bug 33093068

Backup of Config Can Fill Filesystem and Cause Numerous Problems

Over time, backups of the Private Cloud Appliance configuration information can accumulate at /nfs/shared_storage/backups and fill the filesystem. Periodically you must remove old backups to ensure the filesystem does not run out of room.

Workaround: Remove backups using the following procedure.

Removing Backups

  1. Log in to the active management node as root user.
  2. Move any custom scripts or data located in /nfs/shared_storage/backups to a different location. During this cleanup procedure everything in that location will be deleted, except the backup files for the selected retention period.

    /nfs/shared_storage/backups must only contain backup tarballs and or uncompressed backup directories.

  3. Remove old backups using this command:
    [root@ovcamn05r1 ~]# find /nfs/shared_storage/backups -maxdepth 1 -mtime +<retention-period-in-days> -exec rm -rf {} \;
    For example, to delete all older backups, inclusive of any uncompressed backup directories, older than 30 days, type:
    [root@ovcamn05r1 ~]# find /nfs/shared_storage/backups -maxdepth 1 -mtime +30 -exec rm -rf {} \;

Bug 33947155

Two Different Release Packages of Same Tool Present in the ISO

There is a chance that an ISO file could contain multiple rpm files for impitool. There is no action to be taken, the upgrader tool will install the correct version.

Bug 34375901

When you create an uplink-port-group, then restore the spine switch configuration from a backup the uplink-port-group is removed from the switch configuration. However, the uplink-port-group still appears in the Oracle Private Cloud Appliance interface. An attempt to delete the uplink-port-group may fail.

Workaround: Repeat the delete command a second time.Bug 34379557