Oracle Private Cloud Appliance Hardware

This section describes hardware-related limitations and workarounds.

Cisco Firmware Configuration Change

The default configuration for resource limit for VRF under the VDC configuration was changed in Cisco Firmware 10.3(4a).

Previous configuration:

limit-resource vrf minimum 2 maximum 4096

Cisco Firmware 10.3(4a) configuration:

limit-resource vrf minimum 2 maximum 4097

This change has no functional impact. No action is required. See Cisco bug "CSCwh68545 Default resource limit change for VRFs" for more information.

Bug 36925686

Compute Node Boot Sequence Interrupted by LSI Bios Battery Error

When a compute node is powered off for an extended period of time, a week or longer, the LSI BIOS may stop because of a battery error, waiting for the user to press a key in order to continue.

Workaround: Wait for approximately 10 minutes to confirm that the compute node is stuck in boot. Use the Reprovision button in the Oracle Private Cloud Appliance Dashboard to reboot the server and restart the provisioning process.

Bug 16985965

Reboot From Oracle Linux Prompt May Cause Management Node to Hang

When the reboot command is issued from the Oracle Linux command line on a management node, the operating system could hang during boot. Recovery requires manual intervention through the server ILOM.

Workaround: When the management node hangs during (re-)boot, log in to the ILOM and run these two commands in succession: stop -f /SYS and start /SYS. The management node should reboot normally.

Bug 28871758

Oracle ZFS Storage Appliance More Aggressively Fails Slow Disks

Oracle ZFS Storage Appliance IDR 8.8.44 5185.1 has a fault management architecture that more aggressively fails slower disks (FMA DISK-8000-VP). Disk failures can be seen because the slow-disk telemetry system-wide variable is set lower.

If you encounter this issue, the following command will show ireport.io.scsi.cmd.disk.dev.slow.read with DISK-8000-VP and the HDD disk location.

> maintenance problems show

For more information, see the Oracle Support article Oracle ZFS Storage Appliance: Handling DISK-8000-VP 'fault.io.disk.slow_rw' (Doc ID 2906318.1).

Workaround:

If you determine you have a single UNAVAIL disk or multiple disks that are faulted and in a DEGRADED state, engage Oracle Support to investigate and correct the issue.

Oracle ZFS Storage Appliance Firmware Upgrade 8.7.20 Requires A Two-Phased Procedure

Oracle Private Cloud Appliance racks shipped prior to Release 2.3.4 have all been factory-installed with an older version of the Operating Software (AK-NAS) on the controllers of the ZFS Storage Appliance. A new version has been qualified for use with Oracle Private Cloud Appliance Release 2.3.4, but a direct upgrade is not possible. An intermediate upgrade to version 8.7.14 is required.

Workaround: Upgrade the firmware of storage heads twice: first to version 8.7.14, then to version 8.7.20. Both required firmware versions are provided as part of the Oracle Private Cloud Appliance Release 2.3.4 controller software. For upgrade instructions, refer to "Upgrading the Operating Software on the Oracle ZFS Storage Appliance" in Upgrading Oracle Private Cloud Appliance in the Oracle Private Cloud Appliance Administration Guide for Release 2.4.4.

Bug 28913616

Interruption of iSCSI Connectivity Leads to LUNs Remaining in Standby

If network connectivity between compute nodes and their LUNs is disrupted, it may occur that one or more compute nodes mark one or more iSCSI LUNs as being in standby state. The system cannot automatically recover from this state without operations requiring downtime, such as rebooting VMs or even rebooting compute nodes. The standby LUNs are caused by the specific methods that the Linux kernel and the ZFS Storage Appliance use to handle failover of LUN paths.

Workaround: This issue was resolved in the ZFS Storage Appliance firmware version AK 8.7.6. Customers who have run into issues with missing LUN paths and standby LUNs, should update the ZFS Storage Appliance firmware to version AK 8.7.6 or later before upgrading Oracle Private Cloud Appliance.

Bug 24522087

Emulex Fibre Channel HBAs Discover Maximum 128 LUNs

When using optional Broadcom/Emulex Fibre Channel expansion cards in Oracle Server X8-2 compute nodes, and your FC configuration results in more than 128 LUNs between the compute nodes and the FC storage hardware, it may occur that only 128 LUNs are discovered. This is typically caused by a driver parameter for Emulex HBAs.

Workaround: Update the Emulex lpcf driver settings by performing the steps below on each affected compute node.

  1. On the compute node containing the Emulex card, modify the file /etc/default/grub. At the end of the GRUB_CMDLINE_LINUX parameter, append the scsi_mod and lpfc module options shown.

    GRUB_CMDLINE_LINUX="crashkernel=auto rd.lvm.lv=vg/lvroot rd.lvm.lv=vg/lvswap \
    rd.lvm.lv=vg/lvusr rhgb quiet numa=off transparent_hugepage=never \
    scsi_mod.max_luns=4096
                               scsi_mod.max_report_luns=4096
                               lpfc.lpfc_max_luns=4096"
  2. Rebuild the grub configuration with the new parameters.

    # grub2-mkconfig -o /boot/efi/EFI/redhat/grub.cfg
  3. Reboot the compute node.

Bug 30461433, 33114489

Fibre Channel LUN Path Discovery Is Disrupted by Other Oracle VM Operations

During the setup of Fibre Channel storage, when the zones on the FC switch have been created, the LUNs become visible to the connected compute nodes. Discovery operations are started automatically, and all discovered LUNs are added to the multipath configuration on the compute nodes. If the storage configuration contains a large number of LUNs, the multipath configuration may take a long time to complete. As long as the multipath configuration has not finished, the system is under high load, and concurrent Oracle VM operations may prevent some of the FC LUN paths from being added to multipath.

Workaround: It is preferred to avoid Oracle VM operations during FC LUN discovery. Especially all operations related to compute node provisioning and tenant group configuration are disruptive, because they include refreshing the storage layer. When LUNs become visible to the compute nodes, they are detected almost immediately. In contrast, the multipath configuration stage is time-consuming and resource-intensive.

Use the lsscsi command to determine the number of detected LUN paths. The command output is equal to the number of LUN paths plus the system disk. Next, verify that all paths have been added to multipath. The multipath configuration is complete once the multipath -ll command output is equal to the output of the lsscsi command minus 1 (for the system disk).

# lsscsi | wc -l
251
# multipath -ll | grep "active ready running" | wc -l
250

When you have established that the multipath configuration is complete, all Oracle VM operations can be resumed.

Bug 30461555

Poor Oracle VM Performance During Configuration of Fibre Channel LUNs

Discovering Fibre Channel LUNs is a time-consuming and resource-intensive operation. As a result, Oracle VM jobs take an unusually long time to complete. Therefore, it is advisable to complete the FC storage configuration and make sure that the configuration is stable before initiating new Oracle VM operations.

Workaround: Schedule Fibre Channel storage setup and configuration changes at a time when no other Oracle VM operations are required. Verify that all FC configuration jobs have been completed, as explained in Fibre Channel LUN Path Discovery Is Disrupted by Other Oracle VM Operations. When the FC configuration is finished, all Oracle VM operations can be resumed.

Bug 30461478

ILOM Firmware Does Not Allow Loopback SSH Access

In Oracle Integrated Lights Out Manager (ILOM) firmware releases newer than 3.2.4, the service processor configuration contains a field named allowed_services that controls which services are permitted on an interface. By default, SSH is not permitted on the loopback interface. However, Oracle Enterprise Manager uses this mechanism to register Oracle Private Cloud Appliance management nodes. Therefore, SSH must be enabled manually if the ILOM version is newer than 3.2.4.

Workaround: On management nodes running an ILOM version more recent than 3.2.4, make sure that SSH is included in the allowed_services field of the network configuration. Log into the ILOM CLI through the NETMGT Ethernet port and enter the following commands:

-> cd /SP/network/interconnect
-> set hostmanaged=false
-> set allowed_services=fault-transport,ipmi,snmp,ssh
-> set hostmanaged=true 

Bug 26953763

incorrect opcode Messages in the Console Log

Any installed packages that use the mstflint command with a device (-d flag) format using the PCI ID will generate the mst_ioctl 1177: incorrect opcode = 8008d10 error message. Messages similar to the following appear in the console log:

Sep 26 09:50:12 ovcacn10r1 kernel: [  218.707917]   MST::  : print_opcode  549: MST_PARAMS=8028d001 
Sep 26 09:50:12 ovcacn10r1 kernel: [  218.707919]   MST::  : print_opcode  551: PCICONF_READ4=800cd101 
Sep 26 09:50:12 ovcacn10r1 kernel: [  218.707920]   MST::  : print_opcode  552: PCICONF_WRITE4=400cd102 

This issue is caused by an error in the PCI memory mapping associated with the InfiniBand ConnectX device. The messages can be safely ignored, the reported error has no impact on PCA functionality.

Workaround: Using mstflint, access the device from the PCI configuration interface, instead of the PCI ID.

[root@ovcamn06r1 ~]# mstflint -d /proc/bus/pci/13/00.0 q
Image type: FS2
FW Version: 2.11.1280
Device ID: 4099
HW Access Key: Disabled
Description: Node Port1 Port2 Sysimage
GUIDs: 0010e0000159ed0c 0010e0000159ed0d 0010e0000159ed0e 0010e0000159ed0f
MACs: 0010e059ed0d 0010e059ed0e
VSD:
PSID: ORC1090120019 

Bug 29623624

Megaraid Firmware Crash Dump Is Not Available

ILOM console logs may contain many messages similar to this:

[ 1756.232496] megaraid_sas 0000:50:00.0: Firmware crash dump is not available
[ 1763.578890] megaraid_sas 0000:50:00.0: Firmware crash dump is not available
[ 2773.220852] megaraid_sas 0000:50:00.0: Firmware crash dump is not available

These are notifications, not errors or warnings. The crash dump feature in the megaraid controller firmware is not enabled, as it is not required in Oracle Private Cloud Appliance.

Workaround: This behavior is not a bug. No workaround is required.

Bug 30274703

North-South Traffic Connectivity Fails After Restarting Network

This issue may occur if you have not up upgraded the Cisco Switch firmware to version NX-OS I7(7) or later. See "Upgrading the Cisco Switch Firmware" in Upgrading Oracle Private Cloud Appliance in the Oracle Private Cloud Appliance Administration Guide for Release 2.4.4.

Bug 29585636

Some Services Require an Upgrade of Hardware Management Pack

Certain secondary services running on Oracle Private Cloud Appliance, such as Oracle Auto Service Request or the Oracle Enterprise Manager Agent, depend on a specific or minimum version of the Oracle Hardware Management Pack. By design, the Controller Software upgrade does not include the installation of a new Oracle Hardware Management Pack or server ILOM version included in the ISO image. This may leave the Hardware Management Pack in a degraded state and not fully compatible with the ILOM version running on the servers.

Workaround: When upgrading the Oracle Private Cloud Appliance Controller Software, make sure that all component firmware matches the qualified versions for the installed Controller Software release. To ensure correct operation of services depending on the Oracle Hardware Management Pack, make sure that the relevant oracle-hmp*.rpm packages are upgraded to the versions delivered in the Controller Software ISO.

Bug 30123062

Compute Nodes Containing Emulex HBA Card With Maximum FC Paths Reboots With Errors in Oracle VM Manager UI

If a compute node contains an Emulex FC HBA and is configured with 500 LUNs/4000 paths, or 1000 LUNs/4000 paths, you might see the following errors upon reboot of that compute node.

Rack1-Repository errors:

Description: OVMEVT_00A000D_000 Presented repository: Rack1-Repository,
mount: ovcacn31r1_/OVS/Repositories/0004fb00000300009f334f0aad38872b, no
longer found on server: ovcacn31r1.
Please unpresent/present the repository on this server
(fsMountAbsPath: /OVS/Repositories/0004fb00000300009f334f0aad38872b,
fsMountSharePath: , fsMountName: 0004fb00000500003150bc24d6f7c2d5
OVMEVT_00A002D_002 Repository: [RepositoryDbImpl]
0004fb00000300009f334f0aad38872b (Rack1-Repository), is unmounted but in Dom0
DB

Compute Node error:

Description: OVMEVT_003500D_003 Active data was not found. Cluster service is
probably not running.

[root@ovcacn31r1 ~]# service o2cb status
Driver for "configfs": Loaded
Filesystem "configfs": Mounted
Stack glue driver: Loaded
Stack plugin "o2cb": Loaded
Driver for "ocfs2_dlmfs": Loaded
Filesystem "ocfs2_dlmfs": Mounted
Checking O2CB cluster "59b95c6b5c6bc782": Offline
Debug file system at /sys/kernel/debug: mounted
Workaround: Clear the errors for the compute node and the Rack1-Repository as follows.
  1. For the compute node follow these directions:

    Attempting to Present a repository fails with "Cannot present the Repository to server: <hostname> . Cluster is currently down on the server" Oracle Support Document 2041602.1.

  2. For the Rack1-Repository, acknowledge the critcal error, then refresh the repository.

Bug 33124747

Compute Nodes Containing FC HBA with Maximum FC Paths in Dead State After Reprovisioning

If you are reprovisioning a compute node that contains a Fibre Channel HBA with one of the following configurations, reprovisioning fails and leaves the compute node in a dead state.

  • 500 FC LUNs/4000 FC paths

  • 1000 FC LUNs/4000 FC paths

To avoid this issue, follow the directions below to reprovision these types of compute nodes.

Note:

Compute nodes with FC LUNs less than or equal to 128 FC LUNs with 2 paths each succeeds in reprovisioning without this workaround.

Workaround:

  1. Log in to the external storage and remove the compute node's FC initiator from the initiator group (the initiator group that was used to create the max FC paths).

  2. Log in to the compute node and run the multipath -F command to flush out the FC LUNs that are no longer available. multipath -ll will now only show 3 default LUNs.

    [root@ovcacn32r1 ~]# multipath -F                                            
                                                                          
    Jul 21 17:23:12 | 3600144f0d0d725c7000060f5ecb30004: map in use
    Jul 21 17:23:18 | 3600062b20200c6002889e3a010d81476: map in use
    Jul 21 17:23:22 | 3600144f0d0d725c7000060f5ecb10003: map in use
    [root@ovcacn32r1 ~]# multipath -ll
    3600144f0d0d725c7000060f5ecb30004 dm-502 SUN,ZFS Storage 7370
    size=3.0T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
    `-+- policy='round-robin 0' prio=50 status=active
      `- 11:0:0:3   sdbks 71:1664  active ready running
    3600062b20200c6002889e3a010d81476 dm-0 AVAGO,MR9361-16i
    size=1.0T features='1 queue_if_no_path' hwhandler='0' wp=rw
    `-+- policy='round-robin 0' prio=1 status=active
      `- 8:2:1:0    sdb   8:16     active ready running
    3600144f0d0d725c7000060f5ecb10003 dm-501 SUN,ZFS Storage 7370
    size=12G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
    `-+- policy='round-robin 0' prio=50 status=active
      `- 11:0:0:1   sdbkr 71:1648  active ready running
  3. Reprovision the compute node.

  4. (Emulex only) Log in to the compute node and re-apply the grub customization for the Emulex driver, see Emulex Fibre Channel HBAs Discover Maximum 128 LUNs.

  5. Log in to the external storage and re-add the compute node's FC initiator into the initiator group.

  6. Log in to the Oracle VM Manager UI and add the compute node as an admin server to the Unmanaged FibreChannel Storage Array. Refresh the Unmanaged FibreChannel Storage Array. Max FC paths should be restored.

Bug 33134228

Compute Node FC HBA (QLogic/Emulex) with FC LUNs Having Path Flapping

You might encounter path flapping when hundreds of FC LUNs are presented to a compute node in the following scenarios:

  • After a compute node reprovision

  • After a compute node upgrade

  • After exposing a compute node with hundreds of new LUNs (either by LUN creation on the storage array or by fabric rezoning)

If path flapping is occurring on your system, you will see the following errors on your compute node:

  • The tailf /var/log/devmon.log command shows many event messages similar to the following:

    AGENT_NOTIFY EVENT : Jul 29 19:56:38 {STORAGE} [CHANGE_DM_SD] (dm-961)
    3600144f0d987aa07000061027d9c48c6-10:0:0:1917
    (failed:0x2100000e1e1b95c0:3600144f0d987aa07000061027d9c48c6)
    AGENT_NOTIFY EVENT : Jul 29 19:56:38 {STORAGE} [CHANGE_DM_SD] (dm-988)
    3600144f0d987aa07000061027db248e1-10:0:0:1971
    (failed:0x2100000e1e1b95c0:3600144f0d987aa07000061027db248e1)
    AGENT_NOTIFY EVENT : Jul 29 19:56:38 {STORAGE} [CHANGE_DM_SD] (dm-988)
    3600144f0d987aa07000061027db248e1-10:0:0:1971
    (active:0x2100000e1e1b95c0:3600144f0d987aa07000061027db248e1)
    AGENT_NOTIFY EVENT : Jul 29 19:56:39 {STORAGE} [CHANGE_DM_SD] (dm-961)
    3600144f0d987aa07000061027d9c48c6-10:0:0:1917
    (active:0x2100000e1e1b95c0:3600144f0d987aa07000061027d9c48c6)

    This issue is resolved when /var/log/devmon.log stops logging new CHANGE_DM_SD messages.

  • The systemd-udevd process consumes 100% CPU in top command.

    This is resolved when systemd-udevd no longer consumes a large percentage of the CPU.

  • The multipath -ll command does not show all the LUNs. The command might show a fraction of the LUNs expected.

  • The multipath -ll | grep "active ready running" | wc -l command might not count all the LUNs. The command might show a fraction of the LUNs expected.

Workaround: Follow this procedure to resolve path flapping:

  1. Log in to the compute node as root and execute the systemctl restart multipathd command.

  2. Continue to execute the above detection commands until all 4 outputs are resolved, and you see the correct amount of FC LUNs/paths.

  3. If any of the monitoring scenarios does not resolve after 3-4 minutes, repeat step 1.

Bug 33171816

Upgrade Compute Node Fails with Fibre Channel LUNs

If your compute node contains an Emulex or QLogic Fiber Channel HBA, the compute node upgrade procedure might fail because of a Fibre Channel LUN path flapping problem. Use the following workaround to avoid the issue.

Workaround:[PCA 2.4.4] Upgrade Compute Node with Fibre Channel Luns may Fail due to FC Path Flapping (Doc ID 2794501.1).

PCA Faultmonitor Check firewall_monitor Fails Due to nc:command not found

If your compute node fails the Faultmonitor firewall_monitor check and displays the following log error, you are encountering a port error which creates a false report and pushes it to Phone Home, if Phone Home is enabled. The firewall_monitor verifies whether the required ports for Oracle VM Manager and the compute node are opened or not.

[2021-08-03 16:30:15 605830] ERROR (ovmfaultmonitor_utils:487) invalid
literal for int() with base 10: '-bash: nc: command not found'
Traceback (most recent call last):
  File
"/usr/lib/python2.7/site-packages/ovca/monitor/faultmonitor/ovmfaultmonitor/ov
mfaultmonitor_utils.py", line 458, in firewall_monitor
    cmd_outputs[server][port] = int(output.strip())
ValueError: invalid literal for int() with base 10: '-bash: nc: command not
found.

Workaround: To manually fix this error apply the workaround documented at: Oracle Support Document 2797364.1 ([PCA 2.4.4] Faultmonitor Check firewall_monitor Fails due to "nc: command not found").

Certain TZ Configuration is Failing on Cisco Switches

Starting in 2016, tzdata implemented numeric timezone abbreviations like "+03" for new timezones. The Cisco switches only support timezones abbrevations with alphabetic abbreviations like "ASTT". Attempting to change the timezone on a Cisco switch could cause the following error in the ovca.log file:

[2022-05-30 15:08:13 10455] ERROR (cisco:145) Configuration failed
partially:Clock timezone set:: Timezone name should contain alphabets only

Workaround: Do not change the timezones on the Cisco switches. Cisco switches will always report the time in UTC.

Bug 34223027

NFS Shares on Internal ZFS Will Fail After ZFS Firmware Update If vnic Owned by SN02

Starting with ZFSSA AK version 8.8.30, it is now a requirement that the address used to mount shares from a pool must be formally owned by the same head which formally owns the pool, as shown by Configuration -> Cluster.

Software version 2.4.4.1 introduces a new pre-check whichs flags any storage network interfaces that have owner = ovcasn02r1 so the customer can manually correct the owner to ovcasn01r1 before proceeding with the upgrade. If you see the following error, proceed to the workaround below.

[2022-05-24 18:57:22 33554] ERROR (precheck:154) [ZFSSA Storage Network Interfaces Check (Ensure ovcasn01r1 is the owner of all customer-created storage network interfaces)] Failed
The check failed: Detected customer-created storage network interface(s) owned by ovcasn02r1: net/vnic10, net/vnic11, net/vnic12, net/vnic7, net/vnic8, net/vnic9

Workaround: See [PCA 2.4.4.1] Pre-check "ZFSSA Storage Network Interfaces Check" Fails (Doc ID 2876150.1).

Bug 34192251

Repository Size is Not Reflecting Properly in OVMM GUI

The Oracle VM Manager GUI can report an incorrect repository size, which may cause VMs to hang because the repository is actually full. Use another method to check the repository size, like the compute node df output or the OVM CLI.

Workaround: Check the repository size using OVM CLI.

OVM> show repository name=NFS-ExtZFSSA-Repository
Command: show repository name=NFS-ExtZFSSA-Repository
Status: Success
Time: 2021-10-19 10:48:15,294 UTC
Data:
  File System = 14a6cf21-a170-41aa-9a09-7b768aaabc6f  [nfs on
192.168.40.242:/export/NFS-Ext-Repo]
  Manager UUID = 0004fb000001000087ae02edfd0534dc
  File System Free (GiB) = 998.48
  File System Total (GiB) = 1018.84
  File System Used (GiB) = 20.37      
  Used % = 2.0
  Apparent Size (GiB) = 25.0
  Capacity % = 2.5

Bug 33455258