Oracle Private Cloud Appliance Hardware
This section describes hardware-related limitations and workarounds.
Cisco Firmware Configuration Change
The default configuration for resource limit for VRF under the VDC configuration was changed in Cisco Firmware 10.3(4a).
Previous configuration:
limit-resource vrf minimum 2 maximum 4096
Cisco Firmware 10.3(4a) configuration:
limit-resource vrf minimum 2 maximum 4097
This change has no functional impact. No action is required. See Cisco bug "CSCwh68545 Default resource limit change for VRFs" for more information.
Bug 36925686
Compute Node Boot Sequence Interrupted by LSI Bios Battery Error
When a compute node is powered off for an extended period of time, a week or longer, the LSI BIOS may stop because of a battery error, waiting for the user to press a key in order to continue.
Workaround: Wait for approximately 10 minutes to confirm that the compute node is stuck in boot. Use the Reprovision button in the Oracle Private Cloud Appliance Dashboard to reboot the server and restart the provisioning process.
Bug 16985965
Reboot From Oracle Linux Prompt May Cause Management Node to Hang
When the reboot command is issued from the Oracle Linux command line on a management node, the operating system could hang during boot. Recovery requires manual intervention through the server ILOM.
Workaround: When the management node hangs during (re-)boot, log
in to the ILOM and run these two commands in
succession: stop -f /SYS
and start /SYS
. The management node
should reboot normally.
Bug 28871758
Oracle ZFS Storage Appliance More Aggressively Fails Slow Disks
Oracle ZFS Storage Appliance IDR 8.8.44 5185.1 has a fault management architecture that more aggressively fails slower disks (FMA DISK-8000-VP). Disk failures can be seen because the slow-disk telemetry system-wide variable is set lower.
If you encounter this issue, the following command will show
ireport.io.scsi.cmd.disk.dev.slow.read
with DISK-8000-VP and the HDD disk
location.
> maintenance problems show
For more information, see the Oracle Support article Oracle ZFS Storage Appliance: Handling DISK-8000-VP 'fault.io.disk.slow_rw' (Doc ID 2906318.1).
Workaround:
If you determine you have a single UNAVAIL
disk or multiple disks that are
faulted and in a DEGRADED
state, engage Oracle Support to investigate and correct the
issue.
Oracle ZFS Storage Appliance Firmware Upgrade 8.7.20 Requires A Two-Phased Procedure
Oracle Private Cloud Appliance racks shipped prior to Release 2.3.4 have all been factory-installed with an older version of the Operating Software (AK-NAS) on the controllers of the ZFS Storage Appliance. A new version has been qualified for use with Oracle Private Cloud Appliance Release 2.3.4, but a direct upgrade is not possible. An intermediate upgrade to version 8.7.14 is required.
Workaround: Upgrade the firmware of storage heads twice: first to version 8.7.14, then to version 8.7.20. Both required firmware versions are provided as part of the Oracle Private Cloud Appliance Release 2.3.4 controller software. For upgrade instructions, refer to "Upgrading the Operating Software on the Oracle ZFS Storage Appliance" in Upgrading Oracle Private Cloud Appliance in the Oracle Private Cloud Appliance Administration Guide for Release 2.4.4.
Bug 28913616
Interruption of iSCSI Connectivity Leads to LUNs Remaining in Standby
If network connectivity between compute nodes and their LUNs is disrupted, it may occur that one or more compute nodes mark one or more iSCSI LUNs as being in standby state. The system cannot automatically recover from this state without operations requiring downtime, such as rebooting VMs or even rebooting compute nodes. The standby LUNs are caused by the specific methods that the Linux kernel and the ZFS Storage Appliance use to handle failover of LUN paths.
Workaround: This issue was resolved in the ZFS Storage Appliance firmware version AK 8.7.6. Customers who have run into issues with missing LUN paths and standby LUNs, should update the ZFS Storage Appliance firmware to version AK 8.7.6 or later before upgrading Oracle Private Cloud Appliance.
Bug 24522087
Emulex Fibre Channel HBAs Discover Maximum 128 LUNs
When using optional Broadcom/Emulex Fibre Channel expansion cards in Oracle Server X8-2 compute nodes, and your FC configuration results in more than 128 LUNs between the compute nodes and the FC storage hardware, it may occur that only 128 LUNs are discovered. This is typically caused by a driver parameter for Emulex HBAs.
Workaround: Update the Emulex lpcf driver settings by performing the steps below on each affected compute node.
-
On the compute node containing the Emulex card, modify the file
/etc/default/grub
. At the end of theGRUB_CMDLINE_LINUX
parameter, append thescsi_mod
andlpfc
module options shown.GRUB_CMDLINE_LINUX="crashkernel=auto rd.lvm.lv=vg/lvroot rd.lvm.lv=vg/lvswap \ rd.lvm.lv=vg/lvusr rhgb quiet numa=off transparent_hugepage=never \ scsi_mod.max_luns=4096 scsi_mod.max_report_luns=4096 lpfc.lpfc_max_luns=4096"
-
Rebuild the grub configuration with the new parameters.
# grub2-mkconfig -o /boot/efi/EFI/redhat/grub.cfg
-
Reboot the compute node.
Bug 30461433, 33114489
Fibre Channel LUN Path Discovery Is Disrupted by Other Oracle VM Operations
During the setup of Fibre Channel storage, when the zones on the FC switch have been created, the LUNs become visible to the connected compute nodes. Discovery operations are started automatically, and all discovered LUNs are added to the multipath configuration on the compute nodes. If the storage configuration contains a large number of LUNs, the multipath configuration may take a long time to complete. As long as the multipath configuration has not finished, the system is under high load, and concurrent Oracle VM operations may prevent some of the FC LUN paths from being added to multipath.
Workaround: It is preferred to avoid Oracle VM operations during FC LUN discovery. Especially all operations related to compute node provisioning and tenant group configuration are disruptive, because they include refreshing the storage layer. When LUNs become visible to the compute nodes, they are detected almost immediately. In contrast, the multipath configuration stage is time-consuming and resource-intensive.
Use the lsscsi
command to determine the
number of detected LUN paths. The command output is equal to
the number of LUN paths plus the system disk. Next, verify
that all paths have been added to multipath. The multipath
configuration is complete once the multipath
-ll
command output is equal to the output of the
lsscsi command minus 1 (for the system disk).
# lsscsi | wc -l 251 # multipath -ll | grep "active ready running" | wc -l 250
When you have established that the multipath configuration is complete, all Oracle VM operations can be resumed.
Bug 30461555
Poor Oracle VM Performance During Configuration of Fibre Channel LUNs
Discovering Fibre Channel LUNs is a time-consuming and resource-intensive operation. As a result, Oracle VM jobs take an unusually long time to complete. Therefore, it is advisable to complete the FC storage configuration and make sure that the configuration is stable before initiating new Oracle VM operations.
Workaround: Schedule Fibre Channel storage setup and configuration changes at a time when no other Oracle VM operations are required. Verify that all FC configuration jobs have been completed, as explained in Fibre Channel LUN Path Discovery Is Disrupted by Other Oracle VM Operations. When the FC configuration is finished, all Oracle VM operations can be resumed.
Bug 30461478
ILOM Firmware Does Not Allow Loopback SSH Access
In Oracle Integrated Lights Out Manager (ILOM) firmware releases newer than 3.2.4,
the service processor configuration contains a field named allowed_services
that controls which services are permitted on an interface. By default, SSH is not permitted
on the loopback interface. However, Oracle Enterprise Manager uses this
mechanism to register Oracle Private Cloud Appliance management nodes.
Therefore, SSH must be enabled manually if the ILOM
version is newer than 3.2.4.
Workaround: On management nodes running an ILOM version more recent than 3.2.4, make sure that
SSH is included in the allowed_services
field of the network configuration.
Log into the ILOM CLI through the
NETMGT
Ethernet port and enter the following commands:
-> cd /SP/network/interconnect -> set hostmanaged=false -> set allowed_services=fault-transport,ipmi,snmp,ssh -> set hostmanaged=true
Bug 26953763
incorrect opcode
Messages in the Console Log
Any installed packages that use the
mstflint
command with a device (-d flag)
format using the PCI ID will generate the mst_ioctl
1177: incorrect opcode = 8008d10
error message.
Messages similar to the following appear in the console log:
Sep 26 09:50:12 ovcacn10r1 kernel: [ 218.707917] MST:: : print_opcode 549: MST_PARAMS=8028d001 Sep 26 09:50:12 ovcacn10r1 kernel: [ 218.707919] MST:: : print_opcode 551: PCICONF_READ4=800cd101 Sep 26 09:50:12 ovcacn10r1 kernel: [ 218.707920] MST:: : print_opcode 552: PCICONF_WRITE4=400cd102
This issue is caused by an error in the PCI memory mapping associated with the InfiniBand ConnectX device. The messages can be safely ignored, the reported error has no impact on PCA functionality.
Workaround: Using
mstflint
, access the device from the PCI
configuration interface, instead of the PCI ID.
[root@ovcamn06r1 ~]# mstflint -d /proc/bus/pci/13/00.0 q Image type: FS2 FW Version: 2.11.1280 Device ID: 4099 HW Access Key: Disabled Description: Node Port1 Port2 Sysimage GUIDs: 0010e0000159ed0c 0010e0000159ed0d 0010e0000159ed0e 0010e0000159ed0f MACs: 0010e059ed0d 0010e059ed0e VSD: PSID: ORC1090120019
Bug 29623624
Megaraid Firmware Crash Dump Is Not Available
ILOM console logs may contain many messages similar to this:
[ 1756.232496] megaraid_sas 0000:50:00.0: Firmware crash dump is not available [ 1763.578890] megaraid_sas 0000:50:00.0: Firmware crash dump is not available [ 2773.220852] megaraid_sas 0000:50:00.0: Firmware crash dump is not available
These are notifications, not errors or warnings. The crash dump feature in the megaraid controller firmware is not enabled, as it is not required in Oracle Private Cloud Appliance.
Workaround: This behavior is not a bug. No workaround is required.
Bug 30274703
North-South Traffic Connectivity Fails After Restarting Network
This issue may occur if you have not up upgraded the Cisco Switch firmware to version NX-OS I7(7) or later. See "Upgrading the Cisco Switch Firmware" in Upgrading Oracle Private Cloud Appliance in the Oracle Private Cloud Appliance Administration Guide for Release 2.4.4.
Bug 29585636
Some Services Require an Upgrade of Hardware Management Pack
Certain secondary services running on Oracle Private Cloud Appliance, such as Oracle Auto Service Request or the Oracle Enterprise Manager Agent, depend on a specific or minimum version of the Oracle Hardware Management Pack. By design, the Controller Software upgrade does not include the installation of a new Oracle Hardware Management Pack or server ILOM version included in the ISO image. This may leave the Hardware Management Pack in a degraded state and not fully compatible with the ILOM version running on the servers.
Workaround: When upgrading the Oracle Private Cloud Appliance Controller Software, make sure that all
component firmware matches the qualified versions for the installed Controller Software
release. To ensure correct operation of services depending on the Oracle Hardware Management
Pack, make sure that the relevant oracle-hmp*.rpm
packages are upgraded to
the versions delivered in the Controller Software ISO.
Bug 30123062
Compute Nodes Containing Emulex HBA Card With Maximum FC Paths Reboots With Errors in Oracle VM Manager UI
If a compute node contains an Emulex FC HBA and is configured with 500 LUNs/4000 paths, or 1000 LUNs/4000 paths, you might see the following errors upon reboot of that compute node.
Rack1-Repository errors:
Description: OVMEVT_00A000D_000 Presented repository: Rack1-Repository, mount: ovcacn31r1_/OVS/Repositories/0004fb00000300009f334f0aad38872b, no longer found on server: ovcacn31r1. Please unpresent/present the repository on this server (fsMountAbsPath: /OVS/Repositories/0004fb00000300009f334f0aad38872b, fsMountSharePath: , fsMountName: 0004fb00000500003150bc24d6f7c2d5
OVMEVT_00A002D_002 Repository: [RepositoryDbImpl] 0004fb00000300009f334f0aad38872b (Rack1-Repository), is unmounted but in Dom0 DB
Compute Node error:
Description: OVMEVT_003500D_003 Active data was not found. Cluster service is probably not running. [root@ovcacn31r1 ~]# service o2cb status Driver for "configfs": Loaded Filesystem "configfs": Mounted Stack glue driver: Loaded Stack plugin "o2cb": Loaded Driver for "ocfs2_dlmfs": Loaded Filesystem "ocfs2_dlmfs": Mounted Checking O2CB cluster "59b95c6b5c6bc782": Offline Debug file system at /sys/kernel/debug: mounted
-
For the compute node follow these directions:
-
For the Rack1-Repository, acknowledge the critcal error, then refresh the repository.
Bug 33124747
Compute Nodes Containing FC HBA with Maximum FC Paths in Dead State After Reprovisioning
If you are reprovisioning a compute node that contains a Fibre Channel HBA with one of the following configurations, reprovisioning fails and leaves the compute node in a dead state.
-
500 FC LUNs/4000 FC paths
-
1000 FC LUNs/4000 FC paths
To avoid this issue, follow the directions below to reprovision these types of compute nodes.
Note:
Compute nodes with FC LUNs less than or equal to 128 FC LUNs with 2 paths each succeeds in reprovisioning without this workaround.
Workaround:
-
Log in to the external storage and remove the compute node's FC initiator from the initiator group (the initiator group that was used to create the max FC paths).
-
Log in to the compute node and run the
multipath -F
command to flush out the FC LUNs that are no longer available.multipath -ll
will now only show 3 default LUNs.[root@ovcacn32r1 ~]# multipath -F Jul 21 17:23:12 | 3600144f0d0d725c7000060f5ecb30004: map in use Jul 21 17:23:18 | 3600062b20200c6002889e3a010d81476: map in use Jul 21 17:23:22 | 3600144f0d0d725c7000060f5ecb10003: map in use [root@ovcacn32r1 ~]# multipath -ll 3600144f0d0d725c7000060f5ecb30004 dm-502 SUN,ZFS Storage 7370 size=3.0T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw `-+- policy='round-robin 0' prio=50 status=active `- 11:0:0:3 sdbks 71:1664 active ready running 3600062b20200c6002889e3a010d81476 dm-0 AVAGO,MR9361-16i size=1.0T features='1 queue_if_no_path' hwhandler='0' wp=rw `-+- policy='round-robin 0' prio=1 status=active `- 8:2:1:0 sdb 8:16 active ready running 3600144f0d0d725c7000060f5ecb10003 dm-501 SUN,ZFS Storage 7370 size=12G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw `-+- policy='round-robin 0' prio=50 status=active `- 11:0:0:1 sdbkr 71:1648 active ready running
-
Reprovision the compute node.
-
(Emulex only) Log in to the compute node and re-apply the grub customization for the Emulex driver, see Emulex Fibre Channel HBAs Discover Maximum 128 LUNs.
-
Log in to the external storage and re-add the compute node's FC initiator into the initiator group.
-
Log in to the Oracle VM Manager UI and add the compute node as an admin server to the Unmanaged FibreChannel Storage Array. Refresh the Unmanaged FibreChannel Storage Array. Max FC paths should be restored.
Bug 33134228
Compute Node FC HBA (QLogic/Emulex) with FC LUNs Having Path Flapping
You might encounter path flapping when hundreds of FC LUNs are presented to a compute node in the following scenarios:
-
After a compute node reprovision
-
After a compute node upgrade
-
After exposing a compute node with hundreds of new LUNs (either by LUN creation on the storage array or by fabric rezoning)
If path flapping is occurring on your system, you will see the following errors on your compute node:
-
The
tailf /var/log/devmon.log
command shows many event messages similar to the following:AGENT_NOTIFY EVENT : Jul 29 19:56:38 {STORAGE} [CHANGE_DM_SD] (dm-961) 3600144f0d987aa07000061027d9c48c6-10:0:0:1917 (failed:0x2100000e1e1b95c0:3600144f0d987aa07000061027d9c48c6) AGENT_NOTIFY EVENT : Jul 29 19:56:38 {STORAGE} [CHANGE_DM_SD] (dm-988) 3600144f0d987aa07000061027db248e1-10:0:0:1971 (failed:0x2100000e1e1b95c0:3600144f0d987aa07000061027db248e1) AGENT_NOTIFY EVENT : Jul 29 19:56:38 {STORAGE} [CHANGE_DM_SD] (dm-988) 3600144f0d987aa07000061027db248e1-10:0:0:1971 (active:0x2100000e1e1b95c0:3600144f0d987aa07000061027db248e1) AGENT_NOTIFY EVENT : Jul 29 19:56:39 {STORAGE} [CHANGE_DM_SD] (dm-961) 3600144f0d987aa07000061027d9c48c6-10:0:0:1917 (active:0x2100000e1e1b95c0:3600144f0d987aa07000061027d9c48c6)
This issue is resolved when
/var/log/devmon.log
stops logging newCHANGE_DM_SD
messages. -
The
systemd-udevd
process consumes 100% CPU intop
command.This is resolved when
systemd-udevd
no longer consumes a large percentage of the CPU. -
The
multipath -ll
command does not show all the LUNs. The command might show a fraction of the LUNs expected. -
The
multipath -ll | grep "active ready running" | wc -l
command might not count all the LUNs. The command might show a fraction of the LUNs expected.
Workaround: Follow this procedure to resolve path flapping:
-
Log in to the compute node as
root
and execute thesystemctl restart multipathd
command. -
Continue to execute the above detection commands until all 4 outputs are resolved, and you see the correct amount of FC LUNs/paths.
-
If any of the monitoring scenarios does not resolve after 3-4 minutes, repeat step 1.
Bug 33171816
Upgrade Compute Node Fails with Fibre Channel LUNs
If your compute node contains an Emulex or QLogic Fiber Channel HBA, the compute node upgrade procedure might fail because of a Fibre Channel LUN path flapping problem. Use the following workaround to avoid the issue.
Workaround:[PCA 2.4.4] Upgrade Compute Node with Fibre Channel Luns may Fail due to FC Path Flapping (Doc ID 2794501.1).
PCA Faultmonitor Check firewall_monitor
Fails Due to nc:command not found
If your compute node fails the Faultmonitor
firewall_monitor
check and displays the following log error, you are encountering a
port error which creates a false report and pushes it to Phone Home, if Phone Home is enabled.
The firewall_monitor
verifies whether the required ports for Oracle VM Manager and the compute node are
opened or not.
[2021-08-03 16:30:15 605830] ERROR (ovmfaultmonitor_utils:487) invalid
literal for int() with base 10: '-bash: nc: command not found'
Traceback (most recent call last):
File
"/usr/lib/python2.7/site-packages/ovca/monitor/faultmonitor/ovmfaultmonitor/ov
mfaultmonitor_utils.py", line 458, in firewall_monitor
cmd_outputs[server][port] = int(output.strip())
ValueError: invalid literal for int() with base 10: '-bash: nc: command not
found.
Workaround: To manually fix this error apply the workaround documented at: Oracle Support Document 2797364.1 ([PCA 2.4.4] Faultmonitor Check firewall_monitor Fails due to "nc: command not found").
Certain TZ Configuration is Failing on Cisco Switches
Starting in 2016, tzdata
implemented numeric timezone abbreviations
like "+03" for new timezones. The Cisco switches only support timezones abbrevations
with alphabetic abbreviations like "ASTT". Attempting to change the timezone on a Cisco
switch could cause the following error in the ovca.log file:
[2022-05-30 15:08:13 10455] ERROR (cisco:145) Configuration failed
partially:Clock timezone set:: Timezone name should contain alphabets only
Workaround: Do not change the timezones on the Cisco switches. Cisco switches will always report the time in UTC.
Bug 34223027
NFS Shares on Internal ZFS Will Fail After ZFS Firmware Update If vnic Owned by SN02
Starting with ZFSSA AK version 8.8.30, it is now a requirement that the address used to mount shares from a pool must be formally owned by the same head which formally owns the pool, as shown by Configuration -> Cluster.
Software version 2.4.4.1 introduces a new pre-check whichs flags any storage network
interfaces that have owner = ovcasn02r1
so the customer can manually correct
the owner to ovcasn01r1
before proceeding with the upgrade. If you see the
following error, proceed to the workaround below.
[2022-05-24 18:57:22 33554] ERROR (precheck:154) [ZFSSA Storage Network Interfaces Check (Ensure ovcasn01r1 is the owner of all customer-created storage network interfaces)] Failed The check failed: Detected customer-created storage network interface(s) owned by ovcasn02r1: net/vnic10, net/vnic11, net/vnic12, net/vnic7, net/vnic8, net/vnic9
Workaround: See [PCA 2.4.4.1] Pre-check "ZFSSA Storage Network Interfaces Check" Fails (Doc ID 2876150.1).
Bug 34192251
Repository Size is Not Reflecting Properly in OVMM GUI
The Oracle VM Manager GUI can report an incorrect repository size, which may cause VMs to
hang because the repository is actually full. Use another method to check the repository size,
like the compute node df
output or the OVM CLI.
Workaround: Check the repository size using OVM CLI.
OVM> show repository name=NFS-ExtZFSSA-Repository Command: show repository name=NFS-ExtZFSSA-Repository Status: Success Time: 2021-10-19 10:48:15,294 UTC Data: File System = 14a6cf21-a170-41aa-9a09-7b768aaabc6f [nfs on 192.168.40.242:/export/NFS-Ext-Repo] Manager UUID = 0004fb000001000087ae02edfd0534dc File System Free (GiB) = 998.48 File System Total (GiB) = 1018.84 File System Used (GiB) = 20.37 Used % = 2.0 Apparent Size (GiB) = 25.0 Capacity % = 2.5
Bug 33455258