Platform Issues
This section describes known issues and workarounds related to the appliance platform layer.
Compute Node Provisioning Takes a Long Time
The provisioning of a new compute node typically takes only a few minutes. However, there are several factors that may adversely affect the duration of the process. For example, the management nodes may be under a high load or the platform services involved in the provisioning may be busy or migrating between hosts. Also, if you started provisioning several compute nodes in quick succession, note that these processes are not executed in parallel but one after the other.
Workaround: Unless an error is displayed, you should assume that the compute node provisioning process is still ongoing and will eventually complete. At that point, the compute node provisioning state changes to Provisioned.
Bug: 33519372
Version: 3.0.1
Not Authorized to Reconfigure Appliance Network Environment
If you attempt to change the network environment parameters for the rack's external connectivity when you have just completed the initial system setup, your commands are rejected because you are not authorized to make those changes. This is caused by a security feature: the permissions for initial system setup are restricted to only those specific setup operations. Even if you are an administrator with unrestricted access to the Service Enclave, you must disconnect after initial system setup and log back in again to activate all permissions associated with your account.
Workaround: This behavior is expected and was designed to help protect against unauthorized access. In case you need to modify the appliance external network configuration right after the initial system setup, log out and log back in to make sure that your session is launched with the required privileges.
Bug: 33535069
Version: 3.0.1
Error Changing Hardware Component Password
The hardware layer of the Oracle Private Cloud Appliance architecture consists of various types of components with different operating and management software. As standalone products their password policies can vary, but the appliance software enforces a stricter rule set. If an error is returned when you try to change a component password, ensure that your new password complies with the Private Cloud Appliance policy for hardware components.
For more information about password maintenance across the entire appliance environment, refer to the Oracle Private Cloud Appliance Security Guide.
Workaround: For hardware components, use the Service CLI to set a password that conforms to the following rules:
-
consists of at least 8 characters
-
with a maximum length of 20 characters for compute nodes, management nodes, and switches
-
with a maximum length of 16 characters for ILOMs and the ZFS Storage Appliance
-
-
contains at least one lowercase letter (a-z)
-
contains at least one uppercase letter (A-Z)
-
contains at least one digit (0-9)
-
contains at least one symbol (@$!#%*&)
Bug: 35828215
Version: 3.0.2
Grafana Service Statistics Remain at Zero
The Grafana Service Monitoring folder contains a dashboard named Service Level, which displays statistical information about requests received by the fundamental appliance services. These numbers can remain at zero even though there is activity pertaining to the services monitored through this dashboard.
Workaround: No workaround is currently available.
Bug: 33535885
Version: 3.0.1
Terraform Provisioning Requires Fully Qualified Domain Name for Region
If you use the Oracle Cloud Infrastructure Terraform provider to automate infrastructure provisioning on Oracle Private Cloud Appliance, you must specify the fully qualified domain name of the appliance in the region variable for the Terraform provider.
Synchronizing Hardware Data Causes Provisioning Node to Appear Ready to Provision
Both the Service Web UI and the Service CLI provide a command to synchronize the information about hardware components with the actual status as currently registered by the internal hardware management services. However, you should not need to synchronize hardware status under normal circumstances, because status changes are detected and communicated automatically.
Furthermore, if a compute node provisioning operation is in progress when you synchronize hardware data, its Provisioning State could be reverted to Ready to Provision. This information is incorrect, and is caused by the hardware synchronization occurring too soon after the provisioning command. In this situation, attempting to provision the compute node again is likely to cause problems.
Workaround: If you have started provisioning a compute node, and its provisioning state reads Provisioning, wait at least another five minutes to see if it changes to Provisioned. If it takes excessively long for the compute node to be listed as Provisioned, run the Sync Hardware Data command.
If the compute node still does not change to Provisioned, retry provisioning the compute node.
Bug: 33575736
Version: 3.0.1
Automatic Disk Shelf Provisioning Disabled for Storage Expansions
During the initial installation of the Private Cloud Appliance, the disk shelves present are automatically added to the appropriate pool: capacity or high-performance. When disk shelves are added at a later time to expand the storage capacity of the appliance, these are no longer automatically provisioned and added to the respective storage pools. This functional change was implemented in appliance software versions newer than 3.0.2-b1081557.
Because storage expansions are processed serially, regardless of how many disk shelves are added in a single operation, automated reconfiguration of the storage pools leads to an excessive number of spare drives. To ensure cost-effective and correctly balanced use of storage resources, it was decided to remove this automation in the latest appliance software.
Workaround: Storage expansions for Private Cloud Appliance are best configured on a case by case basis, so that the number of spare drives can be adjusted to the specific storage configuration of the rack. Contact Oracle for assistance. Storage expansion scenarios are covered in the note with Doc ID 3020837.1.
Bug: 36623140
Version: 3.0.2
Rack Elevation for Storage Controller Not Displayed
In the Service Web UI, the Rack Units list shows all
hardware components with basic status information. One of the data fields is Rack
Elevation, the rack unit number where the component in question is installed. For one of
the controllers of the ZFS storage appliance, pcasn02
, the rack elevation is
shown as Not Available.
Workaround: There is no workaround. The underlying hardware administration services currently do not populate this particular data field. The two controllers occupy 2 rack units each and are installed in RU 1-4.
Bug: 33609276
Version: 3.0.1
Fix available: Please apply the latest patches to your system.
Switch Hardware State Reported "Up"
An expansion rack has its own set of switches, connected into the base rack. When the data and administration networks are integrated across the interconnected racks, the appliance platform recognizes the expansion hardware as part of the same system. If you query the switches, the list contains all switches of that type in the system, but the hardware state of an expansion switch is reported as "Up" instead of "OK".
# pca-admin switch leaf list +----------+-----------+------------+-------------------------------------+-------------+-----------+----------+--------------------+ | RackID | Rack Unit | CPU Vendor | Model | IP Address | Hostname | HW State | Provisioning state | +----------+-----------+------------+-------------------------------------+-------------+-----------+----------+--------------------+ | 1 | 22 | Cisco | cisco Nexus9000 C9336C-FX2 Chassis | 100.96.2.22 | pcaswlf01 | OK | Ready | | 1 | 23 | Cisco | cisco Nexus9000 C9336C-FX2 Chassis | 100.96.2.23 | pcaswlf02 | OK | Ready | | 2 | 23 | Cisco | cisco Nexus9000 C9336C-FX2 Chassis | 100.96.2.40 | pcaswlf03 | Up | Ready | | 2 | 22 | Cisco | cisco Nexus9000 C9336C-FX2 Chassis | 100.96.2.41 | pcaswlf04 | Up | Ready | +----------+-----------+------------+-------------------------------------+-------------+-----------+----------+--------------------+ # pca-admin switch mgmt list +----------+-----------+------------+-------------------------------------+-------------+-----------+----------+--------------------+ | RackID | Rack Unit | CPU Vendor | Model | IP Address | Hostname | HW State | Provisioning state | +----------+-----------+------------+-------------------------------------+-------------+-----------+----------+--------------------+ | 1 | 24 | Cisco | cisco Nexus9000 C9348GC-FXP Chassis | 100.96.2.1 | pcaswmn01 | OK | Ready | | 2 | 23 | Cisco | cisco Nexus9000 C9348GC-FXP Chassis | 100.96.2.46 | pcaswmn02 | Up | Ready | +----------+-----------+------------+-------------------------------------+-------------+-----------+----------+--------------------+
Workaround: The different hardware state label is harmless. There is no workaround.
Bug: 37076258
Version: 3.0.2
Free-Form Tags Used for Extended Functionality
You can use the following free-form tags to extend the functionality of Oracle Private Cloud Appliance.
Note:
Do not use these tag names for other purposes.
-
PCA_no_lm
Use this tag to instruct the Compute service not to live migrate an instance. The value can be either True or False.
By default, an instance can be live migrated, such as when you need to evacuate all running instances from a compute node. Live migration can be a problem for some instances. For example, live migration is not supported for instances in a Microsoft Windows cluster. To prevent an instance from being live migrated, set this tag to True on the instance.
Specify this tag in the Tagging section of the Create Instance or Edit
instance_name
dialog, in theoci compute instance launch
oroci compute instance update
command, or using the API.The following is an example option for the
oci compute instance launch
command:--freeform-tags '{"PCA_no_lm": "True"}'
Setting this tag to True on an instance will not prevent the instance from being moved when you change the fault domain. Changing the fault domain is not a live migration. When you change the fault domain of an instance, the instance is stopped, moved, and restarted.
-
PCA_blocksize
Use this tag to instruct the ZFS storage appliance to create a new volume with a specific block size.
The default block size is 8192 bytes. To specify a different block size, specify the
PCA_blocksize
tag in the Tagging section of the Create Block Volume dialog, in theoci bv volume create
command, or using the API. Supported values are a power of 2 between 512 and 1M bytes, specified as a string and fully expanded.The following is an example option for the
oci bv volume create
command:--freeform-tags '{"PCA_blocksize": "65536"}'
The block size cannot be modified once the volume has been created.
Use of these tags counts against your tag limit.
Version: 3.0.1
Terraform Apply Can Delete Default Tags
Sometimes, the OCI Terraform provider unexpectedly deletes existing tag defaults from a
resource during terraform apply
. For example, the Oracle-Tags.CreatedBy and
Oracle-Tags.CreatedOn tag defaults that were automatically assigned when the resource was
created might be deleted on a subsequent terraform apply
.
Workaround: Add the ignore_defined_tags
attribute
to your provider block, listing the tags that you want the Terraform provider to ignore during
plan
or apply
, as shown in the following example:
provider "oci" { ignore_defined_tags = ["Oracle-Tags.CreatedBy", "Oracle-Tags.CreatedOn"] }
Bug: 36692217
Version: 3.0.2
Depend on the Tag Definition in a Terraform Resource Definition that Includes a New Defined Tag
If your Terraform plan includes both defined tag definitions and other resource definitions
that use those defined tags, then you must tell Terraform about this dependency so that the
resources are created in the correct order. In the definition of each resource that uses a tag
that is defined in this same plan, include a depends_on
meta-argument that
points to where the tag is defined, as shown in the following example:
resource "oci_identity_tag_namespace" "example_tag_ns" { tag_namespace_definition } resource "oci_identity_tag" "example_tag" { tag_key_definition } ... resource "oci_resource" "resource_name" { depends_on = [ oci_identity_tag.example_tag ] resource_definition }
If you do not use depends_on
to tell Terraform about a dependency, you can
use one of the following methods:
-
Create the defined tag in a separate Terraform plan, and apply that plan before you apply the plan that creates the resource that uses the tag.
-
If your apply fails because the tag was unknown when the resource that uses the tag was created, apply the same plan again.
Bug: 36701647
Version: 3.0.2
Failure Creating Dynamic Groups and Policies through Terraform Plan
When using Terraform to create identity groups or dynamic groups and associated policies, it is likely that an error is returned when you apply the Terraform plan. The IAM Service does not allow a policy to be created for an identity group that does not exist. Therefore, if the policy resource defined in the Terraform plan is created before the identity (dynamic) group to which is applies, an authorization error or an object not found error is returned. For example:
...│ Error: 404-NotAuthorizedOrNotFound, Authorization failed or requested resource not found. Suggestion: Either the resource has been deleted or service Identity Policy need policy to access this resource. Policy reference: https://docs.oracle.com/en-us/iaas/Content/Identity/Reference/policyreference.htm ...
Note that all resources are typically created correctly despite the error message. When the Terraform plan is applied a second time, the command is usually successful.
Workaround: If a dependency exists between Private Cloud Appliance resource creation operations, use the
Terraform depends_on
feature to make this dependency explicit in the
Terraform plan. The depends_on
meta-argument tells Terraform to create the
depended-on resource before creating the dependent resource. For example, add a
depends_on
statement similar to the highlighted line below.
resource "oci_identity_dynamic_group" "test_dynamic_group" {
compartment_id = "ocid1.tenancy....unique_ID"
description = "Terraform test dependency"
matching_rule = "matching_rule1"
name = "testdyngrp"
}
resource "oci_identity_policy" "dg_policy" {
compartment_id = "ocid1.tenancy....unique_ID"
description = "Test DG Policy"
name = "DGPolicy"
statements = [
"allow dynamic-group testdyngrp to manage all-resources in tenancy"
]
depends_on = [
oci_identity_dynamic_group.test_dynamic_group
]
}
Bug: 36536058
Version: 3.0.2
Maximum Length of User Name Differs from Oracle Cloud Infrastructure
Oracle Cloud Infrastructure accepts accounts with very long user
names. The maximum user name length in the IAM service of Private Cloud Appliance is not the same. This can be
problematic when migrating a public cloud setup into the Private Cloud Appliance environment. In this case, the IAM
service returns an error "Data too long for column 'name' at row x
".
Workaround: Update the user account that causes the issue by setting a shorter user name.
Bug: 36536058
Version: 3.0.2
Terraform Requires Escaping Double Quotation Marks in Complex Tag Values
A complex tag value has a key as well as a value in the value field. In Terraform, this complex value requires that you escape double quotation marks inside the braces that surround the complex value.
The following example shows a complex tag value. In this example, the value of the
key1_name
tag key is another key and its value:
{"key1_name": {"key2_name": "key2_value"}}
For comparison, the following example shows how to specify this value in the OCI CLI:
--freeform-tags '{"key1_name": {"key2_name": "key2_value"}}'
The following example shows how to specify this value using Terraform:
freeform_tags = {"key1_name" = "{\"key2_name\": \"key2_value\"}"}
The following example is the Terraform for defined tags used to create an OKE cluster. Note the two key/value pairs that are the value of the OraclePCA.cpNodeShapeConfig tag:
defined_tags={"OraclePCA.cpNodeCount"="3","OraclePCA.cpNodeShape"="VM.PCAStandard1.Flex","OraclePCA.cpNodeShapeConfig"="{\"ocpus\":1,\"memoryInGBs\":10}","OraclePCA.sshkey"="sshkey"}
Bug: 36691556
Version: 3.0.2
Imported Images Not Synchronized to High-Performance Pool
In an Oracle Private Cloud Appliance with default storage configuration,
when you import compute images, they are stored on the ZFS Storage Appliance in an images
LUN inside the
standard ZFS pool. If the storage configuration is later extended with a high-performance disk
shelf, an additional high-performance ZFS pool is configured on the ZFS Storage Appliance. Because there is no replication between the
storage pools, the images from the original pool are not automatically made available in the
new high-performance pool. The images have to be imported manually.
Workaround: When adding high-performance storage shelves to the appliance configuration, import the required compute images again to ensure they are loaded into the newly created ZFS pool.
Bug: 33660897
Version: 3.0.1
API Server Failure After Management Node Reboot
When one of the three management nodes is rebooted, it may occur that the API server does not respond to any requests, even though it can still be reached through the other two management nodes in the cluster. This is likely caused by an ownership issue with the virtual IP shared between the management nodes, or by the DNS server not responding quickly enough to route traffic to the service pods on the available management nodes. After the rebooted management node has rejoined the cluster, it may still take several minutes before the API server returns to its normal operating state and accepts requests again.
Workaround: When a single management node reboots, all the services are eventually restored to their normal operating condition, although their pods may be distributed differently across the management node cluster. If your UI, CLI or API operations fail after a management node reboot, wait 5 to 10 minutes and try again.
Bug: 33191011
Version: 3.0.1
Administrators in Authorization Group Other Than SuperAdmin Must Use Service CLI to Change Password
Due to high security restrictions, administrators who are not a member of the SuperAdmin authorization group are unable to change their account password in the Service Web UI. An authorization error is displayed when an administrator from a non-SuperAdmin authorization group attempts to access their own profile.
Workaround: Log in to the Service CLI, find your user id in the user preferences, and change your password as follows:
PCA-ADMIN> show UserPreference Data: Id = 1c74b2a5-c1ce-4433-99da-cb17aab4c090 Type = UserPreference [...] UserId = id:5b6c1bfa-453c-4682-e692-6f0c91b53d21 type:User name:dcadmin PCA-ADMIN> changePassword id=<user_id> password=<new_password> confirmPassword=<new_password>
Bug: 33749967
Version: 3.0.1
Service Web UI and Grafana Unavailable when HAProxy Is Down
HAProxy is the load balancer used by the Private Cloud Appliance platform layer for all access to and from the microservices. When the load balancer and proxy services are down, the Service Web UI and Grafana monitoring interface are unavailable. When you attempt to log in, you receive an error message: "Server Did Not Respond".
Workaround: Log in to one of the management nodes. Check the status of the HAProxy cluster resource, and restart if necessary.
# ssh pcamn01
# pcs status
Cluster name: mncluster
Stack: corosync
[...]
Full list of resources:
scsi_fencing (stonith:fence_scsi): Stopped (disabled)
Resource Group: mgmt-rg
vip-mgmt-int (ocf::heartbeat:IPaddr2): Started pcamn03
vip-mgmt-host (ocf::heartbeat:IPaddr2): Started pcamn03
vip-mgmt-ilom (ocf::heartbeat:IPaddr2): Started pcamn03
vip-mgmt-lb (ocf::heartbeat:IPaddr2): Started pcamn03
vip-mgmt-ext (ocf::heartbeat:IPaddr2): Started pcamn03
l1api (systemd:l1api): Started pcamn03
haproxy (ocf::heartbeat:haproxy): Stopped (disabled)
pca-node-state (systemd:pca_node_state): Started pcamn03
dhcp (ocf::heartbeat:dhcpd): Started pcamn03
hw-monitor (systemd:hw_monitor): Started pcamn03
To start HAProxy, use the pcs resource
command as shown in the example
below. Verify that the cluster resource status has changed from "Stopped (disabled)" to
"Started".
# pcs resource enable haproxy
# pcs status
[...]
Resource Group: mgmt-rg
haproxy (ocf::heartbeat:haproxy): Started pcamn03
Bug: 34485377
Version: 3.0.2
Lock File Issue Occurs when Changing Compute Node Passwords
When a command is issued to modify the password for a compute node or ILOM, the system sets a temporary lock on the relevant database to ensure that password changes are applied in a reliable and consistent manner. If the database lock cannot be obtained or released on the first attempt, the system makes several further attempts to complete the request. Under normal operating circumstances, it is expected that the password is eventually successfully changed. However, the command output may contain error messages such as "Failed to create DB lockfile" or "Failed to remove DB lock", even if the final result is "Password successfully changed".
Workaround: The error messages are inaccurate and can be ignored as long as the password operations complete as expected. No workaround is required.
Bug: 34065740
Version: 3.0.2
Compute Node Hangs at Dracut Prompt after System Power Cycle
When an appliance or some of its components need to be powered off, for example to perform
maintenance, there is always a minimal risk that a step in the complex reboot sequence is not
completed successfully. When a compute node reboots after a system power cycle, it can hang at
the dracut
prompt because the boot framework fails to build the required
initramfs/initrd image. As a result, primary GPT partition errors are reported for the root
file system.
Workaround: Log on to the compute node ILOM. Verify that the server
has failed to boot, and is in the dracut
recovery shell. To allow the compute
node to return to normal operation, reset it from the ILOM using the reset
/System
command.
Bug: 34096073
Version: 3.0.2
No Error Reported for Unavailable Spine Switch
When a spine switch goes offline due to loss of power or a fatal error, the system gives no indication of the issue in the Service Enclave UI/CLI or Grafana. This behavior is the result of the switch client not properly handling exceptions and continuing to report the default "healthy" status.
Workaround: There is currently no workaround to make the system generate an error that alerts the administrator of a spine switch issue.
Bug: 34696315
Version: 3.0.2
ZFS Storage Appliance Controller Stuck in Failsafe Shell After Power Cycle
The two controllers of the Oracle ZFS Storage Appliance operate in an active-active cluster configuration. When one controller is taken offline, for example when its firmware is upgraded or when maintenance is required, the other controller takes ownership of all storage resources to provide continuation of service. During this process, several locks must be applied and released. When the rebooted controller rejoins the cluster to take back ownership of its assigned storage resources, the cluster synchronization will fail if the necessary locks are not released correctly. In this situation, the rebooted controller could become stuck in the failsafe shell, waiting for the peer controller to release certain locks. This is likely the result of a takeover operation that was not completed entirely, leaving the cluster in an indeterminate state.
Workaround: There is currently no workaround. If the storage controller cluster ends up in this condition, contact Oracle for assistance.
Bug: 34700405
Version: 3.0.2
Concurrent Compute Node Provisioning Operations Fail Due to Storage Configuration Timeout
When the Private Cloud Appliance has just been installed, or when a set of expansion compute nodes have been added, the system does not prevent you from provisioning all new compute nodes at once. Note, however, that for each provisioned node the storage initiators and targets must be configured on the ZFS Storage Appliance. If there are too many configuration update requests for the storage appliance to process, they will time out. As a result, all compute node provisioning operations will fail and be rolled back to the unprovisioned state.
Workaround: To avoid ZFS Storage Appliance configuration timeouts, provision compute nodes sequentially one by one, or in groups of no more than 3.
Bug: 34739702
Version: 3.0.2
Data Switch Fails to Boot Due to Active Console Connection
If a Cisco Nexus 9336C-FX2 Switch has an active local console session, for example when a terminal server is connected, the switch could randomly hang during reboot. It is assumed that the interruption of the boot sequence is caused by a ghost session on the console port. This behavior has not been observed when no local console connection is used.
Workaround: Do not connect any cables to the console ports of the data switches. There is no need for a local console connection in a Private Cloud Appliance installation.
Bug: 32965120
Version: 3.0.2
Switches in Failed State Due to Expired Certificate
During the appliance software upgrade, new certificates are installed for authentication between components. It may occur that a certificate is not uploaded to the switches. When the certificate on the switches expires, they go into a failed state, which results in critical active faults.
PCA-ADMIN> list fault where status EQ ACTIVE Data: id Name Status Severity -- ---- ------ -------- d38d1e3b-893e-49bd-a62a-77b0bd22e5d9 RackUnitRunStateFaultStatusFault(pcaswmn01) Active Critical c0358065-ea81-4ad3-4a6c-017194f73659 RackUnitRunStateFaultStatusFault(pcaswlf01) Active Critical 2fe549b7-596c-4a6c-25d3-1f9fa4b0bd1c RackUnitRunStateFaultStatusFault(pcaswlf02) Active Critical 25d3c91d-1e69-4f73-8319-f5ed18f6a903 RackUnitRunStateFaultStatusFault(pcaswsp01) Active Critical 4af35eff-994f-c400-a93a-314cc43c97a1 RackUnitRunStateFaultStatusFault(pcaswsp02) Active Critical
Workaround: Confirm that the switch certificates have expired, then reprovision the switches to force a new certificate to be uploaded. Follow the instructions in the note with Doc ID 3080032.1. When completed successfully, the switches return to Ready state.
Bug: 37743552
Version: 3.0.2
Federated Login Failure after Appliance Upgrade
Identity federation allows users to log in to Private Cloud Appliance with their existing company user name and password. After an upgrade of the appliance software, the trust relationship between the identity provider and Private Cloud Appliance might be broken, causing all federated logins to fail. During the upgrade the Private Cloud Appliance X.509 external server certificate could be updated for internal service changes. In this case, the certificate on the identity provider side no longer matches.
Workaround: If the identity provider allows it, update its service provider certificate.
-
Retrieve the appliance SAML metadata XML file from
https://iaas.<domain>/saml/<TenancyId>
and save it to a local file. -
Open the local file with a text editor and find the
<X509Certificate>
element.<SPSSODescriptor> <KeyDescriptor use="signing"> <KeyInfo> <X509Data> <X509Certificate> <COPY CERTIFICATE CONTENT FROM HERE> </X509Certificate> </X509Data> </KeyInfo> </KeyDescriptor> </KeyDescriptor>
-
Copy the certificate content and save it to a new
*.pem
file, structured as follows:-----BEGIN CERTIFICATE----- <PASTE CERTIFICATE CONTENT HERE> -----END CERTIFICATE-----
-
Update the identity provider with this new service provider certificate for your Private Cloud Appliance.
If the identity provider offers no easy way to update the certificate, we recommend that you delete the service provider and reconfigure identity federation. For more information, refer to the section "Federating with Microsoft Active Directory" in the Oracle Private Cloud Appliance Administrator Guide.
Bug: 35688600
Version: 3.0.2
Ensure No Storage Buckets Are Present Before Deleting a Compartment or Tenancy
When a command is issued to delete a compartment or tenancy, the appliance software cannot reliably confirm that no object storage buckets exist, because it has no service account with access to all buckets present on the ZFS Storage Appliance. As a result, access to certain object storage buckets could be lost when their compartment is deleted.
Workaround: Before deleting a compartment or tenancy, verify that no object storage buckets are present in that compartment or tenancy.
Bug: 35811594
Version: 3.0.2
Listing Upgrade Jobs Fails with RabbitMQ Error
When you run the Service CLI command
getUpgradeJobs
, the following error might be returned:
PCA-ADMIN> getUpgradeJobs Status: Failure Error Msg: PCA_GENERAL_000012: Error in RabbitMQ service: null
Workaround: The issue is temporary. Please retry the command at a later time.
Bug: 35999461
Version: 3.0.2
Availability Domain Name Change in Version 3.0.2-b1001356
In software version 3.0.2-b1001356 (December 2023), Private Cloud Appliance's single availability domain has been
renamed from "ad1
" to "AD-1
". This change was required for
compatibility with Oracle Cloud Infrastructure. The availability domain
is a mandatory parameter in a small set of commands, and an optional parameter in several
other commands.
The --availability-domain
parameter is required with the following
commands:
oci bv boot-volume create oci bv boot-volume list oci bv volume create oci bv volume-group create oci compute instance launch oci compute boot-volume-attachment list oci fs file-system create oci fs file-system list oci fs mount-target create oci fs mount-target list oci fs export-set list oci iam fault-domain list
Workaround: Ensure that the correct value is used to identify the
availability domain in your commands, depending on the version of the appliance software your
system is running. If you are using scripts or any form of automation that includes the
--availability-domain
parameter, ensure that your code is updated when you
upgrade or patch the appliance with version 3.0.2-b1001356 or newer.
Bug: 36094977
Version: 3.0.2
No Packages Available to Patch MySQL Cluster Database
With the release of appliance software version 3.0.2-b1001356, new MySQL RPM packages were added to the ULN channel PCA 3.0.2 MN. However, a package signing issue prevents the ULN mirror from downloading them, which means the MySQL cluster database on your system cannot be patched to the latest available version.
When patching the system, you will see no error message or abnormal behavior related to the missing MySQL packages. Follow the workaround to obtain the new packages. Once these have been downloaded to the ULN mirror, you can patch the MySQL cluster database.
Note:
For new ULN mirror installations, the steps to enable updates of MySQL packages have been included in the Oracle Private Cloud Appliance Patching Guide under "Configure Your Environment for Patching".
To determine if a system is affected by this issue, check the ULN mirror for the presence of
MySQL packages in the yum directory referenced by the pca302_x86_64_mn
soft
link. If the search returns no results, the ULN mirror was unable to download the MySQL
packages. The default location of the yum setup directory is
/var/www/html/yum
, which is used in the following example:
# ls -al /var/www/html/yum/pca302_x86_64_mn/getPackage/ | grep mysql -rw-r--r--. 1 root root 85169400 Dec 19 03:19 mysql-cluster-commercial-client-8.0.33-1.1.el7.x86_64.rpm -rw-r--r--. 1 root root 4751220 Dec 19 03:19 mysql-cluster-commercial-client-plugins-8.0.33-1.1.el7.x86_64.rpm -rw-r--r--. 1 root root 689392 Dec 19 03:19 mysql-cluster-commercial-common-8.0.33-1.1.el7.x86_64.rpm -rw-r--r--. 1 root root 12417692 Dec 19 03:19 mysql-cluster-commercial-data-node-8.0.33-1.1.el7.x86_64.rpm -rw-r--r--. 1 root root 2229080 Dec 19 03:19 mysql-cluster-commercial-icu-data-files-8.0.33-1.1.el7.x86_64.rpm -rw-r--r--. 1 root root 2236184 Dec 19 03:19 mysql-cluster-commercial-libs-8.0.33-1.1.el7.x86_64.rpm -rw-r--r--. 1 root root 1279012 Dec 19 03:19 mysql-cluster-commercial-libs-compat-8.0.33-1.1.el7.x86_64.rpm -rw-r--r--. 1 root root 3478680 Dec 19 03:19 mysql-cluster-commercial-management-server-8.0.33-1.1.el7.x86_64.rpm -rw-r--r--. 1 root root 364433848 Dec 19 03:19 mysql-cluster-commercial-server-8.0.33-1.1.el7.x86_64.rpm -rw-r--r--. 1 root root 2428848 Dec 19 03:19 mysql-connector-j-commercial-8.0.33-1.1.el7.noarch.rpm -rw-r--r--. 1 root root 4570200 Dec 19 03:19 mysql-connector-odbc-commercial-8.0.33-1.1.el7.x86_64.rpm
Workaround: When you import the appropriate GPG keys on your ULN mirror, it can download the updated MySQL packages. Proceed as follows:
-
Log in to the ULN mirror server.
-
Download the MySQL GPG keys from these locations:
-
Import the GPG keys.
# rpm --import RPM-GPG-KEY-mysql-2022 # rpm --import RPM-GPG-KEY-mysql-2023
-
Update the ULN mirror.
# /usr/bin/uln-yum-mirror
If the key was imported successfully, the new MySQL packages are downloaded to the ULN mirror.
-
For confirmation, verify the signature using one of the new packages.
# rpm --checksig mysql-cluster-commercial-management-server-8.0.33-1.1.el7.x86_64.rpm mysql-cluster-commercial-management-server-8.0.33-1.1.el7.x86_64.rpm: rsa sha1 (md5) pgp md5 OK
Bug: 36123758
Version: 3.0.2
Uppercase Letters Are Not Supported in Domain Names
Uppercase letters aren't supported in domain names. The domain name for your system is used as the base domain for the internal network, and by Oracle Private Cloud Appliance public facing services. This attribute has a maximum length of 190 characters. Acceptable characters are "a"→"z", "0"→"9", "-"
Bug: 36484125
Version: 3.0.2
Prometheus Backup Archives Are Corrupt
Private Cloud Appliance follows an internal daily backup schedule to preserve system data in case a major outage occurs. The automated backups originally included the Prometheus monitoring data, but its volume caused significant side effects. The amount of data, even when compressed, fills the file system allocated for backups before the 14-day retention period expires. At some point, a failure occurs while building the archive, which results in a corrupt backup. For this reason, Prometheus was removed from the daily backups in appliance software version 3.0.2-b1081557.
Workaround: The monitoring data from Prometheus is not included in automated backups. To preserve your Prometheus data, create a backup and restore it manually. For more information, refer to the note with Doc ID 3021643.1.
Bug: 36623554
Version: 3.0.2
Sauron Ingress Breaks During Install or Upgrade on Host Names with Capital Letters
When upgrading the appliance, the platform upgrade could fail because of the use of capital letters in the host name.
curl -k -X PUT -HHost:api.pca.oracledx.com https://253.255.0.32/v1/uri?uri=<FQDN in Lowercase> -u admin:<sauron_password>
Bug: 36792458
Version: 3.0.2
Upgrader Log Reports Health Check Error in Post-Upgrade Tasks
When upgrading or patching to appliance software version 3.0.2-b1261765, the pca-health-checker script
returns an error in post-upgrade tasks due to a missing file system mount:
253.255.12.2:/export/clamav_db
. The Upgrader is not interrupted by the
error, and is expected to complete all tasks.
Workaround: This error can be ignored. It has no functional impact.
Bug: 37311137
Version: 3.0.2
Unable to Collect Support Bundles During Appliance Software Upgrade
When upgrading to appliance software version 3.0.2-b1392231, it is temporarily not possible to generate and collect support bundles. Other methods of log collection are not affected.
Workaround: When the upgrade is completed, the support bundle functionality returns to normal. For log collection during the upgrade window, use a timeslice.
Bug: 37828395
Version: 3.0.2
Cannot Add Performance Pool to Local Endpoint
Peer connections between multiple Private Cloud Appliance systems are configured between their local endpoints. The parameters of the local endpoint configuration include the IP addresses of the ZFS pools. However, if a high-performance ZFS pool is created after the local endpoint, the configuration cannot be updated with the new pool IP. The local endpoint must be deleted and re-created with the new parameters. This adversely affects the native Disaster Recovery (DR) Service. Before you can delete the local endpoint, you must also delete all DR configurations.
Workaround: Delete and re-create the local endpoint. If DR configurations exist, those also need to be deleted and created again.
Bug: 37196927
Version: 3.0.2
DR Configurations Unusable After Deleting and Re-creating a Peer Connection on One Appliance
For the native Disaster Recovery (DR) Service, a peer connection between two Private Cloud Appliance systems must be configured from each
side of the connection. If the peer connection needs to be deleted on one of the two systems,
and is created again, the system reports that peering is complete and fully functional.
However, when you execute a DR plan from the existing DR configurations, the operation returns
an error, typically containing messages like Precheck Failed
and DR
metadata project not replicated
.
Workaround: Peer connections are designed to be symmetrical. If you delete and re-create the peer connection on one system, you must delete and re-create it on the second system as well.
Bug: 37260498
Version: 3.0.2
Peering with Unhealthy Storage Controller Results in Replication Action Test Failure
If one of the target Oracle ZFS Storage Appliance controllers is in an unhealthy state when you create a peer connection between two Private Cloud Appliance systems, the peering operation fails. In fact, the target is created but the replication action test fails, causing the peer connection to remain in 'failed' lifecycle state. The error message in the peering logs and progress records looks like this example:
Details = Replication target for host <serial>/<IP>/<Pool> failed post-checks. Post-check failed during replication action test of <Pool>: Failed to create replication action for <share>. Reset the peer connection and try again. PeerConnectionId = id:ocid1.drpeerconnection.<unique_ID> type:PeerConnection
Workaround: The storage controllers on both systems must be in good operating condition before you create a peer connection. Clean up the configuration, ensure that all storage controllers are healthy, and create the peer connection again.
Bug: 37743102
Version: 3.0.2
Deleting Incomplete Peer Connection Prevents Creating a New One
A peer connection between two Private Cloud Appliance systems must be configured symmetrically on both racks. If the configuration on one rack is stuck in creating state, and the other rack's configuration becomes active, the peering connection cannot be completed. The normal course of action would be to delete the peering configuration on both systems, and create a new peering connection. However, the incomplete configuration cannot be deleted using the normal commands, which implies that new attempts to peer these systems will fail. Testing indicates that this problem occurs when a replication target is registered using an IP address instead of a fully qualified domain name, which suggests a DNS problem at the time of creation.
Workaround: To resolve the incomplete configuration problem and set up peering again between both systems, request support from Oracle. The replication target that was created for the incomplete configuration must be destroyed manually from the ZFS Storage Appliance. The order of operations is as follows:
-
Delete the peer connection from both systems.
-
Manually destroy the replication target that was created for the incomplete configuration. This example shows a target identified by its IP address:
TARGET LABEL ACTIONS target-000 LOCAL-PCA_POOL_HIGH 0 target-001 LOCAL-PCA_POOL 0 target-002 192.168.1.43 0
Create the peer connection between both systems again.
Bug: 37873134
Version: 3.0.2
Storage Hardware Faults May Not Clear Automatically
When a storage hardware fault occurs that includes AK-8003-HF
in the fault
code, the automatic clearing of these faults may not work properly due to a database mismatch
between fault codes.
Workaround: Using PCA 3.0 Service Advisor, look at the health of
the Storage Appliance to ensure the faults are resolved. If the faults are resolved on the
Storage Appliance, but the fault codes are still present in the software healthchecker, you
can manually clear the faults using the clearFault id=$faultid
command.
PCA-ADMIN> list fault where status EQ ACTIVE
Command: list fault where status EQ ACTIVE
Status: Success
Time: 2025-05-28 12:27:31,527 UTC
Data:
id Name Status Severity
-- ---- ------ --------
08bec50b-00b8-48df-99bb-fc31cb839e03 sn02AK00684129--AK-8003-HF--vnic7 Active Major
7f7baee8-033d-4912-a759-c31c646bb30d sn01AK00684129--AK-8003-F9--PCIe 7/NET0 Active Minor
6357a915-bae9-463e-a619-8abf10c59751 sn02AK00684129--AK-8003-F9--PCIe 7/NET0 Active Minor
f2370a63-4dda-470b-98ad-e7f63bf36702 sn02AK00684129--AK-8003-F9--PCIe 7/NET1 Active Minor
06a377c6-2283-c11c-ad4b-90a861380b0f sn02AK00684129--SPINTEL-8006-CE--DIMM 0/2 Active Minor
e4520a8a-32e1-422f-ac9b-989459aab000 sn02AK00684129--AK-8003-HF--aggr2 Active Major
ff96bce2-4ef5-43f5-9791-e9b4c64e35e8 sn02AK00684129--AK-8003-HF--vnic8 Active Major
3a2f6f38-a062-4205-940f-cda0fc1b19fc sn01AK00684129--AK-8003-F9--PCIe 7/NET1 Active Minor
PCA-ADMIN> show fault id=08bec50b-00b8-48df-99bb-fc31cb839e03
Command: show fault id=08bec50b-00b8-48df-99bb-fc31cb839e03
Status: Success
Time: 2025-05-28 12:27:36,373 UTC
Data:
Id = 08bec50b-00b8-48df-99bb-fc31cb839e03
Type = Fault
Category = Internal
Severity = Major
Status = Active
Last Update Time = 2025-05-27 14:57:24,752 UTC
Cause = Network connectivity via datalink vnic7 has been lost.
Message Id = AK-8003-HF
Time Reported = 2025-05-27 14:56:52,000 UTC
Action = Check the networking cable, switch port, and switch configuration. Contact your vendor for support if the datalink remains inexplicably failed. Please refer to the associated reference document at http://support.oracle.com/msg/AK-8003-HF for the latest service procedures and policies regarding this diagnosis.
Health Exporter = zfssa-analytics-exportersn02AK00684129
Uuid = 60e580ca-f410-41f7-a10e-e9f68d8a442c
Diagnosing Source = zfssa_analytics_exporter
Upgrade Fault = True
Faulted Component Type = HARDWARE
ASR Notification Time = 2025-05-27 14:57:24,748 UTC
Last Occurrence Activation Time = 2025-05-27 14:57:24,739 UTC
Last Occurrence Clearing Time = 2025-05-26 11:33:25,611 UTC
FaultHistoryLogIds 1 = id:a2a2a5f5-13e9-4e7c-bd1f-b09a32de8e36 type:FaultHistoryLog name:
FaultHistoryLogIds 2 = id:1f814f6f-24d3-48d3-b289-76d03a8841a5 type:FaultHistoryLog name:
FaultHistoryLogIds 3 = id:07ef95a5-fef0-4143-ac02-a1437dfb494d type:FaultHistoryLog name:
BaseManagedObjectId = id:Unknown/vnic7/Unknown type:HardwareComponent name:
Description = Network connectivity via datalink vnic7 has been lost.
Name = sn02AK00684129--AK-8003-HF--vnic7
Work State = Normal
PCA-ADMIN> clearFault id=08bec50b-00b8-48df-99bb-fc31cb839e03
Command: clearFault id=08bec50b-00b8-48df-99bb-fc31cb839e03
Status: Success
Time: 2025-05-28 12:27:59,397 UTC
Data:
status = Success
PCA-ADMIN> list fault where status EQ ACTIVE
Command: list fault where status EQ ACTIVE
Status: Success
Time: 2025-05-28 12:28:04,790 UTC
Data:
id Name Status Severity
-- ---- ------ --------
7f7baee8-033d-4912-a759-c31c646bb30d sn01AK00684129--AK-8003-F9--PCIe 7/NET0 Active Minor
6357a915-bae9-463e-a619-8abf10c59751 sn02AK00684129--AK-8003-F9--PCIe 7/NET0 Active Minor
f2370a63-4dda-470b-98ad-e7f63bf36702 sn02AK00684129--AK-8003-F9--PCIe 7/NET1 Active Minor
06a377c6-2283-c11c-ad4b-90a861380b0f sn02AK00684129--SPINTEL-8006-CE--DIMM 0/2 Active Minor
e4520a8a-32e1-422f-ac9b-989459aab000 sn02AK00684129--AK-8003-HF--aggr2 Active Major
ff96bce2-4ef5-43f5-9791-e9b4c64e35e8 sn02AK00684129--AK-8003-HF--vnic8 Active Major
3a2f6f38-a062-4205-940f-cda0fc1b19fc sn01AK00684129--AK-8003-F9--PCIe 7/NET1 Active Minor
PCA-ADMIN>
Bug: 37999786
Version: 3.0.2
Faults AK-8003-F9--PCIe 7/NET0 | NET1 Reported on Fresh Installed Rack
For Private Cloud Appliance systems that are newly install with software build 3.0.2-b1261765, it is possible that during the first bring up of the system you might see faults similar to these:
311275ec-5077-4dba-a5eb-8085df4a855d sn02AK00684129--AK-8003-F9--PCIe 7/NET0 Active Minor
efb1b954-d7bc-4a00-bc8c-c7076caad192 sn02AK00684129--AK-8003-F9--PCIe 7/NET1 Active Minor
afa421be-193a-4124-8585-9d742c4e1c57 sn01AK00684129--AK-8003-F9--PCIe 7/NET1 Active Minor
Workaround: These faults can be ignored and will clear automatically.
Bug: 37846460
Version: 3.0.2
Role Reversal Precheck Fails During DR Switchover
When
executing a DR switchover plan, and the plan fails with the error
role_reversal_precheck job has failed
, this error is likely caused by a
timeout.
Workaround: Rerun the switchover command.
Bug: 38068506
Version: 3.0.2
After Deleting a Peer Connection on One Appliance, Peered System Reports Active Connection
A peer connection between two Private Cloud Appliance systems must be configured from each side of the connection. If the peer connection is deleted on one of the systems, the other system continues to report that peering is active. This is not correct, but the systems are unable to track the configuration state on the remote side of the connection.
Workaround: Peer connections are designed to be symmetrical. If you delete the peer connection on one system, it must be deleted on the other system as well.
Bug: 37262830
Version: 3.0.2
Adding Instances to a DR Configuration During DR Operation Causes Failure
Administrators must not modify the compute instances included in any DR configuration while a
DR operation is in progress. Especially the combination of executing a (switchover) DR plan
and adding compute instances to a DR configuration with the all=True
option
is highly likely to cause errors. The execution of a DR plan might be interrupted, conflicts
between DR configurations might occur, and the ZFS Storage Appliance
might become overloaded and stop accepting incoming requests.
Workaround: When adding compute instances to a DR configuration, or
modifying DR configurations in general, ensure that no DR plan execution is in progress.
Before executing a DR plan, ensure that there are no ongoing updates to the instances list of
a DR configuration. Note that adding many compute instances in a single operation, such as
with the all=True
option, result in a long-running job. Ensure that no such
job is still in progress. It is also recommended to add no more than 100 compute instances to
a single DR configuration.
Bug: 37299009
Version: 3.0.2
Failure Creating Peer Connection If One System Does Not Expose High-Performance Storage Pool
A peer connection between two Oracle Private Cloud Appliance systems requires a local endpoint to be configured on each side. For the native Disaster Recovery (DR) service, the local endpoint configuration must include the IP address of each storage pool. If your systems have an optional high-performance pool, but you do not include its interface in the local endpoint configuration on one of the systems, then the peer connection cannot be created. An error is returned, indicating the replication target for the high-performance pool cannot be created.
Workaround: If you are peering two systems that contain a
high-performance ZFS storage pool, ensure that an IP address is allocated on each system, and
expose it through their respective local endpoint configuration. Include both parameters:
zfsCapacityPoolEndpointIp
and
zfsPerformancePoolEndpointIp
.
Bug: 37316895
Version: 3.0.2