Platform Issues
This section describes known issues and workarounds related to the appliance platform layer.
Compute Node Provisioning Takes a Long Time
The provisioning of a new compute node typically takes only a few minutes. However, there are several factors that may adversely affect the duration of the process. For example, the management nodes may be under a high load or the platform services involved in the provisioning may be busy or migrating between hosts. Also, if you started provisioning several compute nodes in quick succession, note that these processes are not executed in parallel but one after the other.
Workaround: Unless an error is displayed, you should assume that the compute node provisioning process is still ongoing and will eventually complete. At that point, the compute node provisioning state changes to Provisioned.
Bug: 33519372
Version: 3.0.1
Not Authorized to Reconfigure Appliance Network Environment
If you attempt to change the network environment parameters for the rack's external connectivity when you have just completed the initial system setup, your commands are rejected because you are not authorized to make those changes. This is caused by a security feature: the permissions for initial system setup are restricted to only those specific setup operations. Even if you are an administrator with unrestricted access to the Service Enclave, you must disconnect after initial system setup and log back in again to activate all permissions associated with your account.
Workaround: This behavior is expected and was designed to help protect against unauthorized access. In case you need to modify the appliance external network configuration right after the initial system setup, log out and log back in to make sure that your session is launched with the required privileges.
Bug: 33535069
Version: 3.0.1
Error Changing Hardware Component Password
The hardware layer of the Oracle Private Cloud Appliance architecture consists of various types of components with different operating and management software. As standalone products their password policies can vary, but the appliance software enforces a stricter rule set. If an error is returned when you try to change a component password, ensure that your new password complies with the Private Cloud Appliance policy for hardware components.
For more information about password maintenance across the entire appliance environment, refer to the Oracle Private Cloud Appliance Security Guide.
Workaround: For hardware components, use the Service CLI to set a password that conforms to the following rules:
-
consists of at least 8 characters
-
with a maximum length of 20 characters for compute nodes, management nodes, and switches
-
with a maximum length of 16 characters for ILOMs and the ZFS Storage Appliance
-
-
contains at least one lowercase letter (a-z)
-
contains at least one uppercase letter (A-Z)
-
contains at least one digit (0-9)
-
contains at least one symbol (@$!#%*&)
Bug: 35828215
Version: 3.0.2
Grafana Service Statistics Remain at Zero
The Grafana Service Monitoring folder contains a dashboard named Service Level, which displays statistical information about requests received by the fundamental appliance services. These numbers can remain at zero even though there is activity pertaining to the services monitored through this dashboard.
Workaround: No workaround is currently available.
Bug: 33535885
Version: 3.0.1
Terraform Provisioning Requires Fully Qualified Domain Name for Region
If you use the Oracle Cloud Infrastructure Terraform provider to automate infrastructure provisioning on Oracle Private Cloud Appliance, you must specify the fully qualified domain name of the appliance in the region variable for the Terraform provider.
Synchronizing Hardware Data Causes Provisioning Node to Appear Ready to Provision
Both the Service Web UI and the Service CLI provide a command to synchronize the information about hardware components with the actual status as currently registered by the internal hardware management services. However, you should not need to synchronize hardware status under normal circumstances, because status changes are detected and communicated automatically.
Furthermore, if a compute node provisioning operation is in progress when you synchronize hardware data, its Provisioning State could be reverted to Ready to Provision. This information is incorrect, and is caused by the hardware synchronization occurring too soon after the provisioning command. In this situation, attempting to provision the compute node again is likely to cause problems.
Workaround: If you have started provisioning a compute node, and its provisioning state reads Provisioning, wait at least another five minutes to see if it changes to Provisioned. If it takes excessively long for the compute node to be listed as Provisioned, run the Sync Hardware Data command.
If the compute node still does not change to Provisioned, retry provisioning the compute node.
Bug: 33575736
Version: 3.0.1
Rack Elevation for Storage Controller Not Displayed
In the Service Web UI, the Rack Units list shows all
hardware components with basic status information. One of the data fields is Rack
Elevation, the rack unit number where the component in question is installed. For one of
the controllers of the ZFS storage appliance, pcasn02
, the rack elevation is
shown as Not Available.
Workaround: There is no workaround. The underlying hardware administration services currently do not populate this particular data field. The two controllers occupy 2 rack units each and are installed in RU 1-4.
Bug: 33609276
Version: 3.0.1
Fix available: Please apply the latest patches to your system.
Free-Form Tags Used for Extended Functionality
You can use the following free-form tags to extend the functionality of Oracle Private Cloud Appliance.
Note:
Do not use these tag names for other purposes.
-
PCA_no_lm
Use this tag to instruct the Compute service not to live migrate an instance. The value can be either True or False.
By default, an instance can be live migrated, such as when you need to evacuate all running instances from a compute node. Live migration can be a problem for some instances. For example, live migration is not supported for instances in a Microsoft Windows cluster. To prevent an instance from being live migrated, set this tag to True on the instance.
Specify this tag in the Tagging section of the Create Instance or Edit
instance_name
dialog, in theoci compute instance launch
oroci compute instance update
command, or using the API.The following is an example option for the
oci compute instance launch
command:--freeform-tags '{"PCA_no_lm": "True"}'
Setting this tag to True on an instance will not prevent the instance from being moved when you change the fault domain. Changing the fault domain is not a live migration. When you change the fault domain of an instance, the instance is stopped, moved, and restarted.
-
PCA_blocksize
Use this tag to instruct the ZFS storage appliance to create a new volume with a specific block size.
The default block size is 8192 bytes. To specify a different block size, specify the
PCA_blocksize
tag in the Tagging section of the Create Block Volume dialog, in theoci bv volume create
command, or using the API. Supported values are a power of 2 between 512 and 1M bytes, specified as a string and fully expanded.The following is an example option for the
oci bv volume create
command:--freeform-tags '{"PCA_blocksize": "65536"}'
The block size cannot be modified once the volume has been created.
Use of these tags counts against your tag limit.
Version: 3.0.1
Do Not Use Reserved Tag Namespace
Oracle Private Cloud Appliance uses a reserved tag namespace named
OraclePCA
to enable additional functionality. For example, the File Storage
Service supports defined tags to set file system quota or database record size. There is no
protection mechanism in place to prevent users from using that same namespace for other
purposes. However, the reserved tag namespace must not be used for any other tags than those
defined by the system.
Workaround: Do not use the OraclePCA
tag namespace
to create and use your own defined tags. Create your tags in a different tag namespace.
Bug: 35976195
Version: 3.0.2
Imported Images Not Synchronized to High-Performance Pool
In an Oracle Private Cloud Appliance with default storage configuration,
when you import compute images, they are stored on the ZFS Storage Appliance in an images
LUN inside the
standard ZFS pool. If the storage configuration is later extended with a high-performance disk
shelf, an additional high-performance ZFS pool is configured on the ZFS Storage Appliance. Because there is no replication between the
storage pools, the images from the original pool are not automatically made available in the
new high-performance pool. The images have to be imported manually.
Workaround: When adding high-performance storage shelves to the appliance configuration, import the required compute images again to ensure they are loaded into the newly created ZFS pool.
Bug: 33660897
Version: 3.0.1
API Server Failure After Management Node Reboot
When one of the three management nodes is rebooted, it may occur that the API server does not respond to any requests, even though it can still be reached through the other two management nodes in the cluster. This is likely caused by an ownership issue with the virtual IP shared between the management nodes, or by the DNS server not responding quickly enough to route traffic to the service pods on the available management nodes. After the rebooted management node has rejoined the cluster, it may still take several minutes before the API server returns to its normal operating state and accepts requests again.
Workaround: When a single management node reboots, all the services are eventually restored to their normal operating condition, although their pods may be distributed differently across the management node cluster. If your UI, CLI or API operations fail after a management node reboot, wait 5 to 10 minutes and try again.
Bug: 33191011
Version: 3.0.1
CLI Command Returns Status 500 Due To MySQL Connection Error
When a command is issued from the OCI CLI and accepted by the API server, it starts a series of internal operations involving the microservice pods and the MySQL database, among other components. It may occur that the pod instructed to execute an operation is unable to connect to the MySQL database before the timeout is reached. This exception is reported back to the API server, which in turn reports that the request could not be fulfilled due to an unexpected condition (HTTP status code 500). It is normal for this type of exception to result in a generic server error code. More detailed information may be stored in logs.
Workaround: If a generic status 500 error code is returned after you issued a CLI command, try to execute the command again. If the error was the result of an intermittent connection problem, the command is likely to succeed upon retry.
Bug: n/a
Version: 3.0.1
Administrators in Authorization Group Other Than SuperAdmin Must Use Service CLI to Change Password
Due to high security restrictions, administrators who are not a member of the SuperAdmin authorization group are unable to change their account password in the Service Web UI. An authorization error is displayed when an administrator from a non-SuperAdmin authorization group attempts to access their own profile.
Workaround: Log in to the Service CLI, find your user id in the user preferences, and change your password as follows:
PCA-ADMIN> show UserPreference Data: Id = 1c74b2a5-c1ce-4433-99da-cb17aab4c090 Type = UserPreference [...] UserId = id:5b6c1bfa-453c-4682-e692-6f0c91b53d21 type:User name:dcadmin PCA-ADMIN> changePassword id=<user_id> password=<new_password> confirmPassword=<new_password>
Bug: 33749967
Version: 3.0.1
Service Web UI and Grafana Unavailable when HAProxy Is Down
HAProxy is the load balancer used by the Private Cloud Appliance platform layer for all access to and from the microservices. When the load balancer and proxy services are down, the Service Web UI and Grafana monitoring interface are unavailable. When you attempt to log in, you receive an error message: "Server Did Not Respond".
Workaround: Log in to one of the management nodes. Check the status of the HAProxy cluster resource, and restart if necessary.
# ssh pcamn01
# pcs status
Cluster name: mncluster
Stack: corosync
[...]
Full list of resources:
scsi_fencing (stonith:fence_scsi): Stopped (disabled)
Resource Group: mgmt-rg
vip-mgmt-int (ocf::heartbeat:IPaddr2): Started pcamn03
vip-mgmt-host (ocf::heartbeat:IPaddr2): Started pcamn03
vip-mgmt-ilom (ocf::heartbeat:IPaddr2): Started pcamn03
vip-mgmt-lb (ocf::heartbeat:IPaddr2): Started pcamn03
vip-mgmt-ext (ocf::heartbeat:IPaddr2): Started pcamn03
l1api (systemd:l1api): Started pcamn03
haproxy (ocf::heartbeat:haproxy): Stopped (disabled)
pca-node-state (systemd:pca_node_state): Started pcamn03
dhcp (ocf::heartbeat:dhcpd): Started pcamn03
hw-monitor (systemd:hw_monitor): Started pcamn03
To start HAProxy, use the pcs resource
command as shown in the example
below. Verify that the cluster resource status has changed from "Stopped (disabled)" to
"Started".
# pcs resource enable haproxy
# pcs status
[...]
Resource Group: mgmt-rg
haproxy (ocf::heartbeat:haproxy): Started pcamn03
Bug: 34485377
Version: 3.0.2
Lock File Issue Occurs when Changing Compute Node Passwords
When a command is issued to modify the password for a compute node or ILOM, the system sets a temporary lock on the relevant database to ensure that password changes are applied in a reliable and consistent manner. If the database lock cannot be obtained or released on the first attempt, the system makes several further attempts to complete the request. Under normal operating circumstances, it is expected that the password is eventually successfully changed. However, the command output may contain error messages such as "Failed to create DB lockfile" or "Failed to remove DB lock", even if the final result is "Password successfully changed".
Workaround: The error messages are inaccurate and can be ignored as long as the password operations complete as expected. No workaround is required.
Bug: 34065740
Version: 3.0.2
Compute Node Hangs at Dracut Prompt after System Power Cycle
When an appliance or some of its components need to be powered off, for example to perform
maintenance, there is always a minimal risk that a step in the complex reboot sequence is not
completed successfully. When a compute node reboots after a system power cycle, it can hang at
the dracut
prompt because the boot framework fails to build the required
initramfs/initrd image. As a result, primary GPT partition errors are reported for the root
file system.
Workaround: Log on to the compute node ILOM. Verify that the server
has failed to boot, and is in the dracut
recovery shell. To allow the compute
node to return to normal operation, reset it from the ILOM using the reset
/System
command.
Bug: 34096073
Version: 3.0.2
No Error Reported for Unavailable Spine Switch
When a spine switch goes offline due to loss of power or a fatal error, the system gives no indication of the issue in the Service Enclave UI/CLI or Grafana. This behavior is the result of the switch client not properly handling exceptions and continuing to report the default "healthy" status.
Workaround: There is currently no workaround to make the system generate an error that alerts the administrator of a spine switch issue.
Bug: 34696315
Version: 3.0.2
ZFS Storage Appliance Controller Stuck in Failsafe Shell After Power Cycle
The two controllers of the Oracle ZFS Storage Appliance operate in an active-active cluster configuration. When one controller is taken offline, for example when its firmware is upgraded or when maintenance is required, the other controller takes ownership of all storage resources to provide continuation of service. During this process, several locks must be applied and released. When the rebooted controller rejoins the cluster to take back ownership of its assigned storage resources, the cluster synchronization will fail if the necessary locks are not released correctly. In this situation, the rebooted controller could become stuck in the failsafe shell, waiting for the peer controller to release certain locks. This is likely the result of a takeover operation that was not completed entirely, leaving the cluster in an indeterminate state.
Workaround: There is currently no workaround. If the storage controller cluster ends up in this condition, contact Oracle for assistance.
Bug: 34700405
Version: 3.0.2
Concurrent Compute Node Provisioning Operations Fail Due to Storage Configuration Timeout
When the Private Cloud Appliance has just been installed, or when a set of expansion compute nodes have been added, the system does not prevent you from provisioning all new compute nodes at once. Note, however, that for each provisioned node the storage initiators and targets must be configured on the ZFS Storage Appliance. If there are too many configuration update requests for the storage appliance to process, they will time out. As a result, all compute node provisioning operations will fail and be rolled back to the unprovisioned state.
Workaround: To avoid ZFS Storage Appliance configuration timeouts, provision compute nodes sequentially one by one, or in groups of no more than 3.
Bug: 34739702
Version: 3.0.2
Data Switch Fails to Boot Due to Active Console Connection
If a Cisco Nexus 9336C-FX2 Switch has an active local console session, for example when a terminal server is connected, the switch could randomly hang during reboot. It is assumed that the interruption of the boot sequence is caused by a ghost session on the console port. This behavior has not been observed when no local console connection is used.
Workaround: Do not connect any cables to the console ports of the data switches. There is no need for a local console connection in a Private Cloud Appliance installation.
Bug: 32965120
Version: 3.0.2
Federated Login Failure after Appliance Upgrade
Identity federation allows users to log in to Private Cloud Appliance with their existing company user name and password. After an upgrade of the appliance software, the trust relationship between the identity provider and Private Cloud Appliance might be broken, causing all federated logins to fail. During the upgrade the Private Cloud Appliance X.509 external server certificate could be updated for internal service changes. In this case, the certificate on the identity provider side no longer matches.
Workaround: If the identity provider allows it, update its service provider certificate.
-
Retrieve the appliance SAML metadata XML file from
https://iaas.<domain>/saml/<TenancyId>
and save it to a local file. -
Open the local file with a text editor and find the
<X509Certificate>
element.<SPSSODescriptor> <KeyDescriptor use="signing"> <KeyInfo> <X509Data> <X509Certificate> <COPY CERTIFICATE CONTENT FROM HERE> </X509Certificate> </X509Data> </KeyInfo> </KeyDescriptor> </KeyDescriptor>
-
Copy the certificate content and save it to a new
*.pem
file, structured as follows:-----BEGIN CERTIFICATE----- <PASTE CERTIFICATE CONTENT HERE> -----END CERTIFICATE-----
-
Update the identity provider with this new service provider certificate for your Private Cloud Appliance.
If the identity provider offers no easy way to update the certificate, we recommend that you delete the service provider and reconfigure identity federation. For more information, refer to the section "Federating with Microsoft Active Directory" in the Oracle Private Cloud Appliance Administrator Guide.
Bug: 35688600
Version: 3.0.2
Ensure No Storage Buckets Are Present Before Deleting a Compartment or Tenancy
When a command is issued to delete a compartment or tenancy, the appliance software cannot reliably confirm that no object storage buckets exist, because it has no service account with access to all buckets present on the ZFS Storage Appliance. As a result, access to certain object storage buckets could be lost when their compartment is deleted.
Workaround: Before deleting a compartment or tenancy, verify that no object storage buckets are present in that compartment or tenancy.
Bug: 35811594
Version: 3.0.2
Syntax Issue when Generating Certificate Signing Request
When you plan to implement a new custom certificate, you need to provide a certificate
signing request (CSR) to the Certificate Authority (CA) that will create your CA certificate.
You generate the CSR from the Service CLI using the
generateCustomerCsr
command, typically with optional parameters to add
details about your organization.
To include multiple Organization Units (OU) in the CSR, you enter them as a comma-separated
list. For example: generateCustomerCsr organization="My Company"
organizationUnit=Division-1,Division-2
. However, the CLI always interprets the
comma as a separator in this situation, and no escape character can be used to alter the
behavior. This means you cannot enter a name containing a comma, as it will be interpreted as
a set of two comma-separated strings.
Workaround: Do not enter strings containing commas when adding
optional parameters to the generateCustomerCsr
command. Use a comma only as a
separator.
Bug: 35946278
Version: 3.0.2
Listing Upgrade Jobs Fails with RabbitMQ Error
When you run the Service CLI command
getUpgradeJobs
, the following error might be returned:
PCA-ADMIN> getUpgradeJobs Status: Failure Error Msg: PCA_GENERAL_000012: Error in RabbitMQ service: null
Workaround: The issue is temporary. Please retry the command at a later time.
Bug: 35999461
Version: 3.0.2
Availability Domain Name Change in Version 3.0.2-b1001356
In software version 3.0.2-b1001356 (December 2023), Private Cloud Appliance's single availability domain has been
renamed from "ad1
" to "AD-1
". This change was required for
compatibility with Oracle Cloud Infrastructure. The availability domain
is a mandatory parameter in a small set of commands, and an optional parameter in several
other commands.
The --availability-domain
parameter is required with the following
commands:
oci bv boot-volume create oci bv boot-volume list oci bv volume create oci bv volume-group create oci compute instance launch oci compute boot-volume-attachment list oci fs file-system create oci fs file-system list oci fs mount-target create oci fs mount-target list oci fs export-set list oci iam fault-domain list
Workaround: Ensure that the correct value is used to identify the
availability domain in your commands, depending on the version of the appliance software your
system is running. If you are using scripts or any form of automation that includes the
--availability-domain
parameter, ensure that your code is updated when you
upgrade or patch the appliance with version 3.0.2-b1001356 or newer.
Bug: 36094977
Version: 3.0.2
No Packages Available to Patch MySQL Cluster Database
With the release of appliance software version 3.0.2-b1001356, new MySQL RPM packages were added to the ULN channel PCA 3.0.2 MN. However, a package signing issue prevents the ULN mirror from downloading them, which means the MySQL cluster database on your system cannot be patched to the latest available version.
When patching the system, you will see no error message or abnormal behavior related to the missing MySQL packages. Follow the workaround to obtain the new packages. Once these have been downloaded to the ULN mirror, you can patch the MySQL cluster database.
Note:
For new ULN mirror installations, the steps to enable updates of MySQL packages have been included in the Oracle Private Cloud Appliance Patching Guide under "Configure Your Environment for Patching".
To determine if a system is affected by this issue, check the ULN mirror for the presence of
MySQL packages in the yum directory referenced by the pca302_x86_64_mn
soft
link. If the search returns no results, the ULN mirror was unable to download the MySQL
packages. The default location of the yum setup directory is
/var/www/html/yum
, which is used in the following example:
# ls -al /var/www/html/yum/pca302_x86_64_mn/getPackage/ | grep mysql -rw-r--r--. 1 root root 85169400 Dec 19 03:19 mysql-cluster-commercial-client-8.0.33-1.1.el7.x86_64.rpm -rw-r--r--. 1 root root 4751220 Dec 19 03:19 mysql-cluster-commercial-client-plugins-8.0.33-1.1.el7.x86_64.rpm -rw-r--r--. 1 root root 689392 Dec 19 03:19 mysql-cluster-commercial-common-8.0.33-1.1.el7.x86_64.rpm -rw-r--r--. 1 root root 12417692 Dec 19 03:19 mysql-cluster-commercial-data-node-8.0.33-1.1.el7.x86_64.rpm -rw-r--r--. 1 root root 2229080 Dec 19 03:19 mysql-cluster-commercial-icu-data-files-8.0.33-1.1.el7.x86_64.rpm -rw-r--r--. 1 root root 2236184 Dec 19 03:19 mysql-cluster-commercial-libs-8.0.33-1.1.el7.x86_64.rpm -rw-r--r--. 1 root root 1279012 Dec 19 03:19 mysql-cluster-commercial-libs-compat-8.0.33-1.1.el7.x86_64.rpm -rw-r--r--. 1 root root 3478680 Dec 19 03:19 mysql-cluster-commercial-management-server-8.0.33-1.1.el7.x86_64.rpm -rw-r--r--. 1 root root 364433848 Dec 19 03:19 mysql-cluster-commercial-server-8.0.33-1.1.el7.x86_64.rpm -rw-r--r--. 1 root root 2428848 Dec 19 03:19 mysql-connector-j-commercial-8.0.33-1.1.el7.noarch.rpm -rw-r--r--. 1 root root 4570200 Dec 19 03:19 mysql-connector-odbc-commercial-8.0.33-1.1.el7.x86_64.rpm
Workaround: When you import the appropriate GPG key on your ULN mirror, it can download the updated MySQL packages. Proceed as follows:
-
Log in to the ULN mirror server.
-
Download the MySQL GPG key from https://repo.mysql.com/RPM-GPG-KEY-mysql-2022.
-
Import the GPG key.
# rpm --import RPM-GPG-KEY-mysql-2022
-
Update the ULN mirror.
# /usr/bin/uln-yum-mirror
If the key was imported successfully, the new MySQL packages are downloaded to the ULN mirror.
-
For confirmation, verify the signature using one of the new packages.
# rpm --checksig mysql-cluster-commercial-management-server-8.0.33-1.1.el7.x86_64.rpm mysql-cluster-commercial-management-server-8.0.33-1.1.el7.x86_64.rpm: rsa sha1 (md5) pgp md5 OK
Bug: 36123758
Version: 3.0.2
Uppercase Letters Are Not Supported in Domain Names
Uppercase letters aren't supported in domain names. The domain name for your system is used as the base domain for the internal network, and by Oracle Private Cloud Appliance public facing services. This attribute has a maximum length of 190 characters. Acceptable characters are "a"→"z", "0"→"9", "-"
Bug: 36484125
Version: 3.0.2