Platform Issues

This section describes known issues and workarounds related to the appliance platform layer.

Compute Node Provisioning Takes a Long Time

The provisioning of a new compute node typically takes only a few minutes. However, there are several factors that may adversely affect the duration of the process. For example, the management nodes may be under a high load or the platform services involved in the provisioning may be busy or migrating between hosts. Also, if you started provisioning several compute nodes in quick succession, note that these processes are not executed in parallel but one after the other.

Workaround: Unless an error is displayed, you should assume that the compute node provisioning process is still ongoing and will eventually complete. At that point, the compute node provisioning state changes to Provisioned.

Bug: 33519372

Version: 3.0.1

Not Authorized to Reconfigure Appliance Network Environment

If you attempt to change the network environment parameters for the rack's external connectivity when you have just completed the initial system setup, your commands are rejected because you are not authorized to make those changes. This is caused by a security feature: the permissions for initial system setup are restricted to only those specific setup operations. Even if you are an administrator with unrestricted access to the Service Enclave, you must disconnect after initial system setup and log back in again to activate all permissions associated with your account.

Workaround: This behavior is expected and was designed to help protect against unauthorized access. In case you need to modify the appliance external network configuration right after the initial system setup, log out and log back in to make sure that your session is launched with the required privileges.

Bug: 33535069

Version: 3.0.1

Error Changing Hardware Component Password

The hardware layer of the Oracle Private Cloud Appliance architecture consists of various types of components with different operating and management software. As standalone products their password policies can vary, but the appliance software enforces a stricter rule set. If an error is returned when you try to change a component password, ensure that your new password complies with the Private Cloud Appliance policy for hardware components.

For more information about password maintenance across the entire appliance environment, refer to the Oracle Private Cloud Appliance Security Guide.

Workaround: For hardware components, use the Service CLI to set a password that conforms to the following rules:

  • consists of at least 8 characters

    • with a maximum length of 20 characters for compute nodes, management nodes, and switches

    • with a maximum length of 16 characters for ILOMs and the ZFS Storage Appliance

  • contains at least one lowercase letter (a-z)

  • contains at least one uppercase letter (A-Z)

  • contains at least one digit (0-9)

  • contains at least one symbol (@$!#%*&)

Bug: 35828215

Version: 3.0.2

Grafana Service Statistics Remain at Zero

The Grafana Service Monitoring folder contains a dashboard named Service Level, which displays statistical information about requests received by the fundamental appliance services. These numbers can remain at zero even though there is activity pertaining to the services monitored through this dashboard.

Workaround: No workaround is currently available.

Bug: 33535885

Version: 3.0.1

Terraform Provisioning Requires Fully Qualified Domain Name for Region

If you use the Oracle Cloud Infrastructure Terraform provider to automate infrastructure provisioning on Oracle Private Cloud Appliance, you must specify the fully qualified domain name of the appliance in the region variable for the Terraform provider.

Synchronizing Hardware Data Causes Provisioning Node to Appear Ready to Provision

Both the Service Web UI and the Service CLI provide a command to synchronize the information about hardware components with the actual status as currently registered by the internal hardware management services. However, you should not need to synchronize hardware status under normal circumstances, because status changes are detected and communicated automatically.

Furthermore, if a compute node provisioning operation is in progress when you synchronize hardware data, its Provisioning State could be reverted to Ready to Provision. This information is incorrect, and is caused by the hardware synchronization occurring too soon after the provisioning command. In this situation, attempting to provision the compute node again is likely to cause problems.

Workaround: If you have started provisioning a compute node, and its provisioning state reads Provisioning, wait at least another five minutes to see if it changes to Provisioned. If it takes excessively long for the compute node to be listed as Provisioned, run the Sync Hardware Data command.

If the compute node still does not change to Provisioned, retry provisioning the compute node.

Bug: 33575736

Version: 3.0.1

Rack Elevation for Storage Controller Not Displayed

In the Service Web UI, the Rack Units list shows all hardware components with basic status information. One of the data fields is Rack Elevation, the rack unit number where the component in question is installed. For one of the controllers of the ZFS storage appliance, pcasn02, the rack elevation is shown as Not Available.

Workaround: There is no workaround. The underlying hardware administration services currently do not populate this particular data field. The two controllers occupy 2 rack units each and are installed in RU 1-4.

Bug: 33609276

Version: 3.0.1

Fix available: Please apply the latest patches to your system.

Free-Form Tags Used for Extended Functionality

You can use the following free-form tags to extend the functionality of Oracle Private Cloud Appliance.

Note:

Do not use these tag names for other purposes.

  • PCA_no_lm

    Use this tag to instruct the Compute service not to live migrate an instance. The value can be either True or False.

    By default, an instance can be live migrated, such as when you need to evacuate all running instances from a compute node. Live migration can be a problem for some instances. For example, live migration is not supported for instances in a Microsoft Windows cluster. To prevent an instance from being live migrated, set this tag to True on the instance.

    Specify this tag in the Tagging section of the Create Instance or Edit instance_name dialog, in the oci compute instance launch or oci compute instance update command, or using the API.

    The following is an example option for the oci compute instance launch command:

    --freeform-tags '{"PCA_no_lm": "True"}'

    Setting this tag to True on an instance will not prevent the instance from being moved when you change the fault domain. Changing the fault domain is not a live migration. When you change the fault domain of an instance, the instance is stopped, moved, and restarted.

  • PCA_blocksize

    Use this tag to instruct the ZFS storage appliance to create a new volume with a specific block size.

    The default block size is 8192 bytes. To specify a different block size, specify the PCA_blocksize tag in the Tagging section of the Create Block Volume dialog, in the oci bv volume create command, or using the API. Supported values are a power of 2 between 512 and 1M bytes, specified as a string and fully expanded.

    The following is an example option for the oci bv volume create command:

    --freeform-tags '{"PCA_blocksize": "65536"}'

    The block size cannot be modified once the volume has been created.

Use of these tags counts against your tag limit.

Version: 3.0.1

Do Not Use Reserved Tag Namespace

Oracle Private Cloud Appliance uses a reserved tag namespace named OraclePCA to enable additional functionality. For example, the File Storage Service supports defined tags to set file system quota or database record size. There is no protection mechanism in place to prevent users from using that same namespace for other purposes. However, the reserved tag namespace must not be used for any other tags than those defined by the system.

Workaround: Do not use the OraclePCA tag namespace to create and use your own defined tags. Create your tags in a different tag namespace.

Bug: 35976195

Version: 3.0.2

Imported Images Not Synchronized to High-Performance Pool

In an Oracle Private Cloud Appliance with default storage configuration, when you import compute images, they are stored on the ZFS Storage Appliance in an images LUN inside the standard ZFS pool. If the storage configuration is later extended with a high-performance disk shelf, an additional high-performance ZFS pool is configured on the ZFS Storage Appliance. Because there is no replication between the storage pools, the images from the original pool are not automatically made available in the new high-performance pool. The images have to be imported manually.

Workaround: When adding high-performance storage shelves to the appliance configuration, import the required compute images again to ensure they are loaded into the newly created ZFS pool.

Bug: 33660897

Version: 3.0.1

API Server Failure After Management Node Reboot

When one of the three management nodes is rebooted, it may occur that the API server does not respond to any requests, even though it can still be reached through the other two management nodes in the cluster. This is likely caused by an ownership issue with the virtual IP shared between the management nodes, or by the DNS server not responding quickly enough to route traffic to the service pods on the available management nodes. After the rebooted management node has rejoined the cluster, it may still take several minutes before the API server returns to its normal operating state and accepts requests again.

Workaround: When a single management node reboots, all the services are eventually restored to their normal operating condition, although their pods may be distributed differently across the management node cluster. If your UI, CLI or API operations fail after a management node reboot, wait 5 to 10 minutes and try again.

Bug: 33191011

Version: 3.0.1

CLI Command Returns Status 500 Due To MySQL Connection Error

When a command is issued from the OCI CLI and accepted by the API server, it starts a series of internal operations involving the microservice pods and the MySQL database, among other components. It may occur that the pod instructed to execute an operation is unable to connect to the MySQL database before the timeout is reached. This exception is reported back to the API server, which in turn reports that the request could not be fulfilled due to an unexpected condition (HTTP status code 500). It is normal for this type of exception to result in a generic server error code. More detailed information may be stored in logs.

Workaround: If a generic status 500 error code is returned after you issued a CLI command, try to execute the command again. If the error was the result of an intermittent connection problem, the command is likely to succeed upon retry.

Bug: n/a

Version: 3.0.1

Administrators in Authorization Group Other Than SuperAdmin Must Use Service CLI to Change Password

Due to high security restrictions, administrators who are not a member of the SuperAdmin authorization group are unable to change their account password in the Service Web UI. An authorization error is displayed when an administrator from a non-SuperAdmin authorization group attempts to access their own profile.

Workaround: Log in to the Service CLI, find your user id in the user preferences, and change your password as follows:

PCA-ADMIN> show UserPreference
Data:
  Id = 1c74b2a5-c1ce-4433-99da-cb17aab4c090
  Type = UserPreference
[...]
  UserId = id:5b6c1bfa-453c-4682-e692-6f0c91b53d21  type:User  name:dcadmin

PCA-ADMIN> changePassword id=<user_id> password=<new_password> confirmPassword=<new_password>

Bug: 33749967

Version: 3.0.1

Service Web UI and Grafana Unavailable when HAProxy Is Down

HAProxy is the load balancer used by the Private Cloud Appliance platform layer for all access to and from the microservices. When the load balancer and proxy services are down, the Service Web UI and Grafana monitoring interface are unavailable. When you attempt to log in, you receive an error message: "Server Did Not Respond".

Workaround: Log in to one of the management nodes. Check the status of the HAProxy cluster resource, and restart if necessary.

# ssh pcamn01
# pcs status
Cluster name: mncluster
Stack: corosync
[...]
Full list of resources:

 scsi_fencing   (stonith:fence_scsi):   Stopped (disabled)
 Resource Group: mgmt-rg
     vip-mgmt-int     (ocf::heartbeat:IPaddr2):       Started pcamn03
     vip-mgmt-host    (ocf::heartbeat:IPaddr2):       Started pcamn03
     vip-mgmt-ilom    (ocf::heartbeat:IPaddr2):       Started pcamn03
     vip-mgmt-lb      (ocf::heartbeat:IPaddr2):       Started pcamn03
     vip-mgmt-ext     (ocf::heartbeat:IPaddr2):       Started pcamn03
     l1api            (systemd:l1api):                Started pcamn03
     haproxy          (ocf::heartbeat:haproxy):       Stopped (disabled)
     pca-node-state   (systemd:pca_node_state):       Started pcamn03
     dhcp             (ocf::heartbeat:dhcpd):         Started pcamn03
     hw-monitor       (systemd:hw_monitor):           Started pcamn03

To start HAProxy, use the pcs resource command as shown in the example below. Verify that the cluster resource status has changed from "Stopped (disabled)" to "Started".

# pcs resource enable haproxy
# pcs status
[...]
 Resource Group: mgmt-rg
     haproxy          (ocf::heartbeat:haproxy):       Started pcamn03

Bug: 34485377

Version: 3.0.2

Lock File Issue Occurs when Changing Compute Node Passwords

When a command is issued to modify the password for a compute node or ILOM, the system sets a temporary lock on the relevant database to ensure that password changes are applied in a reliable and consistent manner. If the database lock cannot be obtained or released on the first attempt, the system makes several further attempts to complete the request. Under normal operating circumstances, it is expected that the password is eventually successfully changed. However, the command output may contain error messages such as "Failed to create DB lockfile" or "Failed to remove DB lock", even if the final result is "Password successfully changed".

Workaround: The error messages are inaccurate and can be ignored as long as the password operations complete as expected. No workaround is required.

Bug: 34065740

Version: 3.0.2

Compute Node Hangs at Dracut Prompt after System Power Cycle

When an appliance or some of its components need to be powered off, for example to perform maintenance, there is always a minimal risk that a step in the complex reboot sequence is not completed successfully. When a compute node reboots after a system power cycle, it can hang at the dracut prompt because the boot framework fails to build the required initramfs/initrd image. As a result, primary GPT partition errors are reported for the root file system.

Workaround: Log on to the compute node ILOM. Verify that the server has failed to boot, and is in the dracut recovery shell. To allow the compute node to return to normal operation, reset it from the ILOM using the reset /System command.

Bug: 34096073

Version: 3.0.2

No Error Reported for Unavailable Spine Switch

When a spine switch goes offline due to loss of power or a fatal error, the system gives no indication of the issue in the Service Enclave UI/CLI or Grafana. This behavior is the result of the switch client not properly handling exceptions and continuing to report the default "healthy" status.

Workaround: There is currently no workaround to make the system generate an error that alerts the administrator of a spine switch issue.

Bug: 34696315

Version: 3.0.2

ZFS Storage Appliance Controller Stuck in Failsafe Shell After Power Cycle

The two controllers of the Oracle ZFS Storage Appliance operate in an active-active cluster configuration. When one controller is taken offline, for example when its firmware is upgraded or when maintenance is required, the other controller takes ownership of all storage resources to provide continuation of service. During this process, several locks must be applied and released. When the rebooted controller rejoins the cluster to take back ownership of its assigned storage resources, the cluster synchronization will fail if the necessary locks are not released correctly. In this situation, the rebooted controller could become stuck in the failsafe shell, waiting for the peer controller to release certain locks. This is likely the result of a takeover operation that was not completed entirely, leaving the cluster in an indeterminate state.

Workaround: There is currently no workaround. If the storage controller cluster ends up in this condition, contact Oracle for assistance.

Bug: 34700405

Version: 3.0.2

Concurrent Compute Node Provisioning Operations Fail Due to Storage Configuration Timeout

When the Private Cloud Appliance has just been installed, or when a set of expansion compute nodes have been added, the system does not prevent you from provisioning all new compute nodes at once. Note, however, that for each provisioned node the storage initiators and targets must be configured on the ZFS Storage Appliance. If there are too many configuration update requests for the storage appliance to process, they will time out. As a result, all compute node provisioning operations will fail and be rolled back to the unprovisioned state.

Workaround: To avoid ZFS Storage Appliance configuration timeouts, provision compute nodes sequentially one by one, or in groups of no more than 3.

Bug: 34739702

Version: 3.0.2

Data Switch Fails to Boot Due to Active Console Connection

If a Cisco Nexus 9336C-FX2 Switch has an active local console session, for example when a terminal server is connected, the switch could randomly hang during reboot. It is assumed that the interruption of the boot sequence is caused by a ghost session on the console port. This behavior has not been observed when no local console connection is used.

Workaround: Do not connect any cables to the console ports of the data switches. There is no need for a local console connection in a Private Cloud Appliance installation.

Bug: 32965120

Version: 3.0.2

Federated Login Failure after Appliance Upgrade

Identity federation allows users to log in to Private Cloud Appliance with their existing company user name and password. After an upgrade of the appliance software, the trust relationship between the identity provider and Private Cloud Appliance might be broken, causing all federated logins to fail. During the upgrade the Private Cloud Appliance X.509 external server certificate could be updated for internal service changes. In this case, the certificate on the identity provider side no longer matches.

Workaround: If the identity provider allows it, update its service provider certificate.

  1. Retrieve the appliance SAML metadata XML file from https://iaas.<domain>/saml/<TenancyId> and save it to a local file.

  2. Open the local file with a text editor and find the <X509Certificate> element.

    <SPSSODescriptor>
        <KeyDescriptor use="signing">
            <KeyInfo>
                <X509Data>
                    <X509Certificate>  
                        <COPY CERTIFICATE CONTENT FROM HERE>
                      </X509Certificate>
                </X509Data>
            </KeyInfo>
        </KeyDescriptor>
    </KeyDescriptor>
  3. Copy the certificate content and save it to a new *.pem file, structured as follows:

    -----BEGIN CERTIFICATE-----
    <PASTE CERTIFICATE CONTENT HERE>
    -----END CERTIFICATE-----
  4. Update the identity provider with this new service provider certificate for your Private Cloud Appliance.

If the identity provider offers no easy way to update the certificate, we recommend that you delete the service provider and reconfigure identity federation. For more information, refer to the section "Federating with Microsoft Active Directory" in the Oracle Private Cloud Appliance Administrator Guide.

Bug: 35688600

Version: 3.0.2

Ensure No Storage Buckets Are Present Before Deleting a Compartment or Tenancy

When a command is issued to delete a compartment or tenancy, the appliance software cannot reliably confirm that no object storage buckets exist, because it has no service account with access to all buckets present on the ZFS Storage Appliance. As a result, access to certain object storage buckets could be lost when their compartment is deleted.

Workaround: Before deleting a compartment or tenancy, verify that no object storage buckets are present in that compartment or tenancy.

Bug: 35811594

Version: 3.0.2

Syntax Issue when Generating Certificate Signing Request

When you plan to implement a new custom certificate, you need to provide a certificate signing request (CSR) to the Certificate Authority (CA) that will create your CA certificate. You generate the CSR from the Service CLI using the generateCustomerCsr command, typically with optional parameters to add details about your organization.

To include multiple Organization Units (OU) in the CSR, you enter them as a comma-separated list. For example: generateCustomerCsr organization="My Company" organizationUnit=Division-1,Division-2. However, the CLI always interprets the comma as a separator in this situation, and no escape character can be used to alter the behavior. This means you cannot enter a name containing a comma, as it will be interpreted as a set of two comma-separated strings.

Workaround: Do not enter strings containing commas when adding optional parameters to the generateCustomerCsr command. Use a comma only as a separator.

Bug: 35946278

Version: 3.0.2

Listing Upgrade Jobs Fails with RabbitMQ Error

When you run the Service CLI command getUpgradeJobs, the following error might be returned:

PCA-ADMIN> getUpgradeJobs
Status: Failure
Error Msg: PCA_GENERAL_000012: Error in RabbitMQ service: null

Workaround: The issue is temporary. Please retry the command at a later time.

Bug: 35999461

Version: 3.0.2

Availability Domain Name Change in Version 3.0.2-b1001356

In software version 3.0.2-b1001356 (December 2023), Private Cloud Appliance's single availability domain has been renamed from "ad1" to "AD-1". This change was required for compatibility with Oracle Cloud Infrastructure. The availability domain is a mandatory parameter in a small set of commands, and an optional parameter in several other commands.

The --availability-domain parameter is required with the following commands:

oci bv boot-volume create
oci bv boot-volume list
oci bv volume create
oci bv volume-group create
oci compute instance launch
oci compute boot-volume-attachment list
oci fs file-system create
oci fs file-system list
oci fs mount-target create
oci fs mount-target list
oci fs export-set list
oci iam fault-domain list

Workaround: Ensure that the correct value is used to identify the availability domain in your commands, depending on the version of the appliance software your system is running. If you are using scripts or any form of automation that includes the --availability-domain parameter, ensure that your code is updated when you upgrade or patch the appliance with version 3.0.2-b1001356 or newer.

Bug: 36094977

Version: 3.0.2

No Packages Available to Patch MySQL Cluster Database

With the release of appliance software version 3.0.2-b1001356, new MySQL RPM packages were added to the ULN channel PCA 3.0.2 MN. However, a package signing issue prevents the ULN mirror from downloading them, which means the MySQL cluster database on your system cannot be patched to the latest available version.

When patching the system, you will see no error message or abnormal behavior related to the missing MySQL packages. Follow the workaround to obtain the new packages. Once these have been downloaded to the ULN mirror, you can patch the MySQL cluster database.

Note:

For new ULN mirror installations, the steps to enable updates of MySQL packages have been included in the Oracle Private Cloud Appliance Patching Guide under "Configure Your Environment for Patching".

To determine if a system is affected by this issue, check the ULN mirror for the presence of MySQL packages in the yum directory referenced by the pca302_x86_64_mn soft link. If the search returns no results, the ULN mirror was unable to download the MySQL packages. The default location of the yum setup directory is /var/www/html/yum, which is used in the following example:

# ls -al /var/www/html/yum/pca302_x86_64_mn/getPackage/ | grep mysql
-rw-r--r--. 1 root root  85169400 Dec 19 03:19 mysql-cluster-commercial-client-8.0.33-1.1.el7.x86_64.rpm
-rw-r--r--. 1 root root   4751220 Dec 19 03:19 mysql-cluster-commercial-client-plugins-8.0.33-1.1.el7.x86_64.rpm
-rw-r--r--. 1 root root    689392 Dec 19 03:19 mysql-cluster-commercial-common-8.0.33-1.1.el7.x86_64.rpm
-rw-r--r--. 1 root root  12417692 Dec 19 03:19 mysql-cluster-commercial-data-node-8.0.33-1.1.el7.x86_64.rpm
-rw-r--r--. 1 root root   2229080 Dec 19 03:19 mysql-cluster-commercial-icu-data-files-8.0.33-1.1.el7.x86_64.rpm
-rw-r--r--. 1 root root   2236184 Dec 19 03:19 mysql-cluster-commercial-libs-8.0.33-1.1.el7.x86_64.rpm
-rw-r--r--. 1 root root   1279012 Dec 19 03:19 mysql-cluster-commercial-libs-compat-8.0.33-1.1.el7.x86_64.rpm
-rw-r--r--. 1 root root   3478680 Dec 19 03:19 mysql-cluster-commercial-management-server-8.0.33-1.1.el7.x86_64.rpm
-rw-r--r--. 1 root root 364433848 Dec 19 03:19 mysql-cluster-commercial-server-8.0.33-1.1.el7.x86_64.rpm
-rw-r--r--. 1 root root   2428848 Dec 19 03:19 mysql-connector-j-commercial-8.0.33-1.1.el7.noarch.rpm
-rw-r--r--. 1 root root   4570200 Dec 19 03:19 mysql-connector-odbc-commercial-8.0.33-1.1.el7.x86_64.rpm

Workaround: When you import the appropriate GPG key on your ULN mirror, it can download the updated MySQL packages. Proceed as follows:

  1. Log in to the ULN mirror server.

  2. Download the MySQL GPG key from https://repo.mysql.com/RPM-GPG-KEY-mysql-2022.

  3. Import the GPG key.

    # rpm --import RPM-GPG-KEY-mysql-2022
  4. Update the ULN mirror.

    # /usr/bin/uln-yum-mirror

    If the key was imported successfully, the new MySQL packages are downloaded to the ULN mirror.

  5. For confirmation, verify the signature using one of the new packages.

    # rpm --checksig mysql-cluster-commercial-management-server-8.0.33-1.1.el7.x86_64.rpm
    mysql-cluster-commercial-management-server-8.0.33-1.1.el7.x86_64.rpm: rsa sha1 (md5) pgp md5 OK

Bug: 36123758

Version: 3.0.2

Uppercase Letters Are Not Supported in Domain Names

Uppercase letters aren't supported in domain names. The domain name for your system is used as the base domain for the internal network, and by Oracle Private Cloud Appliance public facing services. This attribute has a maximum length of 190 characters. Acceptable characters are "a"→"z", "0"→"9", "-"

Bug: 36484125

Version: 3.0.2