Compute Service Issues

This section describes known issues and workarounds related to the compute service.

Possible VM Impact When OS Images are Deleted

Under certain operating conditions, if an OS image is deleted while boot volumes and VM instances based on this image are still present in the system, there can be an impact to all VMs' boot devices based on the deleted image. The symptoms can include but are not limited to the following:

  • Input/output error messages in a running VM's logs or console and possible VM application failures

  • VMs fail to boot with a no bootable device message on the VMs' console

  • Re-attaching a boot volume to a stopped VM might fail

Workaround: To avoid this situation, do not delete an OS image unless all the VM instances, all the boot volumes, their backups and their clones originating from this image have been properly terminated first.

Bug: 36489907

Version: 3.0.2

E5.Flex Instance Shape Is Not Supported on the X9-2 Hardware Platform

Compute instance shapes are tied to the architecture of the underlying compute nodes. The VM.PCAStandard.E5.Flex shape was added specifically to create instances on Oracle Server X10 compute nodes. It is the only shape supported on the X10 rack configuration. On a Private Cloud Appliance X9-2, all other shapes – including flex shapes – are supported.

Workaround: Select a suitable shape for your Private Cloud Appliance compute node architecture. If the compute nodes in your appliance are Oracle Server X10, always select the VM.PCAStandard.E5.Flex shape. Systems with Oracle Server X9-2 compute nodes support all shapes except VM.PCAStandard.E5.Flex. If you need a flexible shape, select the VM.PCAStandard1.Flex shape instead.

Bug: 35549831

Version: 3.0.2

Displaced Instances Not Returned to Their Selected Fault Domains

A displaced instance is an instance that is running in a fault domain that is not the fault domain that is specified in the configuration for that instance. An instance can become displaced during compute node evacuation or failure.

When Auto Recovery is enabled, a displaced instance is automatically returned to the fault domain that is specified in its configuration when resources become available in that fault domain. Auto Recovery is enabled by default.

Workaround:

If your Private Cloud Appliance is running Software Version 3.0.2-b852928 or Software Version 3.0.2-b892153, or if you upgrade to either of these releases, disable Auto Recovery from the Service CLI:

PCA-ADMIN> disableAutoResolveDisplacedInstance

If your Private Cloud Appliance is running a release that is newer than Software Version 3.0.2-b892153, you can enable Auto Recovery.

See "Migrating Instances from a Compute Node" and "Configuring the Compute Service for High Availability" in the Hardware Administration chapter of the Oracle Private Cloud Appliance Administrator Guide for more information about these commands.

If your Private Cloud Appliance is affected by this bug and an instance is displaced, stop and restart the instance to return the instance to its selected fault domain. See "Stopping, Starting, and Resetting an Instance" in the Compute Instance Deployment chapter of the Oracle Private Cloud Appliance User Guide.

Bug: 35601960, 35703270

Version: 3.0.2

Terraform Cannot Be Used for Instance Update

Starting with the May 2023 release of the Oracle Private Cloud Appliance software, the Oracle Cloud Infrastructure Terraform provider cannot be used to update an instance on Oracle Private Cloud Appliance. Only the instance update operation is affected by this issue.

Instance update fails when done using Terraform because the is_live_migration_preferred property does not exist for Terraform. Because the property is unknown, when the property is seen, Terraform treats the property value as false, which is not a supported value.

Workaround: Use the Compute Web UI or the OCI CLI to perform instance update.

Bug: 35421618

Version: 3.0.2

No Consistent Device Paths for Connecting to Block Volumes

When you attach a block volume to an instance, it is not possible to specify a device path that remains consistent between instance reboots. It means that for the attach-paravirtualized-volume CLI command the optional --device parameter does not work. Because the device name might be different after the instance is rebooted, this affects tasks you perform on the volume, such as partitioning, creating and mounting file systems, and so on.

Workaround: No workaround is available.

Bug: 32561299

Version: 3.0.1

Instance Pools Cannot Be Terminated While Starting or Scaling

While the instances in a pool are being started, and while a scaling operation is in progress to increase or decrease the number of instances in the pool, it is not possible to terminate the instance pool. Individual instances, in contrast, can be terminated at any time.

Workaround: To terminate an instance pool, wait until all instances have started or scaling operations have been completed. Then you can successfully terminate the instance pool as a whole.

Bug: 33038853

Version: 3.0.1

TypeError Returned when Attaching an Instance to an Instance Pool

When you attach an existing compute instance to an instance pool, you can include parameters with the OCI CLI command so it reports when the instance reaches the intended ("active") lifecycle state. However, a bug in the OCI CLI could lead to the following error:

# oci compute-management instance-pool-instance attach \
--instance-id ocid1.instance....unique_ID --instance-pool-id ocid1.instancePool....unique_ID \
--wait-for-state ACTIVE --wait-interval-seconds 120 --max-wait-seconds 1200
Action completed. Waiting until the resource has entered state: ('ACTIVE',)
Encountered error while waiting for resource to enter the specified state. Outputting last known resource state
{
  "data": {
    "availability-domain": "AD-1",
    "compartment-id": "ocid1.tenancy....unique_ID",
    "display-name": "Standard1.4",
    "fault-domain": "FAULT-DOMAIN-3",
    "id": "ocid1.instance....unique_ID",
    "instance-configuration-id": null,
    "instance-pool-id": "ocid1.instancePool....unique_ID",
    "lifecycle-state": "ATTACHING",
    "load-balancer-backends": [],
    "region": "mypca.mydomain.com",
    "shape": "VM.PCAStandard1.Flex",
    "state": "RUNNING",
    "time-created": "2023-10-28T03:22:45+00:00"
  },
  "opc-work-request-id": "ocid1.workrequest....unique_ID"
}
TypeError: get_instance_pool_instance() missing 1 required positional argument: 'instance_id'

Workaround: The command option --wait-for-state is unreliable at this time. As an alternative you can use the command list-instance-pool-instances to check the state of the instances in the pool.

Bug: 35956140

Version: 3.0.2

Network Interface on Windows Does Not Accept MTU Setting from DHCP Server

When an instance is launched, it requests an IP address through DHCP. The response from the DHCP server includes the instruction to set the VNIC maximum transmission unit (MTU) to 9000 bytes. However, Windows instances boot with an MTU of 1500 bytes instead, which may adversely affect network performance.

Workaround: When the instance has been assigned its initial IP address by the DHCP server, change the interface MTU manually to the appropriate value, which is typically 9000 bytes for an instance's primary VNIC. This new value is persistent across network disconnections and DHCP lease renewals.

Alternatively, if the Windows image contains cloudbase-init with the MTUPlugin, it is possible to set the interface MTU from DHCP. To enable this function, execute the following steps:

  1. Edit the file C:\Program Files\Cloudbase Solutions\Cloudbase-Init\conf\cloudbase-init.conf. Add these lines:

    mtu_use_dhcp_config=true
    plugins=cloudbaseinit.plugins.common.mtu.MTUPlugin
  2. Enter the command Restart-Service cloudbase-init.

  3. Confirm that the MTU setting has changed. Use this command: netsh interface ipv4 show subinterfaces.

Bug: 33541806

Version: 3.0.1

Oracle Solaris Instance in Maintenance Mode After Restoring from Backup

It is supported to create a new instance from a backup of the boot volume of an existing instance. The existing instance may be running or stopped. However, if you use a boot volume backup of an instance based on the Oracle Solaris image provided with Private Cloud Appliance, the new instance created from that backup boots in maintenance mode. The Oracle Solaris console displays this message: "Enter user name for system maintenance (control-d to bypass):"

Workaround: When the new Oracle Solaris instance created from the block volume backup has come up in maintenance mode, reboot the instance from the Compute Web UI or the CLI. After this reboot, the instance is expected to return to a normal running state and be reachable through its network interfaces.

Bug: 33581118

Version: 3.0.1

Instance Disk Activity Not Shown in Compute Node Metrics

The virtual disks attached to compute instances are presented to the guest through the hypervisor on the host compute node. Consequently, disk I/O from the instances should be detected at the level of the physical host, and reflected in the compute node disk statistics in Grafana. Unfortunately, the activity on the virtual disks is not aggregated into the compute node disk metrics.

Workaround: To monitor instance disk I/O and aggregated load on each compute node, rather than analyzing compute node metrics, use the individual VM statistics presented through Grafana.

Bug: 33551814

Version: 3.0.1

Attached Block Volumes Not Visible Inside Oracle Solaris Instance

When you attach additional block volumes to a running Oracle Solaris compute instance, they do not become visible automatically to the operating system. Even after manually rescanning the disks, the newly attached block volumes remain invisible. The issues is caused by the hypervisor not sending the correct event trigger to re-enumerate the guest LUNs.

Workaround: When you attach additional block volumes to an Oracle Solaris compute instance, reboot the instance to make sure that the new virtual disks or LUNs are detected.

Bug: 33581238

Version: 3.0.1

Host Name Not Set In Successfully Launched Windows Instance

When you work in a VCN and subnet where DNS is enabled, and you launch an instance, it is expected that its host name matches either the instance display name or the optional host name you provided. However, when you launch a Windows instance, it may occur that the host name is not set correctly according to the launch command parameters. In this situation, the instance's fully qualified domain name (FQDN) does resolve as expected, meaning there is no degraded functionality. Only the host name setting within the instance itself is incorrect; the VCN's DNS configuration works as expected.

Workaround: If your instance host name does not match the specified instance launch parameters, you can manually change the host name within the instance. There is no functional impact.

Alternatively, if the Windows image contains cloudbase-init with the SetHostNamePlugin, it is possible to set the instance host name (computer name) based on the instance FQDN (hostname-label). To enable this function, execute the following steps:

  1. Edit the file C:\Program Files\Cloudbase Solutions\Cloudbase-Init\conf\cloudbase-init.conf. Make sure it contains lines with these settings:

    plugins=cloudbaseinit.plugins.common.sethostname.SetHostNamePlugin
    allow_reboot=true
  2. Enter the command Restart-Service cloudbase-init.

  3. Confirm that the instance host name has changed.

Bug: 33736674

Version: 3.0.1

Oracle Solaris Instance Stuck in UEFI Interactive Shell

It has been known to occur that Oracle Solaris 11.4 compute instances, deployed from the image delivered through the management node web server, get stuck in the UEFI interactive shell and fail to boot. If the instance does not complete its boot sequence, users are not able to log in. The issue is likely caused by corruption of the original .oci image file during the import into the tenancy.

Workaround: If your Oracle Solaris 11.4 instance hangs during UEFI boot and remains unavailable, proceed as follows:

  1. Terminate the instance that fails to boot.

  2. Delete the imported Oracle Solaris 11.4 image.

  3. Import the Oracle Solaris 11.4 image again from the management node web server.

  4. Launch an instance from the newly imported image and verify that you can log in after it has fully booted.

Bug: 33736100

Version: 3.0.1

Instance Backups Can Get Stuck in an EXPORTING or IMPORTING State

In rare cases, when an instance is exporting to create a backup, or a backup is being imported, and the system experiences a failure of one of the components, the exported or imported backup gets stuck in an EXPORTING or IMPORTING state.

Workaround:

  1. Delete the instance backup.
  2. Wait 5 minutes or more to ensure that all internal services are running.
  3. Perform the instance export or import operation again.

See Backing Up and Restoring an Instance in Compute Instance Deployment.

Bug: 34699012

Version: 3.0.1

Instance Not Started After Fault Domain Change

When you change the fault domain of a compute instance, the system stops it, cold-migrates it to a compute node in the selected target fault domain, and restarts the instance on the new host. This process includes a number of internal operations to ensure that the instance can return to its normal running state on the target compute node. If one of these internal operations fails, the instance could remain stopped.

The risk of running into issues with fault domain changes increases with the complexity of the operations. For example, moving multiple instances concurrently to another fault domain, especially if they have shared block volumes and are migrated to different compute nodes in the target fault domain, requires many timing-sensitive configuration changes at the storage level. If the underlying iSCSI connections are not available on a migrated compute instance's new host, the hypervisor cannot bring up the instance.

Workaround: After changing the fault domain, if a compute instance remains stopped, try to start it manually. If the instance failed to come up due to a timing issue as described above, the manual start command is likely to bring the instance back to its normal running state.

Bug: 34550107

Version: 3.0.2

Instance Migration Stuck in MOVING State

When migrating VMs using the Service Web UI it is possible that a migration can get stuck in the MOVING lifecycle state and you will be unable to continue further migrations.

This error can occur when administrative activities, such as live migrations, are running during a patching or upgrading process, or administrative activities are started before patching or upgrading processes have fully completed.

Workaround: Contact Oracle Support to resolve this issue.

Bug: 33911138

Version: , 3.0.1, 3.0.2

OCI CLI Commands Fail When Run From a Compute Instance

Compute instances based on Oracle Linux images provided since early 2023 are likely to have a firewall configuration that prevents the OCI CLI from connecting to the Private Cloud Appliance identity service. In Oracle Cloud Infrastructure the identity service must now be accessed through a public IP address (or FQDN), while Oracle Private Cloud Appliance provides access through an internal IP address. The Oracle Cloud Infrastructure images are configured by default to block all connections to this internal IP address.

The issue has been observed with these images:

  • uln-pca-oracle-linux-7-9-2023-08-31-0-oci

  • uln-pca-oracle-linux-8-2023-08-31-0-oci

  • all Oracle Linux 9 images with a 2023 availability date

Workaround: If you intend to use the OCI CLI from a compute instance in your Private Cloud Appliance environment, verify its access to the identity service. If connections are refused, check the instance firewall configuration and enable access to the identity service.

  1. Test the instance connection to the identity service. For example, use telnet or netcat.

    # curl -v telnet://identity.mydomain.us.oracle.com:443
    * connect to 169.254.169.254 port 443 failed: Connection refused
    
    -- OR --
    # nc -vz identity.mydomain.us.oracle.com 443
    Ncat: Connection refused.
  2. Confirm that the firewall output chain contains a rule named BareMetalInstanceServices.

    # iptables -L OUTPUT --line-numbers
    Chain OUTPUT (policy ACCEPT)
    num  target                     prot   opt   source           destination         
    1    BareMetalInstanceServices  all    --    anywhere         169.254.0.0/16      
  3. Disable the bare metal instance rules in the firewall configuration.

    1. Rename the file that defines these firewall rules (/etc/firewalld/direct.xml).

    2. Restart the firewalld service.

    Detailed instructions are provided in the note with Doc ID 2983004.1.

Bug: 35234468

Version: 3.0.2

Cannot Install OCI CLI on Oracle Linux 9 Instance

To run the OCI CLI on an Oracle Linux 9 compute instance, the package python39-oci-cli and its dependencies are required. These are provided through the Oracle Linux 9 OCI Included Packages (ol9_oci_included) repository, but this repository cannot be accessed outside Oracle Cloud Infrastructure.

An Oracle Linux 9 compute instance on Oracle Private Cloud Appliance must instead retrieve the required packages from the public Oracle Linux 9 repositories – specifically: Oracle Linux 9 Development Packages (ol9_developer) and Oracle Linux 9 Application Stream Packages (ol9_appstream). These repositories are not enabled by default in the provided Oracle Linux 9 image.

Workaround: Enable the ol9_developer and ol9_appstream public yum repositories to install python39-oci-cli.

$ sudo yum --disablerepo="*" --enablerepo="ol9_developer ol9_appstream" install python39-oci-cli -y
Dependencies resolved.
==================================================================================================================
 Package                               Architecture       Version                    Repository              Size
==================================================================================================================
Installing:
 python39-oci-cli                      noarch             3.40.2-1.el9               ol9_developer           39 M
Upgrading:
 python39-oci-sdk                      x86_64             2.126.2-1.el9              ol9_developer           74 M
Installing dependencies:
 python3-arrow                         noarch             1.1.0-2.el9                ol9_developer          153 k
 python3-importlib-metadata            noarch             4.12.0-2.el9               ol9_developer           75 k
 python3-jmespath                      noarch             0.10.0-4.el9               ol9_developer           78 k
 python3-prompt-toolkit                noarch             3.0.38-4.el9               ol9_appstream          1.0 M
 python3-terminaltables                noarch             3.1.10-8.0.1.el9           ol9_developer           60 k
 python3-wcwidth                       noarch             0.2.5-8.el9                ol9_appstream           65 k
 python3-zipp                          noarch             0.5.1-1.el9                ol9_developer           24 k

Transaction Summary
=================================================================================================================
Install  8 Packages
Upgrade  1 Package
[...]
Complete!

Bug: 35855058

Version: 3.0.2

Instance Launch Fails at 80 Percent Complete with Libvirt Error

When an instance is launched, Libvirt processes a set of instructions to bring the requested virtual machine to a running state. During this process, the connection with the agent sending the requests might be interrupted, which means the responses are not received. As a result, the instance is terminated. The Compute Service logs an internal server error similar to this example:

INFO (errors:66) Libvirt Error: code: 38, domain: 7, message: Cannot recv data: Input/output error

Workaround: This is an intermittent issue. If an instance fails to launch, retry the operation.

Bug: 36100146

Version: 3.0.2

Instance Principal Unavailable Until Next Certificate Renewal Check

An instance principal is a compute instance that is authorized to perform actions on service resources. Before allowing these operations, the Identity and Access Management Service (IAM) validates the instance principal security token: a TLS certificate that expires after 30 days.

The system checks for expired certificates every 24 hours and renews them if necessary. However, an instance principal might lose its authorization after an outage, system maintenance, or upgrade activity. In that case, it cannot obtain an updated certificate until the next renewal check, which could be up to 24 hours later.

Similarly, after upgrading from a release that does not support instance principals to a release that does support instance principals, compute instances might have to wait up to 24 hours to receive their TLS certificates.

Workaround: If you need to have this certificate installed or renewed immediately, contact Oracle for assistance.

Bug: 36165739

Version: 3.0.2

Unable to Delete Tag Due to Instance Principal Error

When cleaning up a set of resources including compute instances with defined tags, it might occur that the compute service is unable to remove the tag from an instance. As a result, the tag key definition cannot be deleted: the associated work request fails, and typically returns the error "Error message from compute: create instance principal". It indicates that an instance principal certificate was regenerated for an instance that belongs to a compartment that no longer exists.

This situation can occur if the tag key definition belongs to a different compartment than the tagged instance, and the cleanup operations are performed in this order: first the tagged instance, then the compartment it belongs to, and then the tag key definition. When attempting to delete the tag key definition, the tagged instances have already been terminated, yet the instance principal certificate can be regenerated during tag deletion. At this point the error is logged.

Workaround: Terminated instances are purged from the database after 24 hours. Once they have been purged, the tag key definition can be deleted. Alternatively, the terminated instances can be removed manually from the database. Contact Oracle for assistance.

Bug: 36348781

Version: 3.0.2

When Instance Is Shut Down from OS, Soft Stop Results in Conflict

To avoid data corruption in applications that take a long time to stop, the recommendation is to shut down a compute instance from within its operating system, before issuing the soft stop command for a graceful shutdown from the Compute Enclave. However, due to a problem with the instance action logic, when an instance has been shut down from the OS, the soft stop command returns a conflict (error 409) because the instance is no longer in a running state.

Workaround: To shut down a compute instance with lowest possible risk of data corruption, use the OS shutdown command first. Then, issue either the force stop command from the Compute Web UI or the instance action STOP from the OCI CLI.

Bug: 36299430

Version: 3.0.2