Compute Service Issues
This section describes known issues and workarounds related to the compute service.
Possible VM Impact When OS Images are Deleted
Under certain operating conditions, if an OS image is deleted while boot volumes and VM instances based on this image are still present in the system, there can be an impact to all VMs' boot devices based on the deleted image. The symptoms can include but are not limited to the following:
-
Input/output error messages in a running VM's logs or console and possible VM application failures
-
VMs fail to boot with a
no bootable device
message on the VMs' console -
Re-attaching a boot volume to a stopped VM might fail
Workaround: To avoid this situation, do not delete an OS image unless all the VM instances, all the boot volumes, their backups and their clones originating from this image have been properly terminated first.
Bug: 36489907
Version: 3.0.2
E5.Flex Instance Shape Is Not Supported on the X9-2 Hardware Platform
Compute instance shapes are tied to the architecture of the underlying compute nodes. The VM.PCAStandard.E5.Flex shape was added specifically to create instances on Oracle Server X10 compute nodes. It is the only shape supported on the X10 rack configuration. On a Private Cloud Appliance X9-2, all other shapes – including flex shapes – are supported.
Workaround: Select a suitable shape for your Private Cloud Appliance compute node architecture. If the compute nodes in your appliance are Oracle Server X10, always select the VM.PCAStandard.E5.Flex shape. Systems with Oracle Server X9-2 compute nodes support all shapes except VM.PCAStandard.E5.Flex. If you need a flexible shape, select the VM.PCAStandard1.Flex shape instead.
Bug: 35549831
Version: 3.0.2
Displaced Instances Not Returned to Their Selected Fault Domains
A displaced instance is an instance that is running in a fault domain that is not the fault domain that is specified in the configuration for that instance. An instance can become displaced during compute node evacuation or failure.
When Auto Recovery is enabled, a displaced instance is automatically returned to the fault domain that is specified in its configuration when resources become available in that fault domain. Auto Recovery is enabled by default.
Workaround:
If your Private Cloud Appliance is running Software Version 3.0.2-b852928 or Software Version 3.0.2-b892153, or if you upgrade to either of these releases, disable Auto Recovery from the Service CLI:
PCA-ADMIN> disableAutoResolveDisplacedInstance
If your Private Cloud Appliance is running a release that is newer than Software Version 3.0.2-b892153, you can enable Auto Recovery.
See "Migrating Instances from a Compute Node" and "Configuring the Compute Service for High Availability" in the Hardware Administration chapter of the Oracle Private Cloud Appliance Administrator Guide for more information about these commands.
If your Private Cloud Appliance is affected by this bug and an instance is displaced, stop and restart the instance to return the instance to its selected fault domain. See "Stopping, Starting, and Resetting an Instance" in the Compute Instance Deployment chapter of the Oracle Private Cloud Appliance User Guide.
Bug: 35601960, 35703270
Version: 3.0.2
Terraform Cannot Be Used for Instance Update
Starting with the May 2023 release of the Oracle Private Cloud Appliance software, the Oracle Cloud Infrastructure Terraform provider cannot be used to update an instance on Oracle Private Cloud Appliance. Only the instance update operation is affected by this issue.
Instance update fails when done using Terraform because the
is_live_migration_preferred
property does not exist for Terraform. Because
the property is unknown, when the property is seen, Terraform treats the property value as
false
, which is not a supported value.
Workaround: Use the Compute Web UI or the OCI CLI to perform instance update.
Bug: 35421618
Version: 3.0.2
No Consistent Device Paths for Connecting to Block Volumes
When you attach a block volume to an instance, it is not possible to specify a device path
that remains consistent between instance reboots. It means that for the
attach-paravirtualized-volume
CLI command the optional
--device
parameter does not work. Because the device name might be
different after the instance is rebooted, this affects tasks you perform on the volume, such as
partitioning, creating and mounting file systems, and so on.
Workaround: No workaround is available.
Bug: 32561299
Version: 3.0.1
Instance Pools Cannot Be Terminated While Starting or Scaling
While the instances in a pool are being started, and while a scaling operation is in progress to increase or decrease the number of instances in the pool, it is not possible to terminate the instance pool. Individual instances, in contrast, can be terminated at any time.
Workaround: To terminate an instance pool, wait until all instances have started or scaling operations have been completed. Then you can successfully terminate the instance pool as a whole.
Bug: 33038853
Version: 3.0.1
TypeError Returned when Attaching an Instance to an Instance Pool
When you attach an existing compute instance to an instance pool, you can include parameters with the OCI CLI command so it reports when the instance reaches the intended ("active") lifecycle state. However, a bug in the OCI CLI could lead to the following error:
# oci compute-management instance-pool-instance attach \ --instance-id ocid1.instance....unique_ID --instance-pool-id ocid1.instancePool....unique_ID \ --wait-for-state ACTIVE --wait-interval-seconds 120 --max-wait-seconds 1200 Action completed. Waiting until the resource has entered state: ('ACTIVE',) Encountered error while waiting for resource to enter the specified state. Outputting last known resource state { "data": { "availability-domain": "AD-1", "compartment-id": "ocid1.tenancy....unique_ID", "display-name": "Standard1.4", "fault-domain": "FAULT-DOMAIN-3", "id": "ocid1.instance....unique_ID", "instance-configuration-id": null, "instance-pool-id": "ocid1.instancePool....unique_ID", "lifecycle-state": "ATTACHING", "load-balancer-backends": [], "region": "mypca.mydomain.com", "shape": "VM.PCAStandard1.Flex", "state": "RUNNING", "time-created": "2023-10-28T03:22:45+00:00" }, "opc-work-request-id": "ocid1.workrequest....unique_ID" } TypeError: get_instance_pool_instance() missing 1 required positional argument: 'instance_id'
Workaround: The command option --wait-for-state
is
unreliable at this time. As an alternative you can use the command
list-instance-pool-instances
to check the state of the instances in the
pool.
Bug: 35956140
Version: 3.0.2
Network Interface on Windows Does Not Accept MTU Setting from DHCP Server
When an instance is launched, it requests an IP address through DHCP. The response from the DHCP server includes the instruction to set the VNIC maximum transmission unit (MTU) to 9000 bytes. However, Windows instances boot with an MTU of 1500 bytes instead, which may adversely affect network performance.
Workaround: When the instance has been assigned its initial IP address by the DHCP server, change the interface MTU manually to the appropriate value, which is typically 9000 bytes for an instance's primary VNIC. This new value is persistent across network disconnections and DHCP lease renewals.
Alternatively, if the Windows image contains cloudbase-init
with the
MTUPlugin, it is possible to set the interface MTU from DHCP. To enable this function, execute
the following steps:
-
Edit the file
C:\Program Files\Cloudbase Solutions\Cloudbase-Init\conf\cloudbase-init.conf
. Add these lines:mtu_use_dhcp_config=true plugins=cloudbaseinit.plugins.common.mtu.MTUPlugin
-
Enter the command
Restart-Service cloudbase-init
. -
Confirm that the MTU setting has changed. Use this command:
netsh interface ipv4 show subinterfaces
.
Bug: 33541806
Version: 3.0.1
Oracle Solaris Instance in Maintenance Mode After Restoring from Backup
It is supported to create a new instance from a backup of the boot volume of an existing instance. The existing instance may be running or stopped. However, if you use a boot volume backup of an instance based on the Oracle Solaris image provided with Private Cloud Appliance, the new instance created from that backup boots in maintenance mode. The Oracle Solaris console displays this message: "Enter user name for system maintenance (control-d to bypass):"
Workaround: When the new Oracle Solaris instance created from the block volume backup has come up in maintenance mode, reboot the instance from the Compute Web UI or the CLI. After this reboot, the instance is expected to return to a normal running state and be reachable through its network interfaces.
Bug: 33581118
Version: 3.0.1
Instance Disk Activity Not Shown in Compute Node Metrics
The virtual disks attached to compute instances are presented to the guest through the hypervisor on the host compute node. Consequently, disk I/O from the instances should be detected at the level of the physical host, and reflected in the compute node disk statistics in Grafana. Unfortunately, the activity on the virtual disks is not aggregated into the compute node disk metrics.
Workaround: To monitor instance disk I/O and aggregated load on each compute node, rather than analyzing compute node metrics, use the individual VM statistics presented through Grafana.
Bug: 33551814
Version: 3.0.1
Attached Block Volumes Not Visible Inside Oracle Solaris Instance
When you attach additional block volumes to a running Oracle Solaris compute instance, they do not become visible automatically to the operating system. Even after manually rescanning the disks, the newly attached block volumes remain invisible. The issues is caused by the hypervisor not sending the correct event trigger to re-enumerate the guest LUNs.
Workaround: When you attach additional block volumes to an Oracle Solaris compute instance, reboot the instance to make sure that the new virtual disks or LUNs are detected.
Bug: 33581238
Version: 3.0.1
Host Name Not Set In Successfully Launched Windows Instance
When you work in a VCN and subnet where DNS is enabled, and you launch an instance, it is expected that its host name matches either the instance display name or the optional host name you provided. However, when you launch a Windows instance, it may occur that the host name is not set correctly according to the launch command parameters. In this situation, the instance's fully qualified domain name (FQDN) does resolve as expected, meaning there is no degraded functionality. Only the host name setting within the instance itself is incorrect; the VCN's DNS configuration works as expected.
Workaround: If your instance host name does not match the specified instance launch parameters, you can manually change the host name within the instance. There is no functional impact.
Alternatively, if the Windows image contains cloudbase-init
with the
SetHostNamePlugin, it is possible to set the instance host name (computer
name) based on the instance FQDN (hostname-label). To enable
this function, execute the following steps:
-
Edit the file
C:\Program Files\Cloudbase Solutions\Cloudbase-Init\conf\cloudbase-init.conf
. Make sure it contains lines with these settings:plugins=cloudbaseinit.plugins.common.sethostname.SetHostNamePlugin allow_reboot=true
-
Enter the command
Restart-Service cloudbase-init
. -
Confirm that the instance host name has changed.
Bug: 33736674
Version: 3.0.1
Oracle Solaris Instance Stuck in UEFI Interactive Shell
It has been known to occur that Oracle Solaris 11.4
compute instances, deployed from the image delivered through the management node web server,
get stuck in the UEFI interactive shell and fail to boot. If the instance does not complete
its boot sequence, users are not able to log in. The issue is likely caused by corruption of
the original .oci
image file during the import into the tenancy.
Workaround: If your Oracle Solaris 11.4 instance hangs during UEFI boot and remains unavailable, proceed as follows:
-
Terminate the instance that fails to boot.
-
Delete the imported Oracle Solaris 11.4 image.
-
Import the Oracle Solaris 11.4 image again from the management node web server.
-
Launch an instance from the newly imported image and verify that you can log in after it has fully booted.
Bug: 33736100
Version: 3.0.1
Instance Backups Can Get Stuck in an EXPORTING or IMPORTING State
In rare cases, when an instance is exporting to create a backup, or a backup is being imported, and the system experiences a failure of one of the components, the exported or imported backup gets stuck in an EXPORTING or IMPORTING state.
Workaround:
- Delete the instance backup.
- Wait 5 minutes or more to ensure that all internal services are running.
- Perform the instance export or import operation again.
See Backing Up and Restoring an Instance in Compute Instance Deployment.
Bug: 34699012
Version: 3.0.1
Instance Not Started After Fault Domain Change
When you change the fault domain of a compute instance, the system stops it, cold-migrates it to a compute node in the selected target fault domain, and restarts the instance on the new host. This process includes a number of internal operations to ensure that the instance can return to its normal running state on the target compute node. If one of these internal operations fails, the instance could remain stopped.
The risk of running into issues with fault domain changes increases with the complexity of the operations. For example, moving multiple instances concurrently to another fault domain, especially if they have shared block volumes and are migrated to different compute nodes in the target fault domain, requires many timing-sensitive configuration changes at the storage level. If the underlying iSCSI connections are not available on a migrated compute instance's new host, the hypervisor cannot bring up the instance.
Workaround: After changing the fault domain, if a compute instance remains stopped, try to start it manually. If the instance failed to come up due to a timing issue as described above, the manual start command is likely to bring the instance back to its normal running state.
Bug: 34550107
Version: 3.0.2
Instance Migration Stuck in MOVING State
When migrating VMs using the Service Web UI it is possible that a migration can get stuck in the MOVING lifecycle state and you will be unable to continue further migrations.
This error can occur when administrative activities, such as live migrations, are running during a patching or upgrading process, or administrative activities are started before patching or upgrading processes have fully completed.
Workaround: Contact Oracle Support to resolve this issue.
Bug: 33911138
Version: , 3.0.1, 3.0.2
OCI CLI Commands Fail When Run From a Compute Instance
Compute instances based on Oracle Linux images provided since early 2023 are likely to have a firewall configuration that prevents the OCI CLI from connecting to the Private Cloud Appliance identity service. In Oracle Cloud Infrastructure the identity service must now be accessed through a public IP address (or FQDN), while Oracle Private Cloud Appliance provides access through an internal IP address. The Oracle Cloud Infrastructure images are configured by default to block all connections to this internal IP address.
The issue has been observed with these images:
-
uln-pca-oracle-linux-7-9-2023-08-31-0-oci
-
uln-pca-oracle-linux-8-2023-08-31-0-oci
-
all Oracle Linux 9 images with a 2023 availability date
Workaround: If you intend to use the OCI CLI from a compute instance in your Private Cloud Appliance environment, verify its access to the identity service. If connections are refused, check the instance firewall configuration and enable access to the identity service.
-
Test the instance connection to the identity service. For example, use telnet or netcat.
# curl -v telnet://identity.mydomain.us.oracle.com:443 * connect to 169.254.169.254 port 443 failed: Connection refused -- OR -- # nc -vz identity.mydomain.us.oracle.com 443 Ncat: Connection refused.
-
Confirm that the firewall output chain contains a rule named BareMetalInstanceServices.
# iptables -L OUTPUT --line-numbers Chain OUTPUT (policy ACCEPT) num target prot opt source destination 1 BareMetalInstanceServices all -- anywhere 169.254.0.0/16
-
Disable the bare metal instance rules in the firewall configuration.
-
Rename the file that defines these firewall rules (
/etc/firewalld/direct.xml
). -
Restart the firewalld service.
Detailed instructions are provided in the note with Doc ID 2983004.1.
-
Bug: 35234468
Version: 3.0.2
Cannot Install OCI CLI on Oracle Linux 9 Instance
To run the OCI CLI on an Oracle Linux 9 compute instance, the package
python39-oci-cli
and its dependencies are required. These are provided
through the Oracle Linux 9 OCI Included Packages (ol9_oci_included
)
repository, but this repository cannot be accessed outside Oracle Cloud Infrastructure.
An Oracle Linux 9 compute instance on Oracle Private Cloud Appliance must instead retrieve the required packages
from the public Oracle Linux 9 repositories –
specifically: Oracle Linux 9 Development Packages (ol9_developer
) and
Oracle Linux 9 Application Stream Packages (ol9_appstream
). These
repositories are not enabled by default in the provided Oracle Linux 9 image.
Workaround: Enable the ol9_developer
and
ol9_appstream
public yum repositories to install
python39-oci-cli
.
$ sudo yum --disablerepo="*" --enablerepo="ol9_developer ol9_appstream" install python39-oci-cli -y Dependencies resolved. ================================================================================================================== Package Architecture Version Repository Size ================================================================================================================== Installing: python39-oci-cli noarch 3.40.2-1.el9 ol9_developer 39 M Upgrading: python39-oci-sdk x86_64 2.126.2-1.el9 ol9_developer 74 M Installing dependencies: python3-arrow noarch 1.1.0-2.el9 ol9_developer 153 k python3-importlib-metadata noarch 4.12.0-2.el9 ol9_developer 75 k python3-jmespath noarch 0.10.0-4.el9 ol9_developer 78 k python3-prompt-toolkit noarch 3.0.38-4.el9 ol9_appstream 1.0 M python3-terminaltables noarch 3.1.10-8.0.1.el9 ol9_developer 60 k python3-wcwidth noarch 0.2.5-8.el9 ol9_appstream 65 k python3-zipp noarch 0.5.1-1.el9 ol9_developer 24 k Transaction Summary ================================================================================================================= Install 8 Packages Upgrade 1 Package [...] Complete!
Bug: 35855058
Version: 3.0.2
Instance Launch Fails at 80 Percent Complete with Libvirt Error
When an instance is launched, Libvirt processes a set of instructions to bring the requested virtual machine to a running state. During this process, the connection with the agent sending the requests might be interrupted, which means the responses are not received. As a result, the instance is terminated. The Compute Service logs an internal server error similar to this example:
INFO (errors:66) Libvirt Error: code: 38, domain: 7, message: Cannot recv data: Input/output error
Workaround: This is an intermittent issue. If an instance fails to launch, retry the operation.
Bug: 36100146
Version: 3.0.2
Instance Principal Unavailable Until Next Certificate Renewal Check
An instance principal is a compute instance that is authorized to perform actions on service resources. Before allowing these operations, the Identity and Access Management Service (IAM) validates the instance principal security token: a TLS certificate that expires after 30 days.
The system checks for expired certificates every 24 hours and renews them if necessary. However, an instance principal might lose its authorization after an outage, system maintenance, or upgrade activity. In that case, it cannot obtain an updated certificate until the next renewal check, which could be up to 24 hours later.
Similarly, after upgrading from a release that does not support instance principals to a release that does support instance principals, compute instances might have to wait up to 24 hours to receive their TLS certificates.
Workaround: If you need to have this certificate installed or renewed immediately, contact Oracle for assistance.
Bug: 36165739
Version: 3.0.2
Unable to Delete Tag Due to Instance Principal Error
When cleaning up a set of resources including compute instances with defined tags, it might
occur that the compute service is unable to remove the tag from an instance. As a result, the
tag key definition cannot be deleted: the associated work request fails, and typically returns
the error "Error message from compute: create instance principal
". It
indicates that an instance principal certificate was regenerated for an instance that belongs
to a compartment that no longer exists.
This situation can occur if the tag key definition belongs to a different compartment than the tagged instance, and the cleanup operations are performed in this order: first the tagged instance, then the compartment it belongs to, and then the tag key definition. When attempting to delete the tag key definition, the tagged instances have already been terminated, yet the instance principal certificate can be regenerated during tag deletion. At this point the error is logged.
Workaround: Terminated instances are purged from the database after 24 hours. Once they have been purged, the tag key definition can be deleted. Alternatively, the terminated instances can be removed manually from the database. Contact Oracle for assistance.
Bug: 36348781
Version: 3.0.2
When Instance Is Shut Down from OS, Soft Stop Results in Conflict
To avoid data corruption in applications that take a long time to stop, the recommendation is to shut down a compute instance from within its operating system, before issuing the soft stop command for a graceful shutdown from the Compute Enclave. However, due to a problem with the instance action logic, when an instance has been shut down from the OS, the soft stop command returns a conflict (error 409) because the instance is no longer in a running state.
Workaround: To shut down a compute instance with lowest possible
risk of data corruption, use the OS shutdown command first. Then, issue either the force
stop command from the Compute Web UI or the instance
action STOP
from the OCI CLI.
Bug: 36299430
Version: 3.0.2