Compute Service Issues

This section describes known issues and workarounds related to the compute service.

E5.Flex Instance Shape Is Not Supported on the X9-2 Hardware Platform

Compute instance shapes are tied to the architecture of the underlying compute nodes. The VM.PCAStandard.E5.Flex shape was added specifically to create instances on Oracle Server X10 compute nodes. It is the only shape supported on the X10 rack configuration. On a Private Cloud Appliance X9-2, all other shapes – including flex shapes – are supported.

Workaround: Select a suitable shape for your Private Cloud Appliance compute node architecture. If the compute nodes in your appliance are Oracle Server X10, always select the VM.PCAStandard.E5.Flex shape. Systems with Oracle Server X9-2 compute nodes support all shapes except VM.PCAStandard.E5.Flex. If you need a flexible shape, select the VM.PCAStandard1.Flex shape instead.

Bug: 35549831

Version: 3.0.2

Displaced Instances Not Returned to Their Selected Fault Domains

A displaced instance is an instance that is running in a fault domain that is not the fault domain that is specified in the configuration for that instance. An instance can become displaced during compute node evacuation or failure.

When Auto Recovery is enabled, a displaced instance is automatically returned to the fault domain that is specified in its configuration when resources become available in that fault domain. Auto Recovery is enabled by default.

Workaround:

If your Private Cloud Appliance is running Software Version 3.0.2-b852928 or Software Version 3.0.2-b892153, or if you upgrade to either of these releases, disable Auto Recovery from the Service CLI:

PCA-ADMIN> disableAutoResolveDisplacedInstance

If your Private Cloud Appliance is running a release that is newer than Software Version 3.0.2-b892153, you can enable Auto Recovery.

See "Migrating Instances from a Compute Node" and "Configuring the Compute Service for High Availability" in the Hardware Administration chapter of the Oracle Private Cloud Appliance Administrator Guide for more information about these commands.

If your Private Cloud Appliance is affected by this bug and an instance is displaced, stop and restart the instance to return the instance to its selected fault domain. See "Stopping, Starting, and Resetting an Instance" in the Compute Instance Deployment chapter of the Oracle Private Cloud Appliance User Guide.

Bug: 35601960, 35703270

Version: 3.0.2

Terraform Cannot Be Used for Instance Update

Starting with the May 2023 release of the Oracle Private Cloud Appliance software, the Oracle Cloud Infrastructure Terraform provider cannot be used to update an instance on Oracle Private Cloud Appliance. Only the instance update operation is affected by this issue.

Instance update fails when done using Terraform because the is_live_migration_preferred property does not exist for Terraform. Because the property is unknown, when the property is seen, Terraform treats the property value as false, which is not a supported value.

Workaround: Use the Compute Web UI or the OCI CLI to perform instance update.

Bug: 35421618

Version: 3.0.2

No Consistent Device Paths for Connecting to Block Volumes

When you attach a block volume to an instance, it is not possible to specify a device path that remains consistent between instance reboots. It means that for the attach-paravirtualized-volume CLI command the optional --device parameter does not work. Because the device name might be different after the instance is rebooted, this affects tasks you perform on the volume, such as partitioning, creating and mounting file systems, and so on.

Workaround: No workaround is available.

Bug: 32561299

Version: 3.0.1

Instance Pools Cannot Be Terminated While Starting or Scaling

While the instances in a pool are being started, and while a scaling operation is in progress to increase or decrease the number of instances in the pool, it is not possible to terminate the instance pool. Individual instances, in contrast, can be terminated at any time.

Workaround: To terminate an instance pool, wait until all instances have started or scaling operations have been completed. Then you can successfully terminate the instance pool as a whole.

Bug: 33038853

Version: 3.0.1

TypeError Returned when Attaching an Instance to an Instance Pool

When you attach an existing compute instance to an instance pool, you can include parameters with the OCI CLI command so it reports when the instance reaches the intended ("active") lifecycle state. However, a bug in the OCI CLI could lead to the following error:

# oci compute-management instance-pool-instance attach \
--instance-id ocid1.instance....unique_ID --instance-pool-id ocid1.instancePool....unique_ID \
--wait-for-state ACTIVE --wait-interval-seconds 120 --max-wait-seconds 1200
Action completed. Waiting until the resource has entered state: ('ACTIVE',)
Encountered error while waiting for resource to enter the specified state. Outputting last known resource state
{
  "data": {
    "availability-domain": "AD-1",
    "compartment-id": "ocid1.tenancy....unique_ID",
    "display-name": "Standard1.4",
    "fault-domain": "FAULT-DOMAIN-3",
    "id": "ocid1.instance....unique_ID",
    "instance-configuration-id": null,
    "instance-pool-id": "ocid1.instancePool....unique_ID",
    "lifecycle-state": "ATTACHING",
    "load-balancer-backends": [],
    "region": "mypca.example.com",
    "shape": "VM.PCAStandard1.Flex",
    "state": "RUNNING",
    "time-created": "2023-10-28T03:22:45+00:00"
  },
  "opc-work-request-id": "ocid1.workrequest....unique_ID"
}
TypeError: get_instance_pool_instance() missing 1 required positional argument: 'instance_id'

Workaround: The command option --wait-for-state is unreliable at this time. As an alternative you can use the command list-instance-pool-instances to check the state of the instances in the pool.

Bug: 35956140

Version: 3.0.2

Network Interface on Windows Does Not Accept MTU Setting from DHCP Server

When an instance is launched, it requests an IP address through DHCP. The response from the DHCP server includes the instruction to set the VNIC maximum transmission unit (MTU) to 9000 bytes. However, Windows instances boot with an MTU of 1500 bytes instead, which may adversely affect network performance.

Workaround: When the instance has been assigned its initial IP address by the DHCP server, change the interface MTU manually to the appropriate value, which is typically 9000 bytes for an instance's primary VNIC. This new value is persistent across network disconnections and DHCP lease renewals.

Alternatively, if the Windows image contains cloudbase-init with the MTUPlugin, it is possible to set the interface MTU from DHCP. To enable this function, execute the following steps:

  1. Edit the file C:\Program Files\Cloudbase Solutions\Cloudbase-Init\conf\cloudbase-init.conf. Add these lines:

    mtu_use_dhcp_config=true
    plugins=cloudbaseinit.plugins.common.mtu.MTUPlugin
  2. Enter the command Restart-Service cloudbase-init.

  3. Confirm that the MTU setting has changed. Use this command: netsh interface ipv4 show subinterfaces.

Bug: 33541806

Version: 3.0.1

Oracle Solaris Instance in Maintenance Mode After Restoring from Backup

It is supported to create a new instance from a backup of the boot volume of an existing instance. The existing instance may be running or stopped. However, if you use a boot volume backup of an instance based on the Oracle Solaris image provided with Private Cloud Appliance, the new instance created from that backup boots in maintenance mode. The Oracle Solaris console displays this message: "Enter user name for system maintenance (control-d to bypass):"

Workaround: When the new Oracle Solaris instance created from the block volume backup has come up in maintenance mode, reboot the instance from the Compute Web UI or the CLI. After this reboot, the instance is expected to return to a normal running state and be reachable through its network interfaces.

Bug: 33581118

Version: 3.0.1

Oracle Solaris Instance Stuck in UEFI Interactive Shell

It has been known to occur that Oracle Solaris 11.4 compute instances, deployed from the image delivered through the management node web server, get stuck in the UEFI interactive shell and fail to boot. If the instance does not complete its boot sequence, users are not able to log in. The issue is likely caused by corruption of the original .oci image file during the import into the tenancy.

Workaround: If your Oracle Solaris 11.4 instance hangs during UEFI boot and remains unavailable, proceed as follows:

  1. Terminate the instance that fails to boot.

  2. Delete the imported Oracle Solaris 11.4 image.

  3. Import the Oracle Solaris 11.4 image again from the management node web server.

  4. Launch an instance from the newly imported image and verify that you can log in after it has fully booted.

Bug: 33736100

Version: 3.0.1

Slow Data Transfer and Buffering on Oracle Solaris Instance with LRO/LSO

Oracle Solaris uses Large Send/Receive Offload (LSO and LRO respectively), a feature to optimize network performance and CPU load by offloading TCP segmentation to the network controller hardware (NIC). This feature is enabled by default in the Oracle Solaris compute image provided with Oracle Private Cloud Appliance. However, LSO/LRO in a compute instance can cause very large packet sizes and retransmissions, leading to low data transfer speeds over the Oracle Solaris (virtual) network interfaces.

Workaround: For performance reasons, we recommended disabling LSO/LRO in Oracle Solaris compute instances hosted on Oracle Private Cloud Appliance. The following commands disable LSO/LRO for the net0 interface. On instances with multiple VNICs LRO must be disabled for each interface.

dladm set-linkprop -p lro=off net0
ipadm set-prop -p _lso_outbound=0 ip

For new Oracle Solaris instance deployments, add the commands to a custom cloud initialization script and point to it when launching an instance.

Bug: 36858282, 36405591

Version: 3.0.2

Instance Disk Activity Not Shown in Compute Node Metrics

The virtual disks attached to compute instances are presented to the guest through the hypervisor on the host compute node. Consequently, disk I/O from the instances should be detected at the level of the physical host, and reflected in the compute node disk statistics in Grafana. Unfortunately, the activity on the virtual disks is not aggregated into the compute node disk metrics.

Workaround: To monitor instance disk I/O and aggregated load on each compute node, rather than analyzing compute node metrics, use the individual VM statistics presented through Grafana.

Bug: 33551814

Version: 3.0.1

Attached Block Volumes Not Visible Inside Oracle Solaris Instance

When you attach additional block volumes to a running Oracle Solaris compute instance, they do not become visible automatically to the operating system. Even after manually rescanning the disks, the newly attached block volumes remain invisible. The issues is caused by the hypervisor not sending the correct event trigger to re-enumerate the guest LUNs.

Workaround: When you attach additional block volumes to an Oracle Solaris compute instance, reboot the instance to make sure that the new virtual disks or LUNs are detected.

Bug: 33581238

Version: 3.0.1

Host Name Not Set In Successfully Launched Windows Instance

When you work in a VCN and subnet where DNS is enabled, and you launch an instance, it is expected that its host name matches either the instance display name or the optional host name you provided. However, when you launch a Windows instance, it may occur that the host name is not set correctly according to the launch command parameters. In this situation, the instance's fully qualified domain name (FQDN) does resolve as expected, meaning there is no degraded functionality. Only the host name setting within the instance itself is incorrect; the VCN's DNS configuration works as expected.

Workaround: If your instance host name does not match the specified instance launch parameters, you can manually change the host name within the instance. There is no functional impact.

Alternatively, if the Windows image contains cloudbase-init with the SetHostNamePlugin, it is possible to set the instance host name (computer name) based on the instance FQDN (hostname-label). To enable this function, execute the following steps:

  1. Edit the file C:\Program Files\Cloudbase Solutions\Cloudbase-Init\conf\cloudbase-init.conf. Make sure it contains lines with these settings:

    plugins=cloudbaseinit.plugins.common.sethostname.SetHostNamePlugin
    allow_reboot=true
  2. Enter the command Restart-Service cloudbase-init.

  3. Confirm that the instance host name has changed.

Bug: 33736674

Version: 3.0.1

Instance Backups Can Get Stuck in an EXPORTING or IMPORTING State

In rare cases, when an instance is exporting to create a backup, or a backup is being imported, and the system experiences a failure of one of the components, the exported or imported backup gets stuck in an EXPORTING or IMPORTING state.

Workaround:

  1. Delete the instance backup.
  2. Wait 5 minutes or more to ensure that all internal services are running.
  3. Perform the instance export or import operation again.

See Backing Up and Restoring an Instance in Compute Instance Deployment.

Bug: 34699012

Version: 3.0.1

Instance Not Started After Fault Domain Change

When you change the fault domain of a compute instance, the system stops it, cold-migrates it to a compute node in the selected target fault domain, and restarts the instance on the new host. This process includes a number of internal operations to ensure that the instance can return to its normal running state on the target compute node. If one of these internal operations fails, the instance could remain stopped.

The risk of running into issues with fault domain changes increases with the complexity of the operations. For example, moving multiple instances concurrently to another fault domain, especially if they have shared block volumes and are migrated to different compute nodes in the target fault domain, requires many timing-sensitive configuration changes at the storage level. If the underlying iSCSI connections are not available on a migrated compute instance's new host, the hypervisor cannot bring up the instance.

Workaround: After changing the fault domain, if a compute instance remains stopped, try to start it manually. If the instance failed to come up due to a timing issue as described above, the manual start command is likely to bring the instance back to its normal running state.

Bug: 34550107

Version: 3.0.2

Instance Migration Stuck in MOVING State

When migrating VMs using the Service Web UI it is possible that a migration can get stuck in the MOVING lifecycle state and you will be unable to continue further migrations.

This error can occur when administrative activities, such as live migrations, are running during a patching or upgrading process, or administrative activities are started before patching or upgrading processes have fully completed.

Workaround: Contact Oracle Support to resolve this issue.

Bug: 33911138

Version: , 3.0.1, 3.0.2

OCI CLI Commands Fail When Run From a Compute Instance

Compute instances based on Oracle Linux images provided since early 2023 are likely to have a firewall configuration that prevents the OCI CLI from connecting to the Private Cloud Appliance identity service. In Oracle Cloud Infrastructure the identity service must now be accessed through a public IP address (or FQDN), while Oracle Private Cloud Appliance provides access through an internal IP address. The Oracle Cloud Infrastructure images are configured by default to block all connections to this internal IP address.

The issue has been observed with these images:

  • uln-pca-oracle-linux-7-9-2023-08-31-0-oci

  • uln-pca-oracle-linux-8-2023-08-31-0-oci

  • all Oracle Linux 9 images with a 2023 availability date

Workaround: If you intend to use the OCI CLI from a compute instance in your Private Cloud Appliance environment, verify its access to the identity service. If connections are refused, check the instance firewall configuration and enable access to the identity service.

  1. Test the instance connection to the identity service. For example, use telnet or netcat.

    # curl -v telnet://identity.mydomain:443
    * connect to 169.254.169.254 port 443 failed: Connection refused
    
    -- OR --
    # nc -vz identity.mydomain 443
    Ncat: Connection refused.
  2. Confirm that the firewall output chain contains a rule named BareMetalInstanceServices.

    # iptables -L OUTPUT --line-numbers
    Chain OUTPUT (policy ACCEPT)
    num  target                     prot   opt   source           destination         
    1    BareMetalInstanceServices  all    --    anywhere         169.254.0.0/16      
  3. Disable the bare metal instance rules in the firewall configuration.

    1. Rename the file that defines these firewall rules (/etc/firewalld/direct.xml).

    2. Restart the firewalld service.

    Detailed instructions are provided in the note with Doc ID 2983004.1.

Bug: 35234468

Version: 3.0.2

Cannot Install OCI CLI on Oracle Linux 9 Instance

To run the OCI CLI on an Oracle Linux 9 compute instance, the package python39-oci-cli and its dependencies are required. These are provided through the Oracle Linux 9 OCI Included Packages (ol9_oci_included) repository, but this repository cannot be accessed outside Oracle Cloud Infrastructure.

An Oracle Linux 9 compute instance on Oracle Private Cloud Appliance must instead retrieve the required packages from the public Oracle Linux 9 repositories – specifically: Oracle Linux 9 Development Packages (ol9_developer) and Oracle Linux 9 Application Stream Packages (ol9_appstream). These repositories are not enabled by default in the provided Oracle Linux 9 image.

Workaround: Enable the ol9_developer and ol9_appstream public yum repositories to install python39-oci-cli.

$ sudo yum --disablerepo="*" --enablerepo="ol9_developer ol9_appstream" install python39-oci-cli -y
Dependencies resolved.
==================================================================================================================
 Package                               Architecture       Version                    Repository              Size
==================================================================================================================
Installing:
 python39-oci-cli                      noarch             3.40.2-1.el9               ol9_developer           39 M
Upgrading:
 python39-oci-sdk                      x86_64             2.126.2-1.el9              ol9_developer           74 M
Installing dependencies:
 python3-arrow                         noarch             1.1.0-2.el9                ol9_developer          153 k
 python3-importlib-metadata            noarch             4.12.0-2.el9               ol9_developer           75 k
 python3-jmespath                      noarch             0.10.0-4.el9               ol9_developer           78 k
 python3-prompt-toolkit                noarch             3.0.38-4.el9               ol9_appstream          1.0 M
 python3-terminaltables                noarch             3.1.10-8.0.1.el9           ol9_developer           60 k
 python3-wcwidth                       noarch             0.2.5-8.el9                ol9_appstream           65 k
 python3-zipp                          noarch             0.5.1-1.el9                ol9_developer           24 k

Transaction Summary
=================================================================================================================
Install  8 Packages
Upgrade  1 Package
[...]
Complete!

Bug: 35855058

Version: 3.0.2

Changing Instance Compartment in OCI CLI Returns Key Error

When you use the OCI CLI to change the compartment where a compute instance resides, the command returns a work request key error.

# oci compute instance change-compartment <instance id> <new compartment id> --debug
[...]
DEBUG:oci.base_client.140018383788856: 2024-06-03 18:41:52.605103: Response status: 200
DEBUG:oci.base_client.140018383788856: 2024-06-03 18:41:52.605301: Response returned
DEBUG:oci.base_client.140018383788856:time elapsed for request: 1.5186387430876493
Traceback (most recent call last):
[...]
  File "/root/lib/oracle-cli/lib64/python3.6/site-packages/services/core/src/oci_cli_compute/compute_cli_extended.py", line 767, in change_instance_compartment
    work_request_client = cli_util.build_client('core', 'work_request', ctx)
  File "/root/lib/oracle-cli/lib64/python3.6/site-packages/oci_cli/cli_util.py", line 461, in build_client
    client_class = CLIENT_MAP[spec_name][service_name]
KeyError: 'work_request'

This is a known issue in the OCI CLI. The change compartment function attempts to create a work request in an incorrect way.

Workaround: We advise against changing the compartment of a compute instance using the OCI CLI, because it is unclear which effect this issue has on the code execution in Private Cloud Appliance. Testing does show that the compartment change is applied correctly despite the key error.

Bug: 36691465

Version: 3.0.2

Instance Principal Unavailable Until Next Certificate Renewal Check

An instance principal is a compute instance that is authorized to perform actions on service resources. Before allowing these operations, the Identity and Access Management Service (IAM) validates the instance principal security token: a TLS certificate that expires after 30 days.

The system checks for expired certificates every 24 hours and renews them if necessary. However, an instance principal might lose its authorization after an outage, system maintenance, or upgrade activity. In that case, it cannot obtain an updated certificate until the next renewal check, which could be up to 24 hours later.

Similarly, after upgrading from a release that does not support instance principals to a release that does support instance principals, compute instances might have to wait up to 24 hours to receive their TLS certificates.

Workaround: If you need to have this certificate installed or renewed immediately, contact Oracle for assistance.

Bug: 36165739

Version: 3.0.2

List of Platform Images Includes OKE Images

Private Cloud Appliance provides a set of standard Oracle Linux and Oracle Solaris images for convenient compute instance deployment. When the appliance is upgraded or patched, the latest available images are added. The same mechanism is used to add the images required to deploy clusters with the OKE service: Oracle Private Cloud Appliance Kubernetes Engine (OKE). When users launch an instance from the Compute Web UI, the appropriate images are listed and the OKE-specific images are filtered out. However, the OCI CLI displays all images by default, including those for the OKE service, which should not be used for regular compute instances.

oci compute image list --compartment-id "ocid1.tenancy.... unique_id" | grep "display-name"
      "display-name": "uln-pca-Oracle-Linux-7.9-2024.04.19_0.oci",
      "display-name": "uln-pca-Oracle-Linux-7.9-2024.05.29_0.oci",
      "display-name": "uln-pca-Oracle-Linux-8-2024.04.19_0.oci",
      "display-name": "uln-pca-Oracle-Linux-8-2024.05.29_0.oci",
      "display-name": "uln-pca-Oracle-Linux-9-2024.04.22_0.oci",
      "display-name": "uln-pca-Oracle-Linux-9-2024.05.29_0.oci",
      "display-name": "uln-pca-Oracle-Linux8-OKE-1.26.6-20240210.oci",
      "display-name": "uln-pca-Oracle-Linux8-OKE-1.26.6-20240611.oci",
      "display-name": "uln-pca-Oracle-Linux8-OKE-1.27.7-20240422.oci",
      "display-name": "uln-pca-Oracle-Linux8-OKE-1.27.7-20240602.oci",
      "display-name": "uln-pca-Oracle-Linux8-OKE-1.28.3-20240210.oci",
      "display-name": "uln-pca-Oracle-Linux8-OKE-1.28.3-20240602.oci",
      "display-name": "uln-pca-Oracle-Solaris-11-2024.05.07_0.oci",

Workaround: When launching a regular compute instance or instance pool, use either the Oracle-provided images or your own custom images. Do not use the OKE-specific images.

Bug: 36112983

Version: 3.0.2

Import of Custom Images in VMDA Format Not Supported

Private Cloud Appliance does not currently support the import of custom images in the VMDA format.

Workaround: Use the QCOW2 format when importing custom images.

Bug: 37049215

Version: 3.0.2

Increasing GPU Usage of Running Instance When Capacity Is Exceeded Results in Stopped Instance

The shape of a compute instance determines the number of GPUs it uses, so you can increase GPU usage by updating the instance to a shape with a higher GPU count. However, if the shape update puts the system over its physical GPU capacity, the update is accepted without error, and the instance in question is stopped.

Note:

When you try to launch a new GPU instance on a system with insufficient GPU capacity available, an appropriate error is returned and the instance launch fails as expected.

Workaround: When updating the shape of a GPU instance to increase GPU usage, confirm that the instance is still running after the update. If the instance is stopped, change its shape back to the original, or check if GPUs can be freed up for use by your instance. An administrator can verify GPU availability using the CLI command getFaultDomainInfo.

PCA-ADMIN> getFaultDomainInfo
Data:
  id               totalsCNs   totalMemory   freeMemory   totalvCPUs   freevCPUs   totalGPUs   freeGPUs   Notes
  --               ---------   -----------   ----------   ----------   ---------   ---------   --------   -----
  UNASSIGNED       6           0.0           0.0          0            0           0           0          N.A.
  UNASSIGNED-GPU   0           0.0           0.0          0            0           0           0          N.A.
  FD1              2           3208.0        3144.0       488          444         0           0          N.A.
  FD1-GPU          1           984.0         24.0         216          2           4           0          N.A.
  FD2              2           3208.0        3160.0       488          446         0           0          N.A.
  FD2-GPU          1           984.0         504.0        216          108         4           2          N.A.
  FD3              2           3208.0        3032.0       488          410         0           0          N.A.
  FD3-GPU          1           984.0         24.0         216          0           4           0          N.A.

Bug: 37278974

Version: 3.0.2

Instances Launched from an Instance Pool Created With an Instance Configuration Including an SR-IOV vNIC are Inaccessible

If you use an instance configuration with an SR-IOV vNIC to create an instance pool, then launch an instance from that pool, you will be unable to access that instance. At this time, it is not possible to create an instance pool with a configuration that includes an SR-IOV vNIC, as that vNIC will not attach.

Workaround: Create the instance without the SR-IOV vNIC attached, launch the instance, and then attach the SR-IOV vNIC.

Bug: 37192651

Version: 3.0.2

Compute Instance Migration Fails Because Device Cannot Be Attached

When migrating a compute instance to another compute node, the operation might fail because the path to a device is not found and it cannot be attached. The issue applies to regular live migration as well as migration of displaced instances. The migration job output looks similar to this example:

PCA-ADMIN> show job id=71782dc1-02d8-412e-97fa-703508c606e9
Data: 
  AssociatedObj = id:a7c7bbfe-21f5-4f9d-9158-87779c7eb7b0  type:ComputeNode  name:pcacn003
  Name = OPERATION-MigrateVm
  Progress Message = Fail to attach device 600144f00d9c3e67000067bcfce30047, path not found, server 100.96.2.66 
  Run State = Failed

Workaround: Manually retry migrating the instance. The new job is expected to succeed.

Bug: 37574886

Version: 3.0.2

GPU Drivers Not Included in Oracle Linux Platform Images

If a Private Cloud Appliance installation includes compute nodes with GPUs, you can access them by selecting a dedicated shape. The GPU shapes can be selected for compute instances based on an Oracle Linux 8 or Oracle Linux 9 platform image. The current image versions do not include GPU drivers. The instance OS detects the allocated GPUs, but to use them, you need the CUDA Toolkit from the NVIDIA developer site to install the required drivers.

Note:

The large download and local repository installation need a large amount of disk space. The default 50GB boot volume is insufficient on Oracle Linux 9 and only just large enough on Oracle Linux 8. It is highly recommended to increase the boot volume size to at least 60GB, and extend the file system accordingly.

Workaround: After launching the instance, log in to the command line and install the CUDA Toolkit. Follow the instructions for your version of Oracle Linux.

Installing GPU Drivers in an Oracle Linux 9 Instance
  1. From the command line of the instance, download and install the CUDA Toolkit rpm for your OS.

    $ wget https://developer.download.nvidia.com/compute/cuda/12.8.0/local_installers/cuda-repo-rhel9-12-8-local-12.8.0_570.86.10-1.x86_64.rpm
    $ sudo rpm -i cuda-repo-rhel9-12-8-local-12.8.0_570.86.10-1.x86_64.rpm
    $ sudo dnf clean all
    $ sudo dnf install cuda-toolkit-12-8
  2. Enable the Oracle Linux 9 EPEL yum repository. Install the dkms package.

    $ sudo yum-config-manager --enable ol9_developer_EPEL
    $ sudo dnf install dkms
  3. Install the GPU drivers.

    $ sudo dnf install cuda-12-8
  4. Verify the installation with the NVIDIA System Management Interface.

    $ nvidia-smi
    +-----------------------------------------------------------------------------------------+
    | NVIDIA-SMI 570.86.10              Driver Version: 570.86.10      CUDA Version: 12.8     |
    |-----------------------------------------+------------------------+----------------------+
    | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
    |                                         |                        |               MIG M. |
    |=========================================+========================+======================|
    |   0  NVIDIA L40S                    Off |   00000000:00:05.0 Off |                    0 |
    | N/A   26C    P8             23W /  350W |       1MiB /  46068MiB |      0%      Default |
    |                                         |                        |                  N/A |
    +-----------------------------------------+------------------------+----------------------+
    
    +-----------------------------------------------------------------------------------------+
    | Processes:                                                                              |
    |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
    |        ID   ID                                                               Usage      |
    |=========================================================================================|
    |  No running processes found                                                             |
    +-----------------------------------------------------------------------------------------+
Installing GPU Drivers in an Oracle Linux 8 Instance
  1. From the command line of the instance, download and install the CUDA Toolkit rpm for your OS.

    $ wget https://developer.download.nvidia.com/compute/cuda/12.8.0/local_installers/cuda-repo-rhel8-12-8-local-12.8.0_570.86.10-1.x86_64.rpm
    $ sudo rpm -i cuda-repo-rhel8-12-8-local-12.8.0_570.86.10-1.x86_64.rpm
    $ sudo dnf clean all
    $ sudo dnf install cuda-toolkit-12-8
  2. Enable the Oracle Linux 8 EPEL yum repository. Install the dkms package.

    $ sudo yum-config-manager --enable ol8_developer_EPEL
    $ sudo dnf install dkms
  3. Install the GPU drivers.

    $ sudo dnf install cuda-12-8
  4. Install the NVIDIA kernel module.

    Confirm which toolset you need, either gcc-toolset-11 or gcc-toolset-13.

    # ls -l /etc/scl/conf/

    Install the correct module. This example shows gcc-toolset-13.

    $ sudo scl enable gcc-toolset-13 bash
    # dkms install nvidia-open -v 570.86.10

    If this make error appears while the kernel module is built, you can safely ignore it.

    Cleaning build area...(bad exit status: 2)
    Failed command:
    make -C /lib/modules/5.15.0-206.153.7.el8uek.x86_64/build M=/var/lib/dkms/nvidia-open/570.86.10/build clean
  5. Verify the installation with the NVIDIA System Management Interface.

    # nvidia-smi
    +-----------------------------------------------------------------------------------------+
    | NVIDIA-SMI 570.86.10              Driver Version: 570.86.10      CUDA Version: 12.8     |
    |-----------------------------------------+------------------------+----------------------+
    | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
    |                                         |                        |               MIG M. |
    |=========================================+========================+======================|
    |   0  NVIDIA L40S                    Off |   00000000:00:05.0 Off |                    0 |
    | N/A   26C    P8             23W /  350W |       1MiB /  46068MiB |      0%      Default |
    |                                         |                        |                  N/A |
    +-----------------------------------------+------------------------+----------------------+
    
    +-----------------------------------------------------------------------------------------+
    | Processes:                                                                              |
    |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
    |        ID   ID                                                               Usage      |
    |=========================================================================================|
    |  No running processes found                                                             |
    +-----------------------------------------------------------------------------------------+

Bug: n/a

Version: 3.0.2

Unable to Launch GPU Flex Shape Instance Using Terraform or OCI SDK

If a Private Cloud Appliance installation includes compute nodes with GPUs, you must launch compute instances with a dedicated shape to be able to use the GPUs. This can be a standard shape with fixed hardware resource ratios, or a flex shape that allows users to customize the CPU/RAM/GPU ratio. However, due to a difference in shape modeling between Private Cloud Appliance and Oracle Cloud Infrastructure, it is not possible to launch an instance with a GPU flex shape using the OCI SDK or Terraform provider. If you try to launch an instance this way, an error is returned indicating the "gpus" argument is not recognized.

Workaround: Use the Compute Web UI or the OCI CLI to launch a compute instance with a GPU flex shape. As an alternative, a direct curl request providing all the correct arguments is also expected to work.

Bug: 37195244

Version: 3.0.2

Forced Compute Node Evacuation Fails when Non-Migratable Instances Are Running

Evacuating a compute node is an operation to migrate all compute instances to other compute nodes, so the node can be safely taken offline for maintenance. It could also be used to soft-stop all non-migratable instances running on a node with a single command, instead of stopping the instances one by one. However, best practice is to shut down an instance from its guest OS and gracefully stop the instance from the Compute Web UI or OCI CLI.

If you decide to use forced node evacuation when non-migratable instances (such as instances based on a GPU shape or configured with SR-IOV) are running, the job will return an error and remain in failed state, even if the instances are stopped successfully.

PCA-ADMIN> migrateVm id=<compute_node_id> force=true
JobId: 03816731-2829-471c-9aaf-b8f5e0666bdf
Data: Running

PCA-ADMIN> show job id=03816731-2829-471c-9aaf-b8f5e0666bdf
Data: 
  Name = OPERATION-MigrateVm
  Progress Message = (400, 'LimitExceeded', 'Unable to place VM instance')
                     (403, 'NotAllowed', 'instance ocid1.instance.unique_ID migration not allowed')
  Run State = Failed

Workaround: After performing a forced compute node evacuation, confirm that all non-migratable instances have been stopped successfully.

PCA-ADMIN> getNonMigratableInstances
Data:
  id                           Display Name  Compute Node Id  Domain State
  --                           ------------  ---------------  ------------
  ocid1.instance.unique_ID     instance202   CN_ID            shut off
  ocid1.instance.unique_ID     kqh027        CN_ID            shut off

To clear the error from the failed job, run the evacuation command a second time. Now it should succeed.

For more information, see Migrating Instances from a Compute Node in the "Oracle Private Cloud Appliance Administrator Guide".

Bug: 37092239

Version: 3.0.2

Instance with Multiple GPUs Fails P2P Verification

When an instance is launched using a shape with multiple GPUs, direct peer-to-peer (P2P) data access between the GPUs is enabled for best performance. However, verification using the simpleP2P test tools may return errors, indicating that peer access between some GPU pairs is not working as expected.

Workaround: After installing the NVIDIA CUDA Toolkit in the compute instance, enable driver persistence mode and disable PCIe relaxed ordering.

  1. Log in to the compute instance as a user with sudo privileges.

  2. Enable GPU driver persistence mode.

    $ sudo systemctl enable nvidia-persistenced.service
    $ sudo systemctl start nvidia-persistenced.service
    
    $ systemctl status nvidia-persistenced.service
      nvidia-persistenced.service - NVIDIA Persistence Daemon
         Loaded: loaded (/usr/lib/systemd/system/nvidia-persistenced.service; enabled; preset: enabled)
         Active: active (running) since Tue 2025-02-04 09:50:12 GMT; 15s ago
        Process: 413704 ExecStart=/usr/bin/nvidia-persistenced --verbose (code=exited, status=0/SUCCESS)
       Main PID: 413705 (nvidia-persiste)
          Tasks: 1 (limit: 1284273)
         Memory: 876.0K
            CPU: 14ms
         CGroup: /system.slice/nvidia-persistenced.service
                 └─413705 /usr/bin/nvidia-persistenced --verbose
  3. Look up the PCI identifiers for the GPUs.

    $ lspci | grep -i nvidia
    00:05.0 3D controller: NVIDIA Corporation AD102GL [L40S] (rev a1)
    00:06.0 3D controller: NVIDIA Corporation AD102GL [L40S] (rev a1)
  4. Disable PCIe relaxed ordering for the GPUs. Use this command:

    $ sudo setpci -s <gpu-device> CAP_EXP+8.w=0

    For example:

    $ sudo setpci -s 00:05.0 CAP_EXP+8.w=0
    $ sudo setpci -s 00:06.0 CAP_EXP+8.w=0
  5. If the compute instance is stopped and started, because of a reboot or another operation that changes the lifecycle state, repeat the commands to disable PCIe relaxed ordering for all GPU devices.

Bug: 37279887

Version: 3.0.2