Integrating GPU Expansion Nodes

The GPU nodes must be discovered and provisioned before their hardware resources are available for use within Private Cloud Appliance compute instances. Unlike standard compute nodes, which are added to the base rack and automatically integrated and prepared for provisioning, GPU nodes go through a more strictly controlled process.

GPU nodes are installed in an expansion rack. Its networking components must be connected to the base rack so the new hardware can be integrated into the hardware administration and data networks. In this section we assume the GPU expansion rack has been installed and connected to the Private Cloud Appliance base rack. For installation requirements, physical hardware installation information, and cabling details, see Expanding Private Cloud Appliance with GPU Capacity.

The GPU expansion rack is activated by running a script from one of the management nodes. With precise timing and orchestration based on a static mapping, this script powers on and configures each component in the GPU expansion rack. The required ports on the switches are enabled so that all hardware can be discovered and registered in the component database. When the scripted operations are completed, the data and management networks are operational across the interconnected racks. The operating system and additional software are installed on the new nodes, after which they are ready to provision.

Installation and activation of the expansion rack and GPU nodes are performed by Oracle. From this point forward, the system treats GPU nodes the same way as all other compute nodes. After provisioning, appliance administrators can manage and monitor them from the Service Enclave UI or CLI. See Performing Administrative Operations on Compute Nodes.

Note

Live migration is not supported for GPU instances. This impacts some compute node operations.

  • Evacuating a GPU node will fail. Instances must be stopped manually.

  • The high availability configuration of the Compute Service applies to GPU instances, but is further restricted by limited hardware resources.

    When a GPU node goes offline and returns to normal operation, the Compute Service restarts instances that were stopped during the outage. An instance might be restarted, through cold migration, on another GPU node with enough hardware resources are available.

Caution

For planned maintenance or upgrade, best practice is to issue a shut down command from the instance OS, then gracefully stop the instance from the Compute Web UI or OCI CLI.

GPU nodes are added to the 3 existing fault domains, which is consistent with the overall Oracle cloud architecture. The fault domains might become unbalanced because, unlike standard compute nodes, GPU nodes can be added one at a time. This has no functional impact on the fault domains because the server families operate separately from each other. The GPU nodes can only host compute instances based on a GPU shape, and migrations between different server families in the same fault domain are not supported.

In the Compute Enclave, consuming resources provided by a GPU node is straightforward. Users deploy compute instances with a dedicated shape to allocate 1-4 GPUs. Instances based on a GPU shape always run on a GPU node.