Integrating GPU Expansion Nodes

GPU nodes are installed in an expansion rack. Its networking components must be connected to the base rack so the new hardware can be integrated into the hardware administration and data networks. For installation requirements, physical hardware installation information, and cabling details, refer to Optional GPU Expansion in the "Oracle Private Cloud Appliance Installation Guide".

In this section we assume the GPU expansion rack has been installed and connected to the Private Cloud Appliance base rack. The GPU nodes must be discovered and provisioned before their hardware resources are available for use within compute instances. Unlike standard compute nodes, which are added to the base rack and automatically integrated and prepared for provisioning, GPU nodes in an expansion rack go through a more strictly controlled process.

The GPU expansion rack is activated by running a script from one of the management nodes. With precise timing and orchestration based on a static mapping, this script powers on and configures each component in the GPU expansion rack. The required ports on the switches are enabled so that all hardware can be discovered and registered in the component database. When the scripted operations are completed, the data and management networks are operational across the interconnected racks. The operating system and additional software are installed on the new nodes, after which they are ready to provision.

Installation and activation of the expansion rack and GPU nodes are performed by Oracle. From this point forward, the system treats GPU nodes the same way as all other compute nodes. After provisioning, appliance administrators can manage and monitor them from the Service Enclave UI or CLI. See Performing Compute Node Operations.

Note:

Live migration is not supported for GPU instances. This impacts some compute node operations.

  • Evacuating a GPU node will fail. Instances must be stopped manually.

  • The high availability configuration of the Compute Service applies to GPU instances, but is further restricted by limited hardware resources.

    When a GPU node goes offline and returns to normal operation, the Compute Service restarts instances that were stopped during the outage. An instance might be restarted, through cold migration, on another GPU node with enough hardware resources are available.

Caution:

For planned maintenance or upgrade, best practice is to issue a shut down command from the instance OS, then gracefully stop the instance from the Compute Web UI or OCI CLI.

GPU nodes are added to the 3 existing fault domains, which is consistent with the overall Oracle cloud architecture. The fault domains might become unbalanced because, unlike standard compute nodes, GPU nodes can be added one at a time. This has no functional impact on the fault domains because the server families operate separately from each other. The GPU nodes can only host compute instances based on a GPU shape, and migrations between different server families in the same fault domain are not supported.

In the Compute Enclave, consuming resources provided by a GPU node is straightforward. Users deploy compute instances with a dedicated shape to allocate 1-4 GPUs. Instances based on a GPU shape always run on a GPU node.