Configure an HPC Cluster Stack to Deploy NVIDIA AI on an OCI Government Region

Configure and deploy a private cluster of bare metal NVIDIA GPU systems in Oracle US Government Cloud (FedRAMP High). All cloud resources and data remain under your cloud tenancy, giving you full control over software versions, administrative access, encryption keys, and resource sharing.

The HPC Cluster stack uses Terraform to deploy Oracle Cloud Infrastructure (OCI) resources. The stack creates GPU nodes, storage, standard networking and high performance cluster networking, and a bastion/head node for access to and management of the cluster.

Before You Begin

Learn more about deploying NVIDIA Enterprise on an Oracle Cloud Infrastructure Government Cloud. See Deploy high-performance GPU computing for government AI workloads.

Architecture

This architecture deploys a bastion or head node, which runs the scheduler and can be used as a bastion server for access to the cluster.

You can create a compute processing node, using a variety of NVIDIA GPU instance types, with your processing requirements. We recommend placing the compute processing node in the secure private subnet. You can deploy NVIDIA GPU compute cluster instance from Oracle Cloud Marketplace.

This architecture is deployed using public and private virtual cloud networks (VCNs). The customer network can access the head node and compute node only through IPSec VPN, Oracle Cloud Infrastructure FastConnect, or public internet.

The architecture uses a region with one availability domain and regional subnets. You can use the same architecture in a region with multiple availability domains. We recommend that you use regional subnets for your deployment, regardless of the number of availability domains. You can access these cluster networks from Oracle Cloud Marketplace or deploy them manually. In either case, we recommend using the baseline reference architecture and then adjusting it to meet your specific requirements.

The following diagram illustrates this reference architecture.

Description of nvidia-ai-gvt-hpc-oci.png follows

Description of the illustration nvidia-ai-gvt-hpc-oci.png

nvidia-ngc-ai-gvt-hpc-oci-oracle.zip

The architecture has the following components:

Region
An Oracle Cloud Infrastructure region is a localized geographic area that contains one or more data centers, called availability domains. Regions are independent of other regions, and vast distances can separate them (across countries or even continents).
Availability domains
Availability domains are standalone, independent data centers within a region. The physical resources in each availability domain are isolated from the resources in the other availability domains, which provides fault tolerance. Availability domains don’t share infrastructure such as power or cooling, or the internal availability domain network. So, a failure at one availability domain shouldn't affect the other availability domains in the region.
Fault domains
A fault domain is a grouping of hardware and infrastructure within an availability domain. Each availability domain has three fault domains with independent power and hardware. When you distribute resources across multiple fault domains, your applications can tolerate physical server failure, system maintenance, and power failures inside a fault domain.
Virtual cloud network (VCN) and subnets
A VCN is a customizable, software-defined network that you set up in an Oracle Cloud Infrastructure region. Like traditional data center networks, VCNs give you control over your network environment. A VCN can have multiple non-overlapping CIDR blocks that you can change after you create the VCN. You can segment a VCN into subnets, which can be scoped to a region or to an availability domain. Each subnet consists of a contiguous range of addresses that don't overlap with the other subnets in the VCN. You can change the size of a subnet after creation. A subnet can be public or private.
Bastion host
The bastion host is a compute instance that serves as a secure, controlled entry point to the topology from outside the cloud. The bastion host is provisioned typically in a demilitarized zone (DMZ). It enables you to protect sensitive resources by placing them in private networks that can't be accessed directly from outside the cloud. The topology has a single, known entry point that you can monitor and audit regularly. So, you can avoid exposing the more sensitive components of the topology without compromising access to them.
Compute node
Select the bare metal GPU shape you are using in this cluster. For example, select BM.GPU4.8 powered by 4 x NVIDIA A100 Tensor Core GPUs, as shown in the example above, or select BM.GPU.H100.8 powered by 8 x NVIDIA H100 Tensor Core GPUs for FP8 performance benefits using the NVIDIA Transformer Engine.
Orchestration node
The orchestration node performs cluster node management, provisioning, deprovisioning, and deployment of software configurations as well as managing compute workflows and jobs orchestration.
Security list
For each subnet, you can create security rules that specify the source, destination, and type of traffic that must be allowed in and out of the subnet.

About Required Products, Services, and Roles

This solution requires the following products, services, and roles:

Oracle Cloud Infrastructure Government Cloud
NVIDIA AI Enterprise
NVIDIA NeMo Framework
NVIDIA Enroot
NVIDIA NCCL

These are the roles needed for each service.

Service Name: Role	Required to...
Oracle Cloud Infrastructure Government Cloud: Oracle Cloud user for the tenancy	Create a compartment in Oracle Cloud Infrastructure (OCI), deploy the GPU Cluster, and configure the GPU Cluster.
OCI Government Cloud: security or network administrator	Create or edit OCI policies, as needed, to allow you to build the cluster.
OCI Government Cloud: `opc`	Connect to the bastion to review the configuration, update the OS, and run the LLM training workload.

See Oracle Products, Solutions, and Services to get what you need.