Note:

Deploy a GPU High Performance Computing Cluster in Oracle Cloud Infrastructure

Introduction

The advent of powerful large language models (LLMs) increases the need for an infrastructure with sufficient Graphics Processing Unit (GPU) memory to perform fine-tuning tasks, and one way to achieve this uses a GPU cluster. Oracle Cloud Infrastructure (OCI) has the ability to deploy a supercluster of NVIDIA GPU A100s and use their power to run or fine-tune an LLM.

Components

Cluster network is a powerful resource from OCI for deploying clusters of HPC and GPU machines connected by a high-bandwidth, ultra-low-latency network. Each node in the cluster is a bare metal machine located in close physical proximity to the other nodes. A Remote Direct Memory Access (RDMA) network between the nodes provides single digit microsecond latency, comparable to on-premises high performance computing (HPC) clusters. For more information, see Cluster Networks with Instance Pools.

To deploy a cluster, you need to create a Dynamic Group with your workspace compartment information, a set of Policies that allow the services and dynamic group to perform some tasks, a Custom Image of an Ubuntu ISO image to be used by the node’s cluster, and deploy a Marketplace stack to deploy the cluster. For more information, see Managing Dynamic Groups, Policies, Custom Images and Oracle Cloud Marketplace.

Objective

Prerequisites

Task 1: Create a Dynamic Group

Create a dynamic group rule with workspace information.

  1. Log in to the OCI Console, navigate to Identity & Security and click Compartments. Copy the Oracle Cloud Identifier (OCID) from the work compartment.

    Image 1

  2. Click Dynamic Groups and Create Dynamic Group.

  3. Enter a Name and Description. For this tutorial, enter instance-principal as name. Update the OCID and click Create.

    Image 2

Task 2: Define the Policies

Define the policies required for the deployment process.

  1. Go to the OCI Console, navigate to Identity & Security and Policies.

  2. Click Create Policy and enter a Name, Description and select the root compartment.

  3. Click Show manual editor and enter the following policies, replace <> with your information and click Create.

    Allow service compute_management to use tag-namespace in tenancy
    
    Allow service compute_management to manage compute-management-family in tenancy
    
    Allow service compute_management to read app-catalog-listing in tenancy
    
    Allow group Administrators to manage all-resources in compartment <>
    
    allow service compute_management to use tag-namespace in tenancy
    
    allow service compute_management to manage compute-management-family in tenancy
    
    allow service compute_management to read app-catalog-listing in tenancy
    
    allow group user to manage all-resources in compartment compartmentName
    
    Allow dynamic-group instance-principal to read app-catalog-listing in tenancy
    
    Allow dynamic-group instance-principal to use tag-namespace in tenancy
    
    Allow dynamic-group instance-principal to manage compute-management-family in compartment <>
    
    Allow dynamic-group instance-principal to manage instance-family in compartment <>
    
    Allow dynamic-group instance-principal to use virtual-network-family in compartment <>
    
    Allow dynamic-group instance-principal to use volumes in compartment <>
    

    Image 3

Task 3: (Optional) Create a Custom Image

Create a custom image from an Ubuntu image for GPU machines. If necessary.

  1. Go to the OCI Console, navigate to Compute and Custom Images.

    Image 4

  2. Under Custom Images, click Import Image.

    Image 5

  3. Enter the following information.

    • Compartment: Enter the compartment.
    • Name: For this tutorial, enter Ubuntu-22-OCA-OFED-5.8-3.0.7.0-GPU-535-2023.11.30-0 as name.
    • Operating system (OS): Enter OS.
    • Select Import from an Object Storage URL and enter the following URL: https://objectstorage.ca-toronto-1.oraclecloud.com/p/3IlDVBRG3pjDLq4WHlmbpY6Tas8GU4GLuHw7i3ZC8pf4rJZDoB2b1WFxy9OTZCzc/n/hpc_limited_availability/b/images/o/Ubuntu-22-OCA-OFED-5.8-3.0.7.0-GPU-535-2023.11.30-0

    Image 6

  4. Enter the image location in object storage.

    Image 7

    Image 8

  5. Keep the other configuration by default and click Import Image. It will take a few minutes for the custom image to be ready for use.

    Image 9

Task 4: Deploy the HPC Stack

A simple and quick way to deploy the HPC stack is to use the following URL: https://cloud.oracle.com/resourcemanager/stacks/create?zipUrl=https://github.com/oracle/quickstart/oci-hpc/archive/refs/heads/master.zip. This URL will use all recommended and updated scripts to create the environment.

Note: To check the latest updates to the deploy script, go to the URL: https://github.com/oracle-quickstart/oci-hpc. In the README.md file, click Deploy to Oracle Cloud as shown in the following image.

Image 28

or

Deploy the HPC stack traditionally through the OCI Console.

  1. Go to the OCI Console, click Marketplace and All Application.

    Image 10

  2. Enter HPC solutions in the Search bar.

    Image 11

  3. Select HPC Cluster.

    Image 12

  4. Enter the required information to create the stack.

    Image 13

    Image 14

    Image 15

    Image 16

    Image 17

    Image 18

    Image 19

    Image 20

  5. Enter the required values to configure the Advanced bastion options.

    Image 21

  6. Enter the cluster network parameters.

    Image 22 Image 23

  7. Click Create to initialize the stack deployment.

    Image 24

    The stack is created successfully.

    Image 25

  8. To check the instances created, go to the OCI Console and click Compute, Instances.

    Image 26

    Image 27

Acknowledgments

More Learning Resources

Explore other labs on docs.oracle.com/learn or access more free learning content on the Oracle Learning YouTube channel. Additionally, visit education.oracle.com/learning-explorer to become an Oracle Learning Explorer.

For product documentation, visit Oracle Help Center.