Note:
- This tutorial requires access to Oracle Cloud. To sign up for a free account, see Get started with Oracle Cloud Infrastructure Free Tier.
- It uses example values for Oracle Cloud Infrastructure credentials, tenancy, and compartments. When completing your lab, substitute these values with ones specific to your cloud environment.
Deploy a GPU High Performance Computing Cluster in Oracle Cloud Infrastructure
Introduction
The advent of powerful large language models (LLMs) increases the need for an infrastructure with sufficient Graphics Processing Unit (GPU) memory to perform fine-tuning tasks, and one way to achieve this uses a GPU cluster. Oracle Cloud Infrastructure (OCI) has the ability to deploy a supercluster of NVIDIA GPU A100s and use their power to run or fine-tune an LLM.
Components
Cluster network is a powerful resource from OCI for deploying clusters of HPC and GPU machines connected by a high-bandwidth, ultra-low-latency network. Each node in the cluster is a bare metal machine located in close physical proximity to the other nodes. A Remote Direct Memory Access (RDMA) network between the nodes provides single digit microsecond latency, comparable to on-premises high performance computing (HPC) clusters. For more information, see Cluster Networks with Instance Pools.
To deploy a cluster, you need to create a Dynamic Group with your workspace compartment information, a set of Policies that allow the services and dynamic group to perform some tasks, a Custom Image of an Ubuntu ISO image to be used by the node’s cluster, and deploy a Marketplace stack to deploy the cluster. For more information, see Managing Dynamic Groups, Policies, Custom Images and Oracle Cloud Marketplace.
Objective
- Deploy a GPU A100 cluster on OCI using a preconfigured stack.
Prerequisites
-
Access to create dynamic groups, user groups and policies. For access permissions, contact your tenancy administrator.
-
GPU computing limits. If you do not have enough compute GPU limits, see Requesting a Service Limit Increase.
Task 1: Create a Dynamic Group
Create a dynamic group rule with workspace information.
-
Log in to the OCI Console, navigate to Identity & Security and click Compartments. Copy the Oracle Cloud Identifier (OCID) from the work compartment.
-
Click Dynamic Groups and Create Dynamic Group.
-
Enter a Name and Description. For this tutorial, enter
instance-principal
as name. Update the OCID and click Create.
Task 2: Define the Policies
Define the policies required for the deployment process.
-
Go to the OCI Console, navigate to Identity & Security and Policies.
-
Click Create Policy and enter a Name, Description and select the root compartment.
-
Click Show manual editor and enter the following policies, replace
<>
with your information and click Create.Allow service compute_management to use tag-namespace in tenancy Allow service compute_management to manage compute-management-family in tenancy Allow service compute_management to read app-catalog-listing in tenancy Allow group Administrators to manage all-resources in compartment <> allow service compute_management to use tag-namespace in tenancy allow service compute_management to manage compute-management-family in tenancy allow service compute_management to read app-catalog-listing in tenancy allow group user to manage all-resources in compartment compartmentName Allow dynamic-group instance-principal to read app-catalog-listing in tenancy Allow dynamic-group instance-principal to use tag-namespace in tenancy Allow dynamic-group instance-principal to manage compute-management-family in compartment <> Allow dynamic-group instance-principal to manage instance-family in compartment <> Allow dynamic-group instance-principal to use virtual-network-family in compartment <> Allow dynamic-group instance-principal to use volumes in compartment <>
Task 3: (Optional) Create a Custom Image
Create a custom image from an Ubuntu image for GPU machines. If necessary.
-
Go to the OCI Console, navigate to Compute and Custom Images.
-
Under Custom Images, click Import Image.
-
Enter the following information.
- Compartment: Enter the compartment.
- Name: For this tutorial, enter
Ubuntu-22-OCA-OFED-5.8-3.0.7.0-GPU-535-2023.11.30-0
as name. - Operating system (OS): Enter OS.
- Select Import from an Object Storage URL and enter the following URL:
https://objectstorage.ca-toronto-1.oraclecloud.com/p/3IlDVBRG3pjDLq4WHlmbpY6Tas8GU4GLuHw7i3ZC8pf4rJZDoB2b1WFxy9OTZCzc/n/hpc_limited_availability/b/images/o/Ubuntu-22-OCA-OFED-5.8-3.0.7.0-GPU-535-2023.11.30-0
-
Enter the image location in object storage.
-
Keep the other configuration by default and click Import Image. It will take a few minutes for the custom image to be ready for use.
Task 4: Deploy the HPC Stack
A simple and quick way to deploy the HPC stack is to use the following URL: https://cloud.oracle.com/resourcemanager/stacks/create?zipUrl=https://github.com/oracle/quickstart/oci-hpc/archive/refs/heads/master.zip
. This URL will use all recommended and updated scripts to create the environment.
Note: To check the latest updates to the deploy script, go to the URL:
https://github.com/oracle-quickstart/oci-hpc
. In theREADME.md
file, click Deploy to Oracle Cloud as shown in the following image.
or
Deploy the HPC stack traditionally through the OCI Console.
-
Go to the OCI Console, click Marketplace and All Application.
-
Enter
HPC solutions
in the Search bar. -
Select HPC Cluster.
-
Enter the required information to create the stack.
-
Enter the required values to configure the Advanced bastion options.
-
Enter the cluster network parameters.
-
Click Create to initialize the stack deployment.
The stack is created successfully.
-
To check the instances created, go to the OCI Console and click Compute, Instances.
Acknowledgments
- Authors - Douglas Silva (LAD A-Team), Leandro Camargo (LAD A-Team)
More Learning Resources
Explore other labs on docs.oracle.com/learn or access more free learning content on the Oracle Learning YouTube channel. Additionally, visit education.oracle.com/learning-explorer to become an Oracle Learning Explorer.
For product documentation, visit Oracle Help Center.
Deploy a GPU High Performance Computing Cluster in Oracle Cloud Infrastructure
F98205-01
May 2024