Note:

Migrate Bare Metal GPU Nodes to OKE as Self-Managed Nodes using an OCI Stack

Introduction

In this tutorial, we will walk through the process of migrating bare metal (BM) GPU nodes to Oracle Cloud Infrastructure Kubernetes Engine (OKE) self-managed nodes using an Oracle Cloud Infrastructure (OCI) stack.

Let us first understand what self-managed nodes are, and why they are an ideal fit for running GPUs on OKE.

What are OKE Self-Managed Nodes?

As the name implies, self-managed nodes are fully controlled and maintained by the customer. This includes provisioning, scaling, configuration, upgrades, and maintenance tasks such as Operating System (OS) patching and node replacement. While this approach requires more manual management, it provides maximum flexibility and control, making it suitable for specialized workloads like those running on GPUs.

Key Features of Self-Managed Nodes:

This tutorial covers a use case where BM A100 GPU workloads are currently running on a Slurm cluster in OCI, with the goal of migrating them to an OKE cluster. This can be achieved using the High Performance Computing (HPC) OKE stack to deploy an empty OKE cluster and then add the existing GPU nodes to it.

Objectives

Prerequisites

Task 1: Migrate BM A100 GPU Nodes to OKE using HPC OKE Stack

  1. Log in to the OCI Console and create the necessary policies as mentioned in this GitHub page: Running RDMA (remote direct memory access) GPU workloads on OKE.

  2. Click Deploy to Oracle Cloud and review the terms and conditions.

    Github Page

  3. Select the region where you want to deploy the stack.

  4. In the Stack information page, enter Name for your stack.

    Create Stack

  5. In the Configure variable page, enter Name for your VCN.

    Provide name

  6. In the Bastion & Operator section, enter the information of Bastion instance and add SSH key for the Bastion instance.

    Provide VCN and bastion

  7. (Optional) Select Configure operator shape to create operator node for monitoring or running jobs.

    Operator shape

  8. Configure variables of OKE Cluster, Workers: Operational nodes and Workers: GPU + RDMA nodes. Make sure to select Flannel CNI to use for pod networking.

    Provide OKE cluster conf

    Worker node for operations

    Worker node for RDMA GPU

  9. Select Create a RAID 0 array using local NVMe drives and Install Node Problem Detector & Kube Prometheus Stack.

    Create storage

  10. Review stack information and click Create.

    Review before clicking create

  11. Review the Stack details in Resource Manager and verify the OKE cluster under the Kubernetes section in the OCI Console.

    check Stack details

    OKE cluster running

  12. Log in to the OKE cluster using the access cluster through OCI Console and proceed to add new GPU nodes to it.

  13. Follow all the steps mentioned here: Creating a Dynamic Group and a Policy for Self-Managed Nodes.

  14. Follow step 1 and step 2 mentioned here: Creating Cloud-init Scripts for Self-managed Nodes.

  15. Run the following script to add the GPU nodes to the OKE cluster.

    sudo rm archive_uri-https_objectstorage_ap-osaka-1_oraclecloud_com_p_ltn5w_61bxynnhz4j9g2drkdic3mwpn7vqce4gznmjwqqzdqjamehhuogyuld5ht_n_hpc_limited_availability_b_oke_node_repo_o_ubuntu-jammy.list
    
    sudo apt install -y oci-oke-node-all*
    
    sudo oke bootstrap --apiserver-host <API SERVER IP> --ca <CA CERT> --manage-gpu-services --crio-extra-args "
    
  16. Run the following command to verify that the nodes have been successfully added to the OKE cluster.

    kubectl get nodes
    

Acknowledgments

More Learning Resources

Explore other labs on docs.oracle.com/learn or access more free learning content on the Oracle Learning YouTube channel. Additionally, visit education.oracle.com/learning-explorer to become an Oracle Learning Explorer.

For product documentation, visit Oracle Help Center.