Deploy High-performance Computing (HPC) on Oracle Cloud Infrastructure

The demands of parallel computing workloads in simulation and modeling can now be cost-effectively managed in the cloud.

Deploy high-performance computing (HPC) resources in a high-bandwidth, low-latency cloud network with performance that rivals that of on-premises HPC networks, but with the cost and operational advantages that cloud computing offers.

Cluster Networking is an Oracle Cloud Infrastructure technology that allows HPC instances to communicate with a high-bandwidth, low-latency network. Each node in the cluster is a bare metal machine located in close physical proximity to the other nodes. Remote direct memory access (RDMA) networking between nodes provides below two-microsecond latency and is comparable to on-premises HPC clusters. Oracle uses the RDMA over converged ethernet or RoCEv2 Protocol for cluster networking.

Cluster networks are designed for highly demanding parallel computing workloads, including the following:

  • Computational fluid dynamics simulations for automotive or aerospace modeling

  • Crash simulation

  • Financial modeling and risk analysis

  • Biomedical simulations

  • Trajectory analysis and design for space exploration

  • Artificial intelligence and big data workloads

Cluster networks are supported in the following:

  • Virtual cloud network

    • Public subnet

    • Private subnet

    • Internet gateway

    • NAT gateway

  • Compute nodes

    • Bastion host in a public subnet

    • HPC compute nodes in private subnet

Architecture

This reference architecture deploys a bastion or head node, which runs the scheduler and can be used as a bastion server for access to the cluster.

You can create a visualization node, such as a GPU virtual machine (VM) or bare metal machine, depending on your requirements. We recommend placing the visualization node in the public subnet. HPC workloads often require visualization tools for pre- or post-processing, monitoring, or analyzing the output of the simulations. You can deploy an NVIDIA GRID-enabled workstation from Oracle Cloud Marketplace.

This architecture is deployed using public and private virtual cloud networks (VCNs). The customer network can access the head node and visualization node only through IPSec VPN, Oracle Cloud Infrastructure FastConnect, or public internet.

The architecture uses a region with one availability domain and regional subnets. You can use the same architecture in a region with multiple availability domains. We recommend that you use regional subnets for your deployment, regardless of the number of availability domains.

You can access these cluster networks from Oracle Cloud Marketplace or deploy them manually. In either case, we recommend using the baseline reference architecture and then adjusting it to meet your specific requirements.

The following diagram illustrates this reference architecture.



hpc-oci-architecture.zip

The architecture has the following components:

  • Region

    An Oracle Cloud Infrastructure region is a localized geographic area that contains one or more data centers, called availability domains. Regions are independent of other regions, and vast distances can separate them (across countries or even continents).

  • Availability domains

    Availability domains are standalone, independent data centers within a region. The physical resources in each availability domain are isolated from the resources in the other availability domains, which provides fault tolerance. Availability domains don’t share infrastructure such as power or cooling, or the internal availability domain network. So, a failure at one availability domain is unlikely to affect the other availability domains in the region.

  • Fault domains

    A fault domain is a grouping of hardware and infrastructure within an availability domain. Each availability domain has three fault domains with independent power and hardware. When you distribute resources across multiple fault domains, your applications can tolerate physical server failure, system maintenance, and power failures inside a fault domain.

  • Virtual cloud network (VCN) and subnets

    A VCN is a customizable, software-defined network that you set up in an Oracle Cloud Infrastructure region. Like traditional data center networks, VCNs give you complete control over your network environment. A VCN can have multiple non-overlapping CIDR blocks that you can change after you create the VCN. You can segment a VCN into subnets, which can be scoped to a region or to an availability domain. Each subnet consists of a contiguous range of addresses that don't overlap with the other subnets in the VCN. You can change the size of a subnet after creation. A subnet can be public or private.

  • Bastion host

    The bastion host is a compute instance that serves as a secure, controlled entry point to the topology from outside the cloud. The bastion host is provisioned typically in a demilitarized zone (DMZ). It enables you to protect sensitive resources by placing them in private networks that can't be accessed directly from outside the cloud. The topology has a single, known entry point that you can monitor and audit regularly. So, you can avoid exposing the more sensitive components of the topology without compromising access to them.

  • HPC cluster node

    The head node provisions and deprovisions these compute nodes, which are RDMA-enabled clusters (100 gbps RoCE v2 isolated network). They process the data stored in file storage and return the results to file storage.

  • Visualization node

    The visualization node generally has a 2D or 3D application installed for visual representation and analysis of data processed by HPC cluster nodes.

  • Security list

    For each subnet, you can create security rules that specify the source, destination, and type of traffic that must be allowed in and out of the subnet.

Recommendations

Use the following recommendations as a starting point to deploy high-performance computing (HPC) on Oracle Cloud Infrastructure.

Your requirements might differ from the architecture described here.

  • VCN

    When you create a VCN, determine the number of CIDR blocks required and the size of each block based on the number of resources that you plan to attach to subnets in the VCN. Use CIDR blocks that are within the standard private IP address space.

    Select CIDR blocks that don't overlap with any other network (in Oracle Cloud Infrastructure, your on-premises data center, or another cloud provider) to which you intend to set up private connections.

    After you create a VCN, you can change, add, and remove its CIDR blocks.

    When you design the subnets, consider your traffic flow and security requirements. Attach all the resources within a specific tier or role to the same subnet, which can serve as a security boundary.

    Use regional subnets.

  • Security lists

    Use security lists to define ingress and egress rules that apply to the entire subnet.

  • Bastion node

    Use the VM.Standard.2.8 Compute shape. Since the node is used as a bastion host and to schedule HPC jobs, it doesn’t require locally attached storage or GPU processing.

  • Visualization nodeUse the VM.GPU3.2 Compute shape because this node is used for visualization and is likely installed with a graphic intensive application.
  • HPC Cluster node

    Use the BM.HPC2.36 Compute shape. This shape has 36 cores from two 3.7GHz Intel Xeon Gold 6154 processors, 384-GB RAM, and 6.4-TB NVME local storage. By using powerful NVIDIA GPUs available on Oracle Cloud Infrastructure, you can post-process results on the cloud through remote visualization.

Considerations

When deploying high-performance computing (HPC) on Oracle Cloud Infrastructure, consider these implementation options.

  • Performance

    To get the best performance, choose the correct compute shape with appropriate bandwidth.

  • Availability

    Consider using a high-availability option based on your deployment requirements and region. Options include using multiple availability domains in a region and fault domains.

  • Cost

    A bare metal GPU instance provides necessary CPU power for a higher cost. Evaluate your requirements to choose the appropriate compute shape.

  • Monitoring and alerts

    Set up monitoring and alerts on CPU and memory usage for your nodes, so that you can scale the shape up or down as needed.

Deploy

A Terraform stack to deploy this reference architecture is available as a stack in Oracle Cloud Marketplace. You can also download the code from GitHub, and customize it to your requirements.

  • Deploy using the stack in Oracle Cloud Marketplace:
    1. Go to Oracle Cloud Marketplace.
    2. Click Get App.
    3. Follow the on-screen prompts.
  • Deploy using the code in GitHub:
    1. Go to GitHub.
    2. Clone or download the repository to your local computer.
    3. Follow the instructions in the README document.

Change Log

This log lists significant changes: