The demands of parallel computing workloads in simulation and modeling can now be cost-effectively managed in the cloud.
Deploy high-performance computing (HPC) resources in a high-bandwidth, low-latency cloud network with performance that rivals that of on-premises HPC networks, but with the cost and operational advantages that cloud computing offers.
Cluster Networking is an Oracle Cloud Infrastructure technology that allows HPC instances to communicate with a high-bandwidth, low-latency network. Each node in the cluster is a bare metal machine located in close physical proximity to the other nodes. Remote direct memory access (RDMA) networking between nodes provides below two-microsecond latency and is comparable to on-premises HPC clusters. Oracle uses the RDMA over converged ethernet or RoCEv2 Protocol for cluster networking.
Cluster networks are designed for highly demanding parallel computing workloads, including the following:
Computational fluid dynamics simulations for automotive or aerospace modeling
Financial modeling and risk analysis
Trajectory analysis and design for space exploration
Artificial intelligence and big data workloads
Cluster networks are supported in the following:
Virtual cloud network
Bastion host in a public subnet
HPC compute nodes in private subnet
This reference architecture deploys a bastion or head node, which runs the scheduler and can be used as a bastion server for access to the cluster.
You can create a visualization node, such as a GPU virtual machine (VM) or bare metal machine, depending on your requirements. We recommend placing the visualization node in the public subnet. HPC workloads often require visualization tools for pre- or post-processing, monitoring, or analyzing the output of the simulations. You can deploy an NVIDIA GRID-enabled workstation from Oracle Cloud Marketplace.
This architecture is deployed using public and private virtual cloud networks (VCNs). The customer network can access the head node and visualization node only through IPSec VPN, Oracle Cloud Infrastructure FastConnect, or public internet.
The architecture uses a region with one availability domain and regional subnets. You can use the same architecture in a region with multiple availability domains. We recommend that you use regional subnets for your deployment, regardless of the number of availability domains.
You can access these cluster networks from Oracle Cloud Marketplace or deploy them manually. In either case, we recommend using the baseline reference architecture and then adjusting it to meet your specific requirements.
The following diagram illustrates this reference architecture.
The architecture has the following components:
A region is a localized geographic area composed of one or more availability domains. Regions are independent of other regions, and vast distances can separate them (across countries or continents).
- Availability domains
Availability domains are standalone, independent data centers within a region. The physical resources in each availability domain are isolated from the resources in the other availability domains, which provides fault tolerance. Availability domains don’t share infrastructure such as power or cooling, or the internal availability domain network. So, a failure at one availability domain is unlikely to affect the other availability domains in the region.
- Fault domains
A fault domain is a grouping of hardware and infrastructure within an availability domain. Each availability domain has three fault domains with independent power and hardware. When you place Compute instances across multiple fault domains, applications can tolerate physical server failure, system maintenance, and many common networking and power failures inside the availability domain.
- Virtual cloud network (VCN) and subnets
A VCN is a software-defined network that you set up in an Oracle Cloud Infrastructure region. VCNs can be segmented into subnets, which can be specific to a region or to an availability domain. Both region-specific and availability domain-specific subnets can coexist in the same VCN. A subnet can be public or private.
- Bastion host
The bastion host is a compute instance that serves as a secure, controlled entry point to the topology from outside the cloud. The bastion host is provisioned typically in a demilitarized zone (DMZ). It enables you to protect sensitive resources by placing them in private networks that can't be accessed directly from outside the cloud. The topology has a single, known entry point that you can monitor and audit regularly. So, you can avoid exposing the more sensitive components of the topology without compromising access to them.
- HPC cluster node
The head node provisions and deprovisions these compute nodes, which are RDMA-enabled clusters (100 gbps RoCE v2 isolated network). They process the data stored in file storage and return the results to file storage.
- Visualization node
The visualization node generally has a 2D or 3D application installed for visual representation and analysis of data processed by HPC cluster nodes.
- Security lists
For each subnet, you can create security rules that specify the source, destination, and type of traffic that must be allowed in and out of the subnet.
Use the following recommendations as a starting point to deploy high-performance computing (HPC) on Oracle Cloud Infrastructure.
Your requirements might differ from the architecture described here.
When you create the VCN, determine how many IP addresses your cloud resources in each subnet require. Using the Classless Inter-Domain Routing (CIDR) notation, specify a subnet mask and a network address range that's large enough for the required IP addresses. Use an address range that's within the standard private IP address space.
Select an address range that doesn’t overlap with your on-premises network, so that you can set up a connection between the VCN and your on-premises network, if necessary.
After you create a VCN, you can't change its address range.
When you design the subnets, consider your traffic flow and security requirements. Attach all the resources within a specific tier or role to the same subnet, which can serve as a security boundary.
Use regional subnets.
- Security lists
Use security lists to define ingress and egress rules that apply to the entire subnet.
- Bastion node
Use the VM.Standard.2.8 Compute shape. Since the node is used as a bastion host and to schedule HPC jobs, it doesn’t require locally attached storage or GPU processing.
- Visualization nodeUse the VM.GPU3.2 Compute shape because this node is used for visualization and is likely installed with a graphic intensive application.
- HPC Cluster node
Use the BM.HPC2.36 Compute shape. This shape has 36 cores from two 3.7GHz Intel Xeon Gold 6154 processors, 384-GB RAM, and 6.4-TB NVME local storage. By using powerful NVIDIA GPUs available on Oracle Cloud Infrastructure, you can post-process results on the cloud through remote visualization.
When deploying high-performance computing (HPC) on Oracle Cloud Infrastructure, consider these implementation options.
To get the best performance, choose the correct compute shape with appropriate bandwidth.
Consider using a high-availability option based on your deployment requirements and region. Options include using multiple availability domains in a region and fault domains.
A bare metal GPU instance provides necessary CPU power for a higher cost. Evaluate your requirements to choose the appropriate compute shape.
- Monitoring and alerts
Set up monitoring and alerts on CPU and memory usage for your nodes, so that you can scale the shape up or down as needed.
A Terraform stack to deploy this reference architecture is available as a stack in Oracle Cloud Marketplace.
- Go to Oracle Cloud Marketplace.
- Click Get App.
- Follow the on-screen prompts.
Learn more about the features of this architecture.