Computational frame works used for deep learning and scientific computing are specialized workloads that require specialized Compute shapes. Oracle Cloud Infrastructure (OCI) offers a wide variety of options from bare metal to virtual machine (VM) GPU shapes. NVIDIA GPU Cloud (NGC) is one example of the options available on OCI.
You can use this reference architecture for multiple applications related to deep learning and scientific computing.
In this example, it's for NVIDIA Clara Parabricks. Clara Parabricks is a computational framework that supports genomics applications. A GPU-based solution, it speeds up the process of analyzing whole genomes. For example, all 3 billion base pairs in human chromosomes can be analyzed in under an hour. Clara Parabricks can establish patterns in protein folding, protein-ligand binding, and cell membrane transport, making it a useful application for drug research and discovery.
NVIDIA Clara Parabricks includes the following features:
- Uses NVIDIA’s CUDA, HPC, AI, and data analytics stacks.
- C++ and Python APIs, reference applications, and integrations with third-party applications and workflows for high-performance computing, deep learning, and data analytics tools in genomics.
- Use the Clara Parabricks Toolkit to develop AI-assisted workflows, to optimize mapping, aligning, and polishing for de novo genome assembly.
In this simple reference, a GPU node with Block Storage is deployed in a VCN with public subnet and Internet Gateway. All applications are in Block Storage.
The following diagram illustrates this reference architecture.
Description of the illustration hpc-cloud-guard.png
The architecture has the following components:
An Oracle Cloud Infrastructure region is a localized geographic area that contains one or more data centers, called availability domains. Regions are independent of other regions, and vast distances can separate them (across countries or even continents).
- Availability domains
Availability domains are standalone, independent data centers within a region. The physical resources in each availability domain are isolated from the resources in the other availability domains, which provides fault tolerance. Availability domains don’t share infrastructure such as power or cooling, or the internal availability domain network. So, a failure at one availability domain is unlikely to affect the other availability domains in the region.
- Fault domains
A fault domain is a grouping of hardware and infrastructure within an availability domain. Each availability domain has three fault domains with independent power and hardware. When you distribute resources across multiple fault domains, your applications can tolerate physical server failure, system maintenance, and power failures inside a fault domain.
- Virtual cloud network (VCN) and subnets
A VCN is a customizable, software-defined network that you set up in an Oracle Cloud Infrastructure region. Like traditional data center networks, VCNs give you complete control over your network environment. A VCN can have multiple non-overlapping CIDR blocks that you can change after you create the VCN. You can segment a VCN into subnets, which can be scoped to a region or to an availability domain. Each subnet consists of a contiguous range of addresses that don't overlap with the other subnets in the VCN. You can change the size of a subnet after creation. A subnet can be public or private.
- Cloud Guard
You can use Oracle Cloud Guard to monitor and maintain the security of your resources in Oracle Cloud Infrastructure. Cloud Guard uses detector recipes that you can define to examine your resources for security weaknesses and to monitor operators and users for risky activities. When any misconfiguration or insecure activity is detected, Cloud Guard recommends corrective actions and assists with taking those actions, based on responder recipes that you can define.
- BM GPU
Use a Bare Metal GPU shape for hardware-accelerated analytics and other computations.
- Block Storage
Store your applications in Block Storage.
- Internet gateway
The internet gateway allows traffic between the public subnets in a VCN and the public internet.
- Security list
For each subnet, you can create security rules that specify the source, destination, and type of traffic that must be allowed in and out of the subnet.
- Route table
Virtual route tables contain rules to route traffic from subnets to destinations outside a VCN, typically through gateways.
Your requirements might differ from the architecture described here. Use the following recommendations as a starting point.
When you create a VCN, determine the number of CIDR blocks required and the size of each block based on the number of resources that you plan to attach to subnets in the VCN. Use CIDR blocks that are within the standard private IP address space.
Select CIDR blocks that don't overlap with any other network (in Oracle Cloud Infrastructure, your on-premises data center, or another cloud provider) to which you intend to set up private connections.
After you create a VCN, you can change, add, and remove its CIDR blocks.
When you design the subnets, consider your traffic flow and security requirements. Attach all the resources within a specific tier or role to the same subnet, which can serve as a security boundary.
- Security lists
Use security lists to define ingress and egress rules that apply to the entire subnet.
- Cloud Guard
Clone and customize the default recipes provided by Oracle to create custom detector and responder recipes. These recipes enable you to specify what type of security violations generate a warning and what actions are allowed to be performed on them. For example, you might want to detect Object Storage buckets that have visibility set to public.
Apply Cloud Guard at the tenancy level to cover the broadest scope and to reduce the administrative burden of maintaining multiple configurations.
You can also use the Managed List feature to apply certain configurations to detectors.
- BM GPU
For best performance, use bare metal shapes BM.GPU2.2 or BM.GPU3.8
Consider the following points when deploying this reference architecture.
To get the best performance, choose the correct compute shape with appropriate bandwidth.
Consider using a high-availability option, based on your deployment requirements and region. Options include using multiple availability domains in a region and fault domains.
A bare metal GPU instance provides necessary CPU power for a higher cost. Evaluate your requirements to choose the appropriate Compute shape.
- Monitoring and Alerts
Set up monitoring and alerts on CPU and memory usage for your nodes, so that you can scale the shape up or down as needed.
The Terraform code for this reference architecture is available on GitHub. You can pull the code into Oracle Cloud Infrastructure Resource Manager with a single click, create the stack, and deploy it. Alternatively, you can download the code from GitHub to your computer, customize the code, and deploy the architecture by using the Terraform CLI.
- Deploy using the sample stack in Oracle Cloud Infrastructure Resource
If you aren't already signed in, enter the tenancy and user credentials.
- Review and accept the terms and conditions.
- Select the region where you want to deploy the stack.
- Follow the on-screen prompts and instructions to create the stack.
- After creating the stack, click Terraform Actions, and select Plan.
- Wait for the job to be completed, and review the plan.
To make any changes, return to the Stack Details page, click Edit Stack, and make the required changes. Then, run the Plan action again.
- If no further changes are necessary, return to the Stack Details page, click Terraform Actions, and select Apply.
- Deploy using the Terraform code in GitHub:
- Go to GitHub.
- Clone or download the repository to your local computer.
- Follow the instructions in the
Learn more about the features of this architecture.
- Best practices framework for Oracle Cloud Infrastructure
- For more information, refer to NVIDIA Clara Parabricks documentation.