Deploy a Scalable, Distributed File System Using Lustre

Lustre is an open source, parallel, distributed file system used for high-performance computing (HPC) clusters and environments. The name Lustre is a portmanteau of Linux and cluster.

Using Lustre, you can build an HPC file server on Oracle Cloud Infrastructure bare metal Compute and network-attached block storage or NVMe SSDs locally attached to Compute nodes. A Terraform template provides an easy way to deploy Lustre on Oracle Cloud Infrastructure.

Lustre clusters scale for higher throughput, higher storage capacity, or both for the file system. It costs only a few cents per gigabyte per month for Compute and storage combined.

The Terraform deployment template provisions Oracle Cloud Infrastructure resources, including Compute, storage, virtual cloud networks, and subnets. It also provisions Lustre software, including a Management Server (MGS), a Metadata Server (MDS), an Object Storage Server (OSS), and Lustre client nodes.

Architecture

This reference architecture uses a region with a single availability domain and regional subnets. You can use the same reference architecture in a region with multiple availability domains. We recommend that you use regional subnets for your deployment, regardless of number of availability domains.

The following diagram illustrates this reference architecture.

Description of lustre-oci.eps follows
Description of the illustration lustre-oci.eps

The scalable Lustre architecture has the following components:

  • Management Server (MGS)

    An MGS stores configuration information for one or more Lustre file systems and provides this information to other Lustre hosts. This global resource can support multiple file systems.

  • Metadata Server (MDS)

    An MDS provides the index, or namespace, for a Lustre file system. The metadata content is stored on volumes called Metadata Targets (MDTs). A Lustre file system’s directory structure and file names, permissions, extended attributes, and file layouts are recorded to MDTs. Each Lustre file system must have a minimum of one MDT.

  • Object Storage Servers (OSS)

    An OSS provides the bulk data storage for all file content in a Lustre file system. Each OSS provides access to a set of storage volumes, called Object Storage Targets (OSTs). Each OST contains several binary objects that represent the data for files in Lustre. Files in Lustre are composed of one or more OST objects, in addition to the metadata inode stored on the MDS.

  • Lustre clients

    Clients are Compute instances that access the Lustre file system.

  • Virtual cloud network (VCN) and subnets

    A VCN is a software-defined network that you set up in an Oracle Cloud Infrastructure region. VCNs can be segmented into subnets, which can be specific to a region or to an availability domain. Both region-specific and availability domain-specific subnets can coexist in the same VCN. A subnet can be public or private.

  • Security lists

    For each subnet, you can create security rules that specify the source, destination, and type of traffic that must be allowed in and out of the subnet.

  • Availability domains

    Availability domains are standalone, independent data centers within a region. The physical resources in each availability domain are isolated from the resources in the other availability domains, which provides fault tolerance. Availability domains don’t share infrastructure such as power or cooling, or the internal availability domain network. So, a failure at one availability domain is unlikely to affect the other availability domains in the region.

Recommendations

Your requirements might differ from the architecture described here. Use the following recommendations as a starting point.

  • Compute shape, bastion host

    A bastion host is used to access any nodes in the private subnet. Use the VM.Standard.E2.1 or the VM.Standard.E2.2 shape.

  • Compute shape, MGS and MDS

    Because the MGS isn’t resource intensive, you can host the MGS and the MDS on the same instance. To ensure that a node-level outage doesn’t affect the file system, use a bare metal instance with high availability.

  • Bare metal Compute with block volume and high availability

    Use BM.Standard2.52. Two nodes are configured in a pair. The two physical network interface controllers (NICs) each with 25-Gbps network speed. Use one NIC for all traffic to block storage, and use the other NIC for incoming data to the OSS and MDS nodes from client nodes.

    Use block volume storage (size and number per deployment requirement) with multiple-instance attachment to attach a volume to both the compute nodes.

  • Compute shape, OSS

    Our recommendation for OSS is the same as for MGS and MDS.

  • Compute shape, Lustre client

    Choose a virtual machine (VM) shape based on your deployment plans, especially network bandwidth requirements.

    Throughput on individual clients depends on capacity. If you deploy 10 clients with 2.5-Gbps network bandwidth, the aggregate bandwidth is 25 Gbps.

  • RAID configuration

    Optionally, DenseIO shapes can be configured with RAID 0.

    Use RAID when building one OST per OSS.

    If you are using one OST per OSS, we recommend using eight block volumes per OSS to maximize throughput (RAID 0 is optional).

    Note:

    The Terraform template builds a bare metal shape with DenseIO or with block volumes.
  • VCN

    When you create the VCN, determine how many IP addresses your cloud resources in each subnet require. Using the Classless Inter-Domain Routing (CIDR) notation, specify a subnet mask and a network address range that's large enough for the required IP addresses. Use an address space that's within the standard private IP address blocks.

    Select an address range that doesn’t overlap with your on-premises network, so that you can set up a connection between the VCN and your on-premises network, if necessary.

    After you create a VCN, you can't change its address range.

    When you design the subnets, consider your functionality and security requirements. Attach all the compute instances within the same tier or role to the same subnet, which can serve as a security boundary.

    Use regional subnets.

  • Security lists

    Use security lists to define ingress and egress rules that apply to the entire subnet. For example, this architecture allows ICMP internally for the entire private subnet.

Considerations

  • Performance

    To get the best performance, choose the correct Compute shape with appropriate bandwidth.

  • Availability

    Consider using a high-availability option based on your deployment requirement.

  • Cost

    Bare metal service provides higher performance on network bandwidth for a higher cost. Evaluate your requirements to choose the appropriate Compute shape.

  • Monitoring and alerts

    Set up monitoring and alerts on CPU and memory usage for your MGS, MDS, and OSS nodes to scale the VM shape up or down as needed.

Deploy

The Terraform code for this reference architecture is available on GitHub.

  1. Go to GitHub.
  2. Clone or download the repository to your local computer.
  3. Follow the instructions in the README document.