Deploy IBM Spectrum LSF with Resource Connector Configured for OCI

Solve the problem of fixed resource allocation by dynamically adjusting the number of resources allocated to a workload based on actual demand with IBM Spectrum LSF resource connector autoscaling. Optimize resource usage, reduce costs, and improve overall efficiency in high-performance computing (HPC) environments.

IBM Spectrum LSF (Load Sharing Facility) is a workload management platform used for distributed computing environments. It allows users to manage and schedule computer jobs across a network of computers or compute clusters, ensuring that jobs are completed efficiently and without disruption.

The resource connector for the IBM Spectrum LSF feature (previously referred to as host factory) enables LSF clusters to borrow resources from supported resource providers. When the workload is low, the LSF is using resource connector to reduce the number of resources allocated, saving costs and improving utilization. When the workload is high, more resources are requested from the cloud provider.

Please note that administrative privileges are required for the deployment of this architecture.

Architecture

This reference architecture shows the IBM Spectrum LSF cluster deployed in an existing subnet with a primary host, cluster nodes (created on demand when the resource connector calls OCI API), and bastion service.

The LSF primary host requires instance_principal authorization to interact with the OCI API and has a default configuration (VM.Standard.E4.Flex / 2 OCPUs/ 8 GBs) that can be adjusted during the stack creation.

The LSF resource_connector is pre-configured for the dynamic queue and can request from the OCI API two types of compute resources (amd2 - VM.Standard.E3.Flex / 2 OCPUs / 4 GBs and amd4 - VM.Standard.E4.Flex / 2 OCPUs / 8 GBs) depending on the job requirements. Templates available to the resource_connector can be modified in the LSF configuration files (<lsf_top>/conf/resource_connector/oci/conf/oci_config.json and <lsf_top>/conf/resource_connector/oci/conf/ociprov_templates.json) and reloading the cluster configuration, reloading the cluster configuration using these commands:

$ lsadmin reconfig
$ badmin reconfig
$ badmin mbdrestart

The default maximum number of hosts resource_connector can request from OCI is eight for each available template ( maxNumber can be changed in the file <lsf_top>/conf/resource_connector/oci/conf/ociprov_templates.json if more nodes are required).

Recommended deployment approach is using the one-click deployment link via Oracle Cloud Infrastructure Resource Manager.

The following diagram illustrates this reference architecture.



oci-ibm-lfs-architecture-oracle.zip

The architecture has the following components:

  • Tenancy

    A tenancy is a secure and isolated partition that Oracle sets up within Oracle Cloud when you sign up for Oracle Cloud Infrastructure. You can create, organize, and administer your resources in Oracle Cloud within your tenancy. A tenancy is synonymous with a company or organization. Usually, a company will have a single tenancy and reflect its organizational structure within that tenancy. A single tenancy is usually associated with a single subscription, and a single subscription usually only has one tenancy.

  • Region

    An Oracle Cloud Infrastructure region is a localized geographic area that contains one or more data centers, called availability domains. Regions are independent of other regions, and vast distances can separate them (across countries or even continents).

  • Compartment

    Compartments are cross-region logical partitions within an Oracle Cloud Infrastructure tenancy. Use compartments to organize your resources in Oracle Cloud, control access to the resources, and set usage quotas. To control access to the resources in a given compartment, you define policies that specify who can access the resources and what actions they can perform.

  • Availability domains

    Availability domains are standalone, independent data centers within a region. The physical resources in each availability domain are isolated from the resources in the other availability domains, which provides fault tolerance. Availability domains don’t share infrastructure such as power or cooling, or the internal availability domain network. So, a failure at one availability domain is unlikely to affect the other availability domains in the region.

  • Fault domains

    A fault domain is a grouping of hardware and infrastructure within an availability domain. Each availability domain has three fault domains with independent power and hardware. When you distribute resources across multiple fault domains, your applications can tolerate physical server failure, system maintenance, and power failures inside a fault domain.

  • Virtual cloud network (VCN) and subnets

    A VCN is a customizable, software-defined network that you set up in an Oracle Cloud Infrastructure region. Like traditional data center networks, VCNs give you complete control over your network environment. A VCN can have multiple non-overlapping CIDR blocks that you can change after you create the VCN. You can segment a VCN into subnets, which can be scoped to a region or to an availability domain. Each subnet consists of a contiguous range of addresses that don't overlap with the other subnets in the VCN. You can change the size of a subnet after creation. A subnet can be public or private.

  • Security list

    For each subnet, you can create security rules that specify the source, destination, and type of traffic that must be allowed in and out of the subnet.

  • Network address translation (NAT) gateway

    A NAT gateway enables private resources in a VCN to access hosts on the internet, without exposing those resources to incoming internet connections.

  • Service gateway

    The service gateway provides access from a VCN to other services, such as Oracle Cloud Infrastructure Object Storage. The traffic from the VCN to the Oracle service travels over the Oracle network fabric and never traverses the internet.

  • Internet gateway

    The internet gateway allows traffic between the public subnets in a VCN and the public internet.

  • Bastion service

    Oracle Cloud Infrastructure Bastion provides restricted and time-limited secure access to resources that don't have public endpoints and that require strict resource access controls, such as bare metal and virtual machines, Oracle MySQL Database Service, Autonomous Transaction Processing (ATP), Oracle Container Engine for Kubernetes (OKE), and any other resource that allows Secure Shell Protocol (SSH) access. With Oracle Cloud Infrastructure Bastion service, you can enable access to private hosts without deploying and maintaining a jump host. In addition, you gain improved security posture with identity-based permissions and a centralized, audited, and time-bound SSH session. Oracle Cloud Infrastructure Bastion removes the need for a public IP for bastion access, eliminating the hassle and potential attack surface when providing remote access.

  • Identity and Access Management (IAM)

    Oracle Cloud Infrastructure Identity and Access Management (IAM) is the access control plane for Oracle Cloud Infrastructure (OCI) and Oracle Cloud Applications. The IAM API and the user interface enable you to manage identity domains and the resources within the identity domain. Each OCI IAM identity domain represents a standalone identity and access management solution or a different user population.

  • Oracle Cloud Infrastructure Resource Manager

    OCI Resource Manager automates deployment and operations for all OCI resources. Using the infrastructure-as-code (IaC) model, the service is based on Terraform.

Recommendations

Use the following recommendations as a starting point to ensure LSF cluster scalability and availability:Your requirements might differ from the architecture described here.
  • VCN and subnets

    When you select an existing subnet, you need to consider a CIDR block large enough to accommodate all compute resources requested by the LSF resource connector.

    Use regional subnets (in case of multi-ad regions).

    Allow all communication within the subnet (add to the security list of the subnet a rule allowing all ingress connections from subnet CIDR block to all destination ports).

Considerations

When provisioning, consider the following aspects.

  • IBM Spectrum LSF binaries

    Binaries and the license required to install/run LSF are not included. This deployment was tested with LSF version 10.1 and patch version 601088.

    Before deployment, you can download below files from the IBM support portal, load them into an OCI object store bucket and create pre-authenticated requests.

    • lsf10.1_lsfinstall.tar.Z
    • lsf10.1_lnx310-lib217-x86_64.tar.Z
    • lsf10.1_lnx310-lib217-x86_64-601088.tar.Z
    • lsf_entitlement.dat
  • VCN

    DNS resolution must be enabled for the VCN and subnet used for the LSF master node.

Deploy

The Terraform code to deploy the solution is available in GitHub.

  1. Go to GitHub.
  2. Clone or download the repository to your local computer.
  3. Follow the instructions in the README document.

Explore More

Learn more about IBM Spectrium LSF, IBM Spectrium LSF resource connector, and OCI.

Review these additional resources:

Acknowledgments

Authors: Chandrashekar Avadhani, Andrei Ilas

Contributors: John Sulyok