Managing Cluster Networks

A cluster network is a pool of high performance computing (HPC), GPU, or Optimized instances that are connected with a high-bandwidth, ultra low-latency network. Each node in the cluster is a bare metal machine located in close physical proximity to the other nodes. A remote direct memory access (RDMA) network between nodes provides latency as low as single-digit microseconds, comparable to on-premises HPC clusters.

Cluster networks are designed for highly demanding parallel computing workloads. For example:

  • Computational fluid dynamics simulations for automotive or aerospace modeling
  • Financial modeling and risk analysis
  • Biomedical simulations
  • Trajectory analysis and design for space exploration
  • Artificial intelligence and big data workloads

Cluster networks are built on top of the instance pools feature. Most operations in the instance pool are managed directly by the cluster network, though you can resize the underlying instance pool, change the instance configuration that the pool uses to create new instances, monitor the pool, and add tags.

For more information about how to access and store the data that you want to process in your cluster networks, see FastConnect Overview, Overview of File Storage, Overview of Object Storage, and Overview of Block Volume.

Supported Shapes

The following shapes support cluster networks:

  • BM.GPU4.8
  • BM.GPU.GM4.8 (BM.GPU.A100-v2.8)
  • BM.HPC2.36
  • BM.Optimized3.36

Typically, to be able to create the multiple HPC or GPU instances that are contained in a cluster network, you must request a service limit increase.

Supported Regions and Availability Domains

Cluster networks are supported in selected regions within the Oracle Cloud Infrastructure commercial realm and Government Cloud realms.

Supported Regions in the Commercial Realm
  • Australia East (Sydney)
  • Australia Southeast (Melbourne)
  • Brazil East (Sao Paulo)
  • Brazil Southeast (Vinhedo)
  • Canada Southeast (Toronto)
  • Germany Central (Frankfurt)
  • India South (Hyderabad)
  • India West (Mumbai)
  • Israel Central (Jerusalem)
  • Italy Northwest (Milan)
  • Japan Central (Osaka)
  • Japan East (Tokyo)
  • Netherlands Northwest (Amsterdam)
  • Saudi Arabia West (Jeddah)
  • Singapore (Singapore)
  • South Korea Central (Seoul)
  • South Korea North (Chuncheon)
  • UAE East (Dubai)
  • UK South (London)
  • US East (Ashburn)
  • US Midwest (Chicago)
  • US West (Phoenix)
  • US West (San Jose)
Supported Regions in the Government Cloud Realms
  • UK Gov South (London)
  • UK Gov West (Newport)
  • US Gov East (Ashburn)

The availability domain that you create the cluster network in must have cluster-network-capable hardware.

Required IAM Policy

To use Oracle Cloud Infrastructure, you must be granted security access in a policy  by an administrator. This access is required whether you're using the Console or the REST API with an SDK, CLI, or other tool. If you get a message that you don’t have permission or are unauthorized, verify with your administrator what type of access you have and which compartment  to work in.

For administrators: For a typical policy that gives access to cluster networks, see Let users manage Compute instance configurations, instance pools, and cluster networks.

Creating a Cluster Network

Use the following steps to create a cluster network.

Prerequisites

Create an instance configuration for the instance pool that is managed by the cluster network. Use the following settings:

  • Image: Click Change image, and then click Oracle images. Select the Oracle Linux HPC cluster networking image.
  • Shape: Click Change shape. Select Bare metal machine. Then, select a shape that supports cluster networks.

    For more information about these shapes, see Compute Shapes.

Using the Console

  1. Open the navigation menu and click Compute. Under Compute, click Cluster Networks.

  2. Click Create cluster network.
  3. Enter a name for the cluster network. It doesn't have to be unique, and you can change it later. Avoid entering confidential information.
  4. Select the compartment to create the cluster network in.
  5. Select the Availability Domain to run the cluster network in. Only the availability domains with cluster-network-capable hardware can be selected.
  6. In the Configure networking section, specify the network that you want to use to administer the cluster network. This network is separate from the closed RDMA network between nodes within the cluster. Enter the following information:

    • Virtual cloud network: The virtual cloud network (VCN) for the cluster network.
    • Subnet: The subnet for the cluster network.
  7. In the Configure instance pool section, enter the following:

    • Instance pool name: A name for the instance pool that is managed by the cluster network. Avoid entering confidential information.
    • Number of instances: The number of instances in the pool.
    • Instance configuration: Select the instance configuration to use when creating instances in the cluster network's instance pool, as described in the prerequisites.
  8. Show tagging options: Optionally, you can add tags. If you have permissions to create a resource, you also have permissions to add free-form tags to that resource. To add a defined tag, you must have permissions to use the tag namespace. For more information about tagging, see Resource Tags. If you are not sure whether you should add tags, skip this option (you can add tags later) or ask your administrator.
  9. Click Create cluster network.

    Instances are provisioned until the required number of instances in the pool are launched, subject to host capacity for nodes in the cluster's RDMA network.

    To track the progress of the operation and troubleshoot errors that occur during instance creation, use the associated work request.

Using the API

Use the CreateClusterNetwork operation.

Detaching Instances from a Cluster Network

You can remove specific nodes from a cluster network by detaching instances from the cluster network's underlying instance pool. The instances that you detach are no longer managed as part of the cluster network. If you want to remove instances from the cluster network by deleting instances, you can instead resize the cluster network.

When you detach an instance, you can choose whether to delete the instance or to retain it. You can also choose whether to replace the detached instance by creating a new instance in the cluster network. If you don't replace the detached instance, the size of the cluster network is decreased.

Using the Console

  1. Open the navigation menu and click Compute. Under Compute, click Cluster Networks.

  2. Click the cluster network that you're interested in.
  3. On the Instance pools page, click the instance pool that you want to detach instances from.
  4. Under Resources, click Attached instances.
  5. For the instance that you want to detach, click the Actions menu. Then, click Detach instance.
  6. If you want to delete the instance and its boot volume, select the Permanently terminate (delete) this instance and its attached boot volume check box.
  7. By default, the size of the underlying instance pool is reduced. If you want the cluster network to remain the same size after you detach the instance, you can provision a replacement instance. Select the Replace the instance with a new instance, using the pool’s instance configuration as a template for the instance check box.
  8. Click Detach (or Detach and terminate, if you're also deleting the instance).

    To track the progress of the operation and troubleshoot errors that occur during instance creation, use the associated work request.

Using the API

To list the instances in a cluster network, use the ListClusterNetworkInstances operation.

To detach instances from a cluster network's underlying instance pool, use the DetachInstancePoolInstance operation.

Resizing a Cluster Network

You can change the number of instances in a cluster network by resizing the underlying instance pool.

When you increase the size, instances are provisioned until the required number of instances in the pool are launched, subject to host capacity for nodes in the cluster's RDMA network.

When you decrease the size, instances are terminated (deleted) in the order that they were created, first-in, first-out. If you want to remove a specific instance from the cluster network, you can instead detach the instance from the cluster network.

Prerequisites

The cluster network must be in the Running state.

Using the Console

  1. Open the navigation menu and click Compute. Under Compute, click Cluster Networks.

  2. Click the cluster network that you're interested in.
  3. Click Edit.
  4. In the Number of instances box, specify the updated number of instances for the instance pool.
  5. Click Save changes.

    To track the progress of the operation and troubleshoot errors that occur during instance creation, use the associated work request.

Using the API

Use the UpdateClusterNetwork operation.

Updating the Instance Configuration for a Cluster Network

To update the instance configuration that a cluster network's underlying instance pool uses when creating instances, you can do either of the following things:

  • Create a new instance configuration with the desired settings, and then attach the new instance configuration to the cluster network.

    If you want the instances in the cluster network to use the settings from the new instance configuration, such as a new shape, detach the existing instances from the cluster network and provision new instances.

    Note

    When you detach instances from a cluster network, the existing instances are detached before new instances are provisioned. Depending on your requirements, you might want to increase the size of the cluster network before detaching instances.
  • If you only want to update the display name or tags of an existing instance configuration, you can update the cluster network's existing instance configuration. For any other updates, create and then attach a new instance configuration with the settings that you want to use.

Using the Console

To attach a new instance configuration to a cluster network:

  1. Open the navigation menu and click Compute. Under Compute, click Cluster Networks.

  2. Click the cluster network that you're interested in.
  3. Click Edit.
  4. For Instance configuration, select the instance configuration to use when creating instances in the cluster network's instance pool.
  5. Click Save changes.

Using the API

To attach a new instance configuration to a cluster network, use the UpdateClusterNetwork operation.

Renaming a Cluster Network

Use the following steps to edit the name of a cluster network.

Using the Console

  1. Open the navigation menu and click Compute. Under Compute, click Cluster Networks.

  2. Click the cluster network that you're interested in.
  3. Click Edit.
  4. Enter a new name. Avoid entering confidential information.
  5. Click Save changes.

Using the API

Use the UpdateClusterNetwork operation.

Tagging Resources

You can apply tags to your resources to help you organize them according to your business needs. You can apply tags at the time you create a resource, or you can update the resource later with the wanted tags. For general information about applying tags, see Resource Tags.

To manage tags for a cluster network

Using the Console:

  1. Open the navigation menu and click Compute. Under Compute, click Cluster Networks.

  2. Click the cluster network that you're interested in.
  3. Click the Tags tab to view or edit the existing tags. Or click Add tags to add new ones.

Using the API: Use the UpdateClusterNetwork operation.

Deleting a Cluster Network

You can terminate (delete) a cluster network that you no longer need.

Caution

When you delete a cluster network, all of its resources are permanently deleted, including associated instances, attached boot volumes, and block volumes.

Using the Console

  1. Open the navigation menu and click Compute. Under Compute, click Cluster Networks.

  2. Click the cluster network that you're interested in.
  3. Click Terminate, and then confirm when prompted.

    To track the progress of the operation and troubleshoot errors that occur during instance creation, use the associated work request.

Using the API

Use the TerminateClusterNetwork operation.