Note:

Use Open MPI on Oracle Roving Edge Infrastructure

Introduction

Oracle Roving Edge Infrastructure is a ruggedized cloud computing and storage platform suitable for deployment at the network edge or in locations with limited or no external connectivity. As larger, more complex and demanding workloads move toward the network edge, this can present challenges for an edge infrastructure.

Open MPI is an implementation of the Message Passing Interface (MPI) standard used for developing parallel applications in High Performance Computing (HPC). Open MPI can be used to deploy High Performance Computing and highly parallel workloads across relatively smaller pieces of infrastructure that then operate as a larger, aggregated set of resources. This approach can be used to distribute a workload to CPU and other compute resources such as GPU which allows deployment of larger, computationally intensive tasks at the network edge such as predictive modelling or other Artificial Intelligence/Machine Learning (AI/ML) tasks.

Open MPI can be used to deploy parallel workloads utilizing resources across Oracle Roving Edge Infrastructure nodes. Netfilter provides the necessary Destination Network Address Translation (DNAT) and Source Network Address Translation (SNAT) for VM instances using clustering software hosted across Oracle Roving Edge Infrastructure nodes. This tutorial implements Open MPI using Netfilter on Oracle Roving Edge Infrastructure running a prime number calculator to demonstrate improved performance when using parallel resources.

Background Information

Open MPI can run across several Virtual Machine (VM) instances within a single Oracle Roving Edge Infrastructure node or across multiple Oracle Roving Edge Infrastructure nodes. Running on a single Oracle Roving Edge Infrastructure node is seamless, and does not pose any issues. When running across multiple Oracle Roving Edge Infrastructure nodes, it is important to understand networking on Oracle Roving Edge Infrastructure VM instances and how Open MPI routes traffic to avoid possible issues.

Networking on Oracle Roving Edge Infrastructure Virtual Machine Instances

Virtual Machine instances running on Oracle Roving Edge Infrastructure use private IP addresses to communicate with other VM instances on the same subnet hosted on the same node. Public IP addresses can be assigned to VM instances hosted on Oracle Roving Edge Infrastructure to allow instances to communicate with other subnets and resources running outside of the Oracle Roving Edge Infrastructure node where it is hosted.

Note: Public IP addresses are assigned to VM instance VNICs from a public IP address pool. While the addresses are called Public IPs they are actually IP addresses on the same local network where the Oracle Roving Edge Infrastructure node is connected through its RJ-45 port. These addresses can be an IPv4 address reachable from the internet or they are addresses on a private subnet on the local network. These addresses are also referred to as external IP addresses, since they can be connected to resources external to the node where the VM instance is hosted.

It is important to understand when a VM instance running within an Oracle Roving Edge Infrastructure node attempts to access resources outside of the node, traffic goes through the external IP address which is routed within the node and out to the external network connection.

An example VM instance on Oracle Roving Edge Infrastructure node is shown in the following image, notice the assigned public and private IP addresses.

An example VM instance on an Oracle Roving Edge Infrastructure node

Challenges For Running Open MPI Across Multiple Oracle Roving Edge Infrastructure Nodes

Using VM instance private IP addresses is problematic for clustering software, such as Open MPI, when it is implemented across multiple Oracle Roving Edge Infrastructure nodes. Each node is unaware of private to public IP address mappings, and the translation of the mappings for VM instances hosted on other nodes. Since the mappings are not shared, packets using private IP addresses are either routed incorrectly or lost.

An example of how this problem might look using Open MPI:

Open MPI running on VM instances will try to determine the best network path to use to reach other VM instance members, the software might examine the local interfaces and IP addresses and register these with other nodes in the cluster.

Implement Open MPI Across Oracle Roving Edge Infrastructure

To address the non-shared nature of a Oracle Roving Edge Infrastructure’s internal network, Netfilter software on the Linux VM instances is used to rewrite network packets that are coming from, and destined for, VMs hosted on other Oracle Roving Edge Infrastructure nodes.

Design Considerations

In this tutorial, three Oracle Roving Edge Infrastructure Roving Edge Devices (RED) nodes are used to create an Open MPI cluster. All of the REDs are connected to a shared external network. Each node has been configured with its own external IP pool for allocation to a VM instances.

VCN and Subnet CIDR Table

RED Name VCN CIDR Subnet CIDR
RED1 10.0.0.0/16 10.0.1.0/24
RED2 10.0.0.0/16 10.0.2.0/24
RED3 10.0.0.0/16 10.0.3.0/24

This example is shown in the following image, two networking configurations from two different REDs.

Example VCN and subnet on "RED1"

Example VCN and subnet on "RED2"

Audience

Oracle Roving Edge Infrastructure administrators, developers and users.

Objectives

Prerequisites

Task 1: Create Virtual Machine Instances

Create a VM instance in each subnet on each RED.

Example IP address assignments:

RED Name VM Name VM O/S VM Private IP VM Public IP
RED1 redvm1 Ubuntu 22.04 10.0.1.2/24 10.123.123.32
RED2 redvm2 Ubuntu 22.04 10.0.2.2/24 10.123.123.67
RED3 redvm3 Ubuntu 22.04 10.0.3.2/24 10.123.123.101

Note: The VM instances in this task are created using an image exported from Oracle Cloud Infrastructure (OCI) using OCI Ubuntu 22.04 LTS. Any Linux distribution with suitable Open MPI packages can be used, for example, Oracle Linux 8 or 9, Fedora and so on.

Example VM List on "RED3"

  1. On each RED, navigate to Compute, Instance and click Create Instance.

  2. In Create Compute Instance, enter Name, select the imported Custom image, Shape, Configure networking, SSH keys and click Create.

    Creating a Compute Image on each RED

Task 2: Install Open MPI Package On Each VM Instance

Once all the VM instances have been created, log into each VM instance via SSH with the key provided during provisioning. The following commands are suitable for Ubuntu 22.04, you may need to adapt these instructions or include additional package repositories if you are using a different Linux distribution.

  1. Run the following command to update the system and reboot.

    sudo apt update && sudo apt upgrade -y
    

    This might take a while to complete. When this completes, reboot the instance.

    sudo shutdown -r now
    
  2. Log in to the instance and install the Open MPI packages.

    Note:

    • This will pull through quite a few dependencies.

    • The libopenmpi-dev package is only required to later compile a sample program to test with. If you are not going to be compiling Open MPI features into programs, this package is not needed.

    sudo apt install openmpi-bin libopenmpi-dev -y
    

Task 3: Set Up Destination Network Address Translation (DNAT) and Source Network Address Translation (SNAT)

  1. View details of each VM instance launched and note down the private and randomly assigned public IP addresses. Navigate to Compute, Instance and click the instance name to view the details.

    Example VM on "RED3"

  2. Create the SNATs.

    • If we map the SNAT rules for each VM in a table, it should look like this:

      SNAT Coming From RED1 Coming From RED2 Coming From RED3
      On redvm1 N/A Input src 10.123.123.67 SNAT to src 10.0.2.2 Input src 10.123.123.101 SNAT to src 10.0.3.2
      On redvm2 Input src 10.123.123.32 SNAT to src 10.0.1.2 N/A Input src 10.123.123.101 SNAT to src 10.0.3.2
      On redvm3 Input src 10.123.123.32 SNAT to src 10.0.1.2 Input src 10.123.123.67 SNAT to src 10.0.2.2 N/A
    • Use iptables commands for each VM.

      • On redvm1.

        sudo iptables -I INPUT --src 10.123.123.0/24 -j ACCEPT -m comment --comment "Allow REDs public subnet access."
        sudo iptables -t nat -I INPUT -p tcp -s 10.123.123.67 -j SNAT --to-source 10.0.2.2
        sudo iptables -t nat -I INPUT -p tcp -s 10.123.123.101 -j SNAT --to-source 10.0.3.2
        sudo netfilter-persistent save
        
      • On redvm2.

        sudo iptables -I INPUT --src 10.123.123.0/24 -j ACCEPT -m comment --comment "Allow REDs public subnet access."
        sudo iptables -t nat -I INPUT -p tcp -s 10.123.123.32 -j SNAT --to-source 10.0.1.2
        sudo iptables -t nat -I INPUT -p tcp -s 10.123.123.101 -j SNAT --to-source 10.0.3.2
        sudo netfilter-persistent save
        
      • On redvm3.

        sudo iptables -I INPUT --src 10.123.123.0/24 -j ACCEPT -m comment --comment "Allow REDs public subnet access."
        sudo iptables -t nat -I INPUT -p tcp -s 10.123.123.32 -j SNAT --to-source 10.0.1.2
        sudo iptables -t nat -I INPUT -p tcp -s 10.123.123.67 -j SNAT --to-source 10.0.2.2
        sudo netfilter-persistent save
        

      Note: The addition of the first rule (sudo iptables -I INPUT --src 10.123.123.0/24 -j ACCEPT) to allow access from the subnet the REDs are using for their public IPs. Without this (or a similar rule), inbound traffic from RED to RED is likely to be dropped by the receiving RED. On these VMs, the new rules are persisted with sudo netfilter-persistent save however, this command will likely be something else if you are using a different Linux distribution.

  3. Create the DNATs.

    • Similarly, if we map the DNAT rules for each VM in a table, it should look like this:

      DNAT Going To RED1 Going To RED2 Going To RED3
      On redvm1 N/A Output dst 10.0.2.2 DNAT to dst 10.123.123.67 Output dst 10.0.3.2 DNAT to dst 10.123.123.101
      On redvm2 Output dst 10.0.1.2 DNAT to dst 10.123.123.32 N/A Output dst 10.0.3.2 DNAT to dst 10.123.123.101
      On redvm3 Output dst 10.0.1.2 DNAT to dst 10.123.123.32 Output dst 10.0.2.2 DNAT to dst 10.123.123.67 N/A
    • Use iptables commands for each VM.

      • On redvm1.

        sudo iptables -t nat -I OUTPUT -p tcp -d 10.0.2.2 -j DNAT --to-destination 10.123.123.67
        sudo iptables -t nat -I OUTPUT -p tcp -d 10.0.3.2 -j DNAT --to-destination 10.123.123.101
        sudo netfilter-persistent save
        
      • On redvm2.

        sudo iptables -t nat -I OUTPUT -p tcp -d 10.0.1.2 -j DNAT --to-destination 10.123.123.32
        sudo iptables -t nat -I OUTPUT -p tcp -d 10.0.3.2 -j DNAT --to-destination 10.123.123.101
        sudo netfilter-persistent save
        
      • On redvm3.

        sudo iptables -t nat -I OUTPUT -p tcp -d 10.0.1.2 -j DNAT --to-destination 10.123.123.32
        sudo iptables -t nat -I OUTPUT -p tcp -d 10.0.2.2 -j DNAT --to-destination 10.123.123.67
        sudo netfilter-persistent save
        

      Note: On the VM instances, the new rules are persisted with sudo netfilter-persistent save.

Task 4: Set up Open MPI

Open MPI uses SSH to communicate between cluster members, so there are a few things that need to be taken care of before we can run jobs.

  1. Open MPI will be using the private IP addresses of each of the VMs, make /etc/hosts entries for each VM instance and their private IP address on every Open MPI VM instance.

    For example, using the configuration above the /etc/hosts entry on redvm1 will contain the following entries:

    127.0.0.1 localhost
    127.0.1.1 redvm1  redvm1
    10.0.2.2  redvm2
    10.0.3.2  redvm3
    

    On redvm2, the /etc/hosts will contain the following entries:

    127.0.0.1 localhost
    127.0.1.1 redvm2  redvm2
    10.0.1.2  redvm1
    10.0.3.2  redvm3
    

    On redvm3, the /etc/hosts will contain the following entries:

    127.0.0.1 localhost
    127.0.1.1 redvm3  redvm3
    10.0.1.2  redvm1
    10.0.2.2  redvm2
    
  2. We also need to ensure that the SSH equivalence exists between each VM for Open MPI to use.

    Note: The assumption here is that these are new VMs that do not contain existing SSH keys for the Ubuntu user. If you are using older VMs that have already had SSH keys created, you will need to adapt these instructions, these instructions could overwrite existing keys and lock you out of your VMs.

    1. On redvm1, create a new public or private key pair (if you do not have keys created already). Use ssh-keygen command similar to ssh-keygen -b 4096 -t rsa (accept the defaults, do not set a password for the new keys). This will generate ~/.ssh/id_rsa and ~/.ssh/id_rsa.pub.

    2. Add the new public key to the authorized_keys file by executing cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys or copying them in manually via a text editor.

    3. Copy both id_rsa and id_rsa.pub to the ubuntu user’s ~/.ssh directory on redvm2 and redvm3. Ensure you add id_rsa.pub to authorized_keys, run cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys on redvm2 and redvm3.

    4. Once this has been done, from each VM, connect to all of the other VMs including the VM itself to ensure that connectivity works and that SSH trusts the other hosts.

      • SSH connection on redvm1.

        ubuntu@redvm1:~$ ssh redvm1 date
        The authenticity of host 'redvm1 (127.0.1.1)' can't be established.
        ED25519 key fingerprint is SHA256:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.
        This key is not known by any other names
        Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
        Warning: Permanently added 'redvm1' (ED25519) to the list of known hosts.
        Fri Apr  5 04:28:57 UTC 2024
        ubuntu@redvm1:~$
        ubuntu@redvm1:~$ ssh redvm2 date
        The authenticity of host 'redvm2 (10.0.2.2)' can't be established.
        ED25519 key fingerprint is SHA256:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.
        This key is not known by any other names
        Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
        Warning: Permanently added 'redvm2' (ED25519) to the list of known hosts.
        Wed Jan 31 04:29:11 UTC 2024
        ubuntu@redvm1:~$
        ubuntu@redvm1:~$ ssh redvm3 date
        The authenticity of host 'redvm3 (10.0.3.2)' can't be established.
        ED25519 key fingerprint is SHA256:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.
        This key is not known by any other names
        Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
        Warning: Permanently added 'redvm3' (ED25519) to the list of known hosts.
        Wed Jan 31 04:29:19 UTC 2024
        
    5. Repeat above steps for redvm2 connecting to redvm2, redvm1 and redvm3 and redvm3 connecting to redvm3, redvm1 and redvm2.

  3. Create a common storage location for each cluster member.

    Note: Ideally all VM instances using Open MPI will have a shared storage location. This could take the form of NFS, GlusterFS, OCFS2 or any number of other shared filesystem solutions. This is particularly important if a common working directory or data set is needed for the workload.

    A shared filesystem is not needed for this example, only a location with a common name for our testing binaries. On each VM, create the common name location /mpitest.

    sudo mkdir /mpitest && sudo chown ubuntu:ubuntu /mpitest
    

    If we use a shared filesystem, the shared filesystem will then mount at this location on all VM instances.

  4. Create a hostfile for use with mpirun. For more information, see How do I use the –hostfile option to mpirun?.

    1. We will create two hostfiles for testing. On redvm1, using our common name location /mpitest created above, create a file /mpitest/mpihosts.txt with the following content:

      redvm1
      redvm2
      redvm3
      
    2. Create a file /mpitest/mpihosts_slots.txt with the following content:

      redvm1 slots=1
      redvm2 slots=1
      redvm3 slots=1
      

Note: In this tutorial, tests will only be run from redvm1 so we do not need to copy these files to redvm2 and redvm3. If you want to run jobs from other VMs too, you will need to copy these files to the other VM instances, or use a proper shared filesystem like NFS.

Task 5: Test the VM Instance

  1. A simple test of distributed commands.

    • A simple test is just to invoke a command such as hostname on all of the cluster members. Here is the expected output running across three nodes with slots=1 (mpihosts_slots.txt) file. Theslots directive informs mpirun how many processes can be allocated to this node rather than mpirun determining the number of processes.

      Note: Specifying slots might be necessary if you are using limited resources other than CPU (for example, GPUs), where you want to limit processes to the number of the other resource. Failing to do so can result in processes failing due to not being able to allocate the other resources.

      ubuntu@redvm1:~$ cd /mpitest
      
      ubuntu@redvm1:/mpitest$ cat mpihosts_slots.txt
      redvm1 slots=1
      redvm2 slots=1
      redvm3 slots=1
      
      ubuntu@redvm1:/mpitest$ mpirun --hostfile mpihosts_slots.txt hostname
      redvm1
      redvm2
      redvm3
      
    • Run the same test but without specifying the slots (mpihosts.txt) file, mpirun will determine the available CPUs and run the number of CPUs hostname commands on each node. These three VMs have 16 CPUs, thus we should get 3 x 16 responses (16 of each hostname).

      ubuntu@redvm1:/mpitest$ cat mpihosts.txt
      redvm1
      redvm2
      redvm3
      
      ubuntu@redvm1:/mpitest$ mpirun --hostfile mpihosts.txt hostname | sort | uniq -c
          16 redvm1
          16 redvm2
          16 redvm3
      
  2. Build an Open MPI test binary.

    For a proper test with a program that uses Open MPI, we use a prime number calculator example from John Burkardt. That will need to be downloaded and compiled on redvm1. For more information, see Prime Number Calculator by John Burkardt.

    ubuntu@redvm1:~$ cd /mpitest
    
    ubuntu@redvm1:/mpitest$ curl https://people.sc.fsu.edu/~jburkardt/c_src/prime_mpi/prime_mpi.c -o prime_mpi.c
      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                    Dload  Upload   Total   Spent    Left  Speed
    100  4699  100  4699    0     0   2990      0  0:00:01  0:00:01 --:--:--  2991
    
    ubuntu@redvm1:/mpitest$ mpicc prime_mpi.c -o prime_mpi
    
    ubuntu@redvm1:/mpitest$ ls -l prime_mpi
    -rwxrwxr-x 1 ubuntu ubuntu 16736 Apr  5 05:38 prime_mpi
    

    Since a shared filesystem is not set up for testing, the prime_mpi binary needs to be copied to redvm2 and redvm3 in the same location that it is on redvm1. Run the following command.

    ubuntu@redvm1:/mpitest$ scp prime_mpi redvm2:/mpitest
    prime_mpi                                                                                                                     100%   16KB  27.4MB/s   00:00
    ubuntu@redvm1:/mpitest$ scp prime_mpi redvm3:/mpitest
    prime_mpi                                                                                                                     100%   16KB  28.3MB/s   00:00
    
  3. For a baseline to compare, run an Open MPI binary standalone. Run prime_mpi without Open MPI for a baseline or comparison.

    ubuntu@redvm1:/mpitest$ ./prime_mpi
    31 January 2024 06:08:17 AM
    
    PRIME_MPI
      C/MPI version
    
      An MPI example program to count the number of primes.
      The number of processes is 1
    
            N        Pi          Time
    
            1         0        0.000003
            2         1        0.000000
            4         2        0.000000
            8         4        0.000000
            16         6        0.000000
            32        11        0.000001
            64        18        0.000002
          128        31        0.000022
          256        54        0.000019
          512        97        0.000066
          1024       172        0.000231
          2048       309        0.000810
          4096       564        0.002846
          8192      1028        0.010093
        16384      1900        0.037234
        32768      3512        0.137078
        65536      6542        0.515210
        131072     12251        1.932970
        262144     23000        7.243419
    
    PRIME_MPI - Master process:
      Normal end of execution.
    
    31 January 2024 06:08:27 AM
    

    Note: The number of processes is 1 and it takes about 10 seconds to complete.

  4. A distributed run using Open MPI. Run prime_mpi with Open MPI across all the available CPUs on the three VM instances using the mpihosts.txt file.

    ubuntu@redvm1:/mpitest$ mpirun --hostfile mpihosts.txt ./prime_mpi
    31 January 2024 06:09:02 AM
    
    PRIME_MPI
      C/MPI version
    
      An MPI example program to count the number of primes.
      The number of processes is 48
    
            N        Pi          Time
    
            1         0        0.020740
            2         1        0.000428
            4         2        0.000331
            8         4        0.000392
            16         6        0.000269
            32        11        0.000295
            64        18        0.000374
          128        31        0.000390
          256        54        0.000380
          512        97        0.000331
          1024       172        0.000351
          2048       309        0.000385
          4096       564        0.000740
          8192      1028        0.001931
        16384      1900        0.006316
        32768      3512        0.021577
        65536      6542        0.078834
        131072     12251        0.273368
        262144     23000        0.808825
    
    PRIME_MPI - Master process:
      Normal end of execution.
    
    31 January 2024 06:09:03 AM
    

    48 processes are used and it takes about 1 second to run.

    Explore Open MPI by running the same example, but with mpihosts_slots.txt file. You should see an improvement compared to running prime_mpi standalone, but it will only use 1 processor on each node (total of 3) rather than the full complement on all nodes. By changing the number of slots on each VM instances, you can control the distribution of the job.

Acknowledgments

More Learning Resources

Explore other labs on docs.oracle.com/learn or access more free learning content on the Oracle Learning YouTube channel. Additionally, visit education.oracle.com/learning-explorer to become an Oracle Learning Explorer.

For product documentation, visit Oracle Help Center.