Note:

Configure SR-IOV interfaces for pods on OKE using Multus CNI

Introduction

SR-IOV is a specification that allows a single PCIe device to appear to be multiple separate physical PCIe devices. SR-IOV works by introducing the idea of physical functions (PFs) and virtual functions (VFs). A PF is used by the host and usually represents a single NIC port. VF is a lightweight version of that PF. With appropriate support, SR-IOV presents a way for the physical hardware (like a SmartNIC) to present itself as several distinct (network interface) devices. With containers, we can then move one of these interfaces(a VF) from the host, into the network namespace for a container or a Pod, so that the container can now directly access the interface. The advantage this offers is that we get none of the overhead with virt-io and get native device performance.

Note: The plug-ins and process described in this tutorial are applicable to bare metal instances only. For virtual machine based instances, a different set of plug-ins and configuration is required.

Objective

This tutorial describes how to set up secondary SR-IOV virtual function based network interfaces for pods running on OKE clusters. It leverages the SRIOV-CNI plug-in to manage SR-IOV virtual functions as resources that can be allocated on a node and the Multus meta CNI to add additional network interfaces to pods.

How it works

The approach has several layers and components. At its crux, a Kubernetes device plug-in manages a set of virtual functions and publishes it as an allocatable resource on the node. When a pod requests such a resource, the pod can be assigned to a node where the resource is available and an SR-IOV CNI can plumb the virtual function into the pod’s network namespace. A CNI meta-plugin such as Multus handles multiple network attachments to the Pod so that the Pod can communicate over both the SR-IOV as well as the overlay networks.

We first set up a number of VFs on the SR-IOV capable smartNICs, which will then present themselves as individual NICs. We then configure these VFs with MAC addresses that Oracle Cloud Infrastructure (OCI) recognizes. These VFs are created outside Multus, either manually (as described in this tutorial), or using a script that can be invoked at node creation time. At this point, we have a pool of VFs each identified by the host as a separate NIC, and an OCI MAC address. The Kubernetes network plumbing working group maintains a special purpose network device plug-in that discovers and publishes VFs as allocatable node resources. The SR-IOV CNI (also from the Kubernetes network plumbing working group) works alongside the device plugin and manages the assignment of these virtual functions to the pod based on the pod lifecycle.

Now we have one or more nodes with a pool of VFs that are recognized and managed by the SR-IOV device plug-in as allocatable node resources. These can be requested by pods. The SR-IOV CNI plumbs (moves) the VF into the Pod’s network namespace on Pod creation and releases the VF (moves it back to the root namespace) on Pod deletion. This makes the VF available to be allocated for another pod. A meta plug-in like Multus can provide the VF information to the CNI and manage multiple network attachments on the Pod.

multus pod image

Task 1: Set up the hosts

We start with bare metal hosts where we can set up VFs for the PCIe interfaces. With bare metal hosts, we perform the following steps.

  1. Create VFs: SSH in to the bare metal node. Find the physical device and add virtual functions to it.

    Note: The script to get the device name is required only for automation.

    The example here creates two VFs on the device.

    # Gets the physical device. Alterntively, just run `ip addr show` and look at the primary iface to set $PHYSDEV
    URL=http://169.254.169.254/opc/v1/vnics/
    baseAddress=`curl -s ${URL} | jq -r '.[0] | .privateIp'`
    PHYSDEV=`ip -o -4 addr show | grep ${baseAddress} | awk -F: '{gsub(/^[ \t]|[ \t]$/,"",$2);split($2,out,/[ \t]+/);print out[1]}'`
    
    # Add two VFs
    echo "2" > /sys/class/net/${PHYSDEV}/device/sriov_numvfs
    
    # Verify the VFs
    ip link show ${PHYSDEV}
    
  2. Assign OCI MAC addresses to the VFs.

    These VFs will have auto-generated MAC addresses now (or 000). We need to set OCI MAC addresses for the traffic from these to be permissible on the OCI network. Create the same number of VNIC attachments on the host as the number of VFs created. Note the MAC addresses of each VNIC attachment. Now we assign each of these MAC addresses that OCI recognizes to the VFS we created.

    # For each MAC address from the VNIC attachments
    
    ip link set ${PHYSDEV} vf <n= 0..numVFs> mac <MAC Address from VNIC attachment> spoofchk off
    
    # verify all VFs have Mac addresses from OCI
    ip link show ${PHYSDEV}
    
    

This completes the host setup, which should ideally be automated, as it needs to be performed on every host that needs to provide SR-IOV networking resources to pods.

Task 2: Install the SR-IOV CNI

This CNI can be installed on a 1.16+ cluster as a daemon set. Nodes without SR-IOV devices are handled gracefully by the device plug-in itself.

git clone https://github.com/k8snetworkplumbingwg/sriov-cni.git && cd sriov-cni
kubectl apply -f images/k8s-v1.16/sriov-cni-daemonset.yaml && cd..

Task 3: Install the SR-IOV network device plug-in

Note: The device plug-in does not create the VFs on the fly, they must be created separately.

The device plug-in discovers and advertises the SR-IOV capable network devices on the node. To accomplish this, the device plug-in requires a configuration that will enable it to create the device plug-in endpoints. The configuration identifies the devices and the drivers used.

  1. Create a ConfigMap for the SR-IOV resource pool. To set up the ConfigMap, we need to know the vendor id, device id, and the driver used by the device.

    1. To find the vendor id and device id:

      lspci -nn|grep Virtual
      
      31:02.0 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme-E Ethernet Virtual Function [14e4:16dc]
      31:02.1 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme-E Ethernet Virtual Function [14e4:16dc]
      
      
    2. In the example above, we have two VFs, and the last bit of information gives us the vendor id (14e4) and device id(16dc). We can cross-check this with the hwdata that lspci uses.

      cat /usr/share/hwdata/pci.ids|grep 16dc
      
    3. To find the drivers used:

      # filtering based on the PCIe slots.
      find /sys | grep drivers.*31:02.0|awk -F/ '{print $6}'
      
      bnxt_en
      
      
  2. Set up the ConfigMap. The ConfigMap should be named sriovdp-config and should have a key config.json

    cat << EOF > sriovdp-config.yaml
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: sriovdp-config
      namespace: kube-system
    data:
      config.json: |
        {
          "resourceList": [{
                  "resourceName": "mlnx_sriov_rdma",
                  "resourcePrefix": "mellanox.com",
                  "selectors": {
                    "vendors": ["15b3"],
                    "devices": ["101e"],
                    "drivers": ["mlx5_core"],
                    "isRdma": false
                  }
              },
              {
                  "resourceName": "netxtreme_sriov_rdma",
                  "resourcePrefix": "broadcom.com",
                  "selectors": {
                    "vendors": ["14e4"],
                    "devices": ["16dc"],
                    "drivers": ["bnxt_en"],
                    "isRdma": false
                  }
              }
          ]
        }
    EOF
    
    kubectl create -f sriovdp-config.yaml
    
    
  3. Set up the device plug-in. With the config map created, the device plugin can be installed as a daemon set.

    git clone https://github.com/k8snetworkplumbingwg/sriov-network-device-plugin.git && cd sriov-network-device-plugin
    kubectl create -f deployments/k8s-v1.16/sriovdp-daemonset.yaml && cd ..
    
  4. Once the daemonsets are deployed, you can check the container logs for troubleshooting. After a successful deployment, the node should list the virtual functions as allocatable resources.

    kubectl get node <node_name> -o json | jq '.status.allocatable'
    
    {
      "broadcom.com/netxtreme_sriov_rdma": "2",
      "cpu": "128",
      "ephemeral-storage": "37070025462",
      "hugepages-1Gi": "0",
      "hugepages-2Mi": "0",
      "memory": "527632840Ki",
      "pods": "110"
    }
    

Task 4: Install a meta plug-in CNI (Multus)

Multus is a meta plug-in that can provide the VF information to a downstream CNI like the SR-IOV CNI plug-in for it to handle the network resource plumbing while enabling “Multi-homed” pods or pods with multiple network interfaces.

  1. Install Multus:

    git clone https://github.com/k8snetworkplumbingwg/multus-cni.git && cd multus-cni
    kubectl apply -f images/multus-daemonset.yml && cd ..
    

    Note:

    • The default image used by the daemonset that’s tagged stable needs the kubelet to be v1.20.x. If installing on an older cluster, edit the deamonset in the manifest and use the multus image tag v3.7.1.

    • This manifest creates a new CRD for the kind:NetworkAttachmentDefinition and provides the multus binary on all nodes through a daemonset.

  2. To attach additional interfaces to the pods, we need a configuration for the interface to be attached. This is encapsulated in the custom resource of kind NetworkAttachmentDefinition. This configuration is essentially a CNI configuration packaged as a custom resource.

    Let’s set up a sample NetworkAttachmentDefinition

    cat << EOF > sriov-net1.yaml
    apiVersion: k8s.cni.cncf.io/v1
    kind: NetworkAttachmentDefinition
    metadata:
      name: sriov-net1
      annotations:
        k8s.v1.cni.cncf.io/resourceName: broadcom.com/netxtreme_sriov_rdma
    spec:
      config: '{
      "type": "sriov",
      "cniVersion": "0.3.1",
      "name": "sriov-network",
      "ipam": {
        "type": "host-local",
        "subnet": "10.20.30.0/25",
        "routes": [{
          "dst": "0.0.0.0/0"
        }],
        "gateway": "10.20.10.1"
      }
    }'
    EOF
    
    kubectl apply -f sriov-net1.yaml
    
    

Task 5: Deploy and test pods with multiple interfaces

Acknowledgments

More Learning Resources

Explore other labs on docs.oracle.com/learn or access more free learning content on the Oracle Learning YouTube channel. Additionally, visit education.oracle.com/learning-explorer to become an Oracle Learning Explorer.

For product documentation, visit Oracle Help Center.