Enabling GPU metrics with the OCA HPC plugin
You can enable GPU metrics with the Oracle Cloud Agent High Performance Computing plugin on your instances.
Current OCI HPC package | New OCA plugin | Description |
---|---|---|
oci-cn-auth | Compute HPC RDMA Authentication oci-rdma-authentication |
Configures RDMA/RoCE network interfaces with QoS, MTU, etc. settings and maintains authentication. |
oci-hpc-mlx-configure | Compute HPC RDMA Auto-Configuration oci-hpc-configure |
Configures Mellanox ConnectX-5 firmware and PCIE settings. |
oci-hpc-rdma-configure | Compute HPC RDMA Auto-Configuration oci-hpc-configure |
Configures RDMA interface ip addresses. |
oci-hpc-dapl-configure | Compute HPC RDMA Auto-Configuration oci-hpc-configure |
Configure legacy MPI DAPL oci-dat.conf. |
You can transition from python-based solutions to use the Oracle Cloud Agent High Performance Computing plugin.
Enabling Compute HPC RDMA Authentication and Auto-Configuration on an Existing Instance
Do not perform this workflow on a running workload. These actions can be disruptive and result in data loss.
-
Determine which version of Oracle Cloud Agent is installed. Version 1.35.0 or above is required. If the version is not 1.35.0 or above, contact support to obtain the installation package.
OL7/8
# sudo yum info oracle-cloud-agent
Ubuntu
snap info oracle-cloud-agent
-
Stop the existing oci-cn-auth services.
# sudo systemctl stop oci-cn-auth-renew # sudo systemctl stop oci-cn-auth
-
Verify that oci-cn-auth is stopped.
# sudo systemctl status oci-cn-auth
-
Stop the wpa_supplicant services.
# sudo systemctl stop wpa_supplicant-wired*
-
Verify that wpa_supplicant service are stopped.
# sudo systemctl status wpa_supplicant-wired*
-
Remove the oci-cn-auth, oci-hpc-rdma-configure, oci-hpc-mlx-configure, and oci-hpc-dapl-configure package, if installed.
OL7/8
# sudo yum remove oci-cn-auth oci-hpc-rdma-configure oci-hpc-mlx-configure oci-hpc-dapl-configure
Ubuntu20
# sudo apt-get remove oci-cn-auth oci-hpc-rdma-configure oci-hpc-mlx-configure oci-hpc-dapl-configure
-
Verify that the agent is enabled and running.
OL7/8
# sudo systemctl status oracle-cloud-agent # sudo systemctl status oracle-cloud-agent-updater
Ubuntu20
# sudo systemctl status snap.oracle-cloud-agent.oracle-cloud-agent.service # sudo systemctl status snap.oracle-cloud-agent.oracle-cloud-agent-updater.service
-
Download the current agent configuration on the instance. See Managing Plugins for information on how to enable the plugin.
# curl --silent -H "Authorization: Bearer Oracle" -L http://169.254.169.254/opc/v2/instance/ | jq -r '.agentConfig' > agent-config.json
-
Modify the agent-config.json to enable one or more plugins.
# cat agent-config.json { "monitoringDisabled": false, "managementDisabled": false, "allPluginsDisabled": false, "isManagementDisabled": false, "pluginsConfig": [ { "name": "Compute HPC RDMA Authentication", "desiredState": "ENABLED" }, { "name": "Compute HPC RDMA Auto-Configuration", "desiredState": "ENABLED" } ] }
-
Use the OCI ZCLI or OCI SDK to update the agentConfig for the instance.
# oci compute instance update --instance-id <instance ocid> --agent-config file://agent-config.json
-
Verify that OCA plugin is enabled for the instance via the command line of the SDK.
# curl --silent -H "Authorization: Bearer Oracle" -L http://169.254.169.254/opc/v2/instance/ | jq -r '.agentConfig'
-
Verify that the plugin is running. It takes several minutes for the agentConfig changes to populate to the Oracle Cloud Agent.
# ps -leaf | grep oci-rdma-authentication
-
Confirm that all RDMA network interfaces have a wpa_supplicant
# ps -leaf | grep wpa_supplicant
Launching instance with HPC RDMA Authentication plug-in enabled
Provided the custom image has Oracle Cloud Agent 1.35.0 or above and the OCI HPC packages are not present, the LaunchInstanceDetails is used to apply the agentConfig with the plug-in enabled. OS must have the NVIDIA GPU drivers and Mellanox OFED drivers installed.
For more information, see Oracle Cloud Agent.
Enabling RDMA GPU monitoring
With Oracle Cloud Agent 1.35.0 new functionality to monitor RDMA and GPU is available. To enable this functionality on an existing instance do the following:
-
Download the current agent configuration on the instance. The sections below are only one way of enabling the plug-in. For more information, see Oracle Cloud Agent.
# curl --silent -H "Authorization: Bearer Oracle" -L http://169.254.169.254/opc/v2/instance/ | jq -r '.agentConfig' > agent-config.json
-
Modify the json by adding the "Compute RDMA GPU Monitoring".
# cat agent-config.json { "monitoringDisabled": false, "managementDisabled": false, "allPluginsDisabled": false, "isManagementDisabled": false, "pluginsConfig": [ { "name": "Compute HPC RDMA Authentication", "desiredState": "ENABLED" }, { "name": "Compute HPC RDMA Auto-Configuration", "desiredState": "ENABLED" }, { "name": "Compute RDMA GPU Monitoring", "desiredState": "ENABLED" } ] }
-
Use the OCI CLI or OCI SDK to update the agentConfig for the instance.
# oci compute instance update --instance-id <instance ocid> --agent-config file://agent-config.json
Required policies for RDMA GPU monitoring
If you use a private VPN, you need Service Gateway. If you use a public internet gateway, Service Gateway is not required.
For information on how to use the Monitoring service, see Securing Monitoring.
Create a dynamic group
This example creates a group that contains all instances in a specific compartment.
Any {instance.compartment.id = '<compartment_ocid>'}
Create a policy
Create a policy using the dynamic group to allow to instances to publish metrics. The HPC monitoring plug-in creates 2 custom namespaces that are billed:
gpu_infrastructure_health
rdma_infrastructure_health
Allow dynamic-group <group_name> to use metrics in compartment <compartment_name> where target.metrics.namespace=<metric_namespace>'
Allow dynamic-group <group_name> to read metrics in compartment <compartment_name>
For information on how to publish custom metrics to the Monitoring service, see Publishing Custom Metrics.