Note:

This tutorial requires access to Oracle Cloud. To sign up for a free account, see Get started with Oracle Cloud Infrastructure Free Tier.
It uses example values for Oracle Cloud Infrastructure credentials, tenancy, and compartments. When completing your lab, substitute these values with ones specific to your cloud environment.

Monitor GPU Superclusters on Oracle Cloud Infrastructure with NVIDIA Data Center GPU Manager, Grafana and Prometheus

Introduction

Artificial Intelligence (AI) and Machine Learning (ML) on Oracle Cloud Infrastructure (OCI) NVIDIA GPU Superclusters was announced earlier this year. For many customers running these types of GPU based workload at scale, monitoring can be a challenge. Out-of-the-box, OCI has excellent monitoring solutions, including some GPU instance metrics, but deeper integration into GPU metrics for OCI Superclusters is something our customers are interested in.

Setting up GPU monitoring is a straight forward process. If you are a customer already using the HPC Marketplace stack, then most of the tooling is already included in that deployment image.

Objectives

Monitor GPU OCI Superclusters with NVIDIA Data Center GPU Manager, Grafana and Prometheus.

Prerequisites

Ensure you have these prerequisites installed on each GPU instance. The monitoring solution then requires installation and execution of the NVIDIA Data Center GPU Manager (DCGM) exporter Docker container, to capture the output metrics using Prometheus, and finally display the data in Grafana.

Task 1: Install and Execute the NVIDIA DCGM exporter Docker container

Installation and Execution of the NVIDIA DCGM container is done with a docker command.

Run the following command for a non-persistent Docker container. This execution method will not restart the docker container in the event of docker service being stopped, or the GPU server being rebooted.

docker run -d --gpus all --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:3.2.3-3.1.6-ubuntu20.04
If persistence is desired, then this execution method should be used instead.

docker run --restart unless-stopped -d --gpus all -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:3.2.3-3.1.6-ubuntu20.04

Note: Ensure that the docker service is set to load on-boot for this example to persist through reboots.

Both of these methods require the GPU nodes to have internet access (NAT Gateway is most common) in order to download and run the DCGM exporter container. This container needs to be running on each GPU node being monitored. Once running, the next step is to set up Prometheus on a separate compute node, ideally one which is in an edge network and can be accessed directly over the internet.

Task 2: Install Prometheus

Installation of Prometheus is done by downloading and unpacking the latest release. Update this syntax when newer releases are available by changing release numbers as applicable.

Fetch the package.

wget https://github.com/prometheus/prometheus/releases/download/v2.37.9/prometheus-2.37.9.linux-amd64.tar.gz -O /opt/prometheus-2.37.9.linux-amd64.tar.gz
Change directory.

cd /opt
Extract the package.

tar -zxvf prometheus-2.37.9.linux-amd64.tar.gz
Once unpacked, a Prometheus config YAML needs to be generated in order to scrape the DCGM data from GPU nodes. Start with the standard headers – in this case we are deploying to /etc/prometheus/prometheus.yaml.

sudo mkdir -p /etc/prometheus

Edit the YAML file with your editor of choice. The header section includes standard global config.

# my global config
global:
  scrape_interval: 5s # Set the scrape interval to every 5 seconds. Default is every 1 minute.
  evaluation_interval: 30s # Evaluate rules every 30 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
    # - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:

Now for the targets to scrape, entries for each GPU host needs to be inserted. This can be generated programmatically and must use either host IPs or resolvable DNS hostnames.
```
scrape_configs:
  - job_name: 'gpu'
    scrape_interval: 5s
    static_configs:
      - targets: ['host1_ip:9400','host2_ip:9400', … ,'host64_ip:9400']
```
Note: The targets section is abbreviated for this article, and the “host1_ip” used here should be a resolvable IP or hostname for each GPU host.

Once this file is in place, a SystemD script needs to be created so that the Prometheus service can be started, managed, and persisted. Edit /lib/systemd/system/prometheus.service with your editor of choice (as root or sudo).

[Unit]
Description=Prometheus daemon
After=network.target
[Service]
User=root
Group=root
Type=simple
ExecStart=/opt/prometheus-2.37.9.linux-amd64/prometheus
--config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/data \
  --storage.tsdb.max-block-duration=2h \
  --storage.tsdb.min-block-duration=2h \
  --web.enable-lifec
PrivateTmp=true
[Install]
WantedBy=multi-user.target

Finally, a data directory needs to be created. Ideally a block storage volume should be used here to provide dedicated, resilient storage for the Prometheus data. In the above example, the target is a block volume mounted to /data on the compute node as seen in the storage.tsdb.path directive.
All Prometheus pieces are now in place, and the service can be started.

sudo systemctl start prometheus
Check the status to ensure it’s running using the following command.

systemctl status prometheus
Verify data is present.

ls /data/

Task 3: Set up the Grafana Dashboard

The last piece of this setup is the Grafana Dashboard. Installing Grafana requires enabling the Grafana software repository. For more details, see Grafana docs.

For Oracle Enterprise Linux, install the GPG key, set up the software repository, and then run a yum install for Grafana server.
Start Grafana server and ensure the VCN has an open port to access the GUI. Alternately, an SSH tunnel can be used if edge access is prohibited. Either way the server can be accessed on TCP port 3000.

sudo systemctl start grafana-server
Log in to the Grafana GUI at which point the admin password will need to be changed as part of the first login step. Once that is done, navigate to the data sources in order to setup Prometheus as a data source.
The Prometheus source entry uses localhost:9090 in this case by selecting “+ Add new data source”, select Prometheus, and then fill in the HTTP URL section.
Once that is complete, scroll down and click Save & test and validate the connection. Finally, the DCGM dashboard from NVIDIA can be imported. Navigate to the Dashboards menu.
Choose New and Import.
Enter the NVIDIA DCGM exporter dashboard id “12239”, click Load, select Prometheus as the data source from the drop-down list, and click Import.

Once imported the NVIDIA DCGM dashboard will display GPU information for the cluster hosts targeted by Prometheus. Out-of-the-box, the dashboard includes valuable information for a given date/time range.

Examples:

Temperature
Power
SM Clock
Utilization

This monitoring data can be invaluable for customers who want to have deeper insights into their infrastructure while running AI/ML workloads on OCI GPU Superclusters. If you have AI/ML GPU based workloads which require ultrafast cluster networking, consider OCI for its industry leading scalability at a much better price than other cloud providers.

Acknowledgments

Author - Zachary Smith (Principal Member of Technical Staff, OCI IaaS - Product & Customer engagement)

More Learning Resources

Explore other labs on docs.oracle.com/learn or access more free learning content on the Oracle Learning YouTube channel. Additionally, visit education.oracle.com/learning-explorer to become an Oracle Learning Explorer.

For product documentation, visit Oracle Help Center.

Title and Copyright Information

Monitor GPU Superclusters on Oracle Cloud Infrastructure with NVIDIA Data Center GPU Manager, Grafana and Prometheus

F87825-01

October 2023

Monitor GPU Superclusters on Oracle Cloud Infrastructure with NVIDIA Data Center GPU Manager, Grafana and Prometheus

Introduction

Objectives

Prerequisites

Task 1: Install and Execute the NVIDIA DCGM exporter Docker container

Task 2: Install Prometheus

Task 3: Set up the Grafana Dashboard

Related Links

Acknowledgments

More Learning Resources