Learn About Publishing GPU Instance Metrics to Oracle Cloud Infrastructure Monitoring

Customers use Oracle Cloud Infrastructure Monitoring to actively and passively monitor cloud resources by using custom metrics and alarm features that are built into the service. Currently, though, Oracle Cloud Infrastructure Monitoring doesn't natively support collection of GPU-specific data from instances running those advanced chipsets.

Despite that limitation, if your computer uses an NVIDIA GPU, it's possible to easily publish custom metrics to the Monitoring service. This solution shows how to use a Linux Shell script or a Windows PowerShell script to publish GPU temperature, GPU utilization, and GPU memory utilization metrics from OCI GPU instances to Oracle Cloud Infrastructure Monitoring. In addition to these metrics, the scripts you will be developing in the course of this exercise can be easily modified to publish other GPU-specific data.

About Oracle Cloud Infrastructure Monitoring

Oracle Cloud Infrastructure Monitoring enables you to actively and passively monitor your cloud resources using the Metrics and Alarms features.

The Metrics feature relays metric data about the health, capacity, and performance of your cloud resources. A metric is a measurement related to health, capacity, or performance of a given resource . Resources, services, and applications emit metrics to the Monitoring service. Common metrics reflect data related to:
  • Availability and latency
  • Application uptime and downtime
  • Completed transactions
  • Failed and successful operations
  • Key performance indicators (KPIs), such as sales and engagement quantifiers
By querying Monitoring for this data, you can understand how well the systems and processes are working to achieve the service levels you commit to your customers. For example, you can monitor the CPU utilization and disk reads of your Compute instances. You can then use this data to determine when to launch more instances to handle increased load, troubleshoot issues with your instance, or better understand system behavior.

Architecture

For Linux images, the solution is a shell script that is run by a cron job automatically. For Windows images, the solution is a PowerShell script that is run by a scheduled task automatically.

The scripts provided in this solution do the following:
  1. Query metrics from the GPU instance by using NVIDIA-smi.
  2. Convert the metrics gathered from NVIDIA-smi to valid JSON.
  3. Publish the JSON to the Monitoring service by using the CLI.

Description of publish_gpu_instance_to_oci_monitoring.png follows
Description of the illustration publish_gpu_instance_to_oci_monitoring.png

After the setup is completed, the metrics become visible in Oracle Cloud Infrastructure Monitoring. The following image shows the GPU temperature for a specific time frame:Description of metrics_explorer.png follows
Description of the illustration metrics_explorer.png

Considerations for Publishing GPU Instance Metrics to Oracle Cloud Infrastructure Monitoring

Before starting this solution, you should consider these important factors:

  • This solution applies only to NVIDIA graphics processing units.
  • O/S Versions:
    • For Linux images, the solution is a shell script that is run by a cron job automatically. This solution supports these Linux releases:
      • Oracle Linux 7
      • Ubuntu 16.04
      • Ubuntu 18.04
    • For Windows images, the solution is a PowerShell script that is run by a scheduled task automatically. This solution supports these releases:
      • Windows Server 2012 R2
      • Windows Server 2016
    • Where necessary, this solution presents separate instructions based on your specific operating system. Ensure that you are following the correct instructions for your operating system.
  • IAM Policy: The script publishes the metrics to the same compartment as the GPU instance being monitored by default. You probably have the necessary IAM policy already configured for your user. If you plan to use a separate compartment for publishing the metrics, or if you get a message that you don’t have permission or are unauthorized, check with your tenancy administrator.
  • User interface: This solution employs command line interfaces (CLI) to launch the process. It uses uses the NVIDIA System Management Interface (NVIDIA-smi) command line utility to gather metrics from the GPUs and the Oracle Cloud Infrastructure CLI to upload the metrics to Oracle Cloud Infrastructure Monitoring. Further information for these interfaces follow.

About NVIDIA-smi

This solution uses the NVIDIA System Management Interface (NVIDIA-smi) command line utility to gather metrics from the GPUs, and then uses the Oracle Cloud Infrastructure command line interface (CLI) to publish these metrics to the Monitoring service.

NVIDIA-smi ships with NVIDIA GPU display drivers. GPU shapes with Oracle Linux 7 images come with the GPU drivers installed. GPU shapes with Ubuntu 16.04, Ubuntu 18.04, Windows Server 2012 R2, and Windows Server 2016 images are supported, but you must install the appropriate GPU drivers from NVIDIA.