Prepare to Publish GPU Instance Metrics to Oracle Cloud Infrastructure Monitoring

Before you can publish the GPU instance metrics, you need to ensure both command-line tools are installed and then create the Linux Shell script or a Windows PowerShell script that will collect, convert, and produce consumable versions of the metrics.

Note:

Procedures in this article are specific to the operating system you are using. Ensure you are using the instructions related to your operating system.

Install the Oracle Cloud Infrastructure CLI

The script uses Oracle Cloud Infrastructure command-line interface (CLI) to upload the metrics to Oracle Cloud Infrastructure Monitoring service, so you'll need to install the CLI in the GPU instances that you want to monitor.

To install the CLI, use one of the following commands (depending on the operating system):

On Linux:

bash -c "$(curl -L https://raw.githubusercontent.com/oracle/oci-cli/master/scripts/install/install.sh)"

On Windows:

powershell -NoProfile -ExecutionPolicy Bypass -Command "iex ((New-Object System.Net.WebClient).DownloadString('https://raw.githubusercontent.com/oracle/oci-cli/master/scripts/install/install.ps1'))"

To have the CLI walk you through the first-time setup process, use the oci setup config command. The command prompts you for the information required for the configuration file and the API public/private keys. The setup dialog generates an API key pair and creates the configuration file.

If you change the default installation location of the CLI, or use Ubuntu Linux as the operating system, make sure you update the cliLocation variable in the shell script.

For Linux:

# OCI CLI binary location
# Default installation location for Oracle Linux 7 is /home/opc/bin/oci
# Default installation location for Ubuntu 18.04 and Ubuntu 16.04 is /home/ubuntu/bin/oci
cliLocation="/home/opc/bin/oci"

For Windows:

# OCI CLI binary location
# Default installation location is "C:\Users\opc\bin"
$cliLocation = "C:\Users\opc\bin"

Verify NVIDIA-smi Installation

Next, verify that NVIDIA-smi is installed. This command-line interface is based on top of the NVIDIA Management Library (NVML) and aids in managing and monitoring of NVIDIA GPU devices.

If you are already using your GPU instances for running GPU workloads, you most likely already have the appropriate NVIDIA drivers installed. The default installation directory for NVIDIA-smi is C:\Program Files\NVIDIA Corporation\NVSMI. The script checks if it's installed and in your path. To check it manually you can SSH (on Linux) or RDP (Windows) into your GPU instance and run nvidia-smi.exe in the command line. Assuming NVIDIA-smi is installed, you will receive this response:

Thu Nov 07 21:05:43 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 426.23       Driver Version: 426.23       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  TCC  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0    22W / 300W |      0MiB / 16258MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Create the Script

You now create the script for publishing GPU temperature, GPU utilization, and GPU memory utilization metrics from GPU instances to the Monitoring service.

Note:

This procedure describes both the Linux and Windows coding required to create the script. Be sure you're following the procedures appropriate to your operating system.

In any code editor:

If you are creating a Linux script, you first identify the location for logging:

#!/bin/bash

# Write logs to /tmp/gpuMetrics.log
exec 3>&1 4>&2
trap 'exec 2>&4 1>&3' 0 1 2 3
exec 1>>/tmp/gpuMetrics.log 2>&1
date

Identify the location of each command-line utility, NVIDIA-smi and the Oracle Cloud Infrastructure CLI (along with cURL and jq for Linux implementations). This code will specify the location and throw the appropriate error if the command-line utility is not installed:

For Linux:

# OCI CLI binary location
# Default installation location for Oracle Linux 7 is /home/opc/bin/oci
# Default installation location for Ubuntu 18.04 and Ubuntu 16.04 is /home/ubuntu/bin/oci
cliLocation="/home/opc/bin/oci"

# Check if OCI CLI, nvidia-smi, jq, and curl is installed
if ! [ -x "$(command -v $cliLocation)" ]; then
  echo 'Error: OCI CLI is not installed. Please follow the instructions in this link: https://docs.cloud.oracle.com/iaas/Content/API/SDKDocs/cliinstall.htm' >&2
  exit 1
fi

if ! [ -x "$(command -v nvidia-smi)" ]; then
  echo 'Error: nvidia-smi is not installed.' >&2
  exit 1
fi

if ! [ -x "$(command -v jq)" ]; then
  echo 'Error: jq is not installed.' >&2
  exit 1
fi

if ! [ -x "$(command -v curl)" ]; then
  echo 'Error: curl is not installed.' >&2
  exit 1
fi

For Windows:

Get-Date

# OCI CLI binary location
# Default installation location is "C:\Users\opc\bin"
$cliLocation = "C:\Users\opc\bin"

# nvidia-smi binary location
# Default installation location is "C:\Program Files\NVIDIA Corporation\NVSMI"
$nvidiaSmiLocation = "C:\Program Files\NVIDIA Corporation\NVSMI"

# Check if oci.exe and nvidia-smi.exe are in the path and add them if needed. This is not a persistent add.
if ((Get-Command "oci.exe" -ErrorAction SilentlyContinue) -eq $null) { 
   $env:Path = $env:Path + ";" + "$cliLocation"
}

if ((Get-Command "nvidia-smi.exe" -ErrorAction SilentlyContinue) -eq $null) { 
   $env:Path = $env:Path + ";" + "$nvidiaSmiLocation"
}

Get instance metadata. By default, metrics are published to the same compartment as the instance being monitored. You can change the following variables if you want to use different values.

For Linux:

compartmentId=$(curl -s -L http://169.254.169.254/opc/v1/instance/ | jq -r '.compartmentId')
metricNamespace="gpu_monitoring"
metricResourceGroup="gpu_monitoring_rg"
instanceName=$(curl -s -L http://169.254.169.254/opc/v1/instance/ | jq -r '.displayName')
instanceId=$(curl -s -L http://169.254.169.254/opc/v1/instance/ | jq -r '.id')
endpointRegion=$(curl -s -L http://169.254.169.254/opc/v1/instance/ | jq -r '.canonicalRegionName')

For Windows:

$getMetadata = (curl -s -L http://169.254.169.254/opc/v1/instance/) | ConvertFrom-Json
$compartmentId = $getMetadata.compartmentId
$metricNamespace = "gpu_monitoring"
$metricResourceGroup = "gpu_monitoring_rg"
$instanceName = $getMetadata.displayName
$instanceId = $getMetadata.id
$endpointRegion = $getMetadata.canonicalRegionName

Get data for GPU Temperature, GPU Utilization, and GPU Memory Utilization from NVIDIA-smi and convert them to Oracle Cloud Infrastructure Monitoring compliant values:

For Linux:

Use nvidia-smi to get the information of the GPU (time of reading, temperature, gpu utilization, and memory utilization), then convert the time of reading from to the correct format for Oracle Cloud Infrastructure Monitoring. Then, assign to variables temperature data, GPU utilization data, memory utilization data.

getMetrics=$(nvidia-smi --query-gpu=timestamp,temperature.gpu,utilization.gpu,utilization.memory --format=csv,noheader,nounits) 
gpuTimestamp=$(echo $getMetrics | awk -F, '{print $1}' | sed -e 's/\.[^.]*$//' -e 's/ /T/' -e 's/\//-/g' -e 's/$/Z/')
gpuTemperature=$(echo $getMetrics | awk -F, '{print $2}' | xargs)
gpuUtilization=$(echo $getMetrics | awk -F, '{print $3}' | xargs)
gpuMemoryUtilization=$(echo $getMetrics | awk -F, '{print $4}' | xargs)

Use JSON to upload the GPU temperature as a custom metric:


metricsJson=$(cat << EOF > /tmp/metrics.json
[
   {
      "namespace":"$metricNamespace",
      "compartmentId":"$compartmentId",
      "resourceGroup":"$metricResourceGroup",
      "name":"gpuTemperature",
      "dimensions":{
         "resourceId":"$instanceId",
         "instanceName":"$instanceName"
      },
      "metadata":{
         "unit":"degrees Celcius",
         "displayName":"GPU Temperature"
      },
      "datapoints":[
         {
            "timestamp":"$gpuTimestamp",
            "value":$gpuTemperature
         }
      ]
   },

Upload the GPU utilization as a custom metric:

{
      "namespace":"$metricNamespace",
      "compartmentId":"$compartmentId",
      "resourceGroup":"$metricResourceGroup",
      "name":"gpuUtilization",
      "dimensions":{
         "resourceId":"$instanceId",
         "instanceName":"$instanceName"
      },
      "metadata":{
         "unit":"percent",
         "displayName":"GPU Utilization"
      },
      "datapoints":[
         {
            "timestamp":"$gpuTimestamp",
            "value":$gpuUtilization
         }
      ]
   },

Upload GPU Memory Utilization as a custom metric:


{
      "namespace":"$metricNamespace",
      "compartmentId":"$compartmentId",
      "resourceGroup":"$metricResourceGroup",
      "name":"gpuMemoryUtilization",
      "dimensions":{
         "resourceId":"$instanceId",
         "instanceName":"$instanceName"
      },
      "metadata":{
         "unit":"percent",
         "displayName":"GPU Memory Utilization"
      },
      "datapoints":[
         {
            "timestamp":"$gpuTimestamp",
            "value":$gpuMemoryUtilization
         }
      ]
   }
]
EOF
)

Use $cliLocation to define the location of the OCI CLI installation on the local machine and run the OCI CLI post command to send the metrics.


$cliLocation monitoring metric-data post --metric-data file:///tmp/metrics.json --endpoint https://telemetry-ingestion.$endpointRegion.oraclecloud.com

For Windows:

Use nvidia-smi to get the GPU information (time of reading, temperature, gpu utilization, and memory utilization), then convert the time of reading from to the correct format for Oracle Cloud Infrastructure Monitoring. Then, assign to variables temperature data, GPU utilization data, memory utilization data.

$nvidiaTimestamp = (nvidia-smi.exe --query-gpu=timestamp --format=csv, noheader, nounits)
$gpuTimestamp = ($nvidiaTimestamp.Replace(" ", "T").Replace("/", "-")).Substring(0, $nvidiaTimestamp.IndexOf('.')) + "Z"
$gpuTemperature = (nvidia-smi.exe --query-gpu=temperature.gpu --format=csv, noheader, nounits)
$gpuUtilization = (nvidia-smi.exe --query-gpu=utilization.gpu --format=csv, noheader, nounits)
$gpuMemoryUtilization = (nvidia-smi.exe --query-gpu=utilization.memory --format=csv, noheader, nounits)

Use JSON to upload the GPU temperature as a custom metric:


$metricsJson = @"
[
   {
      "namespace":"$metricNamespace",
      "compartmentId":"$compartmentId",
      "resourceGroup":"$metricResourceGroup",
      "name":"gpuTemperature",
      "dimensions":{
         "resourceId":"$instanceId",
         "instanceName":"$instanceName"
      },
      "metadata":{
         "unit":"degrees Celcius",
         "displayName":"GPU Temperature"
      },
      "datapoints":[
         {
            "timestamp":"$gpuTimestamp",
            "value":$gpuTemperature
         }
      ]
   },

Upload the GPU utilization as a custom metric:


   {
      "namespace":"$metricNamespace",
      "compartmentId":"$compartmentId",
      "resourceGroup":"$metricResourceGroup",
      "name":"gpuUtilization",
      "dimensions":{
         "resourceId":"$instanceId",
         "instanceName":"$instanceName"
      },
      "metadata":{
         "unit":"percent",
         "displayName":"GPU Utilization"
      },
      "datapoints":[
         {
            "timestamp":"$gpuTimestamp",
            "value":$gpuUtilization
         }
      ]
   },

upload GPU Memory Utilization as a custom metric:

{
      "namespace":"$metricNamespace",
      "compartmentId":"$compartmentId",
      "resourceGroup":"$metricResourceGroup",
      "name":"gpuMemoryUtilization",
      "dimensions":{
         "resourceId":"$instanceId",
         "instanceName":"$instanceName"
      },
      "metadata":{
         "unit":"percent",
         "displayName":"GPU Memory Utilization"
      },
      "datapoints":[
         {
            "timestamp":"$gpuTimestamp",
            "value":$gpuMemoryUtilization
         }
      ]
   }
]
"@

$metricsJson | Out-File $env:TEMP\metrics.json -Encoding ASCII

Add this line to define the location of the OCI CLI installation on the local machine and run the OCI CLI post command to send the metrics:

oci monitoring metric-data post --metric-data file://$env:TEMP\metrics.json --endpoint https://telemetry-ingestion.$endpointRegion.oraclecloud.com

You've now created valid JSON for publishing GPU Temperature, GPU Utilization, and GPU Memory Utilization of the GPU. In the next article, you will see how to publish the collected instance metrics to Oracle Cloud Infrastructure Monitoring.