Prepare to Publish GPU Instance Metrics to Oracle Cloud Infrastructure Monitoring
Note:
Procedures in this article are specific to the operating system you are using. Ensure you are using the instructions related to your operating system.Install the Oracle Cloud Infrastructure CLI
The script uses Oracle Cloud Infrastructure command-line interface (CLI) to upload the metrics to Oracle Cloud Infrastructure Monitoring service, so you'll need to install the CLI in the GPU instances that you want to monitor.
To install the CLI, use one of the following commands (depending on the operating system):
bash -c "$(curl -L https://raw.githubusercontent.com/oracle/oci-cli/master/scripts/install/install.sh)"
powershell -NoProfile -ExecutionPolicy Bypass -Command "iex ((New-Object System.Net.WebClient).DownloadString('https://raw.githubusercontent.com/oracle/oci-cli/master/scripts/install/install.ps1'))"
To have the CLI walk you through the first-time setup process, use the oci setup config
command. The command prompts you for the information required for the configuration file and the API public/private keys. The setup dialog generates an API key pair and creates the configuration file.
If you change the default installation location of the CLI, or use Ubuntu Linux as the operating system, make sure you update the cliLocation
variable in the shell script.
# OCI CLI binary location
# Default installation location for Oracle Linux 7 is /home/opc/bin/oci
# Default installation location for Ubuntu 18.04 and Ubuntu 16.04 is /home/ubuntu/bin/oci
cliLocation="/home/opc/bin/oci"
# OCI CLI binary location
# Default installation location is "C:\Users\opc\bin"
$cliLocation = "C:\Users\opc\bin"
Verify NVIDIA-smi Installation
Next, verify that NVIDIA-smi is installed. This command-line interface is based on top of the NVIDIA Management Library (NVML) and aids in managing and monitoring of NVIDIA GPU devices.
If you are already using your GPU instances for running GPU workloads, you most likely already have the appropriate NVIDIA drivers installed. The default installation directory for NVIDIA-smi is C:\Program Files\NVIDIA Corporation\NVSMI
. The script checks if it's installed and in your path. To check it manually you can SSH (on Linux) or RDP (Windows) into your GPU instance and run nvidia-smi.exe
in the command line. Assuming NVIDIA-smi is installed, you will receive this response:
Thu Nov 07 21:05:43 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 426.23 Driver Version: 426.23 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... TCC | 00000000:00:04.0 Off | 0 |
| N/A 34C P0 22W / 300W | 0MiB / 16258MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Create the Script
You now create the script for publishing GPU temperature, GPU utilization, and GPU memory utilization metrics from GPU instances to the Monitoring service.
Note:
This procedure describes both the Linux and Windows coding required to create the script. Be sure you're following the procedures appropriate to your operating system.In any code editor: