Publish GPU Instance Metrics to Oracle Cloud Infrastructure Monitoring

Once the script is complete, you next publish the GPU instance metrics to Oracle Cloud Infrastructure Monitoring. Be sure to follow the instructions for your specific operating system.

Note:

Instructions in this article are specific to the operating system you are using. Ensure you are using the instructions related to your operating system.

Test the Linux Shell Script

If you created a Linux shell script, run the script manually to verify there are no errors.

In this example, we are using a script called publishGPUMetrics.sh saved to the directory oci-gpu-monitoring/scripts.
  1. If necessary, change to the directory where you saved the script:
    cd oci-gpu-monitoring/scripts
  2. Run the script:
    sh ./publishGPUMetrics.sh
    By default, the script writes logs to /tmp/gpuMetrics.log.
  3. Check the logs to see if there were any errors. You should see a log similar to following if the script has run successfully:
    [opc@gputest scripts]$ cat /tmp/gpuMetrics.log
    
    Wed Oct 30 18:58:24 GMT 2019
    {
      "data": {
        "failed-metrics": [],
        "failed-metrics-count": 0
      }
    }

Create a Cron Job to Run the Linux Script Automatically

Asssuming you don't see any errors in the logs from the previous exercise, now create a Cron job so the script runs automatically

The example job below runs the script every minute, but you can change the frequency of the Cron job depending on your needs. Custom metrics can be posted as frequently as every second (minimum frequency of one second), but the minimum aggregation interval is one minute.

  1. From the command line, open the crontab file:
    crontab -e
  2. Add the following line then save the file:

    Note:

    If you have changed the script location, update the example commands with the new location.
    • For Oracle Linux 7:
      * * * * * sh /home/opc/oci-gpu-monitoring/scripts/publishGPUMetrics.sh
    • For Ubuntu 16.04 / Ubuntu 18.04:
       * * * * * sh /home/ubuntu/oci-gpu-monitoring/scripts/publishGPUMetrics.sh
  3. Check the Cron jobs list to make sure the job is listed:
    crontab -l
    You should see the following line (or similar to it if you have changed the location of the script) in the list of jobs:
    * * * * * sh /home/opc/oci-gpu-monitoring/scripts/publishGPUMetrics.sh
  4. After a couple of minutes (to allow the script to run and publish the metrics), log in to Oracle Cloud Infrastructure console and verify that the metrics are available in Oracle Cloud Infrastructure Monitoring. After you login to the console, go to Monitoring then Metrics Explorer.
  5. In the Metrics Explorer, select the following values and click Update Chart.
    • Compartment: Name of the compartment that you publish the metrics. Default value is the same compartment with the GPU instance being monitored. You can configure it in the shell script by changing the value of compartment Id variable.
    • Metric Namespace: Default value is gpu_monitoring. You can configure it in the shell script by changing the value of metricNamespace variable.
    • Resource Group: Default value is gpu_monitoring_rg. You can configure it in the shell script by changing the value of metricResourceGroup variable.
    • Metric Name: Default values are gpuMemoryUtilization, gpuTemperature, and gpuUtilization. Let's choose gpuTemperature so we can see some non-zero data.
    • Interval: Default value is 1m. Select any value in the console that suits your needs. While custom metrics can be posted as frequently as every second (minimum frequency of one second), the minimum aggregation interval is one minute.
    • Statistic: Default value is Mean. Select any value in the console that suits your needs.
    • Dimension Name: You can choose either resourceId or instanceName. resourceId is the OCID of the GPU instance, and instanceName is the display name of the GPU instance.
    You should be seeing some values in the chart now.

Note:

Instead of selecting the values from the fields in the console, you can also use the Query Code Editor by selecting Advanced Mode. For example, to get the same chart as above, use this query (be sure to change the value of the resourceId to your GPU instance ID):
gpuTemperature[1m]{resourceId = "your_gpu_instance_id"}.mean()

Test the PowerShell Script

If you created a Windows PowerShell script, run it manually to verify there are no errors.

In this example, we are using a script called publishGPUMetrics.ps1 saved to the directory oci-gpu-monitoring. You will run the following commands in PowerShell.
  1. If necessary, change to the directory where you've stored the script:
    cd "$Home\oci-gpu-monitoring"
  2. Run the script:
     PS C:\Users\opc\oci-gpu-monitoring> .\publishGPUMetrics.ps1 
    If the script has run successfully, you should see an output similar to this:
    Thursday, November 7, 2019 11:25:51 PM
    {
      "data": {
        "failed-metrics": [],
        "failed-metrics-count": 0
      }
    } 

Create and Execute a Scheduled Task to Run the Windows Script Automatically

Asssuming you don't see any errors in the logs from the previous exercise, now create a scheduled task so the script runs automatically

This example runs the script every minute, but you can change the frequency of the task depending on your needs. Custom metrics can be posted as frequently as every second (minimum frequency of one second), but the minimum aggregation interval is one minute.

In your code editor, create a new scheduled task script:

  1. Set the name of the task. In this exercise, we'll call it oci-gpu-monitoring:
    $taskName = "oci-gpu-monitoring"
  2. Set the script and log locations:
    $scriptLocation = "$Home\oci-gpu-monitoring\publishGPUMetrics.ps1"
    $logLocation = "$Home\oci-gpu-monitoring\gpuMetrics.log"
    $script =  "$scriptLocation 2>&1 > $logLocation"
  3. Set the run frequency:
    $frequency = (New-TimeSpan -Minutes 1)
    
    $action = New-ScheduledTaskAction –Execute "$pshome\powershell.exe" -Argument "$script; quit"
    $trigger = New-ScheduledTaskTrigger -Once -At (Get-Date).Date -RepetitionInterval $frequency
  4. Force the script to request user credentials before it executes:
    $msg = "Enter the username and password that will run the task"; 
    $credential = $Host.UI.PromptForCredential("Task username and password",$msg,"$env:userdomain\$env:username",$env:userdomain)
    $username = $credential.UserName
    $password = $credential.GetNetworkCredential().Password
    $settings = New-ScheduledTaskSettingsSet -StartWhenAvailable -RunOnlyIfNetworkAvailable -DontStopOnIdleEnd
  5. Register the task:
    Register-ScheduledTask -TaskName $taskName -Action $action -Trigger $trigger -RunLevel Highest -User $username -Password $password -Settings $settings
    

    The script should look something like this:

    # Name of the scheduled task
    $taskName = "oci-gpu-monitoring"
    
    # Script and log location defaults. Change if you need to.
    $scriptLocation = "$Home\oci-gpu-monitoring\publishGPUMetrics.ps1"
    $logLocation = "$Home\oci-gpu-monitoring\gpuMetrics.log"
    $script =  "$scriptLocation 2>&1 > $logLocation"
    
    # Running frequency
    $frequency = (New-TimeSpan -Minutes 1)
    
    $action = New-ScheduledTaskAction –Execute "$pshome\powershell.exe" -Argument "$script; quit"
    $trigger = New-ScheduledTaskTrigger -Once -At (Get-Date).Date -RepetitionInterval $frequency
    
    # The script will ask for username & password that will run the task 
    $msg = "Enter the username and password that will run the task"; 
    $credential = $Host.UI.PromptForCredential("Task username and password",$msg,"$env:userdomain\$env:username",$env:userdomain)
    $username = $credential.UserName
    $password = $credential.GetNetworkCredential().Password
    $settings = New-ScheduledTaskSettingsSet -StartWhenAvailable -RunOnlyIfNetworkAvailable -DontStopOnIdleEnd
     
    Register-ScheduledTask -TaskName $taskName -Action $action -Trigger $trigger -RunLevel Highest -User $username -Password $password -Settings $settings
  6. Verify that the task is in Task Scheduler by running taskschd.msc in a PowerShell to open Task Scheduler. You should see the name of the task you created in the previous step.
  7. After a couple of minutes (to allow the script to run and publish the metrics), log in to Oracle Cloud Infrastructure console and verify that the metrics are available in Oracle Cloud Infrastructure Monitoring. After you login to the console, go to Monitoring then Metrics Explorer.
  8. In the Metrics Explorer, select the following values and click Update Chart.
    • Compartment: Name of the compartment that you publish the metrics. Default value is the same compartment with the GPU instance being monitored. You can configure it in the shell script by changing the value of compartment Id variable.
    • Metric Namespace: Default value is gpu_monitoring. You can configure it in the shell script by changing the value of metricNamespace variable.
    • Resource Group: Default value is gpu_monitoring_rg. You can configure it in the shell script by changing the value of metricResourceGroup variable.
    • Metric Name: Default values are gpuMemoryUtilization, gpuTemperature, and gpuUtilization. Let's choose gpuTemperature so we can see some non-zero data.
    • Interval: Default value is 1m. Select any value in the console that suits your needs. While custom metrics can be posted as frequently as every second (minimum frequency of one second), the minimum aggregation interval is one minute.
    • Statistic: Default value is Mean. Select any value in the console that suits your needs.
    • Dimension Name: You can choose either resourceId or instanceName. resourceId is the OCID of the GPU instance, and instanceName is the display name of the GPU instance.
    You should be seeing some values in the chart now.

Note:

Instead of selecting the values from the fields in the console, you can also use the Query Code Editor by selecting Advanced Mode. For example, to get the same chart as above, use this query (be sure to change the value of the resourceId to your GPU instance ID):
gpuTemperature[1m]{resourceId = "your_gpu_instance_id"}.mean()