OneAgent NVML Extension

Foreword

Notice: although the author is a Dynatrace employee, this is a private project. It is not maintained nor endorsed by the Dynatrace.

The project is released under the MIT License.

Overview

A Dynatrace OneAgent extension for gathering NVIDIA GPU metrics using NVIDIA Management Library (NVML), implementation leverages python bindings for NVML.

The extension is capable of monitoring multiple GPUs, the metrics coming from all the devices will be aggregated and sent as combined timeseries. There is no support for sending separate timeseries per device.

Note that the extension can attach metrics to multiple processes at once, but the metrics will only be displayed for processes that were specified in processTypeNames in plugin.json If the process type is not specified there, then metrics will still be sent, but won't appear on the WebUI. Currently there is no way to specify Any in processTypeNames, hence all the process types of interest need to be explicitly enumerated.

Device metrics are reported for the HOST entity, while process-specific metrics are reported per-PGI.

Requirements

NVML installed and available on the system.
Device of Fermi or newer architecture.
No requirements on CUDA version.
OneAgent version >= 1.175.
For extension development: OneAgent Plugin SDK v1.175 or newer.
Python >= 3.6.

Configuration

enable_debug_log - enables debug logging for troubleshooting purposes.

Reported metrics

The table below outlines metrics collected by the extension. Figures 1 and 2 exemplify how metrics are presented on the WebUI.

Key	Entity	Metric description
gpu_mem_total	HOST	Total available global memory
gpu_mem_used	HOST	Device (global) memory usage
gpu_mem_used_by_pgi	PGI	Global memory usage per process
gpu_mem_percentage_used	HOST	Artificial metric (`gpu_mem_used` / `gpu_mem_total`) for raising High GPU memory alert
gpu_utilization	HOST	Percent of time over the past sample period (within CUDA driver) during which one or more kernels have been executing on the GPU
gpu_memory_controller_utilization	HOST	Percent of time over the past sample period (within CUDA driver) during which global memory have been read from or written to
gpu_processes_count	HOST	Number of processes making use of the GPU

If there are multiple GPUs present, the metrics will be displayed in a joint fashion, i.e:

gpu_mem_total will be a sum of all the devices' global memory,
gpu_mem_used and gpu_mem_used_by_pgi will be the total memory usage across all the devices,
gpu_utilization and gpu_memory_controller_utilization will be an average from per-device usage metrics,
gpu_processes_count will show unique count of processes using any of the GPUs, i.e. if single process is using two GPUs it will be counted as one.

Fig 1. Host metrics reported by the extension

Fig 2. PGI metrics reported by the extension

Note that although memory usage metrics values are in MiB, we display them as MB on the WebUI since it is the convention for timeseries labelling in Dynatrace.

Internally, the extension collects several data samples and aggregates them before passing them on to the extension execution engine. By default, 5 samples in 2 second intervals are collected. This can be customized by modifying SAMPLES_COUNT and SAMPLING_INTERVAL in constants.py.

Concerning per-PGI memory usage, on Windows this metric won't be available if the card is managed by WDDM driver, the card needs to be running in TCC (WDM) mode. Note that this mode is not supported by GeForce series cards prior to Volta architecture.

Alerting

Three alerts are predefined in the extension, all three are generated by Davis when metrics exceed certain threshold values. These alerts are reported for the host entity and are visible on the host screen, see Figure 3.

High GPU utilization alert - raised when gpu_utilization exceeds predefined threshold (default: 90%) in given time period, example is shown in Figure 4,
High GPU memory controller utilization alert - raised when gpu_memory_controller_utilization exceeds predefined threshold (default: 90%) in given time period,
High GPU memory utilization alert - raised when gpu_mem_percentage_used exceeds predefined threshold (default: 90%) relative to gpu_mem_total in given time period, example is shown in Figure 5.

Alerts thresholds are customizable by going to WebUI > Settings > Anomaly Detection > Plugin events.

Fig 3. Alerts as seen on host screen

Fig 4. High GPU utilization alert as seen on metrics screen

Fig 5. High GPU memory utilization alert as seen on metrics screen

Note that High GPU memory utilization alert is based on two separate metrics (gpu_mem_used and gpu_mem_total). Due to current extension limitations, it is not possible to define such server-side alert without introducing an artificial metric combining the other two. The alert could be reported by the extension directly via results_builder.report_performance_event(), but then it wouldn't be connected to a particular metric (from server's perspective) and wouldn't be marked on the respective chart, it would only appear on the host screen. Thus, an artificial metric that is hidden on the Memory usage chart, representing percentage usage of the GPU memory had to be introduced.

Fig 6. Problem view for active high GPU utilization alert

Fig 7. Problem view for resolved high GPU memory utilization alert

Acknowledgements

Bartosz Pollok for code review and guidance through the Python world

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

OneAgent NVML Extension

Foreword

Overview

Requirements

Configuration

Reported metrics

Alerting

Acknowledgements

Files

README.md

Latest commit

History

README.md

File metadata and controls

OneAgent NVML Extension

Foreword

Overview

Requirements

Configuration

Reported metrics

Alerting

Acknowledgements