Created by Tomasz Gajger.
Notice: although the author is a Dynatrace employee, this is a private project. It is not maintained nor endorsed by the Dynatrace.
The project is released under the MIT License.
A Dynatrace OneAgent extension for gathering NVIDIA GPU metrics using NVIDIA Management Library (NVML), implementation leverages python bindings for NVML.
The extension is capable of monitoring multiple GPUs, the metrics coming from all the devices will be aggregated and sent as combined timeseries. There is no support for sending separate timeseries per device.
Note that the extension can attach metrics to multiple processes at once, but the metrics will only be displayed for processes that were specified in processTypeNames
in plugin.json
If the process type is not specified there, then metrics will still be sent, but won't appear on the WebUI.
Currently there is no way to specify Any
in processTypeNames
, hence all the process types of interest need to be explicitly enumerated.
Device metrics are reported for the HOST entity, while process-specific metrics are reported per-PGI.
- NVML installed and available on the system.
- Device of Fermi or newer architecture.
- No requirements on CUDA version.
- OneAgent version >= 1.175.
- For extension development: OneAgent Plugin SDK v1.175 or newer.
- Python >= 3.6.
enable_debug_log
- enables debug logging for troubleshooting purposes.
The table below outlines metrics collected by the extension. Figures 1 and 2 exemplify how metrics are presented on the WebUI.
Key | Entity | Metric description |
---|---|---|
gpu_mem_total | HOST | Total available global memory |
gpu_mem_used | HOST | Device (global) memory usage |
gpu_mem_used_by_pgi | PGI | Global memory usage per process |
gpu_mem_percentage_used | HOST | Artificial metric (gpu_mem_used / gpu_mem_total ) for raising High GPU memory alert |
gpu_utilization | HOST | Percent of time over the past sample period (within CUDA driver) during which one or more kernels have been executing on the GPU |
gpu_memory_controller_utilization | HOST | Percent of time over the past sample period (within CUDA driver) during which global memory have been read from or written to |
gpu_processes_count | HOST | Number of processes making use of the GPU |
If there are multiple GPUs present, the metrics will be displayed in a joint fashion, i.e:
gpu_mem_total
will be a sum of all the devices' global memory,gpu_mem_used
andgpu_mem_used_by_pgi
will be the total memory usage across all the devices,gpu_utilization
andgpu_memory_controller_utilization
will be an average from per-device usage metrics,gpu_processes_count
will show unique count of processes using any of the GPUs, i.e. if single process is using two GPUs it will be counted as one.
Fig 1. Host metrics reported by the extension
Fig 2. PGI metrics reported by the extension
Note that although memory usage metrics values are in MiB, we display them as MB on the WebUI since it is the convention for timeseries labelling in Dynatrace.
Internally, the extension collects several data samples and aggregates them before passing them on to the extension execution engine.
By default, 5 samples in 2 second intervals are collected. This can be customized by modifying SAMPLES_COUNT
and SAMPLING_INTERVAL
in constants.py.
Concerning per-PGI memory usage, on Windows this metric won't be available if the card is managed by WDDM driver, the card needs to be running in TCC (WDM) mode. Note that this mode is not supported by GeForce series cards prior to Volta architecture.
Three alerts are predefined in the extension, all three are generated by Davis when metrics exceed certain threshold values. These alerts are reported for the host entity and are visible on the host screen, see Figure 3.
- High GPU utilization alert - raised when
gpu_utilization
exceeds predefined threshold (default: 90%) in given time period, example is shown in Figure 4, - High GPU memory controller utilization alert - raised when
gpu_memory_controller_utilization
exceeds predefined threshold (default: 90%) in given time period, - High GPU memory utilization alert - raised when
gpu_mem_percentage_used
exceeds predefined threshold (default: 90%) relative togpu_mem_total
in given time period, example is shown in Figure 5.
Alerts thresholds are customizable by going to WebUI
> Settings
> Anomaly Detection
> Plugin events
.
Fig 3. Alerts as seen on host screen
Fig 4. High GPU utilization alert as seen on metrics screen
Fig 5. High GPU memory utilization alert as seen on metrics screen
Note that High GPU memory utilization alert is based on two separate metrics (gpu_mem_used
and gpu_mem_total
).
Due to current extension limitations, it is not possible to define such server-side alert without introducing an artificial metric combining the other two.
The alert could be reported by the extension directly via results_builder.report_performance_event()
,
but then it wouldn't be connected to a particular metric (from server's perspective) and wouldn't be marked on the respective chart, it would only appear on the host screen.
Thus, an artificial metric that is hidden on the Memory usage chart, representing percentage usage of the GPU memory had to be introduced.
Fig 6. Problem view for active high GPU utilization alert
Fig 7. Problem view for resolved high GPU memory utilization alert
- Bartosz Pollok for code review and guidance through the Python world