|
| 1 | +# Format |
| 2 | +# If line starts with a '#' it is considered a comment |
| 3 | +# DCGM FIELD, Prometheus metric type, help message |
| 4 | + |
| 5 | +# Clocks |
| 6 | +DCGM_FI_DEV_SM_CLOCK, gauge, SM clock frequency (in MHz). |
| 7 | +DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz). |
| 8 | + |
| 9 | +# Temperature |
| 10 | +DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C). |
| 11 | +DCGM_FI_DEV_GPU_TEMP, gauge, GPU temperature (in C). |
| 12 | + |
| 13 | +# Power |
| 14 | +DCGM_FI_DEV_POWER_USAGE, gauge, Power draw (in W). |
| 15 | +DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ). |
| 16 | + |
| 17 | +# PCIE |
| 18 | +# DCGM_FI_DEV_PCIE_TX_THROUGHPUT, counter, Total number of bytes transmitted through PCIe TX (in KB) via NVML. |
| 19 | +# DCGM_FI_DEV_PCIE_RX_THROUGHPUT, counter, Total number of bytes received through PCIe RX (in KB) via NVML. |
| 20 | +DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries. |
| 21 | + |
| 22 | +# Utilization (the sample period varies depending on the product) |
| 23 | +DCGM_FI_DEV_GPU_UTIL, gauge, GPU utilization (in %). |
| 24 | +DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %). |
| 25 | +DCGM_FI_DEV_ENC_UTIL, gauge, Encoder utilization (in %). |
| 26 | +DCGM_FI_DEV_DEC_UTIL , gauge, Decoder utilization (in %). |
| 27 | + |
| 28 | +# Errors and violations |
| 29 | +DCGM_FI_DEV_XID_ERRORS, gauge, Value of the last XID error encountered. |
| 30 | +# DCGM_FI_DEV_POWER_VIOLATION, counter, Throttling duration due to power constraints (in us). |
| 31 | +# DCGM_FI_DEV_THERMAL_VIOLATION, counter, Throttling duration due to thermal constraints (in us). |
| 32 | +# DCGM_FI_DEV_SYNC_BOOST_VIOLATION, counter, Throttling duration due to sync-boost constraints (in us). |
| 33 | +# DCGM_FI_DEV_BOARD_LIMIT_VIOLATION, counter, Throttling duration due to board limit constraints (in us). |
| 34 | +# DCGM_FI_DEV_LOW_UTIL_VIOLATION, counter, Throttling duration due to low utilization (in us). |
| 35 | +# DCGM_FI_DEV_RELIABILITY_VIOLATION, counter, Throttling duration due to reliability constraints (in us). |
| 36 | + |
| 37 | +# Memory usage |
| 38 | +DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB). |
| 39 | +DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB). |
| 40 | + |
| 41 | +# ECC |
| 42 | +# DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter, Total number of single-bit volatile ECC errors. |
| 43 | +# DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter, Total number of double-bit volatile ECC errors. |
| 44 | +# DCGM_FI_DEV_ECC_SBE_AGG_TOTAL, counter, Total number of single-bit persistent ECC errors. |
| 45 | +# DCGM_FI_DEV_ECC_DBE_AGG_TOTAL, counter, Total number of double-bit persistent ECC errors. |
| 46 | + |
| 47 | +# Retired pages |
| 48 | +# DCGM_FI_DEV_RETIRED_SBE, counter, Total number of retired pages due to single-bit errors. |
| 49 | +# DCGM_FI_DEV_RETIRED_DBE, counter, Total number of retired pages due to double-bit errors. |
| 50 | +# DCGM_FI_DEV_RETIRED_PENDING, counter, Total number of pages pending retirement. |
| 51 | + |
| 52 | +# NVLink |
| 53 | +# DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL, counter, Total number of NVLink flow-control CRC errors. |
| 54 | +# DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL, counter, Total number of NVLink data CRC errors. |
| 55 | +# DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL, counter, Total number of NVLink retries. |
| 56 | +# DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL, counter, Total number of NVLink recovery errors. |
| 57 | +DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, counter, Total number of NVLink bandwidth counters for all lanes. |
| 58 | +# DCGM_FI_DEV_NVLINK_BANDWIDTH_L0, counter, The number of bytes of active NVLink rx or tx data including both header and payload. |
| 59 | + |
| 60 | +# VGPU License status |
| 61 | +DCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status |
| 62 | + |
| 63 | +# Remapped rows |
| 64 | +DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errors |
| 65 | +DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for correctable errors |
| 66 | +DCGM_FI_DEV_ROW_REMAP_FAILURE, gauge, Whether remapping of rows has failed |
| 67 | + |
| 68 | +# Static configuration information. These appear as labels on the other metrics |
| 69 | +DCGM_FI_DRIVER_VERSION, label, Driver Version |
| 70 | +# DCGM_FI_NVML_VERSION, label, NVML Version |
| 71 | +# DCGM_FI_DEV_BRAND, label, Device Brand |
| 72 | +# DCGM_FI_DEV_SERIAL, label, Device Serial Number |
| 73 | +# DCGM_FI_DEV_OEM_INFOROM_VER, label, OEM inforom version |
| 74 | +# DCGM_FI_DEV_ECC_INFOROM_VER, label, ECC inforom version |
| 75 | +# DCGM_FI_DEV_POWER_INFOROM_VER, label, Power management object inforom version |
| 76 | +# DCGM_FI_DEV_INFOROM_IMAGE_VER, label, Inforom image version |
| 77 | +# DCGM_FI_DEV_VBIOS_VERSION, label, VBIOS version of the device |
| 78 | + |
| 79 | +# DCP metrics |
| 80 | +DCGM_FI_PROF_GR_ENGINE_ACTIVE, gauge, Ratio of time the graphics engine is active (in %). |
| 81 | +# DCGM_FI_PROF_SM_ACTIVE, gauge, The ratio of cycles an SM has at least 1 warp assigned (in %). |
| 82 | +# DCGM_FI_PROF_SM_OCCUPANCY, gauge, The ratio of number of warps resident on an SM (in %). |
| 83 | +DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active (in %). |
| 84 | +DCGM_FI_PROF_DRAM_ACTIVE, gauge, Ratio of cycles the device memory interface is active sending or receiving data (in %). |
| 85 | +# DCGM_FI_PROF_PIPE_FP64_ACTIVE, gauge, Ratio of cycles the fp64 pipes are active (in %). |
| 86 | +# DCGM_FI_PROF_PIPE_FP32_ACTIVE, gauge, Ratio of cycles the fp32 pipes are active (in %). |
| 87 | +# DCGM_FI_PROF_PIPE_FP16_ACTIVE, gauge, Ratio of cycles the fp16 pipes are active (in %). |
| 88 | +DCGM_FI_PROF_PCIE_TX_BYTES, gauge, The rate of data transmitted over the PCIe bus - including both protocol headers and data payloads - in bytes per second. |
| 89 | +DCGM_FI_PROF_PCIE_RX_BYTES, gauge, The rate of data received over the PCIe bus - including both protocol headers and data payloads - in bytes per second. |
| 90 | + |
0 commit comments