GPU Cluster Monitoring (GCM): Large-Scale AI Research Cluster Monitoring
-
Updated
Mar 28, 2026 - Python
GPU Cluster Monitoring (GCM): Large-Scale AI Research Cluster Monitoring
Simulate NVIDIA GPUs for testing. 7 behavior profiles, scale to 1000+ GPUs, Docker-ready Prometheus exporter using DCGM
GPU-native agent-swarm orchestration for the NVIDIA AI stack — NeMo, NIM, Triton, DCGM, NGC, NIXL, OpenShell. Spawn GPU-pinned agent teams across DGX/HGX nodes with NVLink-aware scheduling, task DAGs, adaptive scheduling, and full observability.
nvidia dcgm exporter container only
Production-grade health monitoring and predictive fault management system for NVIDIA A100/H100 GPU fleets
Complete security toolkit for enterprise NVIDIA GPU infrastructure. Includes NIST 800-53 controls, Zero Trust architecture, threat models, incident response playbooks, forensic scripts, and monitoring configurations for H100/A100/L40S and other datacenter GPUs.
Add a description, image, and links to the dcgm topic page so that developers can more easily learn about it.
To associate your repository with the dcgm topic, visit your repo's landing page and select "manage topics."