dcgm

Star

Here are 7 public repositories matching this topic...

facebookresearch / gcm

Star

GPU Cluster Monitoring (GCM): Large-Scale AI Research Cluster Monitoring

ai monitoring hpc slurm nvml health-checks ai-training ai-cluster dcgm

Updated Mar 28, 2026
Python

ajcasagrande / fakeai

Star

FakeAI: Rapid Development and Testing for AI Infrastructure

ai nvidia openai llm dcgm ai-dynamo aiperf

Updated Oct 8, 2025
Python

saiakhil2012 / dcgm-fake-gpu-exporter

Star

Simulate NVIDIA GPUs for testing. 7 behavior profiles, scale to 1000+ GPUs, Docker-ready Prometheus exporter using DCGM

testing research monitoring simulation metrics gpu telemetry observability chaos-engineering gpu-monitoring dcgm-exporter dcgm

Updated Jan 7, 2026
Python

GPU-native agent-swarm orchestration for the NVIDIA AI stack — NeMo, NIM, Triton, DCGM, NGC, NIXL, OpenShell. Spawn GPU-pinned agent teams across DGX/HGX nodes with NVLink-aware scheduling, task DAGs, adaptive scheduling, and full observability.

python cli nim hpc gpu slurm orchestration nvidia hyperparameter-optimization triton nemo ai-agents mlops dgx nvlink llm agent-swarm dcgm

Updated Mar 28, 2026
Python

spwangxp / dcgm-docker-exporter

Star

nvidia dcgm exporter container only

docker monitor gpu prometheus nvidia prometheus-exporter grafana-dashboard dcgm-exporter dcgm

Updated Feb 9, 2026
Go

SolidRegardless / gpu-health-monitor

Star

Production-grade health monitoring and predictive fault management system for NVIDIA A100/H100 GPU fleets

machine-learning kafka monitoring gpu nvidia predictive-maintenance failure-prediction timescaledb health-monitoring a100 h100 dcgm

Updated Feb 20, 2026
Python

Zero-Trust-AI-Security / gpu-security-toolkit

Star

Complete security toolkit for enterprise NVIDIA GPU infrastructure. Includes NIST 800-53 controls, Zero Trust architecture, threat models, incident response playbooks, forensic scripts, and monitoring configurations for H100/A100/L40S and other datacenter GPUs.

incident-response forensics zero-trust container-security kubernetes-security nist-800-53 dcgm gpu-security nvidia-security cryptomining-detection

Updated Feb 1, 2026
Shell

Improve this page

Add a description, image, and links to the dcgm topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the dcgm topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dcgm

Here are 7 public repositories matching this topic...

facebookresearch / gcm

ajcasagrande / fakeai

saiakhil2012 / dcgm-fake-gpu-exporter

alokemajumder / nemospawn

spwangxp / dcgm-docker-exporter

SolidRegardless / gpu-health-monitor

Zero-Trust-AI-Security / gpu-security-toolkit

Improve this page

Add this topic to your repo