Distributed, GPU-aware workload scheduler for heterogeneous clusters
Mixed Workloads • Resource Quotas • High Availability
简体中文 | English
DGPU Scheduler is a distributed GPU scheduling system designed for medium-scale GPU clusters (50-200 nodes). It provides:
- Mixed Workload Support: Online inference services and batch processing tasks
- Resource Isolation: Strict quota management with configurable allocation ratios
- High Availability: Active-standby scheduler with automatic failover
- Dual API: gRPC for internal agents, HTTP REST for external users
- No External Dependencies: Self-contained architecture without Redis/etcd requirements
┌─────────────────────────────────────────────────┐
│ Users/Services (HTTP REST) │
└─────────────────┬───────────────────────────────┘
│
┌─────────▼─────────┐
│ API Gateway │
│ (REST + gRPC) │
└─────────┬─────────┘
│
┌─────────────▼──────────────┐
│ Scheduler Master (HA) │
│ - Scheduling Engine │
│ - State Management │
│ - Quota Management │
└─────────────┬──────────────┘
│ gRPC
┌────────┼────────┐
│ │ │
┌────▼───┐ ┌─▼────┐ ┌─▼────┐
│ Agent │ │Agent │ │Agent │
│ GPU节点 │ │GPU节点│ │GPU节点│
└────────┘ └──────┘ └──────┘
See Design Document for detailed architecture.
- Go 1.19+
- Protocol Buffers compiler (protoc)
- NVIDIA GPU with CUDA support (for agents)
# Clone the repository
git clone https://github.com/chicogong/dgpu-scheduler.git
cd dgpu-scheduler
# Install dependencies
make deps
# Generate protobuf code
make proto
# Build binaries
make build# Edit configuration
vim configs/scheduler.yaml
# Run scheduler master
./bin/scheduler -config configs/scheduler.yaml# Edit configuration
vim configs/agent.yaml
# Run agent on GPU node
./bin/agent -config configs/agent.yamldgpu-scheduler/
├── cmd/ # Application entrypoints
│ ├── scheduler/ # Scheduler master binary
│ └── agent/ # Agent binary
├── pkg/ # Core packages
│ ├── scheduler/ # Scheduler logic
│ ├── agent/ # Agent logic
│ ├── api/ # API Gateway
│ ├── models/ # Data models
│ ├── config/ # Configuration
│ └── logger/ # Logging
├── api/
│ └── proto/ # Protobuf definitions
├── configs/ # Configuration templates
├── docs/ # Documentation
│ └── plans/ # Design documents
├── deployments/ # Deployment files
└── scripts/ # Utility scripts
See configs/scheduler.yaml for configuration options:
- Server addresses (gRPC, HTTP)
- Scheduler role (master/standby)
- Quota percentages
- Replication settings
See configs/agent.yaml for configuration options:
- Scheduler addresses
- GPU detection method
- Task execution method
- Heartbeat intervals
make testmake test-coveragemake fmtmake lintSubmit a task:
curl -X POST http://localhost:8080/api/v1/tasks \
-H "Content-Type: application/json" \
-d '{
"priority": "high",
"gpu_count": 2,
"command": "python train.py"
}'Query task status:
curl http://localhost:8080/api/v1/tasks/{task_id}See Design Document for complete API reference.
# Build Docker images
make docker-build
# Run with Docker Compose
docker-compose up -d# Apply Kubernetes manifests
kubectl apply -f deployments/k8s/- System design
- Project structure
- Core scheduler implementation (Week 1-4)
- High availability features (Week 5-7)
- Monitoring and observability (Week 8-9)
- Auto-scaling support (Future)
Contributions are welcome! Please read our contributing guidelines before submitting PRs.
This project is licensed under the MIT License - see the LICENSE file for details.
- Design Document
- API Reference (Coming soon)
- Deployment Guide (Coming soon)
- Troubleshooting Guide (Coming soon)