DGPU Scheduler

Distributed, GPU-aware workload scheduler for heterogeneous clusters

Mixed Workloads • Resource Quotas • High Availability

Overview

DGPU Scheduler is a distributed GPU scheduling system designed for medium-scale GPU clusters (50-200 nodes). It provides:

Mixed Workload Support: Online inference services and batch processing tasks
Resource Isolation: Strict quota management with configurable allocation ratios
High Availability: Active-standby scheduler with automatic failover
Dual API: gRPC for internal agents, HTTP REST for external users
No External Dependencies: Self-contained architecture without Redis/etcd requirements

Architecture

┌─────────────────────────────────────────────────┐
│           Users/Services (HTTP REST)             │
└─────────────────┬───────────────────────────────┘
                  │
        ┌─────────▼─────────┐
        │   API Gateway     │
        │  (REST + gRPC)    │
        └─────────┬─────────┘
                  │
    ┌─────────────▼──────────────┐
    │   Scheduler Master (HA)     │
    │   - Scheduling Engine       │
    │   - State Management        │
    │   - Quota Management        │
    └─────────────┬──────────────┘
                  │ gRPC
         ┌────────┼────────┐
         │        │        │
    ┌────▼───┐ ┌─▼────┐ ┌─▼────┐
    │ Agent  │ │Agent │ │Agent │
    │ GPU节点 │ │GPU节点│ │GPU节点│
    └────────┘ └──────┘ └──────┘

See Design Document for detailed architecture.

Quick Start

Prerequisites

Go 1.19+
Protocol Buffers compiler (protoc)
NVIDIA GPU with CUDA support (for agents)

Build

# Clone the repository
git clone https://github.com/chicogong/dgpu-scheduler.git
cd dgpu-scheduler

# Install dependencies
make deps

# Generate protobuf code
make proto

# Build binaries
make build

Run Scheduler

# Edit configuration
vim configs/scheduler.yaml

# Run scheduler master
./bin/scheduler -config configs/scheduler.yaml

Run Agent

# Edit configuration
vim configs/agent.yaml

# Run agent on GPU node
./bin/agent -config configs/agent.yaml

Project Structure

dgpu-scheduler/
├── cmd/                # Application entrypoints
│   ├── scheduler/      # Scheduler master binary
│   └── agent/          # Agent binary
├── pkg/                # Core packages
│   ├── scheduler/      # Scheduler logic
│   ├── agent/          # Agent logic
│   ├── api/            # API Gateway
│   ├── models/         # Data models
│   ├── config/         # Configuration
│   └── logger/         # Logging
├── api/
│   └── proto/          # Protobuf definitions
├── configs/            # Configuration templates
├── docs/               # Documentation
│   └── plans/          # Design documents
├── deployments/        # Deployment files
└── scripts/            # Utility scripts

Configuration

Scheduler Configuration

See configs/scheduler.yaml for configuration options:

Server addresses (gRPC, HTTP)
Scheduler role (master/standby)
Quota percentages
Replication settings

Agent Configuration

See configs/agent.yaml for configuration options:

Scheduler addresses
GPU detection method
Task execution method
Heartbeat intervals

Development

Run Tests

make test

Run Tests with Coverage

make test-coverage

Format Code

make fmt

Run Linter

make lint

API Documentation

REST API

Submit a task:

curl -X POST http://localhost:8080/api/v1/tasks \
  -H "Content-Type: application/json" \
  -d '{
    "priority": "high",
    "gpu_count": 2,
    "command": "python train.py"
  }'

Query task status:

curl http://localhost:8080/api/v1/tasks/{task_id}

See Design Document for complete API reference.

Deployment

Docker

# Build Docker images
make docker-build

# Run with Docker Compose
docker-compose up -d

Kubernetes

# Apply Kubernetes manifests
kubectl apply -f deployments/k8s/

Roadmap

System design
Project structure
Core scheduler implementation (Week 1-4)
High availability features (Week 5-7)
Monitoring and observability (Week 8-9)
Auto-scaling support (Future)

Contributing

Contributions are welcome! Please read our contributing guidelines before submitting PRs.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Documentation

Design Document
API Reference (Coming soon)
Deployment Guide (Coming soon)
Troubleshooting Guide (Coming soon)

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.github		.github
api/proto		api/proto
cmd		cmd
configs		configs
deployments/docker		deployments/docker
docs		docs
pkg		pkg
test-local		test-local
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
README_CN.md		README_CN.md
docker-compose.yml		docker-compose.yml
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DGPU Scheduler

Overview

Architecture

Quick Start

Prerequisites

Build

Run Scheduler

Run Agent

Project Structure

Configuration

Scheduler Configuration

Agent Configuration

Development

Run Tests

Run Tests with Coverage

Format Code

Run Linter

API Documentation

REST API

Deployment

Docker

Kubernetes

Roadmap

Contributing

License

Documentation

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DGPU Scheduler

Overview

Architecture

Quick Start

Prerequisites

Build

Run Scheduler

Run Agent

Project Structure

Configuration

Scheduler Configuration

Agent Configuration

Development

Run Tests

Run Tests with Coverage

Format Code

Run Linter

API Documentation

REST API

Deployment

Docker

Kubernetes

Roadmap

Contributing

License

Documentation

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages