exo-cuda: Distributed NVIDIA CUDA Inference

First verified working NVIDIA CUDA distributed inference for exo

Run large language models across multiple NVIDIA GPUs with automatic node discovery

Quick Start • Verified Hardware • Multi-Node Setup • Troubleshooting

🎯 What This Fork Adds

The original exo focuses on Apple Silicon (MLX). This fork restores full NVIDIA CUDA support via tinygrad:

Feature	Original exo	exo-cuda
Apple Silicon (MLX)	✅	✅
NVIDIA CUDA	❌ Broken	✅ Working
Tesla V100/M40	❌	✅ Tested
Multi-GPU cluster	⚠️ MLX only	✅ CUDA cluster
Distributed inference	✅	✅

⚡ Quick Start

# Clone this repo
git clone https://github.com/Scottcjn/exo-cuda.git
cd exo-cuda

# Create venv and install
python3 -m venv .venv
source .venv/bin/activate
pip install -e .

# Upgrade tinygrad to latest (fixes CUDA issues)
pip install --upgrade git+https://github.com/tinygrad/tinygrad.git

# Start with CUDA backend
exo --inference-engine tinygrad --chatgpt-api-port 8001 --disable-tui

📋 Requirements

Component	Requirement
OS	Ubuntu 22.04/24.04, Debian 12+
Python	3.10+ (3.12 recommended)
NVIDIA Driver	525+ (`nvidia-smi` to verify)
CUDA Toolkit	12.0+ (`nvcc --version` to verify)
GPU Memory	8GB+ per node

Install CUDA Toolkit

# Ubuntu/Debian
sudo apt install nvidia-cuda-toolkit

# Verify
nvcc --version
nvidia-smi

✅ Verified Hardware

Tested December 2024 - January 2025:

Server	GPU	VRAM	Status
Dell PowerEdge C4130	Tesla V100-SXM2	16GB	✅ Working
Dell PowerEdge C4130	Tesla M40	24GB	✅ Working
Custom Build	RTX 3090	24GB	✅ Working
Multi-node cluster	V100 + M40	40GB total	✅ Working

🖥️ Multi-Node Cluster

Node 1 (Primary + API)

exo --inference-engine tinygrad --chatgpt-api-port 8001 --disable-tui

Node 2+ (Workers)

exo --inference-engine tinygrad --disable-tui

That's it! Nodes auto-discover via UDP broadcast. No manual configuration.

Manual Peer Configuration (Optional)

# Create peers.json
echo '{"peers": ["192.168.1.100:5678", "192.168.1.101:5678"]}' > peers.json

# Start with manual discovery
exo --inference-engine tinygrad --discovery-module manual \
    --discovery-config-path peers.json

🔌 API Usage

exo provides a ChatGPT-compatible API:

# Chat completion
curl http://localhost:8001/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.2-3b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100
  }'

# List models
curl http://localhost:8001/v1/models

Supported Models

All tinygrad-compatible models work:

Model	Parameters	Min VRAM
Llama 3.2 1B	1B	4GB
Llama 3.2 3B	3B	8GB
Llama 3.1 8B	8B	16GB
Llama 3.1 70B	70B	140GB (cluster)
DeepSeek Coder	Various	Varies
Qwen 2.5	0.5B-72B	Varies
Mistral 7B	7B	14GB

🔧 Environment Variables

# Debug logging (0-9, higher = more verbose)
DEBUG=2 exo --inference-engine tinygrad

# Tinygrad-specific debug (1-6)
TINYGRAD_DEBUG=2 exo --inference-engine tinygrad

# Limit GPU visibility
CUDA_VISIBLE_DEVICES=0,1 exo --inference-engine tinygrad

🐛 Troubleshooting

Issue	Solution
`nvcc not found`	`sudo apt install nvidia-cuda-toolkit`
`OpenCL exp2 error`	`pip install --upgrade git+https://github.com/tinygrad/tinygrad.git`
`No GPU detected`	Check `nvidia-smi` and `nvcc --version`
`Out of memory`	Use smaller model or add more nodes
`Connection refused`	Check firewall allows UDP broadcast

Common Fixes

# Fix tinygrad CUDA issues
pip install --upgrade git+https://github.com/tinygrad/tinygrad.git

# Verify CUDA is working
python3 -c "from tinygrad import Device; print(Device.DEFAULT)"
# Should print: CUDA

# Test GPU memory
nvidia-smi --query-gpu=memory.free --format=csv

📊 Performance Tips

Use SXM2 GPUs - NVLink provides faster inter-GPU communication
Match GPU types - Heterogeneous clusters work but homogeneous is faster
10GbE+ networking - For multi-node clusters, network is the bottleneck
Disable TUI - --disable-tui reduces overhead

🔗 Related Projects

Project	Description
nvidia-power8-patches	NVIDIA drivers for IBM POWER8
cuda-power8-patches	CUDA toolkit for POWER8
llama-cpp-power8	llama.cpp on POWER8

🙏 Credits

Original exo by exo labs
tinygrad for the CUDA backend
NVIDIA for CUDA toolkit

📜 License

GPL-3.0 (same as original exo)

Maintained by Elyan Labs

Distributed NVIDIA inference that actually works

Report Issues • Original exo

Name		Name	Last commit message	Last commit date
Latest commit History 1,500 Commits
.circleci		.circleci
.github		.github
docs		docs
examples		examples
exo		exo
extra		extra
scripts		scripts
test		test
.gitattributes		.gitattributes
.gitignore		.gitignore
.style.yapf		.style.yapf
LICENSE		LICENSE
README.md		README.md
benchmark_sophiacord2-PowerEdge-C4130.json		benchmark_sophiacord2-PowerEdge-C4130.json
configure_mlx.sh		configure_mlx.sh
distributed_test_sophiacord2-PowerEdge-C4130.json		distributed_test_sophiacord2-PowerEdge-C4130.json
format.py		format.py
install.sh		install.sh
real_model_test_sophiacord2-PowerEdge-C4130.json		real_model_test_sophiacord2-PowerEdge-C4130.json
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

exo-cuda: Distributed NVIDIA CUDA Inference

🎯 What This Fork Adds

⚡ Quick Start

📋 Requirements

Install CUDA Toolkit

✅ Verified Hardware

🖥️ Multi-Node Cluster

Node 1 (Primary + API)

Node 2+ (Workers)

Manual Peer Configuration (Optional)

🔌 API Usage

Supported Models

🔧 Environment Variables

🐛 Troubleshooting

Common Fixes

📊 Performance Tips

🔗 Related Projects

🙏 Credits

📜 License

About

Uh oh!

Releases 1

Packages

Contributors 50

Uh oh!

Languages

License

Scottcjn/exo-cuda

Folders and files

Latest commit

History

Repository files navigation

exo-cuda: Distributed NVIDIA CUDA Inference

🎯 What This Fork Adds

⚡ Quick Start

📋 Requirements

Install CUDA Toolkit

✅ Verified Hardware

🖥️ Multi-Node Cluster

Node 1 (Primary + API)

Node 2+ (Workers)

Manual Peer Configuration (Optional)

🔌 API Usage

Supported Models

🔧 Environment Variables

🐛 Troubleshooting

Common Fixes

📊 Performance Tips

🔗 Related Projects

🙏 Credits

📜 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 50

Uh oh!

Languages

Packages