First verified working NVIDIA CUDA distributed inference for exo
Run large language models across multiple NVIDIA GPUs with automatic node discovery
Quick Start • Verified Hardware • Multi-Node Setup • Troubleshooting
The original exo focuses on Apple Silicon (MLX). This fork restores full NVIDIA CUDA support via tinygrad:
| Feature | Original exo | exo-cuda |
|---|---|---|
| Apple Silicon (MLX) | ✅ | ✅ |
| NVIDIA CUDA | ❌ Broken | ✅ Working |
| Tesla V100/M40 | ❌ | ✅ Tested |
| Multi-GPU cluster | ✅ CUDA cluster | |
| Distributed inference | ✅ | ✅ |
# Clone this repo
git clone https://github.com/Scottcjn/exo-cuda.git
cd exo-cuda
# Create venv and install
python3 -m venv .venv
source .venv/bin/activate
pip install -e .
# Upgrade tinygrad to latest (fixes CUDA issues)
pip install --upgrade git+https://github.com/tinygrad/tinygrad.git
# Start with CUDA backend
exo --inference-engine tinygrad --chatgpt-api-port 8001 --disable-tui| Component | Requirement |
|---|---|
| OS | Ubuntu 22.04/24.04, Debian 12+ |
| Python | 3.10+ (3.12 recommended) |
| NVIDIA Driver | 525+ (nvidia-smi to verify) |
| CUDA Toolkit | 12.0+ (nvcc --version to verify) |
| GPU Memory | 8GB+ per node |
# Ubuntu/Debian
sudo apt install nvidia-cuda-toolkit
# Verify
nvcc --version
nvidia-smiTested December 2024 - January 2025:
| Server | GPU | VRAM | Status |
|---|---|---|---|
| Dell PowerEdge C4130 | Tesla V100-SXM2 | 16GB | ✅ Working |
| Dell PowerEdge C4130 | Tesla M40 | 24GB | ✅ Working |
| Custom Build | RTX 3090 | 24GB | ✅ Working |
| Multi-node cluster | V100 + M40 | 40GB total | ✅ Working |
exo --inference-engine tinygrad --chatgpt-api-port 8001 --disable-tuiexo --inference-engine tinygrad --disable-tuiThat's it! Nodes auto-discover via UDP broadcast. No manual configuration.
# Create peers.json
echo '{"peers": ["192.168.1.100:5678", "192.168.1.101:5678"]}' > peers.json
# Start with manual discovery
exo --inference-engine tinygrad --discovery-module manual \
--discovery-config-path peers.jsonexo provides a ChatGPT-compatible API:
# Chat completion
curl http://localhost:8001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.2-3b",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}'
# List models
curl http://localhost:8001/v1/modelsAll tinygrad-compatible models work:
| Model | Parameters | Min VRAM |
|---|---|---|
| Llama 3.2 1B | 1B | 4GB |
| Llama 3.2 3B | 3B | 8GB |
| Llama 3.1 8B | 8B | 16GB |
| Llama 3.1 70B | 70B | 140GB (cluster) |
| DeepSeek Coder | Various | Varies |
| Qwen 2.5 | 0.5B-72B | Varies |
| Mistral 7B | 7B | 14GB |
# Debug logging (0-9, higher = more verbose)
DEBUG=2 exo --inference-engine tinygrad
# Tinygrad-specific debug (1-6)
TINYGRAD_DEBUG=2 exo --inference-engine tinygrad
# Limit GPU visibility
CUDA_VISIBLE_DEVICES=0,1 exo --inference-engine tinygrad| Issue | Solution |
|---|---|
nvcc not found |
sudo apt install nvidia-cuda-toolkit |
OpenCL exp2 error |
pip install --upgrade git+https://github.com/tinygrad/tinygrad.git |
No GPU detected |
Check nvidia-smi and nvcc --version |
Out of memory |
Use smaller model or add more nodes |
Connection refused |
Check firewall allows UDP broadcast |
# Fix tinygrad CUDA issues
pip install --upgrade git+https://github.com/tinygrad/tinygrad.git
# Verify CUDA is working
python3 -c "from tinygrad import Device; print(Device.DEFAULT)"
# Should print: CUDA
# Test GPU memory
nvidia-smi --query-gpu=memory.free --format=csv- Use SXM2 GPUs - NVLink provides faster inter-GPU communication
- Match GPU types - Heterogeneous clusters work but homogeneous is faster
- 10GbE+ networking - For multi-node clusters, network is the bottleneck
- Disable TUI -
--disable-tuireduces overhead
| Project | Description |
|---|---|
| nvidia-power8-patches | NVIDIA drivers for IBM POWER8 |
| cuda-power8-patches | CUDA toolkit for POWER8 |
| llama-cpp-power8 | llama.cpp on POWER8 |
GPL-3.0 (same as original exo)
Maintained by Elyan Labs
Distributed NVIDIA inference that actually works