Run Docker's AI Model Runner with native GPU acceleration on macOS Apple Silicon — without Docker Desktop.
This project provides an automated setup script that configures Docker Model Runner to use your Mac's GPU (Metal) for AI inference, using Colima instead of Docker Desktop. Perfect for running local LLMs with optimal performance.
Docker has open-sourced their Model Runner and CLI plugin, but they're designed to work exclusively with Docker Desktop. When you try to use docker model with open-source alternatives like Colima on macOS, you lose GPU acceleration and fall back to CPU-only inference—resulting in significantly slower performance.
Docker Desktop achieves GPU acceleration by running llama.cpp in an Apple native sandbox with Metal support. However, there was no equivalent solution for users who want a fully open-source Docker setup with Colima.
This project provides a forked model-runner with patches that enable the same GPU-accelerated architecture outside of Docker Desktop:
- Runs llama.cpp in Apple's native sandbox: Same security model as Docker Desktop, with Metal GPU acceleration
- Works seamlessly with Colima: Fully open-source Docker environment
- Drop-in replacement: Uses the same
docker modelCLI commands - Host-level service: model-runner runs as a macOS LaunchAgent for direct GPU access
- Native GPU Performance: Full Metal acceleration via llama.cpp, just like Docker Desktop
- Fully Open Source: Both Colima and our forked model-runner are open source
- Sandboxed Security: Models run in Apple's sandbox for container-like security isolation
- OpenAI-Compatible API: Drop-in replacement for OpenAI endpoints at
localhost:12434 - Container Integration: Accessible from Colima containers via
host.lima.internal:12434 - Automatic Service Management: Runs as macOS LaunchAgent (starts on boot)
- macOS with Apple Silicon (M1/M2/M3/M4)
- Homebrew installed
- Colima installed:
brew install colima
git clone https://github.com/Liquescent-Development/colima-model-runner.git
cd colima-model-runner
./setup-colima-gpu-model-runner.shThe script will:
- Install dependencies (llama.cpp, Docker CLI)
- Download pre-built model-runner binary from our GPU-optimized fork
- Configure model-runner as a macOS LaunchAgent service
- Set up Colima for Docker support
- Test GPU support and connectivity
source ~/.zshrc # or ~/.bashrc if you use bash# Pull a model
docker model pull ai/smollm2
# List installed models
docker model ls
# Run inference (interactive)
docker model run ai/smollm2
# Run inference (single prompt)
docker model run ai/smollm2 "Explain quantum computing in simple terms"# Interactive chat
docker model run ai/smollm2
# Single prompt
docker model run ai/smollm2 "Write a haiku about code"
# List models
docker model ls
# Remove a model
docker model rm ai/smollm2curl http://localhost:12434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "ai/smollm2",
"messages": [{"role": "user", "content": "Hello!"}]
}'# Use from any container running in Colima
docker run -e OPENAI_API_BASE=http://host.lima.internal:12434/v1 your-appimport openai
client = openai.OpenAI(
base_url="http://localhost:12434/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="ai/smollm2",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)┌─────────────────────────────────────────────────────────────┐
│ macOS Host │
│ │
│ ┌────────────────────────────────────────────────────┐ │
│ │ model-runner (LaunchAgent) │ │
│ │ • Port: 12434 │ │
│ │ • GPU: Metal acceleration via llama.cpp │ │
│ │ • API: OpenAI-compatible │ │
│ └────────────────────────────────────────────────────┘ │
│ ▲ │
│ │ http://localhost:12434 │
│ │ │
│ ┌────────────────────────┴───────────────────────────┐ │
│ │ docker model CLI │ │
│ └────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────┐ │
│ │ Colima (Lima VM) │ │
│ │ • Docker daemon │ │
│ │ • Containers access via host.lima.internal:12434 │ │
│ └────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
launchctl list | grep model-runner# Standard output
tail -f ~/Library/Logs/model-runner.log
# Errors
tail -f ~/Library/Logs/model-runner.errlaunchctl unload ~/Library/LaunchAgents/com.liquescent.model-runner.plist
launchctl load ~/Library/LaunchAgents/com.liquescent.model-runner.plistlaunchctl unload ~/Library/LaunchAgents/com.liquescent.model-runner.plistsudo powermetrics --samplers gpu_power -i 1000Check that GPU support is enabled:
grep "gpuSupport=true" ~/Library/Logs/model-runner.logYou should see output confirming GPU support is active.
- Check logs:
tail -f ~/Library/Logs/model-runner.err - Verify binary exists:
ls -la ~/.local/bin/model-runner - Ensure Xcode CLT installed:
xcode-select -p
- Verify you're on Apple Silicon:
uname -m(should showarm64) - Check llama.cpp installation:
which llama-server - Review model-runner logs for Metal initialization
# Ensure environment variable is set
echo $MODEL_RUNNER_HOST # Should show http://localhost:12434
# Test connectivity
curl http://localhost:12434/models
# Reinstall Docker CLI
brew upgrade docker# Restart Colima
colima stop
colima start --cpu 4 --memory 8 --disk 60 --vm-type=vz
# Check status
colima status- Service Port: 12434
- Binary Location:
~/.local/bin/model-runner - Installation Directory:
~/.local/share/model-runner - Logs:
~/Library/Logs/model-runner.{log,err} - LaunchAgent:
~/Library/LaunchAgents/com.liquescent.model-runner.plist
MODEL_RUNNER_HOST=http://localhost:12434
MODEL_RUNNER_PORT=12434
LLAMA_SERVER_PATH=/opt/homebrew/bin- Choose the Right Model Size: Start with quantized models (Q4_K_M) for best speed/quality balance
- Monitor GPU Usage: Use
powermetricsto verify GPU utilization during inference - Colima Resources: Allocate enough RAM and CPU for your containers
- Model Caching: Pulled models are cached locally for fast reuse
# Stop and remove service
launchctl unload ~/Library/LaunchAgents/com.liquescent.model-runner.plist
rm ~/Library/LaunchAgents/com.liquescent.model-runner.plist
# Remove binaries and data
rm -rf ~/.local/share/model-runner
rm ~/.local/bin/model-runner
# Remove environment variable from shell profile
# Edit ~/.zshrc or ~/.bashrc and remove MODEL_RUNNER_HOST exportContributions are welcome! Please feel free to submit issues or pull requests.
- Docker Model Runner - Official Docker AI documentation
- Liquescent-Development/model-runner - Our GPU-optimized fork
- Colima - Container runtime for macOS
- llama.cpp - LLM inference with Metal support
See LICENSE file for details.
- Issues: GitHub Issues
- Documentation: See CLAUDE.md for development details
- Docker Model Runner Docs: https://docs.docker.com/ai/model-runner/