Run GPT-4 class AI coding assistants 100% locally. No API costs. No cloud. Total privacy.
Complete guide with agentic workflows, prompt engineering, runner comparison, and real-world examples
β‘ Quick Links:
π Quick Start Β· π€ Agentic Coding Β· π Runners Β· π‘οΈ Guardrails Β· π― Prompts Β· π£οΈ Community Β·
Click to expand full navigation
- DeepSeek V3 + R1 Guide - The open-source king π
- Gemini 2.0 Flash Guide - Speed & context economics
- Holiday Freeze Protocol - Enterprise survival guide
- 2026 Agentic Trends - Vibe coding & corporate immune systems
- Runner Comparison - Ollama vs llama.cpp vs vLLM
- Model Selection
- IDE Integration
- Alternative Tools - LM Studio, Tabby
- Agentic Coding - Autonomous bug fixing
- Guardrails & TDD - Prevent hallucinations
- Prompt Engineering - Better local prompts
- Real-World Workflows
- Community Experiences - Reddit/HN insights
- Advanced Patterns - Architect-Builder, YOLO Mode
- FAQ - Quick answers
- Gotchas & Common Mistakes
- Diagrams - Visual workflows
- Optimization Guide
- Cost Analysis
- Docker Compose - One-command setup
- Config Templates - Ready-to-use configs
- Benchmark Script - Test your hardware
| Cloud AI | Local AI |
|---|---|
| β $200-500/month API costs | β $0/month after hardware |
| β Your code sent to servers | β 100% private |
| β Network latency (~200-500ms) | β <50ms response |
| β Rate limits | β Unlimited usage |
| β Requires internet | β Works offline |
2026 Reality: Qwen2.5-Coder-32B scores 92.7% on HumanEval, matching GPT-4o. The switch is no longer a compromiseβit's an upgrade.
Speed (t/s) β Memory Bandwidth (GB/s) / Model Size (GB)
Example: RTX 4090 (1008 GB/s) + Qwen 32B Q4 (18GB)
β 1008 / 18 = 56 t/s β
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows - Download from https://ollama.com/download# For 24GB VRAM (RTX 3090/4090)
ollama pull qwen2.5-coder:32b
# For 16GB VRAM
ollama pull qwen2.5-coder:14b
# For 8GB VRAM or laptops
ollama pull qwen2.5-coder:7b
# For autocomplete (fast, small)
ollama pull qwen2.5-coder:1.5b-baseollama run qwen2.5-coder:32b
>>> Write a Python function to find prime numbers- Install Continue extension
- Configure
~/.continue/config.json:
{
"models": [{
"title": "Qwen 32B (Chat)",
"provider": "ollama",
"model": "qwen2.5-coder:32b"
}],
"tabAutocompleteModel": {
"title": "Qwen 1.5B (Fast)",
"provider": "ollama",
"model": "qwen2.5-coder:1.5b-base"
}
}Done! You now have a local Copilot alternative.
vLLM is 19x faster than Ollama under concurrent load (Red Hat benchmarks).
| Runner | Throughput | Best For |
|---|---|---|
| Ollama | ~41 TPS | Single dev, easy setup |
| llama.cpp | ~44 TPS | CLI hackers, full control |
| vLLM | ~793 TPS | Team servers, CI/CD |
| SGLang | ~47 TPS | DeepSeek, structured JSON |
Single developer on desktop?
ββ Want simplicity? β Ollama
ββ Want control? β llama.cpp
Running team server?
ββ High throughput? β vLLM
ββ JSON outputs? β SGLang
π Full Runner Comparison Guide β
Reddit's #1 requested feature: "Show me a real workflow, not just setup."
# Install Aider
pip install aider-chat
# Configure for Ollama
cat > ~/.aider.conf.yml << 'EOF'
model: ollama/qwen2.5-coder:32b
openai-api-base: http://localhost:11434/v1
openai-api-key: "ollama"
EOF
# Start fixing bugs!
cd /your/project
aider .YOU: Tests test_user_login and test_user_logout are failing. Please:
1) Run `pytest tests/test_auth.py`
2) Read failing tests and source files
3) Explain the bug and create a plan
4) Apply minimal fix
5) Run tests until they pass
AIDER: [Reads files, proposes fix, applies, runs tests, iterates...]
YOU: git diff # Review
YOU: git commit -am "Fix auth bug"
- Open failing file in VS Code
- Open Continue β Select Agent mode
- Prompt with specific instructions
- Let Agent iterate with tools
π Full Agentic Coding Guide β
Prevent local models from hallucinating and breaking your code.
1. YOU write failing test
2. AI implements code
3. Test runs automatically
4. If fail β AI analyzes, retries
5. If pass β Move to next feature
PROMPT (Step 1 - Plan):
"Analyze the failing test. DO NOT write code yet.
Create a numbered plan with 3-7 steps."
PROMPT (Step 2 - Execute):
"I approve the plan. Now implement step by step.
Run tests after each major change."
RULES:
- Only modify: PaymentService.ts
- Do NOT touch: config.ts, package.json
- Do NOT add new files
π Full Guardrails Guide β
Local models need better prompts than GPT-4.
CONTEXT: You are editing a TypeScript monorepo with Next.js.
OBJECTIVE: Fix the failing tests without breaking other components.
STYLE: Clear, idiomatic TypeScript; minimal changes.
RESPONSE:
1. Short explanation (3-5 bullets)
2. Step-by-step plan
3. Unified diff for changed files only
"You are Qwen, a highly capable coding assistant created by Alibaba Cloud.
You are an expert in algorithms, system design, and clean code principles.
You strictly adhere to user constraints and always think step-by-step."
You are a coding assistant focused on small, safe changes.
RULES:
1. Never invent external APIs
2. Prefer minimal diffs over rewrites
3. Keep style consistent with existing code
4. If ambiguous, ask clarifying questions
5. Output ONLY unified diffs
π Full Prompt Engineering Guide β
| Model | Size | VRAM | HumanEval | Best For |
|---|---|---|---|---|
| Qwen 2.5 Coder 32B π | 32B | 24GB | 92.7% | All-around KING |
| DeepSeek-Coder-V2 | 236B (MoE) | 48GB+ | ~89% | Multi-GPU setups |
| Qwen 2.5 Coder 14B | 14B | 16GB | ~85% | Mid-range GPUs |
| Qwen 2.5 Coder 7B | 7B | 8GB | ~80% | Laptops |
| Codestral 22B | 22B | 20GB | ~82% | FIM specialist |
| Quant | Quality | Use Case |
|---|---|---|
| Q4_K_M | ββββ | Default. Best balance. |
| Q5_K_M | βββββ | Complex refactors |
| Q8_0 | βββββ | If VRAM allows |
| Q2_K | ββ | β Avoid for coding |
Warning: Don't go below Q4 for coding. Logic breaks at low precision.
Speed (t/s) β Memory Bandwidth (GB/s) / Model Size (GB)
| Hardware | Bandwidth | 32B Q4 Speed |
|---|---|---|
| RTX 4090 (24GB) | 1008 GB/s | ~56 t/s |
| RTX 3090 (24GB) | 936 GB/s | ~52 t/s |
| M3 Max (96GB) | 400 GB/s | ~22 t/s |
| RTX 4060 Ti (16GB) | 288 GB/s | N/A (won't fit) |
| Persona | Hardware | Best Model | Speed |
|---|---|---|---|
| Budget Learner | RTX 3060 12GB | Qwen 7B | ~40 t/s |
| Pro Developer | RTX 4090 24GB | Qwen 32B | ~56 t/s |
| AI Architect | Mac Studio 128GB | Llama 70B | ~22 t/s |
| Home Lab | Dual RTX 3090 | Llama 70B Q5 | ~35 t/s |
{
"models": [{
"title": "Qwen 32B",
"provider": "ollama",
"model": "qwen2.5-coder:32b"
}],
"tabAutocompleteModel": {
"title": "StarCoder2 3B",
"provider": "ollama",
"model": "starcoder2:3b"
}
}Settings β Models β OpenAI API Base URL
β http://localhost:11434/v1
API Key: ollama
Model: qwen2.5-coder:32b
pip install aider-chat
export OLLAMA_API_BASE=http://localhost:11434
aider --model ollama/qwen2.5-coder:32b| Tool | Best For |
|---|---|
| LM Studio | Visual exploration, model comparison |
| Tabby | Self-hosted autocomplete (<100ms) |
| LocalAI | Kubernetes/DevOps, multi-model |
| vLLM | Team servers, CI/CD pipelines |
π Full Alternative Tools Guide β
1. Open failing file + test in VS Code
2. Continue Agent mode
3. Prompt: "Avatar doesn't update after profile change..."
4. Let agent read, test, fix, iterate
5. Review diffs and commit
1. Write failing test first
2. Aider: "Implement /api/users/{id} to pass the test"
3. Agent implements, runs tests, iterates
4. Review and commit
1. Plan mode: "Create characterization tests"
2. Agent mode: "Refactor to Python 3.12"
3. Verify all tests pass
4. Review and commit
| Mistake | Fix |
|---|---|
| Expecting GPT-4 from 7B | Use 32B for complex tasks |
| Dumping entire repo | Limit to relevant files |
| Using Q2 quantization | Stay β₯Q4 for coding |
| Long sessions | Clear context regularly |
| No tests | Always have verification |
Symptoms:
- Model repeats itself
- Ignores instructions
- Quality drops suddenly
Fix:
- /clear or restart session
- Use RAG instead of stuffing
- Summarize before continuing
export OLLAMA_KEEP_ALIVE=-1 # Never unloadcat << 'EOF' > Modelfile
FROM qwen2.5-coder:32b
PARAMETER num_ctx 32768
EOF
ollama create qwen32k -f Modelfile| Factor | Cloud (GPT-4o) | Local (RTX 4090) |
|---|---|---|
| Monthly Cost | $200-500 | $0 |
| Hardware | $0 | ~$1,800 one-time |
| Break-even | - | 4-9 months |
| Privacy | β | β |
| Offline | β | β |
Insight: If you already have a gaming PC, local AI is essentially free.
| Resource | Link |
|---|---|
| π Ollama Docs | docs.ollama.com |
| π§ Continue.dev | docs.continue.dev |
| π€ Aider | aider.chat |
| π¦ r/LocalLLaMA | reddit.com/r/LocalLLaMA |
| π·οΈ Qwen2.5-Coder | Hugging Face |
We welcome contributions! Help us keep this guide updated.
| Type | Examples |
|---|---|
| π Tips | Workflows, shortcuts, hidden features |
| π Bug Reports | New issues, workarounds |
| π Benchmarks | Model comparisons, speed tests |
| π§ Configs | Modelfiles, Continue configs |
- Fork this repo
- Add your changes
- Submit a PR