🦙 Local AI Coding Guide

Run GPT-4 class AI coding assistants 100% locally. No API costs. No cloud. Total privacy.

Complete guide with agentic workflows, prompt engineering, runner comparison, and real-world examples

⚡ Quick Links:

🚀 Quick Start · 🤖 Agentic Coding · 🔀 Runners · 🛡️ Guardrails · 🎯 Prompts · 🗣️ Community · ⚠️ Gotchas

📋 Table of Contents

Click to expand full navigation

🔥 Hot Topics (January 2026) - NEW!

DeepSeek V3 + R1 Guide - The open-source king 👑
Gemini 2.0 Flash Guide - Speed & context economics
Holiday Freeze Protocol - Enterprise survival guide
2026 Agentic Trends - Vibe coding & corporate immune systems

🔧 Infrastructure

Runner Comparison - Ollama vs llama.cpp vs vLLM
Model Selection
IDE Integration
Alternative Tools - LM Studio, Tabby

🤖 Advanced Workflows (NEW)

Agentic Coding - Autonomous bug fixing
Guardrails & TDD - Prevent hallucinations
Prompt Engineering - Better local prompts
Real-World Workflows
Community Experiences - Reddit/HN insights
Advanced Patterns - Architect-Builder, YOLO Mode

⚠️ Troubleshooting

FAQ - Quick answers
Gotchas & Common Mistakes
Diagrams - Visual workflows
Optimization Guide
Cost Analysis

🛠️ Tools & Configs

Docker Compose - One-command setup
Config Templates - Ready-to-use configs
Benchmark Script - Test your hardware

🎯 Why Local AI?

Cloud AI	Local AI
❌ $200-500/month API costs	✅ $0/month after hardware
❌ Your code sent to servers	✅ 100% private
❌ Network latency (~200-500ms)	✅ <50ms response
❌ Rate limits	✅ Unlimited usage
❌ Requires internet	✅ Works offline

2026 Reality: Qwen2.5-Coder-32B scores 92.7% on HumanEval, matching GPT-4o. The switch is no longer a compromise—it's an upgrade.

The Bandwidth Formula

Speed (t/s) ≈ Memory Bandwidth (GB/s) / Model Size (GB)

Example: RTX 4090 (1008 GB/s) + Qwen 32B Q4 (18GB)
         ≈ 1008 / 18 = 56 t/s ✓

🚀 Quick Start

Step 1: Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows - Download from https://ollama.com/download

Step 2: Download Coding Model

# For 24GB VRAM (RTX 3090/4090)
ollama pull qwen2.5-coder:32b

# For 16GB VRAM
ollama pull qwen2.5-coder:14b

# For 8GB VRAM or laptops
ollama pull qwen2.5-coder:7b

# For autocomplete (fast, small)
ollama pull qwen2.5-coder:1.5b-base

Step 3: Test It

ollama run qwen2.5-coder:32b
>>> Write a Python function to find prime numbers

Step 4: Install Continue.dev (VS Code)

Install Continue extension
Configure ~/.continue/config.json:

{
  "models": [{
    "title": "Qwen 32B (Chat)",
    "provider": "ollama",
    "model": "qwen2.5-coder:32b"
  }],
  "tabAutocompleteModel": {
    "title": "Qwen 1.5B (Fast)",
    "provider": "ollama",
    "model": "qwen2.5-coder:1.5b-base"
  }
}

Done! You now have a local Copilot alternative.

🔀 Runner Comparison

vLLM is 19x faster than Ollama under concurrent load (Red Hat benchmarks).

Runner	Throughput	Best For
Ollama	~41 TPS	Single dev, easy setup
llama.cpp	~44 TPS	CLI hackers, full control
vLLM	~793 TPS	Team servers, CI/CD
SGLang	~47 TPS	DeepSeek, structured JSON

Quick Decision

Single developer on desktop?
├─ Want simplicity? → Ollama
└─ Want control? → llama.cpp

Running team server?
├─ High throughput? → vLLM
└─ JSON outputs? → SGLang

📖 Full Runner Comparison Guide →

🤖 Agentic Coding (NEW!)

Reddit's #1 requested feature: "Show me a real workflow, not just setup."

The Bug Fix Workflow (Aider + Ollama)

# Install Aider
pip install aider-chat

# Configure for Ollama
cat > ~/.aider.conf.yml << 'EOF'
model: ollama/qwen2.5-coder:32b
openai-api-base: http://localhost:11434/v1
openai-api-key: "ollama"
EOF

# Start fixing bugs!
cd /your/project
aider .

Example Session

YOU: Tests test_user_login and test_user_logout are failing. Please:
     1) Run `pytest tests/test_auth.py`
     2) Read failing tests and source files
     3) Explain the bug and create a plan
     4) Apply minimal fix
     5) Run tests until they pass

AIDER: [Reads files, proposes fix, applies, runs tests, iterates...]

YOU: git diff  # Review
YOU: git commit -am "Fix auth bug"

Continue.dev Agent Mode

Open failing file in VS Code
Open Continue → Select Agent mode
Prompt with specific instructions
Let Agent iterate with tools

📖 Full Agentic Coding Guide →

🛡️ Guardrails & Coding Plans

Prevent local models from hallucinating and breaking your code.

Strategy 1: TDD as Feedback Loop

1. YOU write failing test
2. AI implements code
3. Test runs automatically
4. If fail → AI analyzes, retries
5. If pass → Move to next feature

Strategy 2: Plan Before Code

PROMPT (Step 1 - Plan):
"Analyze the failing test. DO NOT write code yet.
Create a numbered plan with 3-7 steps."

PROMPT (Step 2 - Execute):
"I approve the plan. Now implement step by step.
Run tests after each major change."

Strategy 3: Scope Limiting

RULES:
- Only modify: PaymentService.ts
- Do NOT touch: config.ts, package.json
- Do NOT add new files

📖 Full Guardrails Guide →

🎯 Prompt Engineering

Local models need better prompts than GPT-4.

The CO-STAR Framework

CONTEXT: You are editing a TypeScript monorepo with Next.js.
OBJECTIVE: Fix the failing tests without breaking other components.
STYLE: Clear, idiomatic TypeScript; minimal changes.
RESPONSE: 
  1. Short explanation (3-5 bullets)
  2. Step-by-step plan
  3. Unified diff for changed files only

Identity Reinforcement

"You are Qwen, a highly capable coding assistant created by Alibaba Cloud.
You are an expert in algorithms, system design, and clean code principles.
You strictly adhere to user constraints and always think step-by-step."

System Prompt Template

You are a coding assistant focused on small, safe changes.

RULES:
1. Never invent external APIs
2. Prefer minimal diffs over rewrites
3. Keep style consistent with existing code
4. If ambiguous, ask clarifying questions
5. Output ONLY unified diffs

📖 Full Prompt Engineering Guide →

📊 Model Comparison

Model	Size	VRAM	HumanEval	Best For
Qwen 2.5 Coder 32B 👑	32B	24GB	92.7%	All-around KING
DeepSeek-Coder-V2	236B (MoE)	48GB+	~89%	Multi-GPU setups
Qwen 2.5 Coder 14B	14B	16GB	~85%	Mid-range GPUs
Qwen 2.5 Coder 7B	7B	8GB	~80%	Laptops
Codestral 22B	22B	20GB	~82%	FIM specialist

Quantization Guidance

Quant	Quality	Use Case
Q4_K_M	⭐⭐⭐⭐	Default. Best balance.
Q5_K_M	⭐⭐⭐⭐⭐	Complex refactors
Q8_0	⭐⭐⭐⭐⭐	If VRAM allows
Q2_K	⭐⭐	❌ Avoid for coding

Warning: Don't go below Q4 for coding. Logic breaks at low precision.

💻 Hardware Requirements

The Speed Formula

Speed (t/s) ≈ Memory Bandwidth (GB/s) / Model Size (GB)

Hardware	Bandwidth	32B Q4 Speed
RTX 4090 (24GB)	1008 GB/s	~56 t/s
RTX 3090 (24GB)	936 GB/s	~52 t/s
M3 Max (96GB)	400 GB/s	~22 t/s
RTX 4060 Ti (16GB)	288 GB/s	N/A (won't fit)

Recommendations

Persona	Hardware	Best Model	Speed
Budget Learner	RTX 3060 12GB	Qwen 7B	~40 t/s
Pro Developer	RTX 4090 24GB	Qwen 32B	~56 t/s
AI Architect	Mac Studio 128GB	Llama 70B	~22 t/s
Home Lab	Dual RTX 3090	Llama 70B Q5	~35 t/s

🔧 IDE Integration

Continue.dev (Recommended)

{
  "models": [{
    "title": "Qwen 32B",
    "provider": "ollama",
    "model": "qwen2.5-coder:32b"
  }],
  "tabAutocompleteModel": {
    "title": "StarCoder2 3B",
    "provider": "ollama",
    "model": "starcoder2:3b"
  }
}

Cursor (Local Mode)

Settings → Models → OpenAI API Base URL
→ http://localhost:11434/v1
API Key: ollama
Model: qwen2.5-coder:32b

Aider (Terminal)

pip install aider-chat
export OLLAMA_API_BASE=http://localhost:11434
aider --model ollama/qwen2.5-coder:32b

🖥️ Alternative Tools

Tool	Best For
LM Studio	Visual exploration, model comparison
Tabby	Self-hosted autocomplete (<100ms)
LocalAI	Kubernetes/DevOps, multi-model
vLLM	Team servers, CI/CD pipelines

📖 Full Alternative Tools Guide →

🔄 Real-World Workflows

Workflow 1: Debug React Component

1. Open failing file + test in VS Code
2. Continue Agent mode
3. Prompt: "Avatar doesn't update after profile change..."
4. Let agent read, test, fix, iterate
5. Review diffs and commit

Workflow 2: Add API Endpoint (TDD)

1. Write failing test first
2. Aider: "Implement /api/users/{id} to pass the test"
3. Agent implements, runs tests, iterates
4. Review and commit

Workflow 3: Refactor Legacy Code

1. Plan mode: "Create characterization tests"
2. Agent mode: "Refactor to Python 3.12"
3. Verify all tests pass
4. Review and commit

📖 Full Workflows Guide →

⚠️ Gotchas

Top 5 Mistakes

Mistake	Fix
Expecting GPT-4 from 7B	Use 32B for complex tasks
Dumping entire repo	Limit to relevant files
Using Q2 quantization	Stay ≥Q4 for coding
Long sessions	Clear context regularly
No tests	Always have verification

Context Window Exhaustion

Symptoms:
- Model repeats itself
- Ignores instructions
- Quality drops suddenly

Fix:
- /clear or restart session
- Use RAG instead of stuffing
- Summarize before continuing

📖 Full Gotchas Guide →

⚡ Optimization Guide

Keep Model in Memory

export OLLAMA_KEEP_ALIVE=-1  # Never unload

Increase Context Window

cat << 'EOF' > Modelfile
FROM qwen2.5-coder:32b
PARAMETER num_ctx 32768
EOF
ollama create qwen32k -f Modelfile

💰 Cost Analysis

Factor	Cloud (GPT-4o)	Local (RTX 4090)
Monthly Cost	$200-500	$0
Hardware	$0	~$1,800 one-time
Break-even	-	4-9 months
Privacy	❌	✅
Offline	❌	✅

Insight: If you already have a gaming PC, local AI is essentially free.

📈 Star History

📚 Resources

Resource	Link
📖 Ollama Docs	docs.ollama.com
🔧 Continue.dev	docs.continue.dev
🤖 Aider	aider.chat
🦙 r/LocalLLaMA	reddit.com/r/LocalLLaMA
🏷️ Qwen2.5-Coder	Hugging Face

🤝 Contributing

We welcome contributions! Help us keep this guide updated.

Type	Examples
🆕 Tips	Workflows, shortcuts, hidden features
🐛 Bug Reports	New issues, workarounds
📊 Benchmarks	Model comparisons, speed tests
🔧 Configs	Modelfiles, Continue configs

Fork this repo
Add your changes
Submit a PR

💝 Support

⭐ Star this repo if it helped you!

Made with ❤️ by Murat Aslan

Last updated: January 2026

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
articles		articles
assets		assets
configs		configs
docker		docker
guides		guides
scripts		scripts
social		social
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

License

murataslan1/local-ai-coding-guide

Folders and files

Latest commit

History

Repository files navigation