Skip to content

murataslan1/local-ai-coding-guide

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Local AI Coding Guide Banner

πŸ¦™ Local AI Coding Guide

Run GPT-4 class AI coding assistants 100% locally. No API costs. No cloud. Total privacy.

GitHub stars GitHub forks Last Updated

Qwen Ollama Privacy License: MIT

Complete guide with agentic workflows, prompt engineering, runner comparison, and real-world examples


⚑ Quick Links:

πŸš€ Quick Start Β· πŸ€– Agentic Coding Β· πŸ”€ Runners Β· πŸ›‘οΈ Guardrails Β· 🎯 Prompts Β· πŸ—£οΈ Community Β· ⚠️ Gotchas


πŸ“‹ Table of Contents

Click to expand full navigation

πŸš€ Getting Started

πŸ”₯ Hot Topics (January 2026) - NEW!

πŸ”§ Infrastructure

πŸ€– Advanced Workflows (NEW)

⚠️ Troubleshooting

πŸ› οΈ Tools & Configs


🎯 Why Local AI?

Cloud AI Local AI
❌ $200-500/month API costs βœ… $0/month after hardware
❌ Your code sent to servers βœ… 100% private
❌ Network latency (~200-500ms) βœ… <50ms response
❌ Rate limits βœ… Unlimited usage
❌ Requires internet βœ… Works offline

2026 Reality: Qwen2.5-Coder-32B scores 92.7% on HumanEval, matching GPT-4o. The switch is no longer a compromiseβ€”it's an upgrade.

The Bandwidth Formula

Speed (t/s) β‰ˆ Memory Bandwidth (GB/s) / Model Size (GB)

Example: RTX 4090 (1008 GB/s) + Qwen 32B Q4 (18GB)
         β‰ˆ 1008 / 18 = 56 t/s βœ“

πŸš€ Quick Start

Step 1: Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows - Download from https://ollama.com/download

Step 2: Download Coding Model

# For 24GB VRAM (RTX 3090/4090)
ollama pull qwen2.5-coder:32b

# For 16GB VRAM
ollama pull qwen2.5-coder:14b

# For 8GB VRAM or laptops
ollama pull qwen2.5-coder:7b

# For autocomplete (fast, small)
ollama pull qwen2.5-coder:1.5b-base

Step 3: Test It

ollama run qwen2.5-coder:32b
>>> Write a Python function to find prime numbers

Step 4: Install Continue.dev (VS Code)

  1. Install Continue extension
  2. Configure ~/.continue/config.json:
{
  "models": [{
    "title": "Qwen 32B (Chat)",
    "provider": "ollama",
    "model": "qwen2.5-coder:32b"
  }],
  "tabAutocompleteModel": {
    "title": "Qwen 1.5B (Fast)",
    "provider": "ollama",
    "model": "qwen2.5-coder:1.5b-base"
  }
}

Done! You now have a local Copilot alternative.


πŸ”€ Runner Comparison

vLLM is 19x faster than Ollama under concurrent load (Red Hat benchmarks).

Runner Throughput Best For
Ollama ~41 TPS Single dev, easy setup
llama.cpp ~44 TPS CLI hackers, full control
vLLM ~793 TPS Team servers, CI/CD
SGLang ~47 TPS DeepSeek, structured JSON

Quick Decision

Single developer on desktop?
β”œβ”€ Want simplicity? β†’ Ollama
└─ Want control? β†’ llama.cpp

Running team server?
β”œβ”€ High throughput? β†’ vLLM
└─ JSON outputs? β†’ SGLang

πŸ“– Full Runner Comparison Guide β†’


πŸ€– Agentic Coding (NEW!)

Reddit's #1 requested feature: "Show me a real workflow, not just setup."

The Bug Fix Workflow (Aider + Ollama)

# Install Aider
pip install aider-chat

# Configure for Ollama
cat > ~/.aider.conf.yml << 'EOF'
model: ollama/qwen2.5-coder:32b
openai-api-base: http://localhost:11434/v1
openai-api-key: "ollama"
EOF

# Start fixing bugs!
cd /your/project
aider .

Example Session

YOU: Tests test_user_login and test_user_logout are failing. Please:
     1) Run `pytest tests/test_auth.py`
     2) Read failing tests and source files
     3) Explain the bug and create a plan
     4) Apply minimal fix
     5) Run tests until they pass

AIDER: [Reads files, proposes fix, applies, runs tests, iterates...]

YOU: git diff  # Review
YOU: git commit -am "Fix auth bug"

Continue.dev Agent Mode

  1. Open failing file in VS Code
  2. Open Continue β†’ Select Agent mode
  3. Prompt with specific instructions
  4. Let Agent iterate with tools

πŸ“– Full Agentic Coding Guide β†’


πŸ›‘οΈ Guardrails & Coding Plans

Prevent local models from hallucinating and breaking your code.

Strategy 1: TDD as Feedback Loop

1. YOU write failing test
2. AI implements code
3. Test runs automatically
4. If fail β†’ AI analyzes, retries
5. If pass β†’ Move to next feature

Strategy 2: Plan Before Code

PROMPT (Step 1 - Plan):
"Analyze the failing test. DO NOT write code yet.
Create a numbered plan with 3-7 steps."

PROMPT (Step 2 - Execute):
"I approve the plan. Now implement step by step.
Run tests after each major change."

Strategy 3: Scope Limiting

RULES:
- Only modify: PaymentService.ts
- Do NOT touch: config.ts, package.json
- Do NOT add new files

πŸ“– Full Guardrails Guide β†’


🎯 Prompt Engineering

Local models need better prompts than GPT-4.

The CO-STAR Framework

CONTEXT: You are editing a TypeScript monorepo with Next.js.
OBJECTIVE: Fix the failing tests without breaking other components.
STYLE: Clear, idiomatic TypeScript; minimal changes.
RESPONSE: 
  1. Short explanation (3-5 bullets)
  2. Step-by-step plan
  3. Unified diff for changed files only

Identity Reinforcement

"You are Qwen, a highly capable coding assistant created by Alibaba Cloud.
You are an expert in algorithms, system design, and clean code principles.
You strictly adhere to user constraints and always think step-by-step."

System Prompt Template

You are a coding assistant focused on small, safe changes.

RULES:
1. Never invent external APIs
2. Prefer minimal diffs over rewrites
3. Keep style consistent with existing code
4. If ambiguous, ask clarifying questions
5. Output ONLY unified diffs

πŸ“– Full Prompt Engineering Guide β†’


πŸ“Š Model Comparison

Model Size VRAM HumanEval Best For
Qwen 2.5 Coder 32B πŸ‘‘ 32B 24GB 92.7% All-around KING
DeepSeek-Coder-V2 236B (MoE) 48GB+ ~89% Multi-GPU setups
Qwen 2.5 Coder 14B 14B 16GB ~85% Mid-range GPUs
Qwen 2.5 Coder 7B 7B 8GB ~80% Laptops
Codestral 22B 22B 20GB ~82% FIM specialist

Quantization Guidance

Quant Quality Use Case
Q4_K_M ⭐⭐⭐⭐ Default. Best balance.
Q5_K_M ⭐⭐⭐⭐⭐ Complex refactors
Q8_0 ⭐⭐⭐⭐⭐ If VRAM allows
Q2_K ⭐⭐ ❌ Avoid for coding

Warning: Don't go below Q4 for coding. Logic breaks at low precision.


πŸ’» Hardware Requirements

The Speed Formula

Speed (t/s) β‰ˆ Memory Bandwidth (GB/s) / Model Size (GB)
Hardware Bandwidth 32B Q4 Speed
RTX 4090 (24GB) 1008 GB/s ~56 t/s
RTX 3090 (24GB) 936 GB/s ~52 t/s
M3 Max (96GB) 400 GB/s ~22 t/s
RTX 4060 Ti (16GB) 288 GB/s N/A (won't fit)

Recommendations

Persona Hardware Best Model Speed
Budget Learner RTX 3060 12GB Qwen 7B ~40 t/s
Pro Developer RTX 4090 24GB Qwen 32B ~56 t/s
AI Architect Mac Studio 128GB Llama 70B ~22 t/s
Home Lab Dual RTX 3090 Llama 70B Q5 ~35 t/s

πŸ”§ IDE Integration

Continue.dev (Recommended)

{
  "models": [{
    "title": "Qwen 32B",
    "provider": "ollama",
    "model": "qwen2.5-coder:32b"
  }],
  "tabAutocompleteModel": {
    "title": "StarCoder2 3B",
    "provider": "ollama",
    "model": "starcoder2:3b"
  }
}

Cursor (Local Mode)

Settings β†’ Models β†’ OpenAI API Base URL
β†’ http://localhost:11434/v1
API Key: ollama
Model: qwen2.5-coder:32b

Aider (Terminal)

pip install aider-chat
export OLLAMA_API_BASE=http://localhost:11434
aider --model ollama/qwen2.5-coder:32b

πŸ–₯️ Alternative Tools

Tool Best For
LM Studio Visual exploration, model comparison
Tabby Self-hosted autocomplete (<100ms)
LocalAI Kubernetes/DevOps, multi-model
vLLM Team servers, CI/CD pipelines

πŸ“– Full Alternative Tools Guide β†’


πŸ”„ Real-World Workflows

Workflow 1: Debug React Component

1. Open failing file + test in VS Code
2. Continue Agent mode
3. Prompt: "Avatar doesn't update after profile change..."
4. Let agent read, test, fix, iterate
5. Review diffs and commit

Workflow 2: Add API Endpoint (TDD)

1. Write failing test first
2. Aider: "Implement /api/users/{id} to pass the test"
3. Agent implements, runs tests, iterates
4. Review and commit

Workflow 3: Refactor Legacy Code

1. Plan mode: "Create characterization tests"
2. Agent mode: "Refactor to Python 3.12"
3. Verify all tests pass
4. Review and commit

πŸ“– Full Workflows Guide β†’


⚠️ Gotchas

Top 5 Mistakes

Mistake Fix
Expecting GPT-4 from 7B Use 32B for complex tasks
Dumping entire repo Limit to relevant files
Using Q2 quantization Stay β‰₯Q4 for coding
Long sessions Clear context regularly
No tests Always have verification

Context Window Exhaustion

Symptoms:
- Model repeats itself
- Ignores instructions
- Quality drops suddenly

Fix:
- /clear or restart session
- Use RAG instead of stuffing
- Summarize before continuing

πŸ“– Full Gotchas Guide β†’


⚑ Optimization Guide

Keep Model in Memory

export OLLAMA_KEEP_ALIVE=-1  # Never unload

Increase Context Window

cat << 'EOF' > Modelfile
FROM qwen2.5-coder:32b
PARAMETER num_ctx 32768
EOF
ollama create qwen32k -f Modelfile

πŸ’° Cost Analysis

Factor Cloud (GPT-4o) Local (RTX 4090)
Monthly Cost $200-500 $0
Hardware $0 ~$1,800 one-time
Break-even - 4-9 months
Privacy ❌ βœ…
Offline ❌ βœ…

Insight: If you already have a gaming PC, local AI is essentially free.


πŸ“ˆ Star History

Star History

πŸ“š Resources

Resource Link
πŸ“– Ollama Docs docs.ollama.com
πŸ”§ Continue.dev docs.continue.dev
πŸ€– Aider aider.chat
πŸ¦™ r/LocalLLaMA reddit.com/r/LocalLLaMA
🏷️ Qwen2.5-Coder Hugging Face

🀝 Contributing

Contributors

We welcome contributions! Help us keep this guide updated.

Type Examples
πŸ†• Tips Workflows, shortcuts, hidden features
πŸ› Bug Reports New issues, workarounds
πŸ“Š Benchmarks Model comparisons, speed tests
πŸ”§ Configs Modelfiles, Continue configs
  1. Fork this repo
  2. Add your changes
  3. Submit a PR

πŸ’ Support

Star Share Twitter

⭐ Star this repo if it helped you!

Made with ❀️ by Murat Aslan

Follow

Last updated: January 2026

About

local-ai-coding-guide

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published