You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Enable Ollama-compatible and OpenAI-compatible local LLM inference using Intel NPU (Neural Processing Unit) hardware via OpenVINO GenAI. The solution provides a FastAPI-based proxy that bridges modern AI client interfaces with Intel's NPU, GPU, and CPU devices.
Primary Use Case: Allow WSL2 Linux applications (like Claude Code) to access Windows-hosted Intel NPU hardware for efficient on-device AI inference.
2. Environment Context
OS: Windows 11 (23H2+) with NPU drivers, Linux (native NPU driver support)
CPU: Intel Core Ultra (Meteor Lake, Lunar Lake, Arrow Lake) with integrated NPU
NPU: Intel AI Boost (verified working)
GPU: Intel Arc Graphics (optional, used as fallback)
Runtime: Python 3.12+, OpenVINO GenAI 2025.x
Framework: FastAPI + Uvicorn (async HTTP server)
3. Problem Statement
The Challenge
WSL2 NPU Gap: Intel NPU hardware is not accessible from WSL2 Linux distributions
GPU-PV (paravirtualization) exists for GPUs but not for NPUs
Microsoft's dxgkrnl lacks NPU device passthrough
Intel has acknowledged the request (GitHub #56) but provided no timeline after 14+ months
Client Compatibility: Modern AI tools (Claude Code, Continue, Cursor) expect Ollama or OpenAI APIs
No direct OpenVINO integration in these clients
Need API translation layer
NPU Complexity: Intel NPU has unique constraints not present in GPU/CPU
NPU memory is constrained (2-4GB depending on model)
Static KV-cache shapes must be compiled at model load time
Longer contexts require more memory for attention computation
Memory Constraints
Constraint
Value
Notes
NPU Memory
2-4 GB
Shared with system memory
Concurrent Models
1
Only ONE LLM model at a time
Model Swap
Not supported
Must restart server to change
Model Compilation Time
Phase
Time
Notes
Cold start (first load)
80-130 seconds
Model compilation to NPU kernels
Warm start (cached)
5-8 seconds
Using cached compiled model
Inference latency
1-4 seconds
After model is loaded
Mitigation: NPU warmup on startup (engine.warmup(warmup_tokens=16))
NPU vs GPU vs CPU Comparison
Characteristic
NPU
GPU
CPU
Cold Start
80-130s
20-30s
5-10s
Inference Speed
Moderate
Fast
Slow
Power Efficiency
Excellent
Moderate
Poor
Memory Limit
2-4GB
4-8GB+
System RAM
Concurrent Models
1
1-2
Multiple
Dynamic Shapes
❌
✅
✅
Long Context
Limited
✅
✅
7. API Reference
OpenAI-Compatible Endpoints
Method
Path
Description
Streaming
POST
/v1/chat/completions
Chat completion
✅ SSE
POST
/v1/embeddings
Generate embeddings
❌
GET
/v1/models
List available models
❌
Ollama-Compatible Endpoints
Method
Path
Description
Streaming
POST
/api/generate
Raw text generation
✅ SSE
POST
/api/chat
Chat completion
✅ SSE
POST
/api/embed
Batch embeddings
❌
POST
/api/embeddings
Single embedding (legacy)
❌
GET
/api/ps
List running models
❌
POST
/api/show
Show model details
❌
POST
/api/pull
Download model
✅ Progress
GET
/api/tags
List models
❌
GET
/api/version
Version info
❌
System Endpoints
Method
Path
Description
GET
/health
Health check with NPU status
GET
/health/devices
Detailed device information
GET
/metrics
Prometheus metrics
Response Headers
Header
Description
X-Request-ID
Unique request identifier (req_<24-char-hex>)
X-NPU-Proxy-Device
Device used (NPU/GPU/CPU)
X-NPU-Proxy-Route-Reason
Why device was selected
X-NPU-Proxy-Token-Count
Token count for routing decision
8. Configuration Reference
Environment Variables
Variable
Default
Description
NPU_PROXY_HOST
0.0.0.0
Server bind address
NPU_PROXY_PORT
11435
Server port (matches Ollama)
NPU_PROXY_DEVICE
NPU
Preferred device (NPU, GPU, CPU)
NPU_PROXY_FALLBACK_DEVICE
(auto)
Override fallback device selection
NPU_PROXY_REAL_INFERENCE
0
Enable real inference (1) or mock (0)
NPU_PROXY_MODEL_PATH
~/.cache/npu-proxy/models
Model cache directory
NPU_PROXY_INFERENCE_TIMEOUT
180
Inference timeout in seconds
NPU_PROXY_MAX_PROMPT_LEN
4096
Maximum prompt length for NPU
NPU_PROXY_TOKEN_LIMIT
1800
Token threshold for NPU routing
NPU_PROXY_EMBEDDING_MODEL
BAAI/bge-small-en-v1.5
Default embedding model
NPU_PROXY_EMBEDDING_DEVICE
CPU
Device for embeddings
NPU_PROXY_EMBEDDING_CACHE_SIZE
1000
LRU cache size for embeddings
NPU_PROXY_LOG_LEVEL
INFO
Logging verbosity
Example Configuration
# Windows - Production configuration$env:NPU_PROXY_REAL_INFERENCE="1"$env:NPU_PROXY_DEVICE="NPU"$env:NPU_PROXY_INFERENCE_TIMEOUT="300"# Start server
npu-proxy --host 0.0.0.0--port 11435
# WSL2 - Client configuration
WINDOWS_HOST=$(ip route show | grep default | awk '{print $3}')export OLLAMA_HOST="http://${WINDOWS_HOST}:11435"# Use with Claude Code or other Ollama clients
claude --chat
9. Deployment
Critical Constraint: Host-Only Deployment
⚠️NPU Proxy MUST run as a native host service.
Intel NPU drivers cannot be containerized (no Docker, no Kubernetes).
WSL2 workloads connect via HTTP bridge to Windows host.
Windows Deployment
# Install from source
pip install -e .
# Run as Windows Service (planned)# winget install npu-proxy# Start server
npu-proxy --host 0.0.0.0--port 11435
Linux Deployment (systemd)
# Install
sudo ./scripts/install_linux.sh
# Enable and start
sudo systemctl enable npu-proxy
sudo systemctl start npu-proxy
# Check status
sudo systemctl status npu-proxy
10. Performance Benchmarks
Test System: Intel Core Ultra 7 155H (Meteor Lake), 32GB RAM, Windows 11 23H2
Inference Latency (TinyLlama 1.1B INT4)
Device
Avg Latency
Tokens/sec
NPU
4.03s
~5 tok/s
GPU
2.25s
~9 tok/s
CPU
8.5s
~2.4 tok/s
Cold Start Performance
Model
Device
Load Time
TinyLlama 1.1B INT4
NPU
8.12s (cached)
TinyLlama 1.1B INT4
GPU
21.96s
TinyLlama 1.1B INT4
CPU
5.2s
Embedding Performance (BGE-Small)
Device
Single Query
Batch (3 docs)
CPU
~28ms
~25ms
NPU
~35ms
~30ms
11. Model Compatibility Matrix
Model
Type
NPU
GPU
CPU
Notes
TinyLlama 1.1B INT4
LLM
✅
✅
✅
Recommended for NPU
Phi-2 2.7B INT4
LLM
✅
✅
✅
Good balance
Mistral 7B INT4
LLM
⚠️
✅
✅
May exceed NPU memory
LLaMA-2 7B INT4
LLM
⚠️
✅
✅
May exceed NPU memory
Granite 4 Micro
LLM
✅
✅
✅
1B FP32 model
BGE-Small
Embedding
✅
✅
✅
384 dimensions
BGE-Base
Embedding
✅
✅
✅
768 dimensions
All-MiniLM-L6-v2
Embedding
✅
✅
✅
Lightweight
Legend: ✅ Supported | ⚠️ May work with limitations | ❌ Not recommended
12. Prometheus Metrics
# Counter: Total requests by endpoint and status
npu_proxy_requests_total{endpoint="/v1/chat/completions", status="200"}
# Histogram: Inference latency
npu_proxy_inference_duration_seconds{model="tinyllama", device="NPU"}
# Histogram: Time to first token (critical SLO)
npu_proxy_time_to_first_token_seconds{model="tinyllama"}
# Histogram: Inter-token latency
npu_proxy_inter_token_latency_seconds{model="tinyllama"}
# Gauge: Tokens per second (real-time throughput)
npu_proxy_tokens_per_second{model="tinyllama"}
# Gauge: Currently loaded models
npu_proxy_loaded_models{model="tinyllama"}
# Counter: Tokens generated
npu_proxy_tokens_generated_total{model="tinyllama"}