StrixHalo LLM Server

Optimized llama.cpp server with Vulkan GPU acceleration for AMD Strix Halo (Ryzen AI Max+ 395) integrated GPUs.

Features

1M token context window - Full model capacity
Vulkan GPU acceleration - Works with AMD Ryzen AI Max+ 395 and similar iGPUs
q8_0 KV cache quantization - Balance of quality and memory efficiency
OpenAI-compatible API - Drop-in replacement for OpenAI endpoints

Quick Start

Download the model (choose one):

# Q8 quantization (~34GB) - Higher quality, recommended
./download-model.sh

# Q4 quantization (~18GB) - Smaller, faster, less RAM
./download-model-q4.sh

Build and run:
```
docker compose up -d
```

Test:

curl http://localhost:8091/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen3-coder", "messages": [{"role": "user", "content": "Hello"}]}'

Model Options

Two quantization levels are available from Hugging Face:

Model	Script	Size	RAM Required	Quality
Q8_K_XL	`./download-model.sh`	~34GB	~85GB	Higher
Q4_K_XL	`./download-model-q4.sh`	~18GB	~50GB	Good

To switch models, update MODEL_PATH in docker-compose.yml:

- MODEL_PATH=/models/Qwen3-Coder-30B-A3B-Instruct-1M-UD-Q4_K_XL.gguf

Configuration

Edit docker-compose.yml to customize:

Variable	Default	Description
`MODEL_PATH`	-	Path to GGUF model file
`CTX_SIZE`	1048576	Context window (tokens)
`PARALLEL`	1	Concurrent request slots
`GPU_LAYERS`	999	Layers on GPU (999=all)
`KV_CACHE_TYPE`	q8_0	KV cache quantization
`FLASH_ATTENTION`	true	Enable flash attention

KV Cache Options

Type	Memory	Quality	Use Case
`q4_0`	Lowest	Good	Maximum context
`q8_0`	Medium	Better	Recommended
`f16`	Highest	Best	Short contexts

Performance (Ryzen AI Max+ 395)

Model	Context	KV Cache	Prompt Speed	Generation
Q8_K_XL	1M tokens	52 GB (q8_0)	~400-480 tok/s	~35-40 tok/s
Q4_K_XL	1M tokens	27 GB (q4_0)	~450-500 tok/s	~38-42 tok/s

Note: The first request after startup is slower (~120 tok/s) due to Vulkan shader compilation. Subsequent requests achieve full speed.

Performance (Ryzen 9 7940HS)

Model	Context	KV Cache	RAM	Prompt Speed	Generation
Q4_K_XL	512K tokens	q4_0	64 GB	~30 tok/s	~31 tok/s

Note: The Radeon 780M iGPU has less memory bandwidth than the Radeon 8060S. The 1M token context exceeds available GPU memory; 512K is the maximum tested configuration on this system.

Performance (Ryzen 7 5700G)

Model	Context	KV Cache	RAM	Prompt Speed	Generation
Q4_K_XL	256K tokens	q4_0	64 GB	~74 tok/s	~13 tok/s

Note: The Radeon Vega iGPU (Zen 3) with DDR4-3600 has less bandwidth than newer systems. The 512K context exceeds available GPU memory; 256K is the maximum tested configuration.

Performance Summary

System	GPU	RAM	Max Context	Prompt	Generation
Ryzen AI Max+ 395	Radeon 8060S	128GB	1M	~450 tok/s	~40 tok/s
Ryzen 9 7940HS	Radeon 780M	64GB DDR5	512K	~30 tok/s	~31 tok/s
Ryzen 7 5700G	Radeon Vega	64GB DDR4	256K	~74 tok/s	~13 tok/s

System Requirements

Docker with GPU support
AMD GPU with Vulkan support (Radeon 8060S or similar)
Sufficient RAM for model + KV cache:
- Q8 model with q8_0 KV cache: ~85GB
- Q4 model with q4_0 KV cache: ~50GB

Endpoints

GET /health - Health check
GET /v1/models - List models
POST /v1/chat/completions - Chat completions (OpenAI-compatible)
POST /v1/completions - Text completions
GET /metrics - Prometheus metrics

File Structure

StrixHalo/
├── docker-compose.yml    # Container configuration
├── Dockerfile            # Build instructions (Vulkan + llama.cpp)
├── entrypoint.sh         # Server startup script
├── download-model.sh     # Download Q8 model (~34GB)
├── download-model-q4.sh  # Download Q4 model (~18GB)
├── models/               # Model storage directory
└── README.md             # This file

Troubleshooting

GPU not detected

Check that /dev/dri is accessible:

ls -la /dev/dri/

Verify group IDs in docker-compose.yml match your system:

getent group video render

Out of memory

Reduce context size or use q4_0 KV cache:

- CTX_SIZE=524288
- KV_CACHE_TYPE=q4_0

Slow first request

This is normal. Vulkan shaders compile on first use. Subsequent requests will be fast.

Using with Qwen Code CLI

Qwen Code is an open-source AI coding agent for the terminal (similar to Claude Code). It can connect to StrixHalo as its backend.

Install Qwen Code

# NPM (requires Node.js 20+)
npm install -g @qwen-code/qwen-code@latest

# or Homebrew
brew install qwen-code

Configure for StrixHalo

Set environment variables to point to your local server:

export OPENAI_API_BASE=http://localhost:8091/v1
export OPENAI_API_KEY=not-needed
export OPENAI_MODEL=qwen3-coder

Or create a config file:

# Run qwen-code and configure via settings
qwen
# Then use /settings to configure the API endpoint

Usage

# Interactive mode in your project directory
qwen

# Headless mode for scripts/CI
qwen -p "explain this codebase"

# Single question
qwen -p "write a function to parse JSON"

Features

Agentic coding: Understands codebases, writes code, runs commands
Plan mode: Complex multi-step task planning
IDE integration: VS Code and Zed extensions available
Open source: Both the CLI and model are fully open source

For more information: https://github.com/QwenLM/qwen-code

Related Projects

Qwen3-Coder - The model powering this server
Qwen Code CLI - Terminal AI agent
llama.cpp - Inference engine
Unsloth GGUF - Quantized models

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

StrixHalo LLM Server

Features

Quick Start

Model Options

Configuration

KV Cache Options

Performance (Ryzen AI Max+ 395)

Performance (Ryzen 9 7940HS)

Performance (Ryzen 7 5700G)

Performance Summary

System Requirements

Endpoints

File Structure

Troubleshooting

GPU not detected

Out of memory

Slow first request

Using with Qwen Code CLI

Install Qwen Code

Configure for StrixHalo

Usage

Features

Related Projects

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
models		models
.gitignore		.gitignore
AMD-Ryzen-7-5700G.md		AMD-Ryzen-7-5700G.md
AMD-Ryzen-9-7940HS.md		AMD-Ryzen-9-7940HS.md
AMD-Ryzen-AI-MAX-395.md		AMD-Ryzen-AI-MAX-395.md
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
download-model-q4.sh		download-model-q4.sh
download-model.sh		download-model.sh
entrypoint.sh		entrypoint.sh

jstormes/StrixHalo

Folders and files

Latest commit

History

Repository files navigation

StrixHalo LLM Server

Features

Quick Start

Model Options

Configuration

KV Cache Options

Performance (Ryzen AI Max+ 395)

Performance (Ryzen 9 7940HS)

Performance (Ryzen 7 5700G)

Performance Summary

System Requirements

Endpoints

File Structure

Troubleshooting

GPU not detected

Out of memory

Slow first request

Using with Qwen Code CLI

Install Qwen Code

Configure for StrixHalo

Usage

Features

Related Projects

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages