fix(gpu): try all available backends before falling back to CPU#240
Open
fix(gpu): try all available backends before falling back to CPU#240
Conversation
Previously, when the first preferred GPU backend (e.g. CUDA) was detected
as available but failed to initialize, QMD fell straight to CPU — skipping
other viable backends like Vulkan entirely.
This is a common scenario on machines where CUDA is installed but the
prebuilt binary is incompatible (e.g. older Pascal GPUs, mismatched toolkit
versions), while Vulkan is fully functional.
Changes:
- Try all detected GPU backends in priority order before falling to CPU:
cuda -> metal -> vulkan
This ensures CUDA/Metal users always get the best available backend,
while Vulkan serves as a universal fallback.
- Only fall back to CPU after every GPU backend has been attempted.
- Emit clear, concise warnings listing which backends were tried and why
each one failed — much easier to diagnose than a silent CPU fallback.
- Respect QMD_FORCE_CPU / FORCE_CPU env vars to skip GPU selection entirely.
- Remove @ts-expect-error hack; proper LlamaGpuType casting throughout.
Tested on: GTX 1050 Ti (compute_61) + CUDA 13.1 + Vulkan, Node 25, Windows 11
Before: CPU fallback (GPU: none)
After: GPU: vulkan (offloading: yes)
There was a problem hiding this comment.
Pull request overview
This PR improves GPU backend selection by attempting all available GPU backends (CUDA, Metal, Vulkan) in priority order before falling back to CPU, instead of trying only one backend and immediately falling back to CPU on failure.
Changes:
- Implements a fallback loop that tries all detected GPU backends in order (CUDA → Metal → Vulkan) before falling back to CPU
- Adds support for
QMD_FORCE_CPUandFORCE_CPUenvironment variables to explicitly disable GPU - Improves error reporting with detailed messages about which backends were tried and why each failed
Comments suppressed due to low confidence (1)
src/llm.ts:572
- When
forceCpuis true, two warning messages will be printed: first "GPU disabled via QMD_FORCE_CPU/FORCE_CPU. Running on CPU." (line 561), then "no GPU acceleration, running on CPU (slow). Run 'qmd status' for details." (lines 569-571). This creates redundant output for users who explicitly requested CPU mode.
To fix, add && !forceCpu to the condition on line 568 to prevent the second message when CPU mode is explicitly requested.
if (!llama.gpu) {
process.stderr.write(
"QMD Warning: no GPU acceleration, running on CPU (slow). Run 'qmd status' for details.\n"
);
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When the first preferred GPU backend (CUDA) is detected as available but fails to initialize, QMD fell straight to CPU - skipping other viable backends like Vulkan entirely.
This is a common scenario on machines where CUDA is installed but the prebuilt binary is incompatible:
Solution
Instead of trying one backend and giving up, iterate all detected GPU backends in priority order before falling back to CPU:
cuda ? metal ? vulkanThis ensures:
Changes
src/llm.ts: ReworkensureLlama()to loop through all available backendsQMD_FORCE_CPU/FORCE_CPUenv vars to skip GPU entirelyLlamaGpuTypecasting (removes@ts-expect-errorhack)Tested on
GPU: none (running on CPU)GPU: vulkan (offloading: yes)