Skip to content

Comments

fix(gpu): try all available backends before falling back to CPU#240

Open
haddadrm wants to merge 1 commit intotobi:mainfrom
haddadrm:main
Open

fix(gpu): try all available backends before falling back to CPU#240
haddadrm wants to merge 1 commit intotobi:mainfrom
haddadrm:main

Conversation

@haddadrm
Copy link

Problem

When the first preferred GPU backend (CUDA) is detected as available but fails to initialize, QMD fell straight to CPU - skipping other viable backends like Vulkan entirely.

This is a common scenario on machines where CUDA is installed but the prebuilt binary is incompatible:

  • Older Pascal-generation NVIDIA GPUs (GTX 10xx, compute_61) with newer CUDA toolkits
  • CUDA toolkit version mismatches
  • Systems where Vulkan works fine but CUDA does not

Solution

Instead of trying one backend and giving up, iterate all detected GPU backends in priority order before falling back to CPU:

cuda ? metal ? vulkan

This ensures:

  • Users with working CUDA/Metal always get the best backend (no regression)
  • Users whose CUDA fails gracefully fall through to Vulkan instead of CPU
  • Only hits CPU after every GPU backend has been attempted and failed

Changes

  • src/llm.ts: Rework ensureLlama() to loop through all available backends
  • Clear warnings listing which backends were tried and why each failed
  • Respect QMD_FORCE_CPU / FORCE_CPU env vars to skip GPU entirely
  • Proper LlamaGpuType casting (removes @ts-expect-error hack)

Tested on

  • GTX 1050 Ti (compute_61) + CUDA 13.1 + Vulkan, Node 25.2.1, Windows 11
  • Before: GPU: none (running on CPU)
  • After: GPU: vulkan (offloading: yes)

Previously, when the first preferred GPU backend (e.g. CUDA) was detected
as available but failed to initialize, QMD fell straight to CPU — skipping
other viable backends like Vulkan entirely.

This is a common scenario on machines where CUDA is installed but the
prebuilt binary is incompatible (e.g. older Pascal GPUs, mismatched toolkit
versions), while Vulkan is fully functional.

Changes:
- Try all detected GPU backends in priority order before falling to CPU:
    cuda -> metal -> vulkan
  This ensures CUDA/Metal users always get the best available backend,
  while Vulkan serves as a universal fallback.
- Only fall back to CPU after every GPU backend has been attempted.
- Emit clear, concise warnings listing which backends were tried and why
  each one failed — much easier to diagnose than a silent CPU fallback.
- Respect QMD_FORCE_CPU / FORCE_CPU env vars to skip GPU selection entirely.
- Remove @ts-expect-error hack; proper LlamaGpuType casting throughout.

Tested on: GTX 1050 Ti (compute_61) + CUDA 13.1 + Vulkan, Node 25, Windows 11
Before: CPU fallback (GPU: none)
After:  GPU: vulkan (offloading: yes)
Copilot AI review requested due to automatic review settings February 21, 2026 14:35
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves GPU backend selection by attempting all available GPU backends (CUDA, Metal, Vulkan) in priority order before falling back to CPU, instead of trying only one backend and immediately falling back to CPU on failure.

Changes:

  • Implements a fallback loop that tries all detected GPU backends in order (CUDA → Metal → Vulkan) before falling back to CPU
  • Adds support for QMD_FORCE_CPU and FORCE_CPU environment variables to explicitly disable GPU
  • Improves error reporting with detailed messages about which backends were tried and why each failed
Comments suppressed due to low confidence (1)

src/llm.ts:572

  • When forceCpu is true, two warning messages will be printed: first "GPU disabled via QMD_FORCE_CPU/FORCE_CPU. Running on CPU." (line 561), then "no GPU acceleration, running on CPU (slow). Run 'qmd status' for details." (lines 569-571). This creates redundant output for users who explicitly requested CPU mode.

To fix, add && !forceCpu to the condition on line 568 to prevent the second message when CPU mode is explicitly requested.

      if (!llama.gpu) {
        process.stderr.write(
          "QMD Warning: no GPU acceleration, running on CPU (slow). Run 'qmd status' for details.\n"
        );
      }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant