Skip to content

Conversation

@lukemarsden
Copy link
Collaborator

@lukemarsden lukemarsden commented Nov 24, 2025

Note

Adds AMD/ROCm support with vendor-aware GPU detection, memory parsing, vLLM env selection, compose/install updates, and API/UI field changes.

  • Runner/Inference:
    • Add vendor detection (nvidia/amd) and use nvidia-smi or rocm-smi accordingly for GPU count/memory; parse ROCm outputs; expose SDKVersion (CUDA/ROCm) replacing CUDAVersion.
    • Initialize/update per-GPU stats for AMD; update logs and error paths; adjust process matching for vLLM CUDA/ROCm venvs.
    • vLLM runtime chooses Python venv by vendor (/workspace/vllm-cuda or /workspace/vllm-rocm).
  • Wolf/External Agent:
    • Build GOW_REQUIRED_DEVICES dynamically from GPU_VENDOR (adds /dev/kfd for AMD) in both lobbies/apps flows; neutralize GPU monitoring messages.
  • Docker/Build:
    • Dockerfile: create separate CUDA and ROCm vLLM virtualenvs; for ROCm, install ROCm PyTorch and build vLLM from source; copy examples to both; add symlink for backward compat.
    • docker-compose: add wolf-amd service (profile code-amd); make moonlight-web GPU-agnostic.
  • Installer:
    • Detect GPU vendor (NVIDIA/AMD/Intel), set GPU_VENDOR in .env; select compose profiles (code vs code-amd); conditionally install NVIDIA runtime; load uhid module; generate runner.sh with vendor-specific GPU flags (NVIDIA --gpus, AMD /dev/kfd + ROCR_VISIBLE_DEVICES).
  • API/Types:
    • Change GPUStatus field from cuda_version to sdk_version; propagate in runner status and related structs; clarify GPU stats comments to include ROCm.
  • Frontend:
    • Make GPU labels/vendor references generic (e.g., "GPU" and "SDK"); vendor-neutral model name parsing in Runner and Admin dashboards; update GPU query label text.

Written by Cursor Bugbot for commit ab9fa8c. This will update automatically on new commits. Configure here.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is being reviewed by Cursor Bugbot

Details

You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

Bug: AMD render node not detected for Wolf

The WOLF_RENDER_NODE auto-detection only checks for NVIDIA GPUs by matching driver name "nvidia", but never checks for AMD GPUs (driver name "amdgpu"). When GPU_VENDOR is set to "amd", the code falls back to the default /dev/dri/renderD128, which may be incorrect on systems with multiple GPUs or virtual GPUs. AMD systems need the same render node detection logic to find the correct AMD GPU render node.

install.sh#L1665-L1685

helix/install.sh

Lines 1665 to 1685 in ab9fa8c

# Auto-detect first NVIDIA render node for Wolf
# On some systems (Lambda Labs), renderD128 is virtio-gpu (virtual), NVIDIA starts at renderD129
WOLF_RENDER_NODE="/dev/dri/renderD128" # Default
if [ -d "/sys/class/drm" ]; then
for render_node in /dev/dri/renderD*; do
if [ -e "$render_node" ]; then
# Check if this render node is NVIDIA by checking driver symlink
node_name=$(basename "$render_node")
driver_link="/sys/class/drm/$node_name/device/driver"
if [ -L "$driver_link" ]; then
driver=$(readlink "$driver_link" | grep -o '[^/]*$')
if [[ "$driver" == "nvidia" ]]; then
WOLF_RENDER_NODE="$render_node"
echo "Auto-detected NVIDIA render node: $WOLF_RENDER_NODE"
break
fi
fi
fi
done
fi

Fix in Cursor Fix in Web


case "nvidia":
cmd = exec.Command("nvidia-smi", "--query-gpu=memory.used", "--format=csv,noheader,nounits")
case "amd":
cmd = exec.Command("rocm-smi", "--showmeminfo", "vram", "--csv")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Incorrect rocm-smi flag in updateGPUMemoryMap

The updateGPUMemoryMap function uses rocm-smi --showmeminfo vram --csv with the --csv flag, but the parsing logic at lines 950-1027 expects structured text format with patterns like "VRAM Total Used Memory (B):". This is inconsistent with other AMD memory queries in the same file (lines 218 and 403) which correctly use rocm-smi --showmeminfo vram without the --csv flag. The --csv flag will produce CSV output that won't match the structured text parser, causing AMD GPU memory updates to fail silently.

Fix in Cursor Fix in Web

Enables CI image builds for controlplane, runner, and other
components when tags are created on the feature/amd branch.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
chocobar and others added 2 commits November 24, 2025 13:15
- Relax check_amd_gpu() to detect AMD via lspci + /dev/dri only
- Add check_amd_rocm() to separately check for ROCm support
- Print clear instructions to install ROCm when /dev/kfd is missing
- Make /dev/kfd conditional in runner GPU flags

This fixes AMD GPU detection on Azure NVads_V710 VMs where ROCm
is not pre-installed. Desktop streaming (Helix Code) works with
just /dev/dri, while ML compute requires ROCm installation.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Moonlight auto-pairing was failing on AMD systems because the init script
hardcoded 'wolf' as the hostname, but the AMD profile uses 'wolf-amd'.

Changes:
- Added WOLF_HOSTNAME env var to moonlight-web service (defaults to 'wolf')
- Made data.json.template use {{WOLF_HOSTNAME}} template variable
- Updated init-moonlight-config.sh to substitute WOLF_HOSTNAME in data.json
- Updated init-moonlight-config.sh to use WOLF_HOSTNAME when checking port
- Modified install.sh to set WOLF_HOSTNAME=wolf-amd for AMD GPUs
- Added /var/run/wolf volume mount to moonlight-web for Wolf socket access

This fixes the "Wolf failed to start within 120 seconds" pairing timeout
and enables automatic pairing on both NVIDIA and AMD systems.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
}

# Function to check for AMD GPU specifically (for Helix Code)
# Detects AMD GPU via lspci + /dev/dri (does not require ROCm/kfd)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lukemarsden lukemarsden marked this pull request as draft November 25, 2025 04:59
@lukemarsden
Copy link
Collaborator Author

superceded by #1379, I'll integrate any of the amd changes needed there, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants