Feature/amd #1374

lukemarsden · 2025-11-24T09:58:34Z

Note

Adds AMD/ROCm support with vendor-aware GPU detection, memory parsing, vLLM env selection, compose/install updates, and API/UI field changes.

Runner/Inference:
- Add vendor detection (nvidia/amd) and use nvidia-smi or rocm-smi accordingly for GPU count/memory; parse ROCm outputs; expose SDKVersion (CUDA/ROCm) replacing CUDAVersion.
- Initialize/update per-GPU stats for AMD; update logs and error paths; adjust process matching for vLLM CUDA/ROCm venvs.
- vLLM runtime chooses Python venv by vendor (/workspace/vllm-cuda or /workspace/vllm-rocm).
Wolf/External Agent:
- Build GOW_REQUIRED_DEVICES dynamically from GPU_VENDOR (adds /dev/kfd for AMD) in both lobbies/apps flows; neutralize GPU monitoring messages.
Docker/Build:
- Dockerfile: create separate CUDA and ROCm vLLM virtualenvs; for ROCm, install ROCm PyTorch and build vLLM from source; copy examples to both; add symlink for backward compat.
- docker-compose: add wolf-amd service (profile code-amd); make moonlight-web GPU-agnostic.
Installer:
- Detect GPU vendor (NVIDIA/AMD/Intel), set GPU_VENDOR in .env; select compose profiles (code vs code-amd); conditionally install NVIDIA runtime; load uhid module; generate runner.sh with vendor-specific GPU flags (NVIDIA --gpus, AMD /dev/kfd + ROCR_VISIBLE_DEVICES).
API/Types:
- Change GPUStatus field from cuda_version to sdk_version; propagate in runner status and related structs; clarify GPU stats comments to include ROCm.
Frontend:
- Make GPU labels/vendor references generic (e.g., "GPU" and "SDK"); vendor-neutral model name parsing in Runner and Admin dashboards; update GPU query label text.

^{Written by Cursor Bugbot for commit ab9fa8c. This will update automatically on new commits. Configure here.}

cursor

This PR is being reviewed by Cursor Bugbot

Details

You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

Bug: AMD render node not detected for Wolf

The WOLF_RENDER_NODE auto-detection only checks for NVIDIA GPUs by matching driver name "nvidia", but never checks for AMD GPUs (driver name "amdgpu"). When GPU_VENDOR is set to "amd", the code falls back to the default /dev/dri/renderD128, which may be incorrect on systems with multiple GPUs or virtual GPUs. AMD systems need the same render node detection logic to find the correct AMD GPU render node.

install.sh#L1665-L1685

helix/install.sh

Lines 1665 to 1685 in ab9fa8c

    
           # Auto-detect first NVIDIA render node for Wolf 
        
           # On some systems (Lambda Labs), renderD128 is virtio-gpu (virtual), NVIDIA starts at renderD129 
        
           WOLF_RENDER_NODE="/dev/dri/renderD128"  # Default 
        
           if [ -d "/sys/class/drm" ]; then 
        
               for render_node in /dev/dri/renderD*; do 
        
                   if [ -e "$render_node" ]; then 
        
                       # Check if this render node is NVIDIA by checking driver symlink 
        
                       node_name=$(basename "$render_node") 
        
                       driver_link="/sys/class/drm/$node_name/device/driver" 
        
                       if [ -L "$driver_link" ]; then 
        
                           driver=$(readlink "$driver_link" | grep -o '[^/]*$') 
        
                           if [[ "$driver" == "nvidia" ]]; then 
        
                               WOLF_RENDER_NODE="$render_node" 
        
                               echo "Auto-detected NVIDIA render node: $WOLF_RENDER_NODE" 
        
                               break 
        
                           fi 
        
                       fi 
        
                   fi 
        
               done 
        
           fi

cursor · 2025-11-24T10:00:35Z

api/pkg/runner/gpu.go

+		case "nvidia":
+			cmd = exec.Command("nvidia-smi", "--query-gpu=memory.used", "--format=csv,noheader,nounits")
+		case "amd":
+			cmd = exec.Command("rocm-smi", "--showmeminfo", "vram", "--csv")


Bug: Incorrect rocm-smi flag in updateGPUMemoryMap

The updateGPUMemoryMap function uses rocm-smi --showmeminfo vram --csv with the --csv flag, but the parsing logic at lines 950-1027 expects structured text format with patterns like "VRAM Total Used Memory (B):". This is inconsistent with other AMD memory queries in the same file (lines 218 and 403) which correctly use rocm-smi --showmeminfo vram without the --csv flag. The --csv flag will produce CSV output that won't match the structured text parser, causing AMD GPU memory updates to fail silently.

Enables CI image builds for controlplane, runner, and other components when tags are created on the feature/amd branch. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Relax check_amd_gpu() to detect AMD via lspci + /dev/dri only - Add check_amd_rocm() to separately check for ROCm support - Print clear instructions to install ROCm when /dev/kfd is missing - Make /dev/kfd conditional in runner GPU flags This fixes AMD GPU detection on Azure NVads_V710 VMs where ROCm is not pre-installed. Desktop streaming (Helix Code) works with just /dev/dri, while ML compute requires ROCm installation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Moonlight auto-pairing was failing on AMD systems because the init script hardcoded 'wolf' as the hostname, but the AMD profile uses 'wolf-amd'. Changes: - Added WOLF_HOSTNAME env var to moonlight-web service (defaults to 'wolf') - Made data.json.template use {{WOLF_HOSTNAME}} template variable - Updated init-moonlight-config.sh to substitute WOLF_HOSTNAME in data.json - Updated init-moonlight-config.sh to use WOLF_HOSTNAME when checking port - Modified install.sh to set WOLF_HOSTNAME=wolf-amd for AMD GPUs - Added /var/run/wolf volume mount to moonlight-web for Wolf socket access This fixes the "Wolf failed to start within 120 seconds" pairing timeout and enables automatic pairing on both NVIDIA and AMD systems. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

lukemarsden · 2025-11-25T04:59:18Z

install.sh

 }

+# Function to check for AMD GPU specifically (for Helix Code)
+# Detects AMD GPU via lspci + /dev/dri (does not require ROCm/kfd)


but doesn't this just imply the amdgpu drivers weren't properly installed? ref https://learn.microsoft.com/en-us/azure/virtual-machines/linux/azure-n-series-amd-gpu-driver-linux-installation-guide#rocm-driver-installation

lukemarsden · 2025-11-25T05:00:26Z

superceded by #1379, I'll integrate any of the amd changes needed there, thanks!

cursor bot reviewed Nov 24, 2025

View reviewed changes

Add feature/amd branch to Drone CI pipeline triggers

b3cd415

Enables CI image builds for controlplane, runner, and other components when tags are created on the feature/amd branch. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

chocobar force-pushed the feature/amd branch from 68158ab to b3cd415 Compare November 24, 2025 12:14

chocobar and others added 2 commits November 24, 2025 13:15

chocobar force-pushed the feature/amd branch from bdb9171 to b186edc Compare November 24, 2025 23:23

lukemarsden commented Nov 25, 2025

View reviewed changes

lukemarsden marked this pull request as draft November 25, 2025 04:59

chocobar force-pushed the feature/amd branch from a7ca1a5 to b186edc Compare November 25, 2025 10:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature/amd #1374

Feature/amd #1374

lukemarsden commented Nov 24, 2025 •

edited by cursor bot

Loading

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Nov 24, 2025

Uh oh!

lukemarsden Nov 25, 2025

Uh oh!

lukemarsden commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


	# Auto-detect first NVIDIA render node for Wolf
	# On some systems (Lambda Labs), renderD128 is virtio-gpu (virtual), NVIDIA starts at renderD129
	WOLF_RENDER_NODE="/dev/dri/renderD128" # Default
	if [ -d "/sys/class/drm" ]; then
	for render_node in /dev/dri/renderD*; do
	if [ -e "$render_node" ]; then
	# Check if this render node is NVIDIA by checking driver symlink
	node_name=$(basename "$render_node")
	driver_link="/sys/class/drm/$node_name/device/driver"
	if [ -L "$driver_link" ]; then
	driver=$(readlink "$driver_link" \| grep -o '[^/]*$')
	if [[ "$driver" == "nvidia" ]]; then
	WOLF_RENDER_NODE="$render_node"
	echo "Auto-detected NVIDIA render node: $WOLF_RENDER_NODE"
	break
	fi
	fi
	fi
	done
	fi

Feature/amd #1374

Are you sure you want to change the base?

Feature/amd #1374

Conversation

lukemarsden commented Nov 24, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

This PR is being reviewed by Cursor Bugbot

Bug: AMD render node not detected for Wolf

Uh oh!

cursor bot Nov 24, 2025

Choose a reason for hiding this comment

Bug: Incorrect rocm-smi flag in updateGPUMemoryMap

Uh oh!

lukemarsden Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

lukemarsden commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lukemarsden commented Nov 24, 2025 •

edited by cursor bot

Loading