-
Notifications
You must be signed in to change notification settings - Fork 59
Feature/amd #1374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Feature/amd #1374
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is being reviewed by Cursor Bugbot
Details
You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.
To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.
Bug: AMD render node not detected for Wolf
The WOLF_RENDER_NODE auto-detection only checks for NVIDIA GPUs by matching driver name "nvidia", but never checks for AMD GPUs (driver name "amdgpu"). When GPU_VENDOR is set to "amd", the code falls back to the default /dev/dri/renderD128, which may be incorrect on systems with multiple GPUs or virtual GPUs. AMD systems need the same render node detection logic to find the correct AMD GPU render node.
install.sh#L1665-L1685
Lines 1665 to 1685 in ab9fa8c
| # Auto-detect first NVIDIA render node for Wolf | |
| # On some systems (Lambda Labs), renderD128 is virtio-gpu (virtual), NVIDIA starts at renderD129 | |
| WOLF_RENDER_NODE="/dev/dri/renderD128" # Default | |
| if [ -d "/sys/class/drm" ]; then | |
| for render_node in /dev/dri/renderD*; do | |
| if [ -e "$render_node" ]; then | |
| # Check if this render node is NVIDIA by checking driver symlink | |
| node_name=$(basename "$render_node") | |
| driver_link="/sys/class/drm/$node_name/device/driver" | |
| if [ -L "$driver_link" ]; then | |
| driver=$(readlink "$driver_link" | grep -o '[^/]*$') | |
| if [[ "$driver" == "nvidia" ]]; then | |
| WOLF_RENDER_NODE="$render_node" | |
| echo "Auto-detected NVIDIA render node: $WOLF_RENDER_NODE" | |
| break | |
| fi | |
| fi | |
| fi | |
| done | |
| fi |
| case "nvidia": | ||
| cmd = exec.Command("nvidia-smi", "--query-gpu=memory.used", "--format=csv,noheader,nounits") | ||
| case "amd": | ||
| cmd = exec.Command("rocm-smi", "--showmeminfo", "vram", "--csv") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Incorrect rocm-smi flag in updateGPUMemoryMap
The updateGPUMemoryMap function uses rocm-smi --showmeminfo vram --csv with the --csv flag, but the parsing logic at lines 950-1027 expects structured text format with patterns like "VRAM Total Used Memory (B):". This is inconsistent with other AMD memory queries in the same file (lines 218 and 403) which correctly use rocm-smi --showmeminfo vram without the --csv flag. The --csv flag will produce CSV output that won't match the structured text parser, causing AMD GPU memory updates to fail silently.
Enables CI image builds for controlplane, runner, and other components when tags are created on the feature/amd branch. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
68158ab to
b3cd415
Compare
- Relax check_amd_gpu() to detect AMD via lspci + /dev/dri only - Add check_amd_rocm() to separately check for ROCm support - Print clear instructions to install ROCm when /dev/kfd is missing - Make /dev/kfd conditional in runner GPU flags This fixes AMD GPU detection on Azure NVads_V710 VMs where ROCm is not pre-installed. Desktop streaming (Helix Code) works with just /dev/dri, while ML compute requires ROCm installation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Moonlight auto-pairing was failing on AMD systems because the init script
hardcoded 'wolf' as the hostname, but the AMD profile uses 'wolf-amd'.
Changes:
- Added WOLF_HOSTNAME env var to moonlight-web service (defaults to 'wolf')
- Made data.json.template use {{WOLF_HOSTNAME}} template variable
- Updated init-moonlight-config.sh to substitute WOLF_HOSTNAME in data.json
- Updated init-moonlight-config.sh to use WOLF_HOSTNAME when checking port
- Modified install.sh to set WOLF_HOSTNAME=wolf-amd for AMD GPUs
- Added /var/run/wolf volume mount to moonlight-web for Wolf socket access
This fixes the "Wolf failed to start within 120 seconds" pairing timeout
and enables automatic pairing on both NVIDIA and AMD systems.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
bdb9171 to
b186edc
Compare
| } | ||
|
|
||
| # Function to check for AMD GPU specifically (for Helix Code) | ||
| # Detects AMD GPU via lspci + /dev/dri (does not require ROCm/kfd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but doesn't this just imply the amdgpu drivers weren't properly installed? ref https://learn.microsoft.com/en-us/azure/virtual-machines/linux/azure-n-series-amd-gpu-driver-linux-installation-guide#rocm-driver-installation
|
superceded by #1379, I'll integrate any of the amd changes needed there, thanks! |
a7ca1a5 to
b186edc
Compare
Note
Adds AMD/ROCm support with vendor-aware GPU detection, memory parsing, vLLM env selection, compose/install updates, and API/UI field changes.
nvidia/amd) and usenvidia-smiorrocm-smiaccordingly for GPU count/memory; parse ROCm outputs; exposeSDKVersion(CUDA/ROCm) replacingCUDAVersion./workspace/vllm-cudaor/workspace/vllm-rocm).GOW_REQUIRED_DEVICESdynamically fromGPU_VENDOR(adds/dev/kfdfor AMD) in both lobbies/apps flows; neutralize GPU monitoring messages.wolf-amdservice (profilecode-amd); makemoonlight-webGPU-agnostic.GPU_VENDORin.env; select compose profiles (codevscode-amd); conditionally install NVIDIA runtime; loaduhidmodule; generate runner.sh with vendor-specific GPU flags (NVIDIA--gpus, AMD/dev/kfd+ROCR_VISIBLE_DEVICES).GPUStatusfield fromcuda_versiontosdk_version; propagate in runner status and related structs; clarify GPU stats comments to include ROCm.Written by Cursor Bugbot for commit ab9fa8c. This will update automatically on new commits. Configure here.