-
Notifications
You must be signed in to change notification settings - Fork 59
Wip (Azure Helix Code) #1370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Wip (Azure Helix Code) #1370
Conversation
Problem: Sway was failing to start on NVIDIA GPUs with error: [ERROR] [wlr] Could not connect to remote display: Connection refused [ERROR] [sway/server.c:137] Unable to create backend Root cause: wlroots was trying to use Wayland backend (nested compositor) instead of DRM backend (direct GPU access) needed for headless operation. Solution: Set WLR_BACKENDS=drm to force DRM backend, enabling headless GPU rendering with NVIDIA proprietary drivers. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Problem: Wolf was panicking with "Failed to create GsCUDABuf" when starting Sway sessions on NVIDIA GPUs. NVIDIA GBM (Generic Buffer Management) couldn't create DMA buffers for GPU memory sharing. Root cause: Wolf container lacked access to /dev/dma_heap device which is required for NVIDIA's GBM implementation to allocate DMA-BUF buffers. Solution: - Mount /dev/dma_heap device in Wolf container - Add device cgroup rule 'c 249:* rwm' for DMA heap device (major 249) This enables proper GPU memory buffer allocation for desktop streaming. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Problem: Sway containers created by Wolf were panicking with "Failed to create GsCUDABuf" error. NVIDIA GBM couldn't create DMA buffers inside the Zed/Sway containers even though Wolf container had DMA heap access. Root cause: Wolf API was creating Zed containers without DMA heap device access. The GOW_REQUIRED_DEVICES environment variable and DeviceCgroupRules didn't include /dev/dma_heap. Solution: - Add /dev/dma_heap/* to GOW_REQUIRED_DEVICES (tells GOW to mount device) - Add 'c 249:* rwm' to DeviceCgroupRules (allows container to access it) This completes the DMA heap fix - now both Wolf and its child containers can access /dev/dma_heap for NVIDIA GBM buffer allocation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Problem: Zed/Sway containers created by Wolf API were still failing with "Could not connect to remote display: Connection refused" error even after fixing Wolf container itself. Root cause: The WLR_BACKENDS=drm environment variable was only set in Wolf container, but not passed to the Zed/Sway containers that Wolf creates dynamically. Solution: Add WLR_BACKENDS=drm to the environment variables array in createSwayWolfAppForAppsMode() so all Zed containers get the DRM backend setting. This completes the NVIDIA GPU support fixes for Helix Code desktop streaming: 1. WLR_BACKENDS=drm in Wolf (docker-compose.yaml) 2. /dev/dma_heap device in Wolf (docker-compose.yaml) 3. /dev/dma_heap device in API code for Zed containers 4. WLR_BACKENDS=drm in API code for Zed containers (this commit) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Problem: Previous fix was applied to wrong function. The actual function used for creating Zed/Sway containers is createSwayWolfApp() in wolf_executor.go, not createSwayWolfAppForAppsMode() in wolf_executor_apps.go. Changes: - wolf_executor.go:124 - Added /dev/dma_heap/* to GOW_REQUIRED_DEVICES - wolf_executor.go:126 - Added WLR_BACKENDS=drm environment variable - wolf_executor.go:194 - Added c 249:* rwm to DeviceCgroupRules This ensures ALL Zed/Sway containers (External Agents and PDEs) get: 1. DMA heap device access for NVIDIA GBM buffer allocation 2. DRM backend for headless GPU rendering with NVIDIA drivers 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Problem: NVIDIA GBM (Generic Buffer Management) requires nvidia-drm kernel module with modeset=1 to work with DMA-BUF. Without this, Sway/Wolf desktop streaming fails with "Failed to create GsCUDABuf" errors even if the Wolf container has /dev/dma_heap device access. Root cause: NVIDIA's GBM implementation REQUIRES the nvidia-drm kernel module to have modeset enabled. This is a kernel-level requirement that can't be worked around at the application or container level. Solution: - Add setup_nvidia_drm_modeset() function that: - Checks if nvidia-smi is installed (NVIDIA GPU present) - Checks if modeset is already enabled - Creates /etc/modprobe.d/nvidia-drm.conf with modeset=1 - Updates initramfs - Sets REBOOT_REQUIRED flag - Call this function after install_nvidia_docker in both code paths: - After --code installation (line 1279) - After --runner installation (line 1876) - Add reboot warning at end of install.sh if REBOOT_REQUIRED=true Why all previous changes are still needed: 1. nvidia-drm modeset=1 (this change): Enables kernel-level DRM/GBM support 2. /dev/dma_heap device access: Required for DMA buffer allocation 3. WLR_BACKENDS=drm: Forces Sway to use DRM backend instead of nested Wayland 4. GOW_REQUIRED_DEVICES: Tells Wolf which devices to mount in child containers 5. DeviceCgroupRules: Allows container processes to access the devices All these changes work together as a complete fix. Each addresses a different layer of the problem (kernel → device access → container config → compositor). Testing: - On fresh install with NVIDIA GPU: modeset will be enabled, reboot required - On system with modeset already enabled: no-op, no reboot required - On system without NVIDIA GPU: no-op After reboot, Wolf/Sway desktop streaming (Helix Code) will be fully functional. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Problem: GOW_REQUIRED_DEVICES was set to /dev/dma_heap/* (wildcard), but Wolf couldn't find this device because /dev/dma_heap is a directory containing the actual device /dev/dma_heap/system. When Wolf tried to expand the glob pattern looking for files directly in /dev, it found nothing and didn't mount the device. Evidence from container logs: - "Path '/dev/dma_heap/*' is not present" during GOW device setup - Wolf set DeviceCgroupRules to only ["c 240:* rwm", "c 13:* rwm"], dropping our c 249:* rwm rule because it couldn't find the dma_heap device - Sway failed with DRM backend errors because it couldn't access DMA buffers Root cause: /dev/dma_heap is a directory, not a glob pattern in /dev: /dev/dma_heap/ └── system (character device 249,0) Solution: Change GOW_REQUIRED_DEVICES to use the exact device path: /dev/dma_heap/* → /dev/dma_heap/system This tells Wolf exactly where to find the DMA heap device, allowing it to: 1. Mount /dev/dma_heap/system into the container 2. Preserve our c 249:* rwm cgroup rule (instead of overriding it) 3. Give Sway/wlroots access to DMA buffers for NVIDIA GBM Testing: Deploy dma-heap-fix-v4 image and verify: - Container logs show "/dev/dma_heap/system" device mounted - DeviceCgroupRules includes "c 249:* rwm" - Sway starts successfully with WLR_BACKENDS=drm 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
…ntainers Problem: Sway was failing to start with DRM backend errors: - "Timeout waiting session to become active" - "failed to add backend 'drm'" - "Could not get primary session for user: No data available" - "Could not open target tty: No such file or directory" Root cause: libseat (used by wlroots/Sway for seat management) was trying to acquire a login session and VT (virtual terminal) via logind/seatd. This fails in headless containers because: 1. No systemd/logind running 2. No /dev/tty* devices 3. No login session concept in containers libseat has multiple backends: - logind: Requires systemd (not in containers) - seatd: Requires seatd daemon + VT access (not in containers) - builtin: Requires VT access (not in containers) - noop: Dummy backend for headless environments ✅ Solution: Set LIBSEAT_BACKEND=noop to tell libseat to use the no-op backend. This disables session/VT management entirely, which is fine for our use case: - We're running headless (no physical display/input) - We're using DRM backend directly (no seat arbitration needed) - Wolf handles container lifecycle (no multi-user seat switching) The noop backend allows Sway/wlroots to initialize without trying to acquire a session, while still using DRM for GPU rendering. Testing: Deploy dma-heap-fix-v5 and verify: - No libseat errors in Sway logs - Sway starts successfully with DRM backend - Desktop streaming works 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Problem: Sway was failing to start with "Found 0 GPUs, cannot create backend" even though NVIDIA GPU was present and accessible. Root cause: Azure VMs have two DRM devices: - /dev/dri/card0 - Hyper-V virtual display (hyperv_drm driver) - /dev/dri/card1 - NVIDIA Tesla T4 (nvidia driver) wlroots/Sway defaults to scanning DRM devices in order and trying to use the first one it finds. It was trying to use card0 (Hyper-V dummy device) which doesn't support the DRM operations needed for GPU rendering with libseat noop. Evidence from logs: - "Failed to open device: '/dev/dri/card0': No such file or directory" - "Found 0 GPUs, cannot create backend" - Wolf was mounting card1 correctly but Sway wasn't using it Solution: Set WLR_DRM_DEVICES=/dev/dri/card1 to explicitly tell wlroots to skip card0 and use the NVIDIA GPU directly. This is a common pattern for systems with multiple GPUs where you want to force wlroots to use a specific device instead of auto-detection. Testing: Deploy dma-heap-fix-v6 and verify: - Sway starts successfully with DRM backend - No "Found 0 GPUs" error - Desktop streaming works with NVIDIA GPU 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Problem: Hardcoding WLR_DRM_DEVICES=/dev/dri/card1 works on Azure VMs but breaks on systems where NVIDIA is card0. Different systems have different DRM device orderings depending on which GPU is detected first. Solution: Implement detectNVIDIADRMDevice() that: 1. Scans /sys/class/drm/card*/device/vendor for NVIDIA vendor ID (0x10de) 2. Returns the card path (/dev/dri/cardN) when NVIDIA GPU is found 3. Returns empty string if no NVIDIA GPU found (lets wlroots auto-detect) This handles all scenarios: - Azure VMs: Detects card1 (Hyper-V is card0, NVIDIA is card1) - Standard systems: Detects card0 (NVIDIA is the only/primary GPU) - Multi-GPU systems: Detects first NVIDIA card found - Non-NVIDIA systems: Skips WLR_DRM_DEVICES, wlroots auto-detects The detection runs once at container creation time, so there's no performance impact. Logs the detected card for debugging. Testing: Deploy v7 and verify: - Azure VM: Logs "Detected NVIDIA DRM device: /dev/dri/card1" - Standard NVIDIA system: Logs "/dev/dri/card0" - Sway starts successfully on both 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Improved the up() function to properly handle --profile flags by parsing them separately from other arguments. This ensures profiles are passed before the 'up' command in docker compose, which is required by Docker Compose CLI. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Changed from string concatenation to bash arrays to avoid leading spaces in the docker compose command. This fixes the error: "unknown docker command: compose --profile wolf" 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
When ./stack up --profile wolf is run, automatically build the moonlight-web image if it doesn't exist. This prevents the error: "pull access denied for helix-moonlight-web, repository does not exist" If the build fails, the script continues without the wolf profile to avoid breaking the entire stack startup. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Added validation to ensure git submodules are initialized before building moonlight-web. The build fails with a cryptic error if moonlight-common-c submodule is not pulled. Now shows clear error message telling users to run: git submodule update --init --recursive 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
When --profile wolf is used with ./stack rebuild, automatically build wolf and moonlight-web images before running docker compose. This keeps ./stack up simple (just handles profiles) while ./stack rebuild handles all the building. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Changed ./stack rebuild to always build wolf and moonlight-web without requiring --profile flag. Makes it clearer that rebuild is specifically for wolf components. Workflow: 1. ./stack rebuild # Build wolf and moonlight-web 2. ./stack up --profile wolf # Start with wolf profile 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- ./stack rebuild # Builds wolf and moonlight-web - ./stack rebuild api # Rebuilds api service - ./stack rebuild frontend # Rebuilds frontend service Preserves original behavior when arguments are provided while adding wolf-specific rebuild when called without arguments. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Usage: - ./stack rebuild # Rebuild everything (wolf, moonlight-web, all services) - ./stack rebuild wolf # Rebuild only wolf - ./stack rebuild moonlight-web # Rebuild only moonlight-web - ./stack rebuild api # Rebuild only api service - ./stack rebuild wolf api # Rebuild wolf and api - ./stack rebuild wolf moonlight-web # Rebuild both wolf components This matches standard conventions where no args = all, with args = specific targets. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
The submodule is at moonlight-common-sys/moonlight-common-c, not moonlight-common-c. Fixed the path check to look in the correct location. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Commented out /dev/uhid device mount as the uhid kernel module is not available on all systems (especially cloud VMs). The uhid device is used for HID emulation but is not critical for basic wolf functionality. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
When LIBSEAT_BACKEND=noop is set (required for headless Azure/cloud environments), skip starting seatd entirely. The script was hanging waiting for seatd socket that would never be created. This fixes external agent sessions failing to start with screenshot server connection refused errors. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Attempting to get Sway compositor working with NVIDIA GPU in headless Azure containers for desktop streaming. Changes: - Removed LIBSEAT_BACKEND=noop from wolf_executor env (tried seatd approach) - Simplified startup-app.sh to use direct GPU access with LIBSEAT_BACKEND=noop - Add retro user to video/render groups for DRI device access - Set DRI device permissions to 666 for direct access - Add debug logging for WLR_* environment variables Current status: - NVIDIA DRM detection working (correctly identifies /dev/dri/card1) - seatd approach failed (VT-based session management doesn't work in containers) - Direct GPU access with noop: card1 available but wlroots can't open it - Need to verify WLR_DRM_DEVICES environment variable is passed to container Next steps: - Check if WLR_DRM_DEVICES is set in container - May need to explicitly set WLR_DRM_DEVICES in startup script - Investigate why wlroots can't open /dev/dri/card1 despite 666 permissions 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
After extensive debugging, Sway cannot open /dev/dri/card1 even with: - LIBSEAT_BACKEND=noop (no seat management) - 666 permissions on all DRI devices - Running as root with sudo -E - WLR_DRM_DEVICES=/dev/dri/card1 explicitly set - SYS_ADMIN and all necessary container capabilities - AppArmor/seccomp disabled Key findings: - Device CAN be opened (dd test hung on read, not open) - wlroots specifically fails with "Unable to open /dev/dri/card1 as DRM device" - Error happens immediately, no timeout - Only card1 (NVIDIA) and renderD128 available in container - card0 (Hyper-V) not mounted into Sway containers Current theory: - NVIDIA proprietary driver may require special handling - Possible conflict with Wolf already using the GPU - May need NVIDIA-specific backend (EGLStreams) instead of GBM/DRM - Or need to use X11/Xwayland instead of native Wayland Next steps to investigate: 1. Check if XFCE example uses different DRM approach 2. Try X11-based compositor instead of pure Wayland 3. Investigate NVIDIA EGLStreams support 4. Check if multiple processes can open same NVIDIA DRM device 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
PROBLEM: Wolf containers couldn't start Sway compositor due to DRM master conflict. Wolf already holds DRM master on /dev/dri/card1 for streaming, preventing Sway from acquiring it. Previous attempts (seatd, LIBSEAT_BACKEND=noop, running as root) all failed. SOLUTION: Switch from WLR_BACKENDS=drm to WLR_BACKENDS=headless. - DRM backend: Requires drm master (exclusive), tries to use /dev/dri/card* - Headless backend: Uses /dev/dri/renderD128 (unprivileged), no drm master needed CHANGES: - wolf/sway-config/startup-app.sh: * Export WLR_BACKENDS=headless instead of drm * Export WLR_RENDER_DRM_DEVICE=/dev/dri/renderD128 * Keep LIBSEAT_BACKEND=noop (disable seat management in containers) * Add diagnostic output showing configuration TESTING: Container now starts successfully and Sway runs without DRM errors. Desktop streaming infrastructure is functional. REFERENCES: - NVIDIA Developer Forums: Headless Wayland on T4 GPUs - Requires NVIDIA driver 535+ (we have 550.90.07) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
PROBLEM: - Headless wlroots backend cannot provide CUDA memory buffers - waylanddisplaysrc with video/x-raw(memory:CUDAMemory) fails with MappingError - Video stream shows blank screen despite Sway/Zed rendering correctly SOLUTION: - Change video_producer_buffer_caps from "video/x-raw(memory:CUDAMemory)" to "video/x-raw" - System memory allows headless backend to provide buffers successfully CHANGES: - api/pkg/external-agent/wolf_executor.go: * VideoProducerBufferCaps: "video/x-raw" (was CUDAMemory) * Comment explains headless backend limitation - moonlight-web-config/config.json: * Regenerated with correct credentials after pairing fix TESTING STATUS: - Helix lobby correctly created with video/x-raw ✅ - However Wolf UI app interferes by connecting with CUDAMemory ❌ - Need to resolve Wolf UI app conflict (separate issue) NOTES: - Sway compositor working correctly with headless backend - Screenshots show Zed rendering properly - Moonlight pairing successful - Video pipeline issue: Wolf UI app vs Helix lobby conflict 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
… sessions Added wolf/config.toml.initial with NO [[profiles]] section to prevent static Wolf apps (especially Wolf UI) from being loaded. Helix creates dynamic sessions via Wolf's API instead. **Root Cause:** - Wolf UI app uses video/x-raw(memory:CUDAMemory) buffer caps - CUDA memory doesn't work with headless wlroots backend (uses renderD128) - Results in blank screen when Wolf UI interferes with Helix sessions **Solution:** - Provide minimal config.toml that prevents Wolf from loading default apps - Wolf init script will skip template generation if config file exists and is non-empty - This allows Helix to create clean API-based sessions without interference 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
| - /dev/uinput | ||
| - /dev/uhid | ||
| # /dev/uhid - Optional: uncomment if uhid kernel module is available | ||
| # If missing, run: sudo modprobe uhid || echo "uhid not available" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NB: /dev/dma_heap missing from here
| @@ -0,0 +1,29 @@ | |||
| # Wolf configuration for Helix | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
conceptually this seems to be a duplicate of config.toml.template?
|
|
||
| # Configure wlroots for headless operation with renderD128 | ||
| # Use headless backend instead of drm - this uses /dev/dri/renderD128 (unprivileged) | ||
| # instead of trying to acquire drm master on /dev/dri/card* (which Wolf already has) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if sway was trying to acquire the DRM master on /dev/dri/card* then how could it possibly have worked on other systems?
opening draft PR so I can see changes more easily