Simplify NVIDIA workflow runner selection#400
Conversation
Remove round-robin GPU selection and concurrency groups in favor of letting GitHub's native self-hosted runner queuing handle distribution. Self-hosted runners only run one job at a time by default.
There was a problem hiding this comment.
Pull request overview
This PR aims to simplify NVIDIA workflow runner selection by removing round-robin GPU selection and explicit concurrency groups, allowing GitHub's native self-hosted runner queuing to handle job distribution. The change adds the self-hosted label to the runner configuration.
Changes:
- Modified the
runs-onconfiguration innvidia_workflow.ymlto explicitly include theself-hostedlabel alongside the existingnvidia-docker-b200-8-x86-64label
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| concurrency: | ||
| group: ${{ needs.select-runner.outputs.runner }} | ||
| cancel-in-progress: false | ||
| runs-on: [self-hosted, nvidia-docker-b200-8-x86-64] |
There was a problem hiding this comment.
The addition of the self-hosted label creates an inconsistency with other workflows in the repository. The workflows nvidia-on-prem-health.yml (line 12) and runner_ci.yml (lines 15 and 49) use the same runner label [nvidia-docker-b200-8-x86-64] without the self-hosted prefix. For consistency, either all workflows using this runner should include the self-hosted label, or none should. Additionally, the amd-health.yml workflow uses [amdgpu-mi300-8-x86-64] without self-hosted, suggesting a pattern in this repository of not including the self-hosted label explicitly.
| runs-on: [self-hosted, nvidia-docker-b200-8-x86-64] | |
| runs-on: [nvidia-docker-b200-8-x86-64] |
Summary
Test plan