Enhance GPU health check with distributed testing by msaroufim · Pull Request #360 · gpu-mode/kernelbot

msaroufim · 2025-09-23T20:33:21Z

Increased timeout for health check job and added distributed health check step to verify GPU availability and initialization.

Description

Please provide a brief summary of the changes in this pull request.

Checklist

Before submitting this PR, ensure the following steps have been completed:

Run the slash command /verifyruns on your own server.
- Run the cluster bot on your server:
```
python discord-bot.py
```
- Start training runs with the slash command /verifyruns.
- Verify that the bot eventually responds with:
```
✅ All runs completed successfully!
```
  (It may take a few minutes for all runs to finish. In particular, the GitHub
  runs may take a little longer. The Modal run is typically quick.)
  For more information on running a cluster bot on your own server, see
  README.md.

Increased timeout for health check job and added distributed health check step to verify GPU availability and initialization.

Copilot

Pull Request Overview

This PR enhances the AMD GPU health check workflow by adding distributed GPU testing capabilities and increasing the timeout period. The changes expand the health check from simple GPU availability to include multi-GPU distributed initialization and communication testing.

Increased job timeout from 5 to 10 minutes to accommodate distributed testing
Added comprehensive distributed health check that tests NCCL initialization and inter-GPU communication
Implemented dynamic GPU detection to test with available hardware (up to 8 GPUs)

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-23T20:33:53Z

.github/workflows/amd-health.yml

+            os.environ['MASTER_PORT'] = str(master_port)
+
+            try:
+                dist.init_process_group('nccl', rank=rank, world_size=world_size, device_id=torch.device(f'cuda:{rank}'))


The device_id parameter in init_process_group should be an integer, not a torch.device object. Change to device_id=rank instead of device_id=torch.device(f'cuda:{rank}').

Suggested change

dist.init_process_group('nccl', rank=rank, world_size=world_size, device_id=torch.device(f'cuda:{rank}'))

dist.init_process_group('nccl', rank=rank, world_size=world_size, device_id=rank)

Copilot · 2025-09-23T20:33:53Z

.github/workflows/amd-health.yml

+
+        num_gpus = torch.cuda.device_count()
+        world_size = min(num_gpus, 8)  # Test with available GPUs, up to 8
+        master_port = 12345 + int(time.time()) % 1000  # One port for all ranks


The comment 'One port for all ranks' is misleading. This calculation generates a single port number for the entire distributed group, not one port per rank. Consider updating the comment to 'Generate unique port for this test run' for clarity.

Suggested change

master_port = 12345 + int(time.time()) % 1000 # One port for all ranks

master_port = 12345 + int(time.time()) % 1000 # Generate unique port for this test run

Copilot · 2025-09-23T20:33:53Z

.github/workflows/amd-health.yml

+            if p.exitcode != 0:
+                print('✗ Distributed test failed')
+                exit(1)


Using exit(1) in a workflow step may not properly propagate the failure. Consider using sys.exit(1) after importing sys, or restructure to use a return statement and check the result outside the Python code block.

github-actions · 2025-09-23T20:34:26Z

Coverage report

This PR does not seem to contain any modification to coverable code.

…fim-patch-7

Enhance GPU health check with distributed testing

9c02d48

Increased timeout for health check job and added distributed health check step to verify GPU availability and initialization.

Copilot AI review requested due to automatic review settings September 23, 2025 20:33

Reduce timeout for health check job from 10 to 5 minutes

01d4aab

Copilot AI reviewed Sep 23, 2025

View reviewed changes

msaroufim added 16 commits September 23, 2025 14:39

update

3de2541

Update runner for health check workflow

4d66efb

Update runner for health check workflow

2f44dbe

update

43b360d

update

1649475

update

60ccce5

update

4b65304

update ruff

825decb

update

9629cfe

update

55a291d

update

e005de4

update timeout

3317036

Update amd-health.yml

8910a4b

fix regression test

441b832

Merge remote-tracking branch 'upstream/msaroufim-patch-7' into msarou…

e04b59f

…fim-patch-7

update

d7f934a

msaroufim merged commit dcaaea9 into main Sep 24, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Enhance GPU health check with distributed testing#360

Enhance GPU health check with distributed testing#360
msaroufim merged 18 commits intomainfrom
msaroufim-patch-7

msaroufim commented Sep 23, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Sep 23, 2025

Uh oh!

Copilot AI Sep 23, 2025

Uh oh!

Copilot AI Sep 23, 2025

Uh oh!

github-actions bot commented Sep 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	dist.init_process_group('nccl', rank=rank, world_size=world_size, device_id=torch.device(f'cuda:{rank}'))
	dist.init_process_group('nccl', rank=rank, world_size=world_size, device_id=rank)

	master_port = 12345 + int(time.time()) % 1000 # One port for all ranks
	master_port = 12345 + int(time.time()) % 1000 # Generate unique port for this test run

Comments

Conversation

msaroufim commented Sep 23, 2025

Description

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Sep 23, 2025

Coverage report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant