Enhance GPU health check with distributed testing#360
Conversation
Increased timeout for health check job and added distributed health check step to verify GPU availability and initialization.
There was a problem hiding this comment.
Pull Request Overview
This PR enhances the AMD GPU health check workflow by adding distributed GPU testing capabilities and increasing the timeout period. The changes expand the health check from simple GPU availability to include multi-GPU distributed initialization and communication testing.
- Increased job timeout from 5 to 10 minutes to accommodate distributed testing
- Added comprehensive distributed health check that tests NCCL initialization and inter-GPU communication
- Implemented dynamic GPU detection to test with available hardware (up to 8 GPUs)
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
.github/workflows/amd-health.yml
Outdated
| os.environ['MASTER_PORT'] = str(master_port) | ||
|
|
||
| try: | ||
| dist.init_process_group('nccl', rank=rank, world_size=world_size, device_id=torch.device(f'cuda:{rank}')) |
There was a problem hiding this comment.
The device_id parameter in init_process_group should be an integer, not a torch.device object. Change to device_id=rank instead of device_id=torch.device(f'cuda:{rank}').
| dist.init_process_group('nccl', rank=rank, world_size=world_size, device_id=torch.device(f'cuda:{rank}')) | |
| dist.init_process_group('nccl', rank=rank, world_size=world_size, device_id=rank) |
.github/workflows/amd-health.yml
Outdated
|
|
||
| num_gpus = torch.cuda.device_count() | ||
| world_size = min(num_gpus, 8) # Test with available GPUs, up to 8 | ||
| master_port = 12345 + int(time.time()) % 1000 # One port for all ranks |
There was a problem hiding this comment.
The comment 'One port for all ranks' is misleading. This calculation generates a single port number for the entire distributed group, not one port per rank. Consider updating the comment to 'Generate unique port for this test run' for clarity.
| master_port = 12345 + int(time.time()) % 1000 # One port for all ranks | |
| master_port = 12345 + int(time.time()) % 1000 # Generate unique port for this test run |
.github/workflows/amd-health.yml
Outdated
| if p.exitcode != 0: | ||
| print('✗ Distributed test failed') | ||
| exit(1) |
There was a problem hiding this comment.
Using exit(1) in a workflow step may not properly propagate the failure. Consider using sys.exit(1) after importing sys, or restructure to use a return statement and check the result outside the Python code block.
Increased timeout for health check job and added distributed health check step to verify GPU availability and initialization.
Description
Please provide a brief summary of the changes in this pull request.
Checklist
Before submitting this PR, ensure the following steps have been completed:
/verifyrunson your own server./verifyruns.runs may take a little longer. The Modal run is typically quick.)
For more information on running a cluster bot on your own server, see
README.md.