Add GitHub Actions workflow for NVIDIA health check#365
Add GitHub Actions workflow for NVIDIA health check#365
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR adds a new GitHub Actions workflow to perform automated health checks on NVIDIA GPU infrastructure. The workflow is designed to verify that CUDA and PyTorch are functioning correctly on NVIDIA A100 hardware.
Key changes:
- Creates a scheduled workflow that runs nightly at 2 AM UTC
- Sets up a containerized environment with CUDA 12.4.0 on Ubuntu 22.04
- Performs a basic GPU health check using PyTorch
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
|
||
| env: | ||
| CUDA_VISIBLE_DEVICES: 0 |
There was a problem hiding this comment.
[nitpick] The environment variable CUDA_VISIBLE_DEVICES should be defined at the step level rather than at the job level, as it's only needed for the GPU health check step. This provides better clarity about which steps require GPU access.
| env: | |
| CUDA_VISIBLE_DEVICES: 0 | |
| env: | |
| CUDA_VISIBLE_DEVICES: 0 | |
| pip install torch | ||
|
|
||
| - name: GPU Health Check | ||
| run: python -c "import torch; torch.randn(5, device='cuda')" |
There was a problem hiding this comment.
The GPU health check command should include error handling and more informative output. Consider adding a check to verify the tensor was created successfully and print GPU information for debugging purposes.
| run: python -c "import torch; torch.randn(5, device='cuda')" | |
| run: | | |
| python -c " | |
| import torch | |
| try: | |
| print('PyTorch version:', torch.__version__) | |
| print('CUDA available:', torch.cuda.is_available()) | |
| if torch.cuda.is_available(): | |
| print('CUDA device count:', torch.cuda.device_count()) | |
| print('CUDA device name:', torch.cuda.get_device_name(0)) | |
| t = torch.randn(5, device='cuda') | |
| print('Tensor:', t) | |
| print('Tensor device:', t.device) | |
| assert t.device.type == 'cuda', 'Tensor is not on CUDA device' | |
| print('GPU health check PASSED') | |
| else: | |
| raise RuntimeError('CUDA is not available') | |
| except Exception as e: | |
| print('GPU health check FAILED:', e) | |
| exit(1) |
No description provided.