Skip to content

Comments

Add GitHub Actions workflow for NVIDIA health check#365

Closed
msaroufim wants to merge 1 commit intomainfrom
msaroufim-patch-9
Closed

Add GitHub Actions workflow for NVIDIA health check#365
msaroufim wants to merge 1 commit intomainfrom
msaroufim-patch-9

Conversation

@msaroufim
Copy link
Member

No description provided.

Copilot AI review requested due to automatic review settings October 13, 2025 23:30
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a new GitHub Actions workflow to perform automated health checks on NVIDIA GPU infrastructure. The workflow is designed to verify that CUDA and PyTorch are functioning correctly on NVIDIA A100 hardware.

Key changes:

  • Creates a scheduled workflow that runs nightly at 2 AM UTC
  • Sets up a containerized environment with CUDA 12.4.0 on Ubuntu 22.04
  • Performs a basic GPU health check using PyTorch

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +30 to +32

env:
CUDA_VISIBLE_DEVICES: 0
Copy link

Copilot AI Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The environment variable CUDA_VISIBLE_DEVICES should be defined at the step level rather than at the job level, as it's only needed for the GPU health check step. This provides better clarity about which steps require GPU access.

Suggested change
env:
CUDA_VISIBLE_DEVICES: 0
env:
CUDA_VISIBLE_DEVICES: 0

Copilot uses AI. Check for mistakes.
pip install torch

- name: GPU Health Check
run: python -c "import torch; torch.randn(5, device='cuda')"
Copy link

Copilot AI Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GPU health check command should include error handling and more informative output. Consider adding a check to verify the tensor was created successfully and print GPU information for debugging purposes.

Suggested change
run: python -c "import torch; torch.randn(5, device='cuda')"
run: |
python -c "
import torch
try:
print('PyTorch version:', torch.__version__)
print('CUDA available:', torch.cuda.is_available())
if torch.cuda.is_available():
print('CUDA device count:', torch.cuda.device_count())
print('CUDA device name:', torch.cuda.get_device_name(0))
t = torch.randn(5, device='cuda')
print('Tensor:', t)
print('Tensor device:', t.device)
assert t.device.type == 'cuda', 'Tensor is not on CUDA device'
print('GPU health check PASSED')
else:
raise RuntimeError('CUDA is not available')
except Exception as e:
print('GPU health check FAILED:', e)
exit(1)

Copilot uses AI. Check for mistakes.
@github-actions
Copy link

Coverage report

This PR does not seem to contain any modification to coverable code.

@msaroufim msaroufim closed this Oct 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant