Add GitHub Actions workflow for NVIDIA health check by msaroufim · Pull Request #365 · gpu-mode/kernelbot

msaroufim · 2025-10-13T23:30:29Z

No description provided.

Copilot

Pull Request Overview

This PR adds a new GitHub Actions workflow to perform automated health checks on NVIDIA GPU infrastructure. The workflow is designed to verify that CUDA and PyTorch are functioning correctly on NVIDIA A100 hardware.

Key changes:

Creates a scheduled workflow that runs nightly at 2 AM UTC
Sets up a containerized environment with CUDA 12.4.0 on Ubuntu 22.04
Performs a basic GPU health check using PyTorch

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-13T23:30:52Z

.github/workflows/nvidia-new.yml

+
+    env:
+      CUDA_VISIBLE_DEVICES: 0


[nitpick] The environment variable CUDA_VISIBLE_DEVICES should be defined at the step level rather than at the job level, as it's only needed for the GPU health check step. This provides better clarity about which steps require GPU access.

Suggested change

env:

CUDA_VISIBLE_DEVICES: 0

env:

CUDA_VISIBLE_DEVICES: 0

Copilot · 2025-10-13T23:30:52Z

.github/workflows/nvidia-new.yml

+        pip install torch
+
+    - name: GPU Health Check
+      run: python -c "import torch; torch.randn(5, device='cuda')"


The GPU health check command should include error handling and more informative output. Consider adding a check to verify the tensor was created successfully and print GPU information for debugging purposes.

Suggested change

run: python -c "import torch; torch.randn(5, device='cuda')"

run: |

python -c "

import torch

try:

print('PyTorch version:', torch.__version__)

print('CUDA available:', torch.cuda.is_available())

if torch.cuda.is_available():

print('CUDA device count:', torch.cuda.device_count())

print('CUDA device name:', torch.cuda.get_device_name(0))

t = torch.randn(5, device='cuda')

print('Tensor:', t)

print('Tensor device:', t.device)

assert t.device.type == 'cuda', 'Tensor is not on CUDA device'

print('GPU health check PASSED')

else:

raise RuntimeError('CUDA is not available')

except Exception as e:

print('GPU health check FAILED:', e)

exit(1)

github-actions · 2025-10-13T23:31:26Z

Coverage report

This PR does not seem to contain any modification to coverable code.

Add GitHub Actions workflow for NVIDIA health check

617cd64

Copilot AI review requested due to automatic review settings October 13, 2025 23:30

Copilot AI reviewed Oct 13, 2025

View reviewed changes

msaroufim closed this Oct 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Add GitHub Actions workflow for NVIDIA health check#365

Add GitHub Actions workflow for NVIDIA health check#365
msaroufim wants to merge 1 commit intomainfrom
msaroufim-patch-9

msaroufim commented Oct 13, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 13, 2025

Uh oh!

Copilot AI Oct 13, 2025

Uh oh!

github-actions bot commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-      run: python -c "import torch; torch.randn(5, device='cuda')"
+      run: |
+        python -c "
+import torch
+try:
+    print('PyTorch version:', torch.__version__)
+    print('CUDA available:', torch.cuda.is_available())
+    if torch.cuda.is_available():
+        print('CUDA device count:', torch.cuda.device_count())
+        print('CUDA device name:', torch.cuda.get_device_name(0))
+        t = torch.randn(5, device='cuda')
+        print('Tensor:', t)
+        print('Tensor device:', t.device)
+        assert t.device.type == 'cuda', 'Tensor is not on CUDA device'
+        print('GPU health check PASSED')
+    else:
+        raise RuntimeError('CUDA is not available')
+except Exception as e:
+    print('GPU health check FAILED:', e)
+    exit(1)

Comments

Conversation

msaroufim commented Oct 13, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 13, 2025

Coverage report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant