Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions .github/workflows/nvidia-new.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
name: nvidia-arc

on:
schedule:
# Run nightly at 2 AM UTC
- cron: '0 2 * * *'
workflow_dispatch:
push:
branches: [main]

jobs:
health-check:
runs-on: [Nvidia-A100-8-x86-64]
timeout-minutes: 5
container:
image: nvidia/cuda:12.4.0-devel-ubuntu22.04

steps:
- name: Setup Python
uses: actions/setup-python@v5
with:
python-version: '3.10'

- name: Install PyTorch
run: |
pip install torch

- name: GPU Health Check
run: python -c "import torch; torch.randn(5, device='cuda')"
Copy link

Copilot AI Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GPU health check command should include error handling and more informative output. Consider adding a check to verify the tensor was created successfully and print GPU information for debugging purposes.

Suggested change
run: python -c "import torch; torch.randn(5, device='cuda')"
run: |
python -c "
import torch
try:
print('PyTorch version:', torch.__version__)
print('CUDA available:', torch.cuda.is_available())
if torch.cuda.is_available():
print('CUDA device count:', torch.cuda.device_count())
print('CUDA device name:', torch.cuda.get_device_name(0))
t = torch.randn(5, device='cuda')
print('Tensor:', t)
print('Tensor device:', t.device)
assert t.device.type == 'cuda', 'Tensor is not on CUDA device'
print('GPU health check PASSED')
else:
raise RuntimeError('CUDA is not available')
except Exception as e:
print('GPU health check FAILED:', e)
exit(1)

Copilot uses AI. Check for mistakes.

env:
CUDA_VISIBLE_DEVICES: 0
Comment on lines +30 to +32
Copy link

Copilot AI Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The environment variable CUDA_VISIBLE_DEVICES should be defined at the step level rather than at the job level, as it's only needed for the GPU health check step. This provides better clarity about which steps require GPU access.

Suggested change
env:
CUDA_VISIBLE_DEVICES: 0
env:
CUDA_VISIBLE_DEVICES: 0

Copilot uses AI. Check for mistakes.
Loading