-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New Runner Group #746
New Runner Group #746
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #746 +/- ##
=======================================
Coverage 45.81% 45.81%
=======================================
Files 61 61
Lines 16911 16911
Branches 1969 1969
=======================================
Hits 7748 7748
Misses 7944 7944
Partials 1219 1219 ☔ View full report in Codecov by Sentry. |
Note to self:
#!/bin/bash
#SBATCH -Jtest # Job name
#SBATCH -N1 # Number of nodes required
#SBATCH -p gpuA100x4,gpuA100x4-interactive
#SBATCH --account=bdiy-delta-gpu
#SBATCH --gpus-per-node=4
#SBATCH -t 01:00:00 # Duration of the job (Ex: 15 mins)
#SBATCH -n 20
#SBATCH -o job_slug.out # Combined output and error messages file
#SBATCH --constraint="scratch"
# #SBATCH -W # Do not exit until the submitted job terminates.
set -e
set -x
# cd "\$SLURM_SUBMIT_DIR"
echo "Running in $(pwd):"
. ./mfc.sh load -c d -m gpu
n_test_threads=20
gpu_count=$(nvidia-smi -L | wc -l) # number of GPUs on node
gpu_ids=$(seq -s ' ' 0 $(($gpu_count-1))) # 0,1,2,...,gpu_count-1
device_opts="-g $gpu_ids"
n_test_threads=`expr $gpu_count \* 2`
echo $gpu_count
echo $gpu_ids
echo $device_opts
echo $n_test_threads
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/sw/spack/deltas11-2023-03/apps/linux-rhel8-zen3/nvhpc-24.1/openmpi-4.1.5-zkiklxi/lib/
./mfc.sh build -j 20 --gpu
./mfc.sh test --max-attempts 3 -a -j $n_test_threads $device_opts --no-build -- -c delta |
I never did get Delta runners working, and it is still mysterious what is happening with their failure. I learned that NCSA limits processes on Delta to 7 days of concurrent runtime before they are automatically killed, so using it for CI doesn't seem tenable, regardless. I'm still curious why it never worked. Will continue exploring on other machines (eventually). I am leaving this as an open draft PR for now in case there is something useful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this allowed VERY granular views of what is happening within the toolchain when it is called
Adds runner groups. This PR is mostly for testing.