Shared node for benching #326

sbryngelson · 2024-01-29T07:10:38Z

Idea here is that many benchmark jobs fail because we need an entire node to benchmark on only 1 or 2 GPUs. Taking over the whole node is ideal for benchmarking, but my view is that our testing should be mostly robust to someone else performing periphery tasks on a node. This update lets us share nodes and gets us into the queue much quicker (at least that's my experience, we will see how the CI runs). We will also use other runners (like RG Violet/Quorra) for this purpose so we will have multiple points of contact for performance.

sbryngelson · 2024-01-29T07:20:51Z

Update: Looks like my new submit script is maybe giving us two nodes, each with one GPU each.

login-phoenix-slurm-1: 6/sbryngelson3 $ scontrol show job 4681047
JobId=4681047 JobName=MFC-bench-gpu
   UserId=sbryngelson3(3048356) GroupId=p-sbryngelson3(451953) MCS_label=N/A
   Priority=22 Nice=0 Account=gts-sbryngelson3 QOS=embers
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:02:30 TimeLimit=04:00:00 TimeMin=N/A
   SubmitTime=02:16:57 EligibleTime=02:16:57
   AccrueTime=Unknown
   StartTime=02:17:00 EndTime=06:17:00 Deadline=N/A
   PreemptEligibleTime=03:17:00 PreemptTime=None
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=02:17:00 Scheduler=Main
   Partition=gpu-v100 AllocNode:Sid=login-phoenix-slurm-1:190474
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=atl1-1-02-003-36-0,atl1-1-02-007-31-0
   BatchHost=atl1-1-02-003-36-0
   NumNodes=2 NumCPUs=24 NumTasks=4 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=4,mem=16G,node=1,billing=17098,gres/gpu=2
   AllocTRES=cpu=24,mem=96G,node=2,billing=17098,gres/gpu=2,gres/gpu:v100=2
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=4G MinTmpDiskNode=0
   Features=V100-16GB DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/storage/coda1/p-sbryngelson3/0/sbryngelson3/runners/actions-runner-3/_work/MFC/MFC/pr
   StdErr=/storage/coda1/p-sbryngelson3/0/sbryngelson3/runners/actions-runner-3/_work/MFC/MFC/pr/bench-gpu.out
   StdIn=/dev/null
   StdOut=/storage/coda1/p-sbryngelson3/0/sbryngelson3/runners/actions-runner-3/_work/MFC/MFC/pr/bench-gpu.out
   Power=
   CpusPerTres=gpu:12
   TresPerJob=gres/gpu:2

Not sure how much this matters at the moment.

sbryngelson · 2024-01-29T19:42:23Z

@henryleberre this PR is failing but the error is not clear to me. The logs seem fine. Several parts of this seem quite fragile.

 Comparing Bencharks: master/bench-cpu.yaml is x times slower than pr/bench-cpu.yaml.
 Warning: Metadata of lhs and rhs are not equal.

mfc: ERROR > mfc.py finished with a 1 exit code.
mfc: (venv) Exiting the Python virtual environment.
Error: Process completed with exit code 1.

Update bench.sh

c6d0c6a

sbryngelson requested a review from henryleberre as a code owner January 29, 2024 07:10

sbryngelson added 2 commits January 29, 2024 02:12

Update submit.sh

692b8e5

Update submit.sh

cbad0c7

sbryngelson changed the title ~~Update bench.sh~~ Shared node for benching Jan 29, 2024

sbryngelson marked this pull request as draft January 29, 2024 07:25

sbryngelson closed this Feb 4, 2024

sbryngelson deleted the fast-benchmarking branch February 4, 2024 14:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shared node for benching #326

Shared node for benching #326

sbryngelson commented Jan 29, 2024 •

edited

Loading

sbryngelson commented Jan 29, 2024

sbryngelson commented Jan 29, 2024 •

edited

Loading

Shared node for benching #326

Shared node for benching #326

Conversation

sbryngelson commented Jan 29, 2024 • edited Loading

sbryngelson commented Jan 29, 2024

sbryngelson commented Jan 29, 2024 • edited Loading

sbryngelson commented Jan 29, 2024 •

edited

Loading

sbryngelson commented Jan 29, 2024 •

edited

Loading