Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shared node for benching #326

Closed
wants to merge 3 commits into from
Closed

Shared node for benching #326

wants to merge 3 commits into from

Conversation

sbryngelson
Copy link
Member

@sbryngelson sbryngelson commented Jan 29, 2024

Idea here is that many benchmark jobs fail because we need an entire node to benchmark on only 1 or 2 GPUs. Taking over the whole node is ideal for benchmarking, but my view is that our testing should be mostly robust to someone else performing periphery tasks on a node. This update lets us share nodes and gets us into the queue much quicker (at least that's my experience, we will see how the CI runs). We will also use other runners (like RG Violet/Quorra) for this purpose so we will have multiple points of contact for performance.

@sbryngelson sbryngelson changed the title Update bench.sh Shared node for benching Jan 29, 2024
@sbryngelson
Copy link
Member Author

Update: Looks like my new submit script is maybe giving us two nodes, each with one GPU each.

login-phoenix-slurm-1: 6/sbryngelson3 $ scontrol show job 4681047
JobId=4681047 JobName=MFC-bench-gpu
   UserId=sbryngelson3(3048356) GroupId=p-sbryngelson3(451953) MCS_label=N/A
   Priority=22 Nice=0 Account=gts-sbryngelson3 QOS=embers
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:02:30 TimeLimit=04:00:00 TimeMin=N/A
   SubmitTime=02:16:57 EligibleTime=02:16:57
   AccrueTime=Unknown
   StartTime=02:17:00 EndTime=06:17:00 Deadline=N/A
   PreemptEligibleTime=03:17:00 PreemptTime=None
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=02:17:00 Scheduler=Main
   Partition=gpu-v100 AllocNode:Sid=login-phoenix-slurm-1:190474
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=atl1-1-02-003-36-0,atl1-1-02-007-31-0
   BatchHost=atl1-1-02-003-36-0
   NumNodes=2 NumCPUs=24 NumTasks=4 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=4,mem=16G,node=1,billing=17098,gres/gpu=2
   AllocTRES=cpu=24,mem=96G,node=2,billing=17098,gres/gpu=2,gres/gpu:v100=2
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=4G MinTmpDiskNode=0
   Features=V100-16GB DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/storage/coda1/p-sbryngelson3/0/sbryngelson3/runners/actions-runner-3/_work/MFC/MFC/pr
   StdErr=/storage/coda1/p-sbryngelson3/0/sbryngelson3/runners/actions-runner-3/_work/MFC/MFC/pr/bench-gpu.out
   StdIn=/dev/null
   StdOut=/storage/coda1/p-sbryngelson3/0/sbryngelson3/runners/actions-runner-3/_work/MFC/MFC/pr/bench-gpu.out
   Power=
   CpusPerTres=gpu:12
   TresPerJob=gres/gpu:2

Not sure how much this matters at the moment.

@sbryngelson sbryngelson marked this pull request as draft January 29, 2024 07:25
@sbryngelson
Copy link
Member Author

sbryngelson commented Jan 29, 2024

@henryleberre this PR is failing but the error is not clear to me. The logs seem fine. Several parts of this seem quite fragile.

 Comparing Bencharks: master/bench-cpu.yaml is x times slower than pr/bench-cpu.yaml.
 Warning: Metadata of lhs and rhs are not equal.

mfc: ERROR > mfc.py finished with a 1 exit code.
mfc: (venv) Exiting the Python virtual environment.
Error: Process completed with exit code 1.

@sbryngelson sbryngelson closed this Feb 4, 2024
@sbryngelson sbryngelson deleted the fast-benchmarking branch February 4, 2024 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

1 participant