-
Notifications
You must be signed in to change notification settings - Fork 35
Description
Hi! I've been using ts for a very long time and am currently running into a small problem.
I'm planning to run around 500-1000 experiments, for which ts is perfect. I have successfully done so with many more in the past.
I let it use all 16 GPUs with TS_SLOTS=16, and it runs all 16 in parallel. So far, so good. However, the individual experiments are not very memory-hungry. They each take around 10-20% of GPU memory. So, I could run 64 or even 128 in parallel.
So, I start TS_VISIBLE_DEVICES=0..15 TS_SLOTS=128 ts --set_gpu_free_perc 20 and schedule the jobs with a bash script
(~1000 calls to ts -G 1 ...)
The first 16 are running while the rest are in the allocating state. Starting an individual experiment and allocating the 20% GPU takes a few seconds to minutes, so upon starting, all memory is free.
When does it evaluate that at least 20% of memory is free? If it does in the beginning, it might start all 128 simultaneously, thus probably running into OOM if one of the runs allocates a bit more.
I hope I have explained my issue clearly, it is related to #8.
Do you have a suggestion?