When is the GPU memory measured when running a large list

Hi! I've been using **ts** for a very long time and am currently running into a small problem.

I'm planning to run around 500-1000 experiments, for which **ts** is perfect. I have successfully done so with many more in the past.

I let it use all 16 GPUs with TS_SLOTS=16, and it runs all 16 in parallel. So far, so good. However, the individual experiments are not very memory-hungry. They each take around 10-20% of GPU memory. So, I could run 64 or even 128 in parallel.

So, I start `TS_VISIBLE_DEVICES=0..15 TS_SLOTS=128 ts --set_gpu_free_perc 20` and schedule the jobs with a bash script 
(~1000 calls to `ts -G 1 ...`)

The first 16 are running while the rest are in the _allocating_ state. Starting an individual experiment and allocating the 20% GPU takes a few seconds to minutes, so upon starting, all memory is free.
When does it evaluate that at least 20% of memory is free?  If it does in the beginning, it might start all 128 simultaneously, thus probably running into OOM if one of the runs allocates a bit more.

I hope I have explained my issue clearly, it is related to #8.

Do you have a suggestion?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When is the GPU memory measured when running a large list #59

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

When is the GPU memory measured when running a large list #59

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions