Skip to content

When is the GPU memory measured when running a large list #59

@bermeitinger-b

Description

@bermeitinger-b

Hi! I've been using ts for a very long time and am currently running into a small problem.

I'm planning to run around 500-1000 experiments, for which ts is perfect. I have successfully done so with many more in the past.

I let it use all 16 GPUs with TS_SLOTS=16, and it runs all 16 in parallel. So far, so good. However, the individual experiments are not very memory-hungry. They each take around 10-20% of GPU memory. So, I could run 64 or even 128 in parallel.

So, I start TS_VISIBLE_DEVICES=0..15 TS_SLOTS=128 ts --set_gpu_free_perc 20 and schedule the jobs with a bash script
(~1000 calls to ts -G 1 ...)

The first 16 are running while the rest are in the allocating state. Starting an individual experiment and allocating the 20% GPU takes a few seconds to minutes, so upon starting, all memory is free.
When does it evaluate that at least 20% of memory is free? If it does in the beginning, it might start all 128 simultaneously, thus probably running into OOM if one of the runs allocates a bit more.

I hope I have explained my issue clearly, it is related to #8.

Do you have a suggestion?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions