Skip to content

Slurm job fails with OOM when no resource limits are set #56

@lchladek

Description

@lchladek

My gourd.toml file does not have a resource_limits section and jobs fail almost immediately due to an out-of-memory error.

Excerpt from the gourd.toml docs:

   Example
       An example Slurm Configuration:

       [slurm]
       experiment_name = "my test experiment"
       output_folder = "./slurmout/"
       partition = "compute"
       account = "Education-EEMCS-MSc-CS"

   RESOURCE LIMITS
       To run on Slurm one must also specify resource limits.

The docs imply that the config is invalid without resource limits (and it is not clear if there are any defaults). There should be checks, as I was able to run the file on Slurm hence the OOM error.

Also, would be nice for the status UI to display the OOM status info line on the short-form gourd status instead of just an exit code.
Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions