Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sacct cmd execution crashes a calibration pipeline if slurmdbd is down #115

Open
nefrathenrici opened this issue Oct 18, 2024 · 0 comments
Assignees

Comments

@nefrathenrici
Copy link
Member

Running sacct errors when the slurm database daemon is down, causing the pipeline to exit.

If this errors, we should catch it and fall back to squeue. Then, warn the user because we won't be able to determine if a completed job was successful or not.

Error:

sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:head1:6819: Connection refused
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused
ERROR: LoadError: failed process: Process(`sacct --allocations -u esmbuild --starttime now-1hour -o Submit,Start -n`, ProcessExited(1)) [1]
Stacktrace:
 [1] pipeline_error
   @ ./process.jl:565 [inlined]
 [2] read(cmd::Cmd)
   @ Base ./process.jl:449
 [3] read
   @ ./process.jl:458 [inlined]
 [4] readchomp(x::Cmd)
   @ Base ./io.jl:974
 [5] top-level scope
   @ /central/scratch/esm/slurm-buildkite/climaatmos-ci/21206/climaatmos-ci/calibration/test/e2e_test.jl:108
in expression starting at /central/scratch/esm/slurm-buildkite/climaatmos-ci/21206/climaatmos-ci/calibration/test/e2e_test.jl:107
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant