Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profile not working well on AWS ParallelCluster with Slurm #102

Open
cbrueffer opened this issue Jul 20, 2022 · 0 comments
Open

Profile not working well on AWS ParallelCluster with Slurm #102

cbrueffer opened this issue Jul 20, 2022 · 0 comments

Comments

@cbrueffer
Copy link

AWS ParallelCluster allows for running a Slurm cluster on Amazon AWS. Here are some things that do not work well with this profile, both for other users trying this, and to possibly make this work out of the box on default cluster installations.

Some resources:

Tested with Snakemake version 7.8.5.

Issues:

  • ParallelCluster Slurm cannot be used with sbatch --mem. Using this option sends nodes straight into DRAINED state; see also https://blog.ronin.cloud/slurm-parallelcluster-troubleshooting/
  • By default ParallelCluster does not come with accounting, so sacct does not work. While the job status script supports querying using scontrol, this also lead to issues in my case (to get this far in the first place I removed mem/mem-per-CPU from RESOURCE_MAPPING in slurm-submit.py so jobs would run, see above):
127.0.0.1 - - [19/Jul/2022 14:17:45] "POST /job/register/11557 HTTP/1.1" 200 -
Submitted job 3661 with external jobid '11557'.

[Tue Jul 19 14:17:45 2022]
rule foo:
    input: results/xxx.vcf.gz
    output: results/xxx.pdf
    jobid: 3568
    reason: Missing output files: results/xxx.pdf
    wildcards: sample=xxx
    resources: mem_mb=1000, disk_mb=100000, tmpdir=/scratch, runtime=1000, partition=compute-small

[...]
Submitted job 3747 with external jobid '11561'.
/bin/sh: 11557: command not found
WorkflowError:
Failed to obtain job status. See above for error message.
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Cluster sidecar process has terminated (retcode=0).

config.yaml:

restart-times: 3
jobscript: "slurm-jobscript.sh"
cluster: "slurm-submit.py"
cluster-status: "slurm-status.py"
cluster-status: ""
cluster-sidecar: "slurm-sidecar.py"
cluster-cancel: "scancel"
max-jobs-per-second: 1
max-status-checks-per-second: 10
local-cores: 1
latency-wait: 60

# Example resource configuration
default-resources:
  - runtime=1000
#  - mem_mb=4500
  - disk_mb=100000
  - tmpdir="/scratch"
  - partition="compute-small"
# # set-threads: map rule names to threads
# set-threads:
#   - single_core_rule=1
#   - multi_core_rule=10
# # set-resources: map rule names to resources in general
# set-resources:
#   - high_memory_rule:mem_mb=12000
#   - long_running_rule:runtime=1200

settings.json

{
    "SBATCH_DEFAULTS": "",
    "CLUSTER_NAME": "",
    "CLUSTER_CONFIG": ""
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant