Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update README and licence #33

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
MIT License

Copyright (c) 2019 Giovanni Torres
Copyright (c) 2023 StackHPC

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand Down
21 changes: 18 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,16 @@
# Slurm Docker Cluster

This is a multi-container Slurm cluster using Kubernetes. The Slurm cluster Helm chart creates a named volume for persistent storage of MySQL data files. By default, it also installs the
RookNFS Helm chart (also in this repo) to provide shared storage across the Slurm cluster nodes.
A Helm chart and Dockerfile to run a multi-container Slurm cluster on Kubernetes, featuring:

* Control, login, slurmd (worker), slurmdbd and mariadb pods.
* A shared `/home` directory across the slurm pods, by default via an install of RookNFS to provide a storage class with Read Write Many (RWX) capabilities.
sd109 marked this conversation as resolved.
Show resolved Hide resolved
* SSH and and HTTPS access to the login pod with an Open Ondemand web GUI.
sd109 marked this conversation as resolved.
Show resolved Hide resolved
* A single slurmd pod per Kubernetes worker node with automatic definition of slurm node memory and CPU configuration.
* Slurm jobs run inside the slurmd pods, using host networking for maximum MPI performance.
* Open MPI installed with support for Slurm's `srun` launcher (via `pmix`) - see example below.
* Support for containerised jobs via Apptainer - see example below.
* Job accounting information retained across container upgrades via a persistent volume claim.
* Credentials/secrets are generated during the Helm install, not embedded in images.

## Dependencies

Expand Down Expand Up @@ -178,4 +187,10 @@ and then restart the other dependent deployments to propagate changes:
kubectl rollout restart deployment slurmd slurmctld login slurmdbd
```

# Known Issues
# Limitations and Known Issues
- Only a single cluster should be deployed per Kubernetes namespace.
- Only the `rocky` user is currently supported.

# Acknowlegements

Originally based on https://github.com/giovtorres/slurm-docker-cluster which defines a docker-compose -based cluster.
Loading