You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I try soperator 1.16.1. I have to build populate_jail and worker_slurmd images because the image pull are always failed due to image size. I uses NFS as the shared storage for the test.
After apply slurn-cluster helm chart, the slurm1-populate-jail pod finishes runing and exits after some time. I guess it prepares the jail root for login and worker nodes.
But login and worker nodes all fail with CrashLoopBackOff error. Look into the pod logs give same log messags as below:
Starting slurmd entrypoint script
cgroup v2 detected, creating cgroup for /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod9fa86a0e_4917_4b4a_a37d_afb81545c892.slice/cri-containerd-d66f3229577110ba455bf50d4cc12422dbef9dd7fb829dec0a6d70ca559b8934.scope
Link users from jail
Link home from jail because slurmd uses it
Bind-mount slurm configs from K8S config map
Make ulimits as big as possible
Apply sysctl limits from /etc/sysctl.conf
vm.max_map_count = 655300
Update linker cache
Complement jail rootfs
+ set -e
+ getopts j:u:wh flag
+ case"${flag}"in
+ jaildir=/mnt/jail
+ getopts j:u:wh flag
+ case "${flag}" in
+ upperdir=/mnt/jail.upper
+ getopts j:u:wh flag
+ case "${flag}" in
+ worker=1
+ getopts j:u:wh flag
+ '[' -z /mnt/jail ']'
+ '[' -z /mnt/jail.upper ']'
+ pushd /mnt/jail
+ echo 'Bind-mount virtual filesystems'
/mnt/jail /
Bind-mount virtual filesystems
+ mount -t proc /proc proc/
If it can help, I got the file list of the shared jail volume directory:
I think jail volume is mounted to /mnt/jail, then /opt/bin/slurm/complement_jail.sh -j /mnt/jail -u /mnt/jail.upper is triggered by container entrypoint script. The script changes working directory to /mnt/jail, it then tries to mount the virtual filesystems but it is obvious the mountpoints are not present.
mount -t proc /proc proc/
mount -t sysfs /sys sys/
mount --rbind /dev dev/
mount --rbind /run run/
How this can be workaround or anything wrong with my setup?
The text was updated successfully, but these errors were encountered:
I try soperator 1.16.1. I have to build populate_jail and worker_slurmd images because the image pull are always failed due to image size. I uses NFS as the shared storage for the test.
After apply slurn-cluster helm chart, the slurm1-populate-jail pod finishes runing and exits after some time. I guess it prepares the jail root for login and worker nodes.
But login and worker nodes all fail with CrashLoopBackOff error. Look into the pod logs give same log messags as below:
If it can help, I got the file list of the shared jail volume directory:
I think jail volume is mounted to /mnt/jail, then
/opt/bin/slurm/complement_jail.sh -j /mnt/jail -u /mnt/jail.upper
is triggered by container entrypoint script. The script changes working directory to /mnt/jail, it then tries to mount the virtual filesystems but it is obvious the mountpoints are not present.How this can be workaround or anything wrong with my setup?
The text was updated successfully, but these errors were encountered: