Replies: 6 comments 8 replies
-
I addressed that recently in https://stereopipeline.readthedocs.io/en/latest/tools/parallel_stereo.html. Normally it expects as many processes as cores, and then 1 thread for threads-multiprocess. For threads-singleprocess, it expects it to be set to the total number of cores. For the ASP MGM algorithm, since that one uses 8 threads per process, the number of processes should be divided by 8 and that is the default. But all these are to tune the performance. That you are getting ssh errors means something else is wrong. I was told by a SLURM user (which I don't have access to) that one has to take the output of $SLURM_JOB_NODELIST and split it to be one value per line. That is what GNU parallel expects. I put a note of that here: https://stereopipeline.readthedocs.io/en/latest/examples.html#using-pbs-and-slurm Do tell me exactly what error you get, and you can try by going easy on your system, so use just a few nodes and threads, to be able to understand if your problem is fundamental or gets triggered only for a high number for these. Depending on what you find maybe some adjustments to doc or to how the nodes are read may be needed. |
Beta Was this translation helpful? Give feedback.
-
Was this resolved somehow? Is there something in the doc which is perhaps incomplete or missing? I don't have access to SLURM and I can't tell how things work. Some folks reported success with it though. |
Beta Was this translation helpful? Give feedback.
-
The issue seems to be, per your text, that only 47 simultaneous logins are allowed. You can try to use, as it says, the option --sshdelay. That can be done by passing it to parallel_stereo as: --parallel-options '--sshdelay 1' (per https://stereopipeline.readthedocs.io/en/latest/tools/parallel_stereo.html). Or, you can use less than this many connections, for example, by ensuring that the product of --threads-multiprocess and --processes is less than this. This seems to be a quirk of your system. |
Beta Was this translation helpful? Give feedback.
-
@oleg-alexandrov I have an answer that I fear is specific to my cluster setup. 🤷 Many thanks for your suggestions working through this. The SBATCH commands ended up including: #SBATCH --nodes=2
#SBATCH --ntasks-per-node=40 And then parameterization for the call is: scontrol show hostname $SLURM_NODELIST | tr ' ' '\n' > workdir/nodelist.lis
parallel_stereo ... --nodes-list=workdir/nodelist.lis --processes 40 \
--threads-singleprocess 1 \
--parallel-options '--sshdelay 0.1' This is inline with your last suggestion. I think perhaps that part of the issue was that the nodelist was writing to the proper directory in the first line, but then not being pass (by me) properly to the parallel stereo command. Prepending the nodelist path with the workdir fixed the issue (verified by sshing to the nodes while running). I also swapped to using the CSM sensors while working on all of this and can report that the performance is quite good. |
Beta Was this translation helpful? Give feedback.
-
Glad it is solved. What you wrote looks in line with the doc at https://stereopipeline.readthedocs.io/en/latest/examples.html#using-pbs-and-slurm. I don't have access to slurm myself, so any thoughts about how to improve the docs are welcome. |
Beta Was this translation helpful? Give feedback.
-
I now modified the doc to make the nodes list for slurm unique, so when the user launches several jobs at once they don't conflict. Here are the new suggestions: # Create a temporary list of nodes in current directory nodesList=$(mktemp -p $(pwd)) # Set up the nodes list scontrol show hostname $SLURM_NODELIST | tr ' ' '\n' > $nodesList This is not related to your issue, but it may help avoiding problems going forward. |
Beta Was this translation helpful? Give feedback.
-
I am struggling to get a working parallel_stereo call working on a new cluster. Specifically, I am seeing ssh login errors related to the number of jobs and ssh connections that are opening. Before I open a specific issue, I thought it best to request a bit more information about how parallel stereo is using the parameters. Each node has 40 physical, 80 hyper-threaded cores.
My SBATCH commands look as follows:
To generate the nodes-list to be passed to parallel stereo I am using: scontrol show hostnames $SLURM_JOB_NODELIST > nodelist.lis which is working to get the required list of nodes.
I am running steps 1-2 and 5 in serial. Below is my call. Given the CPU layout of the cluster, what is parallel_stereo expecting to see for processes, threads-multiprocess and threads-singleprocess? If I set these to the number of physical cores (or hyper threaded cores, I am getting ssh errors, even when setting the ssh config to not overload ssh).
Thanks for any insight. I can post specific error messages if a more concrete 'here is the exact error for this setup' is helpful.
Beta Was this translation helpful? Give feedback.
All reactions