Proper parallel_stereo parameterization for single node and multiple nodes #379

jlaura · 2022-08-09T15:18:21Z

jlaura
Aug 9, 2022

I am struggling to get a working parallel_stereo call working on a new cluster. Specifically, I am seeing ssh login errors related to the number of jobs and ssh connections that are opening. Before I open a specific issue, I thought it best to request a bit more information about how parallel stereo is using the parameters. Each node has 40 physical, 80 hyper-threaded cores.

My SBATCH commands look as follows:

#SBATCH --cpus-per-task=80
#SBATCH --ntasks=1
#SBATCH --nodes=2

To generate the nodes-list to be passed to parallel stereo I am using: scontrol show hostnames $SLURM_JOB_NODELIST > nodelist.lis which is working to get the required list of nodes.

I am running steps 1-2 and 5 in serial. Below is my call. Given the CPU layout of the cluster, what is parallel_stereo expecting to see for processes, threads-multiprocess and threads-singleprocess? If I set these to the number of physical cores (or hyper threaded cores, I am getting ssh errors, even when setting the ssh config to not overload ssh).

parallel_stereo --nodes-list=nodelist.lis --entry-point 3 --stop-point 5 $L $R \
    -s ${config} ${odir}/${prefix}_ba_map --bundle-adjust-prefix adjust/ba # \
    --processes 1 --threads-multiprocess 2 --threads-singleprocess 2

Thanks for any insight. I can post specific error messages if a more concrete 'here is the exact error for this setup' is helpful.

oleg-alexandrov · 2022-08-09T16:13:00Z

oleg-alexandrov
Aug 9, 2022
Maintainer

what is parallel_stereo expecting to see for processes, threads-multiprocess and threads-singleprocess

I addressed that recently in https://stereopipeline.readthedocs.io/en/latest/tools/parallel_stereo.html. Normally it expects as many processes as cores, and then 1 thread for threads-multiprocess. For threads-singleprocess, it expects it to be set to the total number of cores. For the ASP MGM algorithm, since that one uses 8 threads per process, the number of processes should be divided by 8 and that is the default.

But all these are to tune the performance. That you are getting ssh errors means something else is wrong.

I was told by a SLURM user (which I don't have access to) that one has to take the output of $SLURM_JOB_NODELIST and split it to be one value per line. That is what GNU parallel expects. I put a note of that here: https://stereopipeline.readthedocs.io/en/latest/examples.html#using-pbs-and-slurm

Do tell me exactly what error you get, and you can try by going easy on your system, so use just a few nodes and threads, to be able to understand if your problem is fundamental or gets triggered only for a high number for these.

Depending on what you find maybe some adjustments to doc or to how the nodes are read may be needed.

0 replies

oleg-alexandrov · 2022-08-10T18:56:31Z

oleg-alexandrov
Aug 10, 2022
Maintainer

Was this resolved somehow? Is there something in the doc which is perhaps incomplete or missing? I don't have access to SLURM and I can't tell how things work. Some folks reported success with it though.

1 reply

jlaura Aug 22, 2022
Author

@oleg-alexandrov I am still testing this. I keep getting really inconsistent results where it looks like parallel is failing due to race conditions. For example, the following is occurring during triangulation some of the time:

Using tiles (before collar addition) of 2048 x 2048 pixels.
Using a collar (padding) for each tile of 0 pixels.
Writing: noMap/D20_035159_1939_XN_13N050W__F02_036425_1943_XN_14N050W_ba_noMap-dirList.txt
Traceback (most recent call last):
  File "/miniconda3/asp3.1.0/bin/parallel_stereo", line 1118, in <module>
    spawn_to_nodes(step, settings, parallel_args)
  File "/miniconda3/asp3.1.0/bin/parallel_stereo", line 526, in spawn_to_nodes
    asp_system_utils.generic_run(cmd, opt.verbose)
  File "/miniconda3/asp3.1.0/libexec/asp_system_utils.py", line 486, in generic_run
    raise Exception('Failed to run: ' + cmd_str)
Exception: Failed to run: parallel --will-cite --env ASP_DEPS_DIR --env PATH --env LD_LIBRARY_PATH --env ASP_LIBRARY_PATH --env PYTHONHOME -u -P 80 -a D20_035159_1939_XN_13N050W__F02_036425_1943_XN_14N050W/tmpf91c_mou --sshloginfile D20_035159_1939_XN_13N050W__F02_036425_1943_XN_14N050W/tmp49wkvco9 " /miniconda3/asp3.1.0/bin/parallel_stereo --nodes-list=nodelist.lis D20_035159_1939_XN_13N050W.lev1eo.cub F02_036425_1943_XN_14N050W.lev1eo.cub -s /home/jlaura/data/ADTM/mroctx/jay100/ctx_nomap_disp_filter_7_13_0.13.stereo noMap/D20_035159_1939_XN_13N050W__F02_036425_1943_XN_14N050W_ba_noMap --bundle-adjust-prefix adjust/ba --nodes-list D20_035159_1939_XN_13N050W__F02_036425_1943_XN_14N050W/tmp49wkvco9 --skip-point-cloud-center-comp --processes 80 --threads-multiprocess 1 --entry-point 5 --stop-point 6 --work-dir D20_035159_1939_XN_13N050W__F02_036425_1943_XN_14N050W --isisroot /miniconda3/asp3.1.0 --isisdata /astro/isisdata/ --tile-id {}"
Failed to process.

I stripped a bunch of full paths from the command, but otherwise replicated the error message above.

Here is the call that is causing the error:

# finish parallel_stereo using default options for Stage 5 (Triangulation)
parallel_stereo --entry-point 5 --stop-point 6 $L $R \
-s ${config} ${odir}/${prefix}_ba_noMap --bundle-adjust-prefix adjust/ba

This is getting through 46 rounds of triangulation before erroring.

I am seeing the following ssh errors just before triangulation starts. Note that the parallel_stereo call is not specifying --nodes-list or any type of thread counts. I think that this should be forcing triangulation to occur on just one of the nodes, but perhaps not?

The fact that 46 rounds of triangulation occur and the warning below indicates the use of 46 connections is also suggesting to me that some race condition or parallelization error is occurring.

parallel: Warning: ssh to nid00031 only allows for 47 simultaneous logins.
parallel: Warning: You may raise this by changing
parallel: Warning: /etc/ssh/sshd_config:MaxStartups and MaxSessions on nid00031.
parallel: Warning: You can also try --sshdelay 0.1
parallel: Warning: Using only 46 connections to avoid race conditions.
parallel: Warning: ssh to nid00044 only allows for 45 simultaneous logins.
parallel: Warning: You may raise this by changing
parallel: Warning: /etc/ssh/sshd_config:MaxStartups and MaxSessions on nid00044.
parallel: Warning: You can also try --sshdelay 0.1
parallel: Warning: Using only 44 connections to avoid race conditions.
kex_exchange_identification: Connection closed by remote host^M
kex_exchange_identification: Connection closed by remote host^M
kex_exchange_identification: Connection closed by remote host^M
kex_exchange_identification: Connection closed by remote host^M
kex_exchange_identification: Connection closed by remote host^M

oleg-alexandrov · 2022-08-22T16:13:23Z

oleg-alexandrov
Aug 22, 2022
Maintainer

The issue seems to be, per your text, that only 47 simultaneous logins are allowed. You can try to use, as it says, the option --sshdelay. That can be done by passing it to parallel_stereo as: --parallel-options '--sshdelay 1' (per https://stereopipeline.readthedocs.io/en/latest/tools/parallel_stereo.html). Or, you can use less than this many connections, for example, by ensuring that the product of --threads-multiprocess and --processes is less than this.

This seems to be a quirk of your system.

7 replies

jlaura Aug 22, 2022
Author

I am inferring it based on the error. I am wondering how the third command could generate the above error about ssh logins without passing a --nodes-list or specify --processes or --threads-*. I will test explicitly passing one or more of those and see if I can either reproduce or eliminate the error.

The strange part about this is that this is affecting ~95/177 DTMs. It is not occurring every time.

oleg-alexandrov Aug 22, 2022
Maintainer

I am still trying to wrap my head around your issue. Sorry for asking simple questions, or for asking you to repeat what you already said. I am reading your text so far and I still don't understand what you are doing or what is going on.

Are you launching parallel_stereo just once, or are you launching it for each step separately?

What is the precise command that you are launching? If you launch each step separately, what is the precise command for each step?

When you say you are inferring a command based on the error, are you running 'ps' and seeing how 'parallel_stereo' modifies your command, and does it differ from what you told it to do? Note that internally, parallel_stereo connects to each node, and runs on each node GNU Parallel, to which it passes the parallel_stereo command with some internal modifications of the options for each node. But that is likely not the source of your problem.

oleg-alexandrov Aug 22, 2022
Maintainer

I looked at the very first message in this thread. You write there:

parallel_stereo --nodes-list=nodelist.lis --entry-point 3 --stop-point 5 $L $R
-s ${config} ${odir}/${prefix}_ba_map --bundle-adjust-prefix adjust/ba #
--processes 1 --threads-multiprocess 2 --threads-singleprocess 2

There is a pound sign which does not belong there. It would also be helpful if you can specify the command with those env variables expanded and I can't tell what those values are.

jlaura Aug 23, 2022
Author

I ran a test of 100 of these and 96 passed when explicitly specifying:

--nodes-list=nodelist.lis --processes 40 --threads-singleprocess ${threads} \
    --parallel-options '--sshdelay 0.1' --threads-multiprocess 1

for the triangulation (final entry point). Not sure if this is my cluster that is pulling arguments from the previous parallel stereo calls, the way that these are being scripted, or something inside of parallel_stereo.

I do not know that this needs a documentation update because I am not quite sure what to describe to a potential user. Given that the parametrization is completely identical between all stages, perhaps the best bet is advise not splitting between entry points when using slurm? This though seems quite extreme and so my actual suggestion would be to do nothing at this time unless other users report similar behavior.

oleg-alexandrov Aug 23, 2022
Maintainer

Not sure if this is my cluster that is pulling arguments from the previous parallel stereo calls

That is not possible. One process can't pull information from another process that finished.

What happens if you use just 10 processes and --threads-singleprocess 1 --threads-multiprocess 1?

You can try this on a single failing run.

I am quite sure something in your particular system does not like too many things happening at the same time.

jlaura · 2022-09-13T12:19:54Z

jlaura
Sep 13, 2022
Author

@oleg-alexandrov I have an answer that I fear is specific to my cluster setup. 🤷

Many thanks for your suggestions working through this.

The SBATCH commands ended up including:

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=40

And then parameterization for the call is:

scontrol show hostname $SLURM_NODELIST | tr ' ' '\n' > workdir/nodelist.lis
parallel_stereo ... --nodes-list=workdir/nodelist.lis --processes 40 \
    --threads-singleprocess 1 \
    --parallel-options '--sshdelay 0.1'

This is inline with your last suggestion. I think perhaps that part of the issue was that the nodelist was writing to the proper directory in the first line, but then not being pass (by me) properly to the parallel stereo command. Prepending the nodelist path with the workdir fixed the issue (verified by sshing to the nodes while running).

I also swapped to using the CSM sensors while working on all of this and can report that the performance is quite good.

0 replies

oleg-alexandrov · 2022-09-13T15:31:48Z

oleg-alexandrov
Sep 13, 2022
Maintainer

Glad it is solved. What you wrote looks in line with the doc at https://stereopipeline.readthedocs.io/en/latest/examples.html#using-pbs-and-slurm.

I don't have access to slurm myself, so any thoughts about how to improve the docs are welcome.

0 replies

oleg-alexandrov · 2022-09-13T15:42:59Z

oleg-alexandrov
Sep 13, 2022
Maintainer

I now modified the doc to make the nodes list for slurm unique, so when the user launches several jobs at once they don't conflict. Here are the new suggestions:

# Create a temporary list of nodes in current directory
nodesList=$(mktemp -p $(pwd))

# Set up the nodes list
scontrol show hostname $SLURM_NODELIST | tr ' ' '\n' > $nodesList

This is not related to your issue, but it may help avoiding problems going forward.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proper parallel_stereo parameterization for single node and multiple nodes #379

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 8 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Proper parallel_stereo parameterization for single node and multiple nodes #379

jlaura Aug 9, 2022

Replies: 6 comments · 8 replies

oleg-alexandrov Aug 9, 2022 Maintainer

oleg-alexandrov Aug 10, 2022 Maintainer

jlaura Aug 22, 2022 Author

oleg-alexandrov Aug 22, 2022 Maintainer

jlaura Aug 22, 2022 Author

oleg-alexandrov Aug 22, 2022 Maintainer

oleg-alexandrov Aug 22, 2022 Maintainer

jlaura Aug 23, 2022 Author

oleg-alexandrov Aug 23, 2022 Maintainer

jlaura Sep 13, 2022 Author

oleg-alexandrov Sep 13, 2022 Maintainer

oleg-alexandrov Sep 13, 2022 Maintainer

jlaura
Aug 9, 2022

Replies: 6 comments 8 replies

oleg-alexandrov
Aug 9, 2022
Maintainer

oleg-alexandrov
Aug 10, 2022
Maintainer

jlaura Aug 22, 2022
Author

oleg-alexandrov
Aug 22, 2022
Maintainer

jlaura Aug 22, 2022
Author

oleg-alexandrov Aug 22, 2022
Maintainer

oleg-alexandrov Aug 22, 2022
Maintainer

jlaura Aug 23, 2022
Author

oleg-alexandrov Aug 23, 2022
Maintainer

jlaura
Sep 13, 2022
Author

oleg-alexandrov
Sep 13, 2022
Maintainer

oleg-alexandrov
Sep 13, 2022
Maintainer