Distributed Training across multiple nodes with 1 GPU each on SLURM managed cluster #217

Alterith · 2019-09-08T20:01:49Z

Alterith
Sep 8, 2019

Problem:

I currently have access to a SLURM managed cluster. The cluster has 60 nodes each with 1 GPU, when using the training script (please see below) (adapted from: https://towardsdatascience.com/trivial-multi-node-training-with-pytorch-lightning-ff75dfb809bd), I specify a single GPU [0] along with 4 nodes (just for testing). When running the training script using sbatch I notice that only the first allocated node is used and has 4 processes spawned, while the other 3 nodes are not used at all. As a side note I am also using the model script from the above towardsdatascience post.

Question:

Is there a way to solve my current problem? I am not sure if the feature that I am requesting is supported but it does seem like it is possible since there is support to train across multiple nodes with multiple GPU's.

Code

from pytorch_lightning import Trainer
from test_tube import Experiment

def main():
	model = CoolModel()
	exp = Experiment(save_dir=os.getcwd())
	
	# train on 4 GPUs across 4 nodes
	trainer = Trainer(experiment=exp,
					  distributed_backend='ddp'	
					  max_nb_epochs=1, 
					  gpus=[0], 
					  nb_gpu_nodes=4)
	trainer.fit(model)
  
if __name__ ==  '__main__':
	main()

The following is my bash script which I run with sbatch.

#!/bin/bash -l

# SLURM SUBMIT SCRIPT
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1
#SBATCH --mem=0
#SBATCH --time=0-02:00:00

# activate conda env
conda activate my_env  
# run script from above
python ptl_trainer.py

What have you tried?

I have tried specifying multiple GPU's all with ID '0' but that did not work (Just grasping at straws). I monitored the other nodes allocated but they are not used.

I combed through the trainer.py file (https://github.com/williamFalcon/pytorch-lightning/blob/master/pytorch_lightning/trainer/trainer.py), and noticed that in the __set_distributed_mode() function that if only a single GPU is requested (which I assumed meant single GPU per node), then no distributed training would take place and threads would be spawned on the first node equal to the number of nodes specified.

    # remove dp and ddp when requesting single gpu
    if self.data_parallel_device_ids is not None and len(self.data_parallel_device_ids) == 1:
        self.use_ddp = False
        self.use_dp = False
        self.single_gpu = True

What's your environment?

conda version :

   Name                      Version                  Build       Channel
   conda                     4.7.11                   py37_0    conda-forge

PyTorch version:

   Name                    Version                     Build                              Channel
   pytorch                   1.2.0                     py3.7_cuda10.0.130_cudnn7.6.2_0    pytorch

Lightning version:

Name                    Version                   Build      Channel
pytorch-lightning         0.4.8                    pypi_0    pypi

Test-tube version:

Name                    Version                   Build      Channel
test-tube                 0.7.0                    pypi_0    pypi

Thanks in advance for any help and time taken to resolve my problem, I hope that you have a great day.

williamFalcon · 2019-09-08T20:40:01Z

williamFalcon
Sep 8, 2019
Maintainer

a few things:

Just today i added support for not using all the gpus in a node :)

install lightning from master
replace gpus=[0] with gpus=1.
to your slurm script, add the following:

#SBATCH --gres=gpu:1

0 replies

Alterith · 2019-09-08T21:57:16Z

Alterith
Sep 8, 2019
Author

a few things:

Just today i added support for not using all the gpus in a node :)

install lightning from master

replace gpus=[0] with gpus=1.

to your slurm script, add the following:
#SBATCH --gres=gpu:1

Hey thanks for your quick reply,

I applied the above changes with the exception of 3:

#SBATCH --gres=gpu:1

and I get the following error:

sbatch: error: Batch job submission failed: Invalid generic resource (gres) specification

I did not find any changes on the other nodes assigned to the job, all the computation was performed on the first node.

I did look into some variables for my trainer and found the following:

use_ddp : False
use_dp : False
single_gpu : True
world_size: 1

Would the specification of the number of GPU's per node change my problem?

Thanks a bunch for the advice and time.

0 replies

williamFalcon · 2019-09-08T22:02:44Z

williamFalcon
Sep 8, 2019
Maintainer

gres may not be used by your cluster. try removing the word gres. you have to specify the number of gpus in your cluster job or you won’t be assigned gpus

0 replies

Alterith · 2019-09-08T22:27:10Z

Alterith
Sep 8, 2019
Author

gres may not be used by your cluster. try removing the word gres. you have to specify the number of gpus in your cluster job or you won’t be assigned gpus

I tried specifying the number of GPU's using all the methods from the man pages. I will try and contact the cluster admin and see if they can advise me as to how I can specify the gpu's for each node.

0 replies

Alterith · 2019-09-08T22:49:26Z

Alterith
Sep 8, 2019
Author

gres may not be used by your cluster. try removing the word gres. you have to specify the number of gpus in your cluster job or you won’t be assigned gpus

I did run it again without any gpu flags and the first node does use it's gpu but that's probably because your trainer.py changes the ddp flag to false and the single_gpu flag to true changing the function used.

Just out of curiosity why does your trainer not use ddp when only specifying 1 gpu but multiple nodes?

0 replies

williamFalcon · 2019-09-08T23:43:30Z

williamFalcon
Sep 8, 2019
Maintainer

it does on master now

0 replies

williamFalcon · 2019-09-08T23:43:46Z

williamFalcon
Sep 8, 2019
Maintainer

you have to set ddp yourself

0 replies

Alterith · 2019-09-09T07:28:32Z

Alterith
Sep 9, 2019
Author

it does on master now

Hey, I went through the updated code again for train.py, if the number of GPU's per node is just 1 then no distributed training will take place (please see below), I am trying to scale across multiple nodes as each node only has 1 GPU available.

if num_gpus == 1:
    self.single_gpu = True

In your fit function if the above flag is set to true then training will only take place on the first node's GPU.

elif self.single_gpu:
    self.__single_gpu_train(model)

Why is the single_gpu flag only a function of the number of GPU's and not GPU's and the number of nodes?

Sorry for bothering you about this I really do appreciate your help.

0 replies

williamFalcon · 2019-09-09T11:39:23Z

williamFalcon
Sep 9, 2019
Maintainer

try again? it should be fixed on master now

0 replies

Alterith · 2019-09-09T21:49:10Z

Alterith
Sep 9, 2019
Author

try again? it should be fixed on master now

Thanks for making the changes I really appreciate it.

I just need to fix up some of the environment vars as they are not set by default, I'll be debugging for a while as the cluster is currently quite busy.

Do you want me to close this question or leave it open until I finish debugging/testing the changes?

0 replies

williamFalcon · 2019-09-09T22:22:59Z

williamFalcon
Sep 9, 2019
Maintainer

keep it open until fixed. also let me know what else you had to do so i can update the framework or docs

by the way, lightning handles all the relevant nvidia environment flags

0 replies

williamFalcon · 2019-09-10T15:50:21Z

williamFalcon
Sep 10, 2019
Maintainer

Ok, I just ran it on multiple nodes using a single GPU on each. It works for me now :)

Please verify it also works for you.

A minor difference is that now you do:

# old way
Trainer(gpus=[0])

# new way
Traner(gpus=1)

0 replies

Alterith · 2019-09-10T22:49:16Z

Alterith
Sep 10, 2019
Author

Hey, I am testing it atm, and It still doesn't work for me, I am not sure what the problem is but it spawns multiple threads on my first node and deadlocks which seems to be a problem with the nccl backend when multiple threads attempt to use the same GPU.

0 replies

williamFalcon · 2019-09-10T23:48:25Z

williamFalcon
Sep 10, 2019
Maintainer

post code and slurm script?

0 replies

Alterith · 2019-09-10T23:59:57Z

Alterith
Sep 10, 2019
Author

Hey, here is the current script and code that I am running. The model code hasn't changed, it is your MNIST model code. Thanks for giving this a look.

bash script

#!/bin/bash -l

## SLURM SUBMIT SCRIPT
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1
#SBATCH --mem=0
#SBATCH --time=0-00:40:00
#SBATCH -p partition

export CUDA_VISIBLE_DEVICES=0
# activate conda env
#source ~/anaconda3/bin/activate ptl_env
python ptl_test.py

Trainer Code

from pytorch_lightning import Trainer
from test_tube import Experiment
from ptl_mnist_model import CoolModel
import os
def main():                                                                                                                                                               
    model = CoolModel()
    exp = Experiment(save_dir=os.getcwd())

    # train on 4 GPUs across 4 nodes
    trainer = Trainer(experiment=exp,
                      max_nb_epochs=3,
                      distributed_backend='ddp',
                      gpus=1,
                      nb_gpu_nodes=4)
    trainer.fit(model)

if __name__ ==  '__main__':
    main()

0 replies

Alterith · 2019-09-11T06:10:36Z

Alterith
Sep 11, 2019
Author

I could be incorrect but SLURM jobs will be allocated the number of nodes requested but only run on the first node provided. After reading through your code does it account for this or does it need to run the code on each node?

My question is related to the def fit(self, model): function, as it obtains the local ID of the node being used which is only node 0. Furthermore I think that the code hangs because in your def __init_tcp_connection(self): function you call dist.init_process_group("nccl", rank=self.proc_rank, world_size=self.world_size) which would block until all the processes in the group have joined which does not occur as the code doesn't run on all the nodes, thus it is waiting for processes to finish which do not exist.

I don't think that the code itself needs changing but I'm just checking up with you to see if that was the intended design, if so the SLURM submit script may need a bit of tweaking instead of your code.

0 replies

williamFalcon · 2019-09-11T10:50:37Z

williamFalcon
Sep 11, 2019
Maintainer

you have a few issues in your slurm script but i can’t look at it today.

you need to specify the number of gpus
make sure you have the correct partition
you don’t need to set visible devices (slurm does this for you).

when slurm runs the job it executes the script on every machine.... i guarantee lightning works, it’s been thoroughly tested under this setting and I personally have been running more jobs now on single gpu across nodes.

how do you know only the first node is running? the logs and weights only save from the root node.... you have to ssh into the other nodes to check the memory consumption.

You also need to set master port in your SLURM script. i set it to a random number otherwise it might hang with port already in use error

Look at the flags used in the documentation: https://williamfalcon.github.io/pytorch-lightning/Trainer/Distributed%20training/

look at the examples:
https://github.com/williamFalcon/pytorch-lightning/tree/master/examples/new_project_templates/multi_node_examples

0 replies

Alterith · 2019-09-11T19:44:24Z

Alterith
Sep 11, 2019
Author

Hey, I have no doubt that Lightning works since you have been testing it after the changes you made to address my earlier problems. The cluster that I have access to has recently been migrated to SLURM for its workload management as such I am still testing it it is correctly configured.

I have ssh'd onto all the nodes and it only runs on the first one, the others don't even have a python process running when I check htop. I am working on a fix for my submission script whereby I srun the code once for each node allocated, the environment variables and the like I am still sifting through to see if there are any problems with this method.

I will edit the linked examples for the code that I am working with and see if that resolves my problem.

Thanks for following up. I will keep you updated as to how my problem will be resolved.

0 replies

williamFalcon · 2019-09-11T23:42:00Z

williamFalcon
Sep 11, 2019
Maintainer

It's all good. You have to check with your cluster admim. SLURM executes your SLURM script on every machine it runs (log stuff from the script to prove it). That in turn runs your code which executes the same lightning code. It could get hung if your backend has issues or the network interface isn't set up. The docs and demos have flags which can help you debug

0 replies

williamFalcon · 2019-09-14T06:15:44Z

williamFalcon
Sep 14, 2019
Maintainer

@Alterith try this?
https://github.com/williamFalcon/pytorch-lightning/blob/master/examples/new_project_templates/multi_node_examples/demo_script.sh

Make sure to set the debug flags or errors in your code will show up as NCCL errors. It's a good idea to debug in DP mode first

0 replies

Alterith · 2019-09-15T18:11:40Z

Alterith
Sep 15, 2019
Author

Hey, I will try that. Just needed to run the python code with srun in the bash script so it would run on all the nodes I am just testing it all now.

0 replies

Alterith · 2019-09-15T20:31:30Z

Alterith
Sep 15, 2019
Author

Hey, the models do train across multiple nodes after using srun, I really appreciate your help.

I have one more question, will it be possible to accumulate the gradients so as to train a single model across multiple nodes? This was implemented when utilizing DP but not DDP.

When training I get 4 models each trained on a subset of the dataset. Would it be possible to accumulate the the gradients using something like allreduce to train a single model on multiple nodes?

0 replies

williamFalcon · 2019-09-15T22:33:13Z

williamFalcon
Sep 15, 2019
Maintainer

so, your SLURM script is what should call srun python etc... (see example here).

It might be useful to check out the details of how DDP works from the PyTorch docs. At a high level, you init a NEW model with SAME weights on every process. Training only syncs gradients. As a result you don't get 4 models trained on a subset of the data, you get a single model trained on all the data. In fact PL saves a single copy of weights automatically not 4 copies of the SAME weights.

So, gradient accumulation is still possible (in fact already supported in PL).

It sounds like this ticket is solved? It sounds like the problem was configuring the SLURM submit script?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed Training across multiple nodes with 1 GPU each on SLURM managed cluster #217

{{title}}

Replies: 23 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Distributed Training across multiple nodes with 1 GPU each on SLURM managed cluster #217

Alterith Sep 8, 2019

Problem:

Question:

Code

What have you tried?

What's your environment?

Replies: 23 comments

williamFalcon Sep 8, 2019 Maintainer

Alterith Sep 8, 2019 Author

williamFalcon Sep 8, 2019 Maintainer

Alterith Sep 8, 2019 Author

Alterith Sep 8, 2019 Author

williamFalcon Sep 8, 2019 Maintainer

williamFalcon Sep 8, 2019 Maintainer

Alterith Sep 9, 2019 Author

williamFalcon Sep 9, 2019 Maintainer

Alterith Sep 9, 2019 Author

williamFalcon Sep 9, 2019 Maintainer

williamFalcon Sep 10, 2019 Maintainer

Alterith Sep 10, 2019 Author

williamFalcon Sep 10, 2019 Maintainer

Alterith Sep 10, 2019 Author

bash script

Trainer Code

Alterith Sep 11, 2019 Author

williamFalcon Sep 11, 2019 Maintainer

Alterith Sep 11, 2019 Author

williamFalcon Sep 11, 2019 Maintainer

williamFalcon Sep 14, 2019 Maintainer

Alterith Sep 15, 2019 Author

Alterith Sep 15, 2019 Author

williamFalcon Sep 15, 2019 Maintainer

Alterith
Sep 8, 2019

williamFalcon
Sep 8, 2019
Maintainer

Alterith
Sep 8, 2019
Author

williamFalcon
Sep 8, 2019
Maintainer

Alterith
Sep 8, 2019
Author

Alterith
Sep 8, 2019
Author

williamFalcon
Sep 8, 2019
Maintainer

williamFalcon
Sep 8, 2019
Maintainer

Alterith
Sep 9, 2019
Author

williamFalcon
Sep 9, 2019
Maintainer

Alterith
Sep 9, 2019
Author

williamFalcon
Sep 9, 2019
Maintainer

williamFalcon
Sep 10, 2019
Maintainer

Alterith
Sep 10, 2019
Author

williamFalcon
Sep 10, 2019
Maintainer

Alterith
Sep 10, 2019
Author

Alterith
Sep 11, 2019
Author

williamFalcon
Sep 11, 2019
Maintainer

Alterith
Sep 11, 2019
Author

williamFalcon
Sep 11, 2019
Maintainer

williamFalcon
Sep 14, 2019
Maintainer

Alterith
Sep 15, 2019
Author

Alterith
Sep 15, 2019
Author

williamFalcon
Sep 15, 2019
Maintainer