Distributed Training across multiple nodes with 1 GPU each on SLURM managed cluster #217
Replies: 23 comments
-
a few things: Just today i added support for not using all the gpus in a node :)
|
Beta Was this translation helpful? Give feedback.
-
Hey thanks for your quick reply, I applied the above changes with the exception of 3:
and I get the following error:
I did not find any changes on the other nodes assigned to the job, all the computation was performed on the first node. I did look into some variables for my trainer and found the following:
Would the specification of the number of GPU's per node change my problem? Thanks a bunch for the advice and time. |
Beta Was this translation helpful? Give feedback.
-
gres may not be used by your cluster. try removing the word gres. you have to specify the number of gpus in your cluster job or you won’t be assigned gpus |
Beta Was this translation helpful? Give feedback.
-
I tried specifying the number of GPU's using all the methods from the man pages. I will try and contact the cluster admin and see if they can advise me as to how I can specify the gpu's for each node. |
Beta Was this translation helpful? Give feedback.
-
I did run it again without any gpu flags and the first node does use it's gpu but that's probably because your trainer.py changes the ddp flag to false and the single_gpu flag to true changing the function used. Just out of curiosity why does your trainer not use ddp when only specifying 1 gpu but multiple nodes? |
Beta Was this translation helpful? Give feedback.
-
it does on master now |
Beta Was this translation helpful? Give feedback.
-
you have to set ddp yourself |
Beta Was this translation helpful? Give feedback.
-
Hey, I went through the updated code again for train.py, if the number of GPU's per node is just 1 then no distributed training will take place (please see below), I am trying to scale across multiple nodes as each node only has 1 GPU available.
In your fit function if the above flag is set to true then training will only take place on the first node's GPU.
Why is the single_gpu flag only a function of the number of GPU's and not GPU's and the number of nodes? Sorry for bothering you about this I really do appreciate your help. |
Beta Was this translation helpful? Give feedback.
-
try again? it should be fixed on master now |
Beta Was this translation helpful? Give feedback.
-
Thanks for making the changes I really appreciate it. I just need to fix up some of the environment vars as they are not set by default, I'll be debugging for a while as the cluster is currently quite busy. Do you want me to close this question or leave it open until I finish debugging/testing the changes? |
Beta Was this translation helpful? Give feedback.
-
keep it open until fixed. also let me know what else you had to do so i can update the framework or docs by the way, lightning handles all the relevant nvidia environment flags |
Beta Was this translation helpful? Give feedback.
-
Ok, I just ran it on multiple nodes using a single GPU on each. It works for me now :) Please verify it also works for you. A minor difference is that now you do:
|
Beta Was this translation helpful? Give feedback.
-
Hey, I am testing it atm, and It still doesn't work for me, I am not sure what the problem is but it spawns multiple threads on my first node and deadlocks which seems to be a problem with the nccl backend when multiple threads attempt to use the same GPU. |
Beta Was this translation helpful? Give feedback.
-
post code and slurm script? |
Beta Was this translation helpful? Give feedback.
-
Hey, here is the current script and code that I am running. The model code hasn't changed, it is your MNIST model code. Thanks for giving this a look. bash script
Trainer Code
|
Beta Was this translation helpful? Give feedback.
-
I could be incorrect but SLURM jobs will be allocated the number of nodes requested but only run on the first node provided. After reading through your code does it account for this or does it need to run the code on each node? My question is related to the def fit(self, model): function, as it obtains the local ID of the node being used which is only node 0. Furthermore I think that the code hangs because in your def __init_tcp_connection(self): function you call dist.init_process_group("nccl", rank=self.proc_rank, world_size=self.world_size) which would block until all the processes in the group have joined which does not occur as the code doesn't run on all the nodes, thus it is waiting for processes to finish which do not exist. I don't think that the code itself needs changing but I'm just checking up with you to see if that was the intended design, if so the SLURM submit script may need a bit of tweaking instead of your code. |
Beta Was this translation helpful? Give feedback.
-
you have a few issues in your slurm script but i can’t look at it today.
when slurm runs the job it executes the script on every machine.... i guarantee lightning works, it’s been thoroughly tested under this setting and I personally have been running more jobs now on single gpu across nodes. how do you know only the first node is running? the logs and weights only save from the root node.... you have to ssh into the other nodes to check the memory consumption. You also need to set master port in your SLURM script. i set it to a random number otherwise it might hang with port already in use error Look at the flags used in the documentation: https://williamfalcon.github.io/pytorch-lightning/Trainer/Distributed%20training/ look at the examples: |
Beta Was this translation helpful? Give feedback.
-
Hey, I have no doubt that Lightning works since you have been testing it after the changes you made to address my earlier problems. The cluster that I have access to has recently been migrated to SLURM for its workload management as such I am still testing it it is correctly configured. I have ssh'd onto all the nodes and it only runs on the first one, the others don't even have a python process running when I check htop. I am working on a fix for my submission script whereby I srun the code once for each node allocated, the environment variables and the like I am still sifting through to see if there are any problems with this method. I will edit the linked examples for the code that I am working with and see if that resolves my problem. Thanks for following up. I will keep you updated as to how my problem will be resolved. |
Beta Was this translation helpful? Give feedback.
-
It's all good. You have to check with your cluster admim. SLURM executes your SLURM script on every machine it runs (log stuff from the script to prove it). That in turn runs your code which executes the same lightning code. It could get hung if your backend has issues or the network interface isn't set up. The docs and demos have flags which can help you debug |
Beta Was this translation helpful? Give feedback.
-
@Alterith try this? Make sure to set the debug flags or errors in your code will show up as NCCL errors. It's a good idea to debug in DP mode first |
Beta Was this translation helpful? Give feedback.
-
Hey, I will try that. Just needed to run the python code with srun in the bash script so it would run on all the nodes I am just testing it all now. |
Beta Was this translation helpful? Give feedback.
-
Hey, the models do train across multiple nodes after using srun, I really appreciate your help. I have one more question, will it be possible to accumulate the gradients so as to train a single model across multiple nodes? This was implemented when utilizing DP but not DDP. When training I get 4 models each trained on a subset of the dataset. Would it be possible to accumulate the the gradients using something like allreduce to train a single model on multiple nodes? |
Beta Was this translation helpful? Give feedback.
-
so, your SLURM script is what should call srun python etc... (see example here). It might be useful to check out the details of how DDP works from the PyTorch docs. At a high level, you init a NEW model with SAME weights on every process. Training only syncs gradients. As a result you don't get 4 models trained on a subset of the data, you get a single model trained on all the data. In fact PL saves a single copy of weights automatically not 4 copies of the SAME weights. So, gradient accumulation is still possible (in fact already supported in PL). It sounds like this ticket is solved? It sounds like the problem was configuring the SLURM submit script? |
Beta Was this translation helpful? Give feedback.
-
Problem:
Question:
Is there a way to solve my current problem? I am not sure if the feature that I am requesting is supported but it does seem like it is possible since there is support to train across multiple nodes with multiple GPU's.
Code
The following is my bash script which I run with sbatch.
What have you tried?
I have tried specifying multiple GPU's all with ID '0' but that did not work (Just grasping at straws). I monitored the other nodes allocated but they are not used.
I combed through the trainer.py file (https://github.com/williamFalcon/pytorch-lightning/blob/master/pytorch_lightning/trainer/trainer.py), and noticed that in the __set_distributed_mode() function that if only a single GPU is requested (which I assumed meant single GPU per node), then no distributed training would take place and threads would be spawned on the first node equal to the number of nodes specified.
What's your environment?
Thanks in advance for any help and time taken to resolve my problem, I hope that you have a great day.
Beta Was this translation helpful? Give feedback.
All reactions