-
Hello, I wanted to add optimal training across 8 GPUs on a single machine. I followed the documentation details here and looked at the CIFAR example. Is there a core list of steps required to appropriately convert Ignite to support What I tried:
Doing all of this, I get the following TypeError. Not sure what's causing this so any ideas are appreciated.
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 8 replies
-
Hi @aksg87 , looks like you are using Also please try to use # spawn method
python train.py --args (training args)
# launch method
python -m torch.distributed.launch --nproc_per_node 8 --use_env -m train.py --backend nccl --args (training args) If you use if __name__ == "__main__":
with idist.Parallel(backend=backend) as parallel: # no need for `nproc_per_node` as it is handled by `torch.distributed.launch`
parallel.run(training, config) |
Beta Was this translation helpful? Give feedback.
-
@aksg87 does it work now your baseline with |
Beta Was this translation helpful? Give feedback.
Hi @aksg87 , looks like you are using
spawn
method to run distributed training and the error is coming from PyTorch DataLoader unable to pickle theSwigPyObject
. Make sure DataLoader can pickleSwigPyObject
.Also please try to use
launch
method to run distributed training, it is faster thanspawn
method.If you use
launch
, calling the training loop will become: