Trouble adding ignite.distributed to my baseline training process #2032

aksg87 · 2021-06-04T05:58:20Z

aksg87
Jun 4, 2021

Hello,

I wanted to add optimal training across 8 GPUs on a single machine.

I followed the documentation details here and looked at the CIFAR example.

Is there a core list of steps required to appropriately convert Ignite to support ignite.distributed? The CIFAR example packs a lot into a single example.

What I tried:

Wrapped dataloaders, model, and optimizer in the idist.auto_* corresponding function.
Added a condition of if rank == 0 for TensorboardLogger context manager.
Calling training loop as:

if __name__ == "__main__":

    with idist.Parallel(backend=backend, nproc_per_node=8) as parallel:
        parallel.run(training, config)

Doing all of this, I get the following TypeError. Not sure what's causing this so any ideas are appreciated.

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/ec2-user/programming/pipline/segmentation-v2/env/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/ec2-user/programming/pipline/segmentation-v2/env/lib/python3.8/site-packages/ignite/distributed/comp_models/native.py", line 272, in _dist_worker_task_fn
    fn(local_rank, *args, **kw_dict)
  File "/home/ec2-user/programming/pipline/segmentation-v2/segmentation/executors/train.py", line 298, in training
    trainer.run(train_loader, max_epochs=epochs)
  File "/home/ec2-user/programming/pipline/segmentation-v2/env/lib/python3.8/site-packages/ignite/engine/engine.py", line 702, in run
    return self._internal_run()
  File "/home/ec2-user/programming/pipline/segmentation-v2/env/lib/python3.8/site-packages/ignite/engine/engine.py", line 775, in _internal_run
    self._handle_exception(e)
  File "/home/ec2-user/programming/pipline/segmentation-v2/env/lib/python3.8/site-packages/ignite/engine/engine.py", line 469, in _handle_exception
    raise e
  File "/home/ec2-user/programming/pipline/segmentation-v2/env/lib/python3.8/site-packages/ignite/engine/engine.py", line 743, in _internal_run
    self._setup_engine()
  File "/home/ec2-user/programming/pipline/segmentation-v2/env/lib/python3.8/site-packages/ignite/engine/engine.py", line 725, in _setup_engine
    self._dataloader_iter = iter(self.state.dataloader)
  File "/home/ec2-user/programming/pipline/segmentation-v2/env/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 355, in __iter__
    return self._get_iterator()
  File "/home/ec2-user/programming/pipline/segmentation-v2/env/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 301, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/home/ec2-user/programming/pipline/segmentation-v2/env/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 914, in __init__
    w.start()
  File "/opt/miniconda3/lib/python3.8/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/opt/miniconda3/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/opt/miniconda3/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/opt/miniconda3/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/opt/miniconda3/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/opt/miniconda3/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/opt/miniconda3/lib/python3.8/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
TypeError: cannot pickle 'SwigPyObject' object

Answered by ydcjeff

Jun 4, 2021

Hi @aksg87 , looks like you are using spawn method to run distributed training and the error is coming from PyTorch DataLoader unable to pickle the SwigPyObject. Make sure DataLoader can pickle SwigPyObject.

Also please try to use launch method to run distributed training, it is faster than spawn method.

# spawn method
python train.py --args (training args)
# launch method
python -m torch.distributed.launch --nproc_per_node 8 --use_env -m train.py --backend nccl --args (training args)

If you use launch, calling the training loop will become:

if __name__ == "__main__":

    with idist.Parallel(backend=backend) as parallel:  # no need for `nproc_per_node` as it is handled by `torch.distribu…

View full answer

ydcjeff · 2021-06-04T07:07:24Z

ydcjeff
Jun 4, 2021

Hi @aksg87 , looks like you are using spawn method to run distributed training and the error is coming from PyTorch DataLoader unable to pickle the SwigPyObject. Make sure DataLoader can pickle SwigPyObject.

Also please try to use launch method to run distributed training, it is faster than spawn method.

# spawn method
python train.py --args (training args)
# launch method
python -m torch.distributed.launch --nproc_per_node 8 --use_env -m train.py --backend nccl --args (training args)

If you use launch, calling the training loop will become:

if __name__ == "__main__":

    with idist.Parallel(backend=backend) as parallel:  # no need for `nproc_per_node` as it is handled by `torch.distributed.launch`
        parallel.run(training, config)

vfdev-5 Jun 4, 2021
Maintainer

In addition to this answer, it is possible to change spawning method between "fork" and "spawn":
If "forking" does not work for a certain object, maybe you can try "spawn" :

    start_method = "spawn" 
    # or 
    # start_method = "fork" 
    with idist.Parallel(backend=backend, nproc_per_node=8, start_method=start_method) as parallel:
        parallel.run(training, config)

But please keep in mind that while spawning there can be a slowdown with DataLoader on the first epoch (that's why @ydcjeff was saying: "Also please try to use launch method to run distributed training, it is faster than spawn method.")

See here for details: #1938

vfdev-5 · 2021-06-07T22:34:32Z

vfdev-5
Jun 7, 2021
Maintainer

@aksg87 does it work now your baseline with idist and python -m torch.distributed.launch --nproc_per_node 8 --use_env as suggested @ydcjeff ?

7 replies

vfdev-5 Jun 8, 2021
Maintainer

@aksg87 thanks for the feedback !

What is the best way to use: idist.get_rank() and idist.barrier()? Is there a general pattern that works well for most use cases? I found the cifar example a bit hard to follow.

In distributed computations we have world_size on processes doing computations. We use get_rank to get the index of the process to execute specific things that should be only done on one process. For example, writing files on the disk (training checkpoints, best models), printing, logging, using experiment tracking systems etc.
We use barrier in cases when we need to synchronize all the processes. In cifar10 example we used to download the dataset only using one process and others were waiting before creating torchvision CIFAR10 class and opening the data.

Does ignite.distributed optimize data usage to the GPU? I noticed making batch sizes 1 and 4 both trained pretty fast.

The method ignite.distributed.auto_dataloader does not optimize data usage, but only scales the batch_size and num_workers arguments by world_size. It also applies DistributedSampler to partition the data between the processes.
To decrease GPU memory usage and run faster on GPUs with "tensor cores" (e.g. Volta, Turing and later), we can think to use automatic-mixed-precision (amp): https://pytorch.org/docs/stable/notes/amp_examples.html#automatic-mixed-precision-examples

HTH

aksg87 Jun 8, 2021
Author

Thanks for the clarifications! @vfdev-5

So in my template, I am assuming we should force rank == 0 for the TensorboardLogger context manager block, and setup_logger, and pbar setup/close.

I was curious about if we need to control the main engine trainer.run(train_loader, max_epochs=epochs) under rank == 0 or if we have a separate main engine running for each process. Just trying to better understand the distributed workflow! thanks again

vfdev-5 Jun 8, 2021
Maintainer

Sure, feel free to ask questions here or on our Discord.

So in my template, I am assuming we should force rank == 0 for the TensorboardLogger context manager block, and setup_logger, and pbar setup/close.

yes, TensorBoardLogger context manager and all associated usage, logger and pbar should be in one process (say, rank=0). Same as here:

ignite/examples/contrib/cifar10/main.py

Lines 98 to 104 in f5b0c5c

    
           if rank == 0: 
        
               # Setup TensorBoard logging on trainer and evaluators. Logged values are: 
        
               #  - Training metrics, e.g. running average loss values 
        
               #  - Learning rate 
        
               #  - Evaluation train/test metrics 
        
               evaluators = {"training": train_evaluator, "test": evaluator} 
        
               tb_logger = common.setup_tb_logging(output_path, trainer, optimizer, evaluators=evaluators)

Otherwise, each process will display messages or log into the filesystem which is redundant.

I was curious about if we need to control the main engine trainer.run(train_loader, max_epochs=epochs) under rank == 0 or if we have a separate main engine running for each process.

Currently, here we are doing data distributed parallelism, it means each process has a copy of model, optimizer, loss function and process its own chunk of data. Every process should train the model and pytorch DDP wrapper handles gradients sync and all necessary things that we need to train one model instead of N unrelated models.

In this case trainer.run should be called by all processes. Same as here:

ignite/examples/contrib/cifar10/main.py

Line 129 in f5b0c5c

trainer.run(train_loader, max_epochs=config["num_epochs"])

aksg87 Jun 8, 2021
Author

Thanks @vfdev-5 -- I'll check out the discord for discussions!

Btw -- it would be really cool if the context managers, loggers, etc. could be wrapped in an ignite.distributed.auto_* object to handle things behind the scenes.

vfdev-5 Jun 8, 2021
Maintainer

Btw -- it would be really cool if the context managers, loggers, etc. could be wrapped in an ignite.distributed.auto_* object to handle things behind the scenes.

@aksg87 thanks for the feedback, I agree that some simplifications on logging, checkpointing could be done behind the scene. We are currently working on a higher level API that expose the engine for flexibility and should provide out-of-the-box DDP support, checkpointing, evaluation and logging...
cc @KickItLikeShika

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trouble adding ignite.distributed to my baseline training process #2032

{{title}}

Replies: 2 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Trouble adding ignite.distributed to my baseline training process #2032

aksg87 Jun 4, 2021

Replies: 2 comments · 8 replies

ydcjeff Jun 4, 2021

vfdev-5 Jun 4, 2021 Maintainer

vfdev-5 Jun 7, 2021 Maintainer

vfdev-5 Jun 8, 2021 Maintainer

aksg87 Jun 8, 2021 Author

vfdev-5 Jun 8, 2021 Maintainer

aksg87 Jun 8, 2021 Author

vfdev-5 Jun 8, 2021 Maintainer

aksg87
Jun 4, 2021

Replies: 2 comments 8 replies

ydcjeff
Jun 4, 2021

vfdev-5 Jun 4, 2021
Maintainer

vfdev-5
Jun 7, 2021
Maintainer

vfdev-5 Jun 8, 2021
Maintainer

aksg87 Jun 8, 2021
Author

vfdev-5 Jun 8, 2021
Maintainer

aksg87 Jun 8, 2021
Author

vfdev-5 Jun 8, 2021
Maintainer