To implement distributed training and non-distributed training, please use MMDistributedDataParallel
and MMDataParallel
respectively.
All outputs (log files and checkpoints) will be saved to the working directory, which is specified by work_dir
in the config file.
By default we evaluate the model on the validation set after some epoches, you can change the evaluation interval by adding the interval argument in the training config.
evaluation = dict(interval=2) # This evaluate the model per 2 epoches.
*Important*: The default learning rate in config files is for 2 GPUs and 8 img/gpu (batch size = 2x8 = 16). Equivalently, you may also use 8 GPUs and 2 imgs/gpu since all models using cross-GPU SyncBN.
official support:
sh tools/dist_train.sh ${CONFIG_FILE} 1 [optional arguments]
experimental support (you may need to set PYTHONPATH):
python tools/train.py ${CONFIG_FILE} [optional arguments]
If you want to specify the working directory in the command, you can add an argument --work-dir ${YOUR_WORK_DIR}
.
The process of training on the CPU is consistent with single GPU training. We just need to disable GPUs before the training process.
export CUDA_VISIBLE_DEVICES=-1
And then run the script above.
The process of training on the CPU is consistent with single GPU training. We just need to disable GPUs before the training process.
sh tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [optional arguments]
Optional arguments are:
--no-validate
(not suggested): By default, the codebase will perform evaluation at every k iterations during the training. To disable this behavior, use--no-validate
.--work-dir ${WORK_DIR}
: Override the working directory specified in the config file.--resume-from ${CHECKPOINT_FILE}
: Resume from a previous checkpoint file (to continue the training process).--load-from ${CHECKPOINT_FILE}
: Load weights from a checkpoint file (to start finetuning for another task).--deterministic
: Switch on "deterministic" mode which slows down training but the results are reproducible.
Difference between resume-from
and load-from
:
resume-from
loads both the model weights and optimizer state including the iteration number.load-from
loads only the model weights, starts the training from iteration 0.
An example:
# If work_dir is not set, it will be generated automatically.
bash ./tools/dist_train.sh configs/habins/habins_swinl_22k_w7_mix5_bs8.py 2 --work-dir work_dirs/saves/habins/habins_swinl_22k_w7_mix5_bs8.py
Note: During training, checkpoints and logs are saved in the same folder structure as the config file under work_dirs/
. Custom work directory is not recommended since evaluation scripts infer work directories from the config file name. If you want to save your weights somewhere else, please use symlink, for example:
ln -s ${YOUR_WORK_DIRS} ${TOOLBOX}/work_dirs
If you launch multiple jobs on a single machine, e.g., 2 jobs of 4-GPU training on a machine with 8 GPUs, you need to specify different ports (29500 by default) for each job to avoid communication conflict. Otherwise, there will be error message saying RuntimeError: Address already in use
.
If you use dist_train.sh
to launch training jobs, you can set the port in commands with environment variable PORT
.
CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 sh tools/dist_train.sh ${CONFIG_FILE} 4
CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 sh tools/dist_train.sh ${CONFIG_FILE} 4
If you launch with multiple machines simply connected with ethernet, you can simply run following commands:
On the first machine:
NNODES=2 NODE_RANK=0 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR sh tools/dist_train.sh $CONFIG $GPUS
On the second machine:
NNODES=2 NODE_RANK=1 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR sh tools/dist_train.sh $CONFIG $GPUS
Usually it is slow if you do not have high speed networking like InfiniBand.
Slurm is a good job scheduling system for computing clusters. On a cluster managed by Slurm, you can use slurm_train.sh to spawn training jobs. It supports both single-node and multi-node training.
Train with multiple machines:
[GPUS=${GPUS}] sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} --work-dir ${WORK_DIR}
Here is an example of using 16 GPUs to train DepthFormer on the dev partition.
GPUS=16 sh tools/slurm_train.sh dev habins configs/habins/habins_swinl_22k_w7_mix5_bs8.py --work-dir work_dirs/saves/habins/habins_swinl_22k_w7_mix5_bs8.py
When using 'slurm_train.sh' to start multiple tasks on a node, different ports need to be specified. Three settings are provided:
Option 1:
In config1.py
:
dist_params = dict(backend='nccl', port=29500)
In config2.py
:
dist_params = dict(backend='nccl', port=29501)
Then you can launch two jobs with config1.py and config2.py.
CUDA_VISIBLE_DEVICES=0,1,2,3 GPUS=4 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config1.py tmp_work_dir_1
CUDA_VISIBLE_DEVICES=4,5,6,7 GPUS=4 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config2.py tmp_work_dir_2
Option 2:
You can set different communication ports without the need to modify the configuration file, but have to set the cfg-options
to overwrite the default port in configuration file.
CUDA_VISIBLE_DEVICES=0,1,2,3 GPUS=4 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config1.py tmp_work_dir_1 --cfg-options dist_params.port=29500
CUDA_VISIBLE_DEVICES=4,5,6,7 GPUS=4 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config2.py tmp_work_dir_2 --cfg-options dist_params.port=29501
Option 3:
You can set the port in the command using the environment variable 'MASTER_PORT':
CUDA_VISIBLE_DEVICES=0,1,2,3 GPUS=4 MASTER_PORT=29500 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config1.py tmp_work_dir_1
CUDA_VISIBLE_DEVICES=4,5,6,7 GPUS=4 MASTER_PORT=29501 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config2.py tmp_work_dir_2
As default, we utilize to TextLoggerHook
and TensorboardImageLoggerHook
to log information during training.
The former prints log in the shell as:
2022-04-03 00:29:11,300 - depth - INFO - Epoch [3][1200/1514] lr: 3.543e-05, eta: 3:13:52, time: 0.357, data_time: 0.009, memory: 15394, decode.loss_depth: 0.1381, loss: 0.1381, grad_norm: 1.4511
2022-04-03 00:29:29,139 - depth - INFO - Epoch [3][1250/1514] lr: 3.608e-05, eta: 3:13:32, time: 0.357, data_time: 0.009, memory: 15394, decode.loss_depth: 0.1420, loss: 0.1420, grad_norm: 1.5763
The later saves loss/acc curves and images in the tensorboard server. After start tensorboard and open the page, you can watch the training process.