[BUG]: #6202

Hongyuan-Liu · 2025-02-19T09:21:00Z

Is there an existing issue for this bug?

I have searched the existing issues

The bug has not been fixed in the latest main branch

I have checked the latest main branch

Do you feel comfortable sharing a concise (minimal) script that reproduces the error? :)

Yes, I will share a minimal reproducible script.

🐛 Describe the bug

I am fine-tuning DeepSeek-R1-Distill-Llama-70B using ColossalAI with the following command:
colossalai run
--nproc_per_node 8
./examples/training_scripts/lora_finetune.py
--pretrained ../../../models/DeepSeek-R1-Distill-Llama-70B
--dataset ./examples/training_scripts/lora_sft_data.jsonl
--plugin moe
--lr 2e-5
--max_length 256
-g
--ep 8
--pp 3
--batch_size 24
--lora_rank 8
--lora_alpha 16
--num_epochs 2
--tensorboard_dir logs
--save_dir DeepSeek-R1-bf16-lora
I have 8 4090 GPU，However, the following error occurred:

W0219 09:16:58.710000 140066961602368 torch/distributed/run.py:779]
W0219 09:16:58.710000 140066961602368 torch/distributed/run.py:779] *****
W0219 09:16:58.710000 140066961602368 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0219 09:16:58.710000 140066961602368 torch/distributed/run.py:779] *
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/utils/safetensors.py:13: UserWarning: Please install the latest tensornvme to use async save. pip install git+https://github.com/hpcaitech/TensorNVMe.git
warnings.warn(
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/utils/safetensors.py:13: UserWarning: Please install the latest tensornvme to use async save. pip install git+https://github.com/hpcaitech/TensorNVMe.git
warnings.warn(
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/utils/safetensors.py:13: UserWarning: Please install the latest tensornvme to use async save. pip install git+https://github.com/hpcaitech/TensorNVMe.git
warnings.warn(
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/utils/safetensors.py:13: UserWarning: Please install the latest tensornvme to use async save. pip install git+https://github.com/hpcaitech/TensorNVMe.git
warnings.warn(
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/utils/safetensors.py:13: UserWarning: Please install the latest tensornvme to use async save. pip install git+https://github.com/hpcaitech/TensorNVMe.git
warnings.warn(
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/utils/safetensors.py:13: UserWarning: Please install the latest tensornvme to use async save. pip install git+https://github.com/hpcaitech/TensorNVMe.git
warnings.warn(
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/utils/safetensors.py:13: UserWarning: Please install the latest tensornvme to use async save. pip install git+https://github.com/hpcaitech/TensorNVMe.git
warnings.warn(
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/utils/safetensors.py:13: UserWarning: Please install the latest tensornvme to use async save. pip install git+https://github.com/hpcaitech/TensorNVMe.git
warnings.warn(
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:48: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel")
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:93: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn(
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:48: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel")
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:93: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn(
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:48: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel")
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:48: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel")
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:93: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn(
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:93: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn(
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:48: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel")
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:93: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn(
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:48: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel")
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:93: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn(
[rank3]: Traceback (most recent call last):
[rank3]: File "/dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat/./examples/training_scripts/lora_finetune.py", line 455, in
[rank3]: train(args)
[rank3]: File "/dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat/./examples/training_scripts/lora_finetune.py", line 109, in train
[rank3]: plugin = MoeHybridParallelPlugin(
[rank3]: File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/booster/plugin/moe_hybrid_parallel_plugin.py", line 238, in init
[rank3]: dist.get_world_size() % (tp_size * pp_size) == 0
[rank3]: AssertionError: World size 8 is not divisible by tp_size 1 * pp_size 3
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:48: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel")
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:93: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn(
[rank2]: Traceback (most recent call last):
[rank2]: File "/dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat/./examples/training_scripts/lora_finetune.py", line 455, in
[rank2]: train(args)
[rank2]: File "/dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat/./examples/training_scripts/lora_finetune.py", line 109, in train
[rank2]: plugin = MoeHybridParallelPlugin(
[rank2]: File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/booster/plugin/moe_hybrid_parallel_plugin.py", line 238, in init
[rank2]: dist.get_world_size() % (tp_size * pp_size) == 0
[rank2]: AssertionError: World size 8 is not divisible by tp_size 1 * pp_size 3
[rank4]: Traceback (most recent call last):
[rank4]: File "/dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat/./examples/training_scripts/lora_finetune.py", line 455, in
[rank4]: train(args)
[rank4]: File "/dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat/./examples/training_scripts/lora_finetune.py", line 109, in train
[rank4]: plugin = MoeHybridParallelPlugin(
[rank4]: File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/booster/plugin/moe_hybrid_parallel_plugin.py", line 238, in init
[rank4]: dist.get_world_size() % (tp_size * pp_size) == 0
[rank4]: AssertionError: World size 8 is not divisible by tp_size 1 * pp_size 3
[rank1]: Traceback (most recent call last):
[rank1]: File "/dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat/./examples/training_scripts/lora_finetune.py", line 455, in
[rank1]: train(args)
[rank1]: File "/dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat/./examples/training_scripts/lora_finetune.py", line 109, in train
[rank1]: plugin = MoeHybridParallelPlugin(
[rank1]: File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/booster/plugin/moe_hybrid_parallel_plugin.py", line 238, in init
[rank1]: dist.get_world_size() % (tp_size * pp_size) == 0
[rank1]: AssertionError: World size 8 is not divisible by tp_size 1 * pp_size 3
[rank6]: Traceback (most recent call last):
[rank6]: File "/dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat/./examples/training_scripts/lora_finetune.py", line 455, in
[rank6]: train(args)
[rank6]: File "/dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat/./examples/training_scripts/lora_finetune.py", line 109, in train
[rank6]: plugin = MoeHybridParallelPlugin(
[rank6]: File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/booster/plugin/moe_hybrid_parallel_plugin.py", line 238, in init
[rank6]: dist.get_world_size() % (tp_size * pp_size) == 0
[rank6]: AssertionError: World size 8 is not divisible by tp_size 1 * pp_size 3
[rank7]: Traceback (most recent call last):
[rank7]: File "/dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat/./examples/training_scripts/lora_finetune.py", line 455, in
[rank7]: train(args)
[rank7]: File "/dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat/./examples/training_scripts/lora_finetune.py", line 109, in train
[rank7]: plugin = MoeHybridParallelPlugin(
[rank7]: File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/booster/plugin/moe_hybrid_parallel_plugin.py", line 238, in init
[rank7]: dist.get_world_size() % (tp_size * pp_size) == 0
[rank7]: AssertionError: World size 8 is not divisible by tp_size 1 * pp_size 3
[rank5]: Traceback (most recent call last):
[rank5]: File "/dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat/./examples/training_scripts/lora_finetune.py", line 455, in
[rank5]: train(args)
[rank5]: File "/dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat/./examples/training_scripts/lora_finetune.py", line 109, in train
[rank5]: plugin = MoeHybridParallelPlugin(
[rank5]: File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/booster/plugin/moe_hybrid_parallel_plugin.py", line 238, in init
[rank5]: dist.get_world_size() % (tp_size * pp_size) == 0
[rank5]: AssertionError: World size 8 is not divisible by tp_size 1 * pp_size 3
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:48: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel")
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:93: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn(
Exception ignored in: <function HybridParallelPlugin.del at 0x7fed5ddaf640>
Traceback (most recent call last):
File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 1245, in del
self.pg_mesh.destroy_mesh_process_groups()
AttributeError: 'MoeHybridParallelPlugin' object has no attribute 'pg_mesh'
Exception ignored in: <function HybridParallelPlugin.del at 0x7f8c3d9a7640>
Traceback (most recent call last):
File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 1245, in del
self.pg_mesh.destroy_mesh_process_groups()
AttributeError: 'MoeHybridParallelPlugin' object has no attribute 'pg_mesh'
[02/19/25 09:17:03] INFO colossalai - colossalai - INFO:
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib
/python3.10/site-packages/colossalai/initialize.py:
75 launch
INFO colossalai - colossalai - INFO: Distributed
environment is initialized, world size: 8
[rank0]: Traceback (most recent call last):
[rank0]: File "/dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat/./examples/training_scripts/lora_finetune.py", line 455, in
[rank0]: train(args)
[rank0]: File "/dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat/./examples/training_scripts/lora_finetune.py", line 109, in train
[rank0]: plugin = MoeHybridParallelPlugin(
[rank0]: File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/booster/plugin/moe_hybrid_parallel_plugin.py", line 238, in init
[rank0]: dist.get_world_size() % (tp_size * pp_size) == 0
[rank0]: AssertionError: World size 8 is not divisible by tp_size 1 * pp_size 3
Exception ignored in: <function HybridParallelPlugin.del at 0x7f704af5b640>
Traceback (most recent call last):
File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 1245, in del
self.pg_mesh.destroy_mesh_process_groups()
AttributeError: 'MoeHybridParallelPlugin' object has no attribute 'pg_mesh'
Exception ignored in: <function HybridParallelPlugin.del at 0x7fd941b03640>
Traceback (most recent call last):
File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 1245, in del
self.pg_mesh.destroy_mesh_process_groups()
AttributeError: 'MoeHybridParallelPlugin' object has no attribute 'pg_mesh'
Exception ignored in: <function HybridParallelPlugin.del at 0x7f4411247640>
Traceback (most recent call last):
File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 1245, in del
self.pg_mesh.destroy_mesh_process_groups()
AttributeError: 'MoeHybridParallelPlugin' object has no attribute 'pg_mesh'
Exception ignored in: <function HybridParallelPlugin.del at 0x7f382381f640>
Traceback (most recent call last):
File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 1245, in del
self.pg_mesh.destroy_mesh_process_groups()
AttributeError: 'MoeHybridParallelPlugin' object has no attribute 'pg_mesh'
Exception ignored in: <function HybridParallelPlugin.del at 0x7feb2b697640>
Traceback (most recent call last):
File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 1245, in del
self.pg_mesh.destroy_mesh_process_groups()
AttributeError: 'MoeHybridParallelPlugin' object has no attribute 'pg_mesh'
Exception ignored in: <function HybridParallelPlugin.del at 0x7f7733453640>
Traceback (most recent call last):
File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 1245, in del
self.pg_mesh.destroy_mesh_process_groups()
AttributeError: 'MoeHybridParallelPlugin' object has no attribute 'pg_mesh'
[rank0]:[W219 09:17:03.293812312 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
W0219 09:17:03.837000 140066961602368 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2448006 closing signal SIGTERM
W0219 09:17:03.837000 140066961602368 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2448007 closing signal SIGTERM
W0219 09:17:03.837000 140066961602368 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2448009 closing signal SIGTERM
W0219 09:17:03.838000 140066961602368 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2448010 closing signal SIGTERM
W0219 09:17:03.838000 140066961602368 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2448011 closing signal SIGTERM
W0219 09:17:03.838000 140066961602368 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2448012 closing signal SIGTERM
W0219 09:17:03.838000 140066961602368 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2448013 closing signal SIGTERM
E0219 09:17:04.000000 140066961602368 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 2 (pid: 2448008) of binary: /home/liuhongyuan/miniconda3/envs/colossal-chat/bin/python3.10
Traceback (most recent call last):
File "/home/liuhongyuan/miniconda3/envs/colossal-chat/bin/torchrun", line 8, in
sys.exit(main())
File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, kwargs)
File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

./examples/training_scripts/lora_finetune.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-02-19_09:17:03
host : servers
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 2448008)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Error: failed to run torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --master_port=29500 ./examples/training_scripts/lora_finetune.py --pretrained ../../../models/DeepSeek-R1-Distill-Llama-70B --dataset ./examples/training_scripts/lora_sft_data.jsonl --plugin moe --lr 2e-5 --max_length 256 -g --ep 8 --pp 3 --batch_size 24 --lora_rank 8 --lora_alpha 16 --num_epochs 2 --tensorboard_dir logs --save_dir DeepSeek-R1-bf16-lora on 127.0.0.1, is localhost: True, exception: Encountered a bad command exit code!

Command: 'cd /dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat && export SHELL="/bin/bash" CONDA_BACKUP_GCC_NM="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-gcc-nm" CONDA_EXE="/home/liuhongyuan/miniconda3/bin/conda" CONDA_BACKUP_LDFLAGS="-Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -Wl,--allow-shlib-undefined -Wl,-rpath,/home/liuhongyuan/miniconda3/envs/deepseek/lib -Wl,-rpath-link,/home/liuhongyuan/miniconda3/envs/deepseek/lib -L/home/liuhongyuan/miniconda3/envs/deepseek/lib -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -Wl,--allow-shlib-undefined -Wl,-rpath,/home/liuhongyuan/miniconda3/envs/deepseek/lib -Wl,-rpath-link,/home/liuhongyuan/miniconda3/envs/deepseek/lib -L/home/liuhongyuan/miniconda3/envs/deepseek/lib" CONDA_BACKUP_CONDA_BUILD_SYSROOT="/home/liuhongyuan/miniconda3/envs/deepseek/x86_64-conda-linux-gnu/sysroot" CONDA_BACKUP_STRIP="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-strip" CONDA_BACKUP_DEBUG_CFLAGS="-march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-all -fno-plt -Og -g -Wall -Wextra -fvar-tracking-assignments -ffunction-sections -pipe -isystem /home/liuhongyuan/miniconda3/envs/deepseek/include -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-all -fno-plt -Og -g -Wall -Wextra -fvar-tracking-assignments -ffunction-sections -pipe -isystem /home/liuhongyuan/miniconda3/envs/deepseek/include" CONDA_BACKUP_DEBUG_CPPFLAGS="-D_DEBUG -D_FORTIFY_SOURCE=2 -Og -isystem /home/liuhongyuan/miniconda3/envs/deepseek/include -D_DEBUG -D_FORTIFY_SOURCE=2 -Og -isystem /home/liuhongyuan/miniconda3/envs/deepseek/include" CONDA_BACKUP_build_alias="x86_64-conda-linux-gnu" CONDA_BACKUP_DEBUG_CXXFLAGS="-fvisibility-inlines-hidden -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-all -fno-plt -Og -g -Wall -Wextra -fvar-tracking-assignments -ffunction-sections -pipe -isystem /home/liuhongyuan/miniconda3/envs/deepseek/include -fvisibility-inlines-hidden -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-all -fno-plt -Og -g -Wall -Wextra -fvar-tracking-assignments -ffunction-sections -pipe -isystem /home/liuhongyuan/miniconda3/envs/deepseek/include" CONDA_BACKUP_ELFEDIT="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-elfedit" CONDA_BACKUP_SIZE="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-size" CONDA_BACKUP_BUILD="x86_64-conda-linux-gnu" TORCH_CUDA_ARCH_LIST="8.0+PTX" CONDA_BACKUP_CPP="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-cpp" PWD="/home/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat" CONDA_BACKUP_CONDA_TOOLCHAIN_HOST="x86_64-conda-linux-gnu" LOGNAME="liuhongyuan" XDG_SESSION_TYPE="tty" CONDA_PREFIX="/home/liuhongyuan/miniconda3/envs/colossal-chat" CONDA_BACKUP_AS="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-as" CONDA_BACKUP_AR="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-ar" CXX="g++" _="/home/liuhongyuan/miniconda3/envs/colossal-chat/bin/colossalai" CONDA_BACKUP_GPROF="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-gprof" MOTD_SHOWN="pam" HOME="/home/liuhongyuan" LANG="en_US.UTF-8" CONDA_BACKUP_GXX="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-g++" CONDA_BACKUP_ADDR2LINE="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-addr2line" LS_COLORS="rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:.tar=01;31:.tgz=01;31:.arc=01;31:.arj=01;31:.taz=01;31:.lha=01;31:.lz4=01;31:.lzh=01;31:.lzma=01;31:.tlz=01;31:.txz=01;31:.tzo=01;31:.t7z=01;31:.zip=01;31:.z=01;31:.dz=01;31:.gz=01;31:.lrz=01;31:.lz=01;31:.lzo=01;31:.xz=01;31:.zst=01;31:.tzst=01;31:.bz2=01;31:.bz=01;31:.tbz=01;31:.tbz2=01;31:.tz=01;31:.deb=01;31:.rpm=01;31:.jar=01;31:.war=01;31:.ear=01;31:.sar=01;31:.rar=01;31:.alz=01;31:.ace=01;31:.zoo=01;31:.cpio=01;31:.7z=01;31:.rz=01;31:.cab=01;31:.wim=01;31:.swm=01;31:.dwm=01;31:.esd=01;31:.jpg=01;35:.jpeg=01;35:.mjpg=01;35:.mjpeg=01;35:.gif=01;35:.bmp=01;35:.pbm=01;35:.pgm=01;35:.ppm=01;35:.tga=01;35:.xbm=01;35:.xpm=01;35:.tif=01;35:.tiff=01;35:.png=01;35:.svg=01;35:.svgz=01;35:.mng=01;35:.pcx=01;35:.mov=01;35:.mpg=01;35:.mpeg=01;35:.m2v=01;35:.mkv=01;35:.webm=01;35:.webp=01;35:.ogm=01;35:.mp4=01;35:.m4v=01;35:.mp4v=01;35:.vob=01;35:.qt=01;35:.nuv=01;35:.wmv=01;35:.asf=01;35:.rm=01;35:.rmvb=01;35:.flc=01;35:.avi=01;35:.fli=01;35:.flv=01;35:.gl=01;35:.dl=01;35:.xcf=01;35:.xwd=01;35:.yuv=01;35:.cgm=01;35:.emf=01;35:.ogv=01;35:.ogx=01;35:.aac=00;36:.au=00;36:.flac=00;36:.m4a=00;36:.mid=00;36:.midi=00;36:.mka=00;36:.mp3=00;36:.mpc=00;36:.ogg=00;36:.ra=00;36:.wav=00;36:.oga=00;36:.opus=00;36:.spx=00;36:.xspf=00;36:" NVCC_PREPEND_FLAGS=" -ccbin=/home/liuhongyuan/miniconda3/envs/ds-r1-sft/bin/x86_64-conda-linux-gnu-c++" CONDA_BACKUP_OBJCOPY="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-objcopy" CONDA_BACKUP__CONDA_PYTHON_SYSCONFIGDATA_NAME="_sysconfigdata_x86_64_conda_cos6_linux_gnu" CONDA_BACKUP_CFLAGS="-march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/liuhongyuan/miniconda3/envs/deepseek/include -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/liuhongyuan/miniconda3/envs/deepseek/include" CONDA_BACKUP_HOST="x86_64-conda-linux-gnu" CONDA_PROMPT_MODIFIER="(colossal-chat) " CONDA_BACKUP_LD="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-ld" CUDA_NVCC_FLAGS="-allow-unsupported-compiler" CONDA_BACKUP_GCC="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-gcc" SSH_CONNECTION="58.56.19.187 1854 10.10.101.15 11500" CONDA_BACKUP_CC="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-cc" CONDA_BACKUP_NM="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-nm" CONDA_BACKUP_LD_GOLD="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-ld.gold" LESSCLOSE="/usr/bin/lesspipe %s %s" XDG_SESSION_CLASS="user" CONDA_BACKUP_CXXFLAGS="-fvisibility-inlines-hidden -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/liuhongyuan/miniconda3/envs/deepseek/include -fvisibility-inlines-hidden -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/liuhongyuan/miniconda3/envs/deepseek/include" CONDA_BACKUP_host_alias="x86_64-conda-linux-gnu" CONDA_BACKUP_RANLIB="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-ranlib" TERM="xterm-256color" CONDA_BACKUP_GCC_RANLIB="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-gcc-ranlib" CONDA_BACKUP_READELF="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-readelf" LESSOPEN="| /usr/bin/lesspipe %s" USER="liuhongyuan" CONDA_BACKUP_GCC_AR="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-gcc-ar" CONDA_BACKUP_DWP="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-dwp" CONDA_BACKUP_CXXFILT="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-c++filt" CONDA_SHLVL="4" SHLVL="2" CONDA_BACKUP_CMAKE_PREFIX_PATH="/home/liuhongyuan/miniconda3/envs/deepseek:/home/liuhongyuan/miniconda3/envs/deepseek/x86_64-conda-linux-gnu/sysroot/usr" XDG_SESSION_ID="1252" CONDA_BACKUP_CONDA_TOOLCHAIN_BUILD="x86_64-conda-linux-gnu" CONDA_BACKUP_OBJDUMP="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-objdump" CONDA_BACKUP_STRINGS="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-strings" CONDA_PYTHON_EXE="/home/liuhongyuan/miniconda3/bin/python" LD_LIBRARY_PATH="/opt/TensorRT-8.6.1.6/lib/:/usr/local/cuda-12.4/lib64:" CONDA_BACKUP_CC_FOR_BUILD="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-cc" XDG_RUNTIME_DIR="/run/user/1002" SSH_CLIENT="58.56.19.187 1854 11500" CONDA_BACKUP_CPPFLAGS="-DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /home/liuhongyuan/miniconda3/envs/deepseek/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /home/liuhongyuan/miniconda3/envs/deepseek/include" CONDA_DEFAULT_ENV="colossal-chat" CONDA_BACKUP_CXX_FOR_BUILD="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-c++" XDG_DATA_DIRS="/usr/local/share:/usr/share:/var/lib/snapd/desktop" HF_ENDPOINT="https://hf-mirror.com" PATH="/opt/TensorRT-8.6.1.6/bin:/home/liuhongyuan/miniconda3/envs/colossal-chat/bin:/home/liuhongyuan/miniconda3/condabin:/usr/local/cuda-12.4/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin" CC="gcc" CONDA_BACKUP_MESON_ARGS="-Dbuildtype=release" DBUS_SESSION_BUS_ADDRESS="unix:path=/run/user/1002/bus" CONDA_BACKUP_CXX="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-c++" SSH_TTY="/dev/pts/17" CONDA_PREFIX_1="/home/liuhongyuan/miniconda3" CONDA_PREFIX_2="/home/liuhongyuan/miniconda3/envs/ds-r1-sft" CONDA_PREFIX_3="/home/liuhongyuan/miniconda3/envs/deepseek" OLDPWD="/home/liuhongyuan/workspace/deepseek/ColossalAI/applications" CUDA_DEVICE_MAX_CONNECTIONS="1" && torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --master_port=29500 ./examples/training_scripts/lora_finetune.py --pretrained ../../../models/DeepSeek-R1-Distill-Llama-70B --dataset ./examples/training_scripts/lora_sft_data.jsonl --plugin moe --lr 2e-5 --max_length 256 -g --ep 8 --pp 3 --batch_size 24 --lora_rank 8 --lora_alpha 16 --num_epochs 2 --tensorboard_dir logs --save_dir DeepSeek-R1-bf16-lora'

Exit code: 1

Stdout: already printed

Stderr: already printed

====== Training on All Nodes =====
127.0.0.1: failure

====== Stopping All Nodes =====
127.0.0.1: finish

Environment

No response

ver217 · 2025-02-20T04:14:46Z

The sample command is for 3x8 GPUs, but you only have 8 GPUs. Adjust the ep size or pp size to ensure they are divisible by number of GPUs.

Hongyuan-Liu added the bug Something isn't working label Feb 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: #6202

[BUG]: #6202

Hongyuan-Liu commented Feb 19, 2025

ver217 commented Feb 20, 2025

[BUG]: #6202

[BUG]: #6202

Comments

Hongyuan-Liu commented Feb 19, 2025

Is there an existing issue for this bug?

The bug has not been fixed in the latest main branch

Do you feel comfortable sharing a concise (minimal) script that reproduces the error? :)

🐛 Describe the bug

./examples/training_scripts/lora_finetune.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2025-02-19_09:17:03 host : servers rank : 2 (local_rank: 2) exitcode : 1 (pid: 2448008) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Environment

ver217 commented Feb 20, 2025

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-02-19_09:17:03
host : servers
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 2448008)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html