You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The bug has not been fixed in the latest main branch
I have checked the latest main branch
Do you feel comfortable sharing a concise (minimal) script that reproduces the error? :)
Yes, I will share a minimal reproducible script.
🐛 Describe the bug
I am fine-tuning DeepSeek-R1-Distill-Llama-70B using ColossalAI with the following command:
colossalai run
--nproc_per_node 8
./examples/training_scripts/lora_finetune.py
--pretrained ../../../models/DeepSeek-R1-Distill-Llama-70B
--dataset ./examples/training_scripts/lora_sft_data.jsonl
--plugin moe
--lr 2e-5
--max_length 256
-g
--ep 8
--pp 3
--batch_size 24
--lora_rank 8
--lora_alpha 16
--num_epochs 2
--tensorboard_dir logs
--save_dir DeepSeek-R1-bf16-lora
I have 8 4090 GPU,However, the following error occurred:
W0219 09:16:58.710000 140066961602368 torch/distributed/run.py:779]
W0219 09:16:58.710000 140066961602368 torch/distributed/run.py:779] *****************************************
W0219 09:16:58.710000 140066961602368 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0219 09:16:58.710000 140066961602368 torch/distributed/run.py:779] *****************************************
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/utils/safetensors.py:13: UserWarning: Please install the latest tensornvme to use async save. pip install git+https://github.com/hpcaitech/TensorNVMe.git
warnings.warn(
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/utils/safetensors.py:13: UserWarning: Please install the latest tensornvme to use async save. pip install git+https://github.com/hpcaitech/TensorNVMe.git
warnings.warn(
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/utils/safetensors.py:13: UserWarning: Please install the latest tensornvme to use async save. pip install git+https://github.com/hpcaitech/TensorNVMe.git
warnings.warn(
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/utils/safetensors.py:13: UserWarning: Please install the latest tensornvme to use async save. pip install git+https://github.com/hpcaitech/TensorNVMe.git
warnings.warn(
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/utils/safetensors.py:13: UserWarning: Please install the latest tensornvme to use async save. pip install git+https://github.com/hpcaitech/TensorNVMe.git
warnings.warn(
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/utils/safetensors.py:13: UserWarning: Please install the latest tensornvme to use async save. pip install git+https://github.com/hpcaitech/TensorNVMe.git
warnings.warn(
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/utils/safetensors.py:13: UserWarning: Please install the latest tensornvme to use async save. pip install git+https://github.com/hpcaitech/TensorNVMe.git
warnings.warn(
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/utils/safetensors.py:13: UserWarning: Please install the latest tensornvme to use async save. pip install git+https://github.com/hpcaitech/TensorNVMe.git
warnings.warn(
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:48: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel")
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:93: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn(
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:48: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel")
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:93: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn(
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:48: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel")
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:48: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel")
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:93: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn(
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:93: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn(
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:48: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel")
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:93: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn(
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:48: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel")
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:93: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn(
[rank3]: Traceback (most recent call last):
[rank3]: File "/dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat/./examples/training_scripts/lora_finetune.py", line 455, in
[rank3]: train(args)
[rank3]: File "/dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat/./examples/training_scripts/lora_finetune.py", line 109, in train
[rank3]: plugin = MoeHybridParallelPlugin(
[rank3]: File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/booster/plugin/moe_hybrid_parallel_plugin.py", line 238, in init
[rank3]: dist.get_world_size() % (tp_size * pp_size) == 0
[rank3]: AssertionError: World size 8 is not divisible by tp_size 1 * pp_size 3
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:48: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel")
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:93: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn(
[rank2]: Traceback (most recent call last):
[rank2]: File "/dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat/./examples/training_scripts/lora_finetune.py", line 455, in
[rank2]: train(args)
[rank2]: File "/dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat/./examples/training_scripts/lora_finetune.py", line 109, in train
[rank2]: plugin = MoeHybridParallelPlugin(
[rank2]: File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/booster/plugin/moe_hybrid_parallel_plugin.py", line 238, in init
[rank2]: dist.get_world_size() % (tp_size * pp_size) == 0
[rank2]: AssertionError: World size 8 is not divisible by tp_size 1 * pp_size 3
[rank4]: Traceback (most recent call last):
[rank4]: File "/dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat/./examples/training_scripts/lora_finetune.py", line 455, in
[rank4]: train(args)
[rank4]: File "/dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat/./examples/training_scripts/lora_finetune.py", line 109, in train
[rank4]: plugin = MoeHybridParallelPlugin(
[rank4]: File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/booster/plugin/moe_hybrid_parallel_plugin.py", line 238, in init
[rank4]: dist.get_world_size() % (tp_size * pp_size) == 0
[rank4]: AssertionError: World size 8 is not divisible by tp_size 1 * pp_size 3
[rank1]: Traceback (most recent call last):
[rank1]: File "/dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat/./examples/training_scripts/lora_finetune.py", line 455, in
[rank1]: train(args)
[rank1]: File "/dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat/./examples/training_scripts/lora_finetune.py", line 109, in train
[rank1]: plugin = MoeHybridParallelPlugin(
[rank1]: File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/booster/plugin/moe_hybrid_parallel_plugin.py", line 238, in init
[rank1]: dist.get_world_size() % (tp_size * pp_size) == 0
[rank1]: AssertionError: World size 8 is not divisible by tp_size 1 * pp_size 3
[rank6]: Traceback (most recent call last):
[rank6]: File "/dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat/./examples/training_scripts/lora_finetune.py", line 455, in
[rank6]: train(args)
[rank6]: File "/dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat/./examples/training_scripts/lora_finetune.py", line 109, in train
[rank6]: plugin = MoeHybridParallelPlugin(
[rank6]: File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/booster/plugin/moe_hybrid_parallel_plugin.py", line 238, in init
[rank6]: dist.get_world_size() % (tp_size * pp_size) == 0
[rank6]: AssertionError: World size 8 is not divisible by tp_size 1 * pp_size 3
[rank7]: Traceback (most recent call last):
[rank7]: File "/dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat/./examples/training_scripts/lora_finetune.py", line 455, in
[rank7]: train(args)
[rank7]: File "/dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat/./examples/training_scripts/lora_finetune.py", line 109, in train
[rank7]: plugin = MoeHybridParallelPlugin(
[rank7]: File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/booster/plugin/moe_hybrid_parallel_plugin.py", line 238, in init
[rank7]: dist.get_world_size() % (tp_size * pp_size) == 0
[rank7]: AssertionError: World size 8 is not divisible by tp_size 1 * pp_size 3
[rank5]: Traceback (most recent call last):
[rank5]: File "/dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat/./examples/training_scripts/lora_finetune.py", line 455, in
[rank5]: train(args)
[rank5]: File "/dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat/./examples/training_scripts/lora_finetune.py", line 109, in train
[rank5]: plugin = MoeHybridParallelPlugin(
[rank5]: File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/booster/plugin/moe_hybrid_parallel_plugin.py", line 238, in init
[rank5]: dist.get_world_size() % (tp_size * pp_size) == 0
[rank5]: AssertionError: World size 8 is not divisible by tp_size 1 * pp_size 3
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:48: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel")
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:93: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn(
Exception ignored in: <function HybridParallelPlugin.del at 0x7fed5ddaf640>
Traceback (most recent call last):
File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 1245, in del
self.pg_mesh.destroy_mesh_process_groups()
AttributeError: 'MoeHybridParallelPlugin' object has no attribute 'pg_mesh'
Exception ignored in: <function HybridParallelPlugin.del at 0x7f8c3d9a7640>
Traceback (most recent call last):
File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 1245, in del
self.pg_mesh.destroy_mesh_process_groups()
AttributeError: 'MoeHybridParallelPlugin' object has no attribute 'pg_mesh'
[02/19/25 09:17:03] INFO colossalai - colossalai - INFO:
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib
/python3.10/site-packages/colossalai/initialize.py:
75 launch
INFO colossalai - colossalai - INFO: Distributed
environment is initialized, world size: 8
[rank0]: Traceback (most recent call last):
[rank0]: File "/dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat/./examples/training_scripts/lora_finetune.py", line 455, in
[rank0]: train(args)
[rank0]: File "/dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat/./examples/training_scripts/lora_finetune.py", line 109, in train
[rank0]: plugin = MoeHybridParallelPlugin(
[rank0]: File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/booster/plugin/moe_hybrid_parallel_plugin.py", line 238, in init
[rank0]: dist.get_world_size() % (tp_size * pp_size) == 0
[rank0]: AssertionError: World size 8 is not divisible by tp_size 1 * pp_size 3
Exception ignored in: <function HybridParallelPlugin.del at 0x7f704af5b640>
Traceback (most recent call last):
File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 1245, in del
self.pg_mesh.destroy_mesh_process_groups()
AttributeError: 'MoeHybridParallelPlugin' object has no attribute 'pg_mesh'
Exception ignored in: <function HybridParallelPlugin.del at 0x7fd941b03640>
Traceback (most recent call last):
File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 1245, in del
self.pg_mesh.destroy_mesh_process_groups()
AttributeError: 'MoeHybridParallelPlugin' object has no attribute 'pg_mesh'
Exception ignored in: <function HybridParallelPlugin.del at 0x7f4411247640>
Traceback (most recent call last):
File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 1245, in del
self.pg_mesh.destroy_mesh_process_groups()
AttributeError: 'MoeHybridParallelPlugin' object has no attribute 'pg_mesh'
Exception ignored in: <function HybridParallelPlugin.del at 0x7f382381f640>
Traceback (most recent call last):
File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 1245, in del
self.pg_mesh.destroy_mesh_process_groups()
AttributeError: 'MoeHybridParallelPlugin' object has no attribute 'pg_mesh'
Exception ignored in: <function HybridParallelPlugin.del at 0x7feb2b697640>
Traceback (most recent call last):
File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 1245, in del
self.pg_mesh.destroy_mesh_process_groups()
AttributeError: 'MoeHybridParallelPlugin' object has no attribute 'pg_mesh'
Exception ignored in: <function HybridParallelPlugin.del at 0x7f7733453640>
Traceback (most recent call last):
File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 1245, in del
self.pg_mesh.destroy_mesh_process_groups()
AttributeError: 'MoeHybridParallelPlugin' object has no attribute 'pg_mesh'
[rank0]:[W219 09:17:03.293812312 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
W0219 09:17:03.837000 140066961602368 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2448006 closing signal SIGTERM
W0219 09:17:03.837000 140066961602368 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2448007 closing signal SIGTERM
W0219 09:17:03.837000 140066961602368 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2448009 closing signal SIGTERM
W0219 09:17:03.838000 140066961602368 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2448010 closing signal SIGTERM
W0219 09:17:03.838000 140066961602368 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2448011 closing signal SIGTERM
W0219 09:17:03.838000 140066961602368 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2448012 closing signal SIGTERM
W0219 09:17:03.838000 140066961602368 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2448013 closing signal SIGTERM
E0219 09:17:04.000000 140066961602368 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 2 (pid: 2448008) of binary: /home/liuhongyuan/miniconda3/envs/colossal-chat/bin/python3.10
Traceback (most recent call last):
File "/home/liuhongyuan/miniconda3/envs/colossal-chat/bin/torchrun", line 8, in
sys.exit(main())
File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Is there an existing issue for this bug?
The bug has not been fixed in the latest main branch
Do you feel comfortable sharing a concise (minimal) script that reproduces the error? :)
Yes, I will share a minimal reproducible script.
🐛 Describe the bug
I am fine-tuning DeepSeek-R1-Distill-Llama-70B using ColossalAI with the following command:
colossalai run
--nproc_per_node 8
./examples/training_scripts/lora_finetune.py
--pretrained ../../../models/DeepSeek-R1-Distill-Llama-70B
--dataset ./examples/training_scripts/lora_sft_data.jsonl
--plugin moe
--lr 2e-5
--max_length 256
-g
--ep 8
--pp 3
--batch_size 24
--lora_rank 8
--lora_alpha 16
--num_epochs 2
--tensorboard_dir logs
--save_dir DeepSeek-R1-bf16-lora
I have 8 4090 GPU,However, the following error occurred:
W0219 09:16:58.710000 140066961602368 torch/distributed/run.py:779]
W0219 09:16:58.710000 140066961602368 torch/distributed/run.py:779] *****************************************
W0219 09:16:58.710000 140066961602368 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0219 09:16:58.710000 140066961602368 torch/distributed/run.py:779] *****************************************
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/utils/safetensors.py:13: UserWarning: Please install the latest tensornvme to use async save. pip install git+https://github.com/hpcaitech/TensorNVMe.git
warnings.warn(
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/utils/safetensors.py:13: UserWarning: Please install the latest tensornvme to use async save. pip install git+https://github.com/hpcaitech/TensorNVMe.git
warnings.warn(
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/utils/safetensors.py:13: UserWarning: Please install the latest tensornvme to use async save. pip install git+https://github.com/hpcaitech/TensorNVMe.git
warnings.warn(
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/utils/safetensors.py:13: UserWarning: Please install the latest tensornvme to use async save. pip install git+https://github.com/hpcaitech/TensorNVMe.git
warnings.warn(
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/utils/safetensors.py:13: UserWarning: Please install the latest tensornvme to use async save. pip install git+https://github.com/hpcaitech/TensorNVMe.git
warnings.warn(
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/utils/safetensors.py:13: UserWarning: Please install the latest tensornvme to use async save. pip install git+https://github.com/hpcaitech/TensorNVMe.git
warnings.warn(
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/utils/safetensors.py:13: UserWarning: Please install the latest tensornvme to use async save. pip install git+https://github.com/hpcaitech/TensorNVMe.git
warnings.warn(
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/utils/safetensors.py:13: UserWarning: Please install the latest tensornvme to use async save. pip install git+https://github.com/hpcaitech/TensorNVMe.git
warnings.warn(
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:48: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel")
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:93: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn(
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:48: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel")
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:93: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn(
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:48: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel")
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:48: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel")
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:93: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn(
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:93: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn(
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:48: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel")
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:93: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn(
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:48: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel")
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:93: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn(
[rank3]: Traceback (most recent call last):
[rank3]: File "/dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat/./examples/training_scripts/lora_finetune.py", line 455, in
[rank3]: train(args)
[rank3]: File "/dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat/./examples/training_scripts/lora_finetune.py", line 109, in train
[rank3]: plugin = MoeHybridParallelPlugin(
[rank3]: File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/booster/plugin/moe_hybrid_parallel_plugin.py", line 238, in init
[rank3]: dist.get_world_size() % (tp_size * pp_size) == 0
[rank3]: AssertionError: World size 8 is not divisible by tp_size 1 * pp_size 3
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:48: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel")
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:93: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn(
[rank2]: Traceback (most recent call last):
[rank2]: File "/dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat/./examples/training_scripts/lora_finetune.py", line 455, in
[rank2]: train(args)
[rank2]: File "/dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat/./examples/training_scripts/lora_finetune.py", line 109, in train
[rank2]: plugin = MoeHybridParallelPlugin(
[rank2]: File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/booster/plugin/moe_hybrid_parallel_plugin.py", line 238, in init
[rank2]: dist.get_world_size() % (tp_size * pp_size) == 0
[rank2]: AssertionError: World size 8 is not divisible by tp_size 1 * pp_size 3
[rank4]: Traceback (most recent call last):
[rank4]: File "/dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat/./examples/training_scripts/lora_finetune.py", line 455, in
[rank4]: train(args)
[rank4]: File "/dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat/./examples/training_scripts/lora_finetune.py", line 109, in train
[rank4]: plugin = MoeHybridParallelPlugin(
[rank4]: File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/booster/plugin/moe_hybrid_parallel_plugin.py", line 238, in init
[rank4]: dist.get_world_size() % (tp_size * pp_size) == 0
[rank4]: AssertionError: World size 8 is not divisible by tp_size 1 * pp_size 3
[rank1]: Traceback (most recent call last):
[rank1]: File "/dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat/./examples/training_scripts/lora_finetune.py", line 455, in
[rank1]: train(args)
[rank1]: File "/dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat/./examples/training_scripts/lora_finetune.py", line 109, in train
[rank1]: plugin = MoeHybridParallelPlugin(
[rank1]: File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/booster/plugin/moe_hybrid_parallel_plugin.py", line 238, in init
[rank1]: dist.get_world_size() % (tp_size * pp_size) == 0
[rank1]: AssertionError: World size 8 is not divisible by tp_size 1 * pp_size 3
[rank6]: Traceback (most recent call last):
[rank6]: File "/dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat/./examples/training_scripts/lora_finetune.py", line 455, in
[rank6]: train(args)
[rank6]: File "/dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat/./examples/training_scripts/lora_finetune.py", line 109, in train
[rank6]: plugin = MoeHybridParallelPlugin(
[rank6]: File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/booster/plugin/moe_hybrid_parallel_plugin.py", line 238, in init
[rank6]: dist.get_world_size() % (tp_size * pp_size) == 0
[rank6]: AssertionError: World size 8 is not divisible by tp_size 1 * pp_size 3
[rank7]: Traceback (most recent call last):
[rank7]: File "/dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat/./examples/training_scripts/lora_finetune.py", line 455, in
[rank7]: train(args)
[rank7]: File "/dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat/./examples/training_scripts/lora_finetune.py", line 109, in train
[rank7]: plugin = MoeHybridParallelPlugin(
[rank7]: File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/booster/plugin/moe_hybrid_parallel_plugin.py", line 238, in init
[rank7]: dist.get_world_size() % (tp_size * pp_size) == 0
[rank7]: AssertionError: World size 8 is not divisible by tp_size 1 * pp_size 3
[rank5]: Traceback (most recent call last):
[rank5]: File "/dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat/./examples/training_scripts/lora_finetune.py", line 455, in
[rank5]: train(args)
[rank5]: File "/dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat/./examples/training_scripts/lora_finetune.py", line 109, in train
[rank5]: plugin = MoeHybridParallelPlugin(
[rank5]: File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/booster/plugin/moe_hybrid_parallel_plugin.py", line 238, in init
[rank5]: dist.get_world_size() % (tp_size * pp_size) == 0
[rank5]: AssertionError: World size 8 is not divisible by tp_size 1 * pp_size 3
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:48: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel")
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:93: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn(
Exception ignored in: <function HybridParallelPlugin.del at 0x7fed5ddaf640>
Traceback (most recent call last):
File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 1245, in del
self.pg_mesh.destroy_mesh_process_groups()
AttributeError: 'MoeHybridParallelPlugin' object has no attribute 'pg_mesh'
Exception ignored in: <function HybridParallelPlugin.del at 0x7f8c3d9a7640>
Traceback (most recent call last):
File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 1245, in del
self.pg_mesh.destroy_mesh_process_groups()
AttributeError: 'MoeHybridParallelPlugin' object has no attribute 'pg_mesh'
[02/19/25 09:17:03] INFO colossalai - colossalai - INFO:
/home/liuhongyuan/miniconda3/envs/colossal-chat/lib
/python3.10/site-packages/colossalai/initialize.py:
75 launch
INFO colossalai - colossalai - INFO: Distributed
environment is initialized, world size: 8
[rank0]: Traceback (most recent call last):
[rank0]: File "/dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat/./examples/training_scripts/lora_finetune.py", line 455, in
[rank0]: train(args)
[rank0]: File "/dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat/./examples/training_scripts/lora_finetune.py", line 109, in train
[rank0]: plugin = MoeHybridParallelPlugin(
[rank0]: File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/booster/plugin/moe_hybrid_parallel_plugin.py", line 238, in init
[rank0]: dist.get_world_size() % (tp_size * pp_size) == 0
[rank0]: AssertionError: World size 8 is not divisible by tp_size 1 * pp_size 3
Exception ignored in: <function HybridParallelPlugin.del at 0x7f704af5b640>
Traceback (most recent call last):
File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 1245, in del
self.pg_mesh.destroy_mesh_process_groups()
AttributeError: 'MoeHybridParallelPlugin' object has no attribute 'pg_mesh'
Exception ignored in: <function HybridParallelPlugin.del at 0x7fd941b03640>
Traceback (most recent call last):
File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 1245, in del
self.pg_mesh.destroy_mesh_process_groups()
AttributeError: 'MoeHybridParallelPlugin' object has no attribute 'pg_mesh'
Exception ignored in: <function HybridParallelPlugin.del at 0x7f4411247640>
Traceback (most recent call last):
File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 1245, in del
self.pg_mesh.destroy_mesh_process_groups()
AttributeError: 'MoeHybridParallelPlugin' object has no attribute 'pg_mesh'
Exception ignored in: <function HybridParallelPlugin.del at 0x7f382381f640>
Traceback (most recent call last):
File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 1245, in del
self.pg_mesh.destroy_mesh_process_groups()
AttributeError: 'MoeHybridParallelPlugin' object has no attribute 'pg_mesh'
Exception ignored in: <function HybridParallelPlugin.del at 0x7feb2b697640>
Traceback (most recent call last):
File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 1245, in del
self.pg_mesh.destroy_mesh_process_groups()
AttributeError: 'MoeHybridParallelPlugin' object has no attribute 'pg_mesh'
Exception ignored in: <function HybridParallelPlugin.del at 0x7f7733453640>
Traceback (most recent call last):
File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 1245, in del
self.pg_mesh.destroy_mesh_process_groups()
AttributeError: 'MoeHybridParallelPlugin' object has no attribute 'pg_mesh'
[rank0]:[W219 09:17:03.293812312 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
W0219 09:17:03.837000 140066961602368 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2448006 closing signal SIGTERM
W0219 09:17:03.837000 140066961602368 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2448007 closing signal SIGTERM
W0219 09:17:03.837000 140066961602368 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2448009 closing signal SIGTERM
W0219 09:17:03.838000 140066961602368 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2448010 closing signal SIGTERM
W0219 09:17:03.838000 140066961602368 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2448011 closing signal SIGTERM
W0219 09:17:03.838000 140066961602368 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2448012 closing signal SIGTERM
W0219 09:17:03.838000 140066961602368 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2448013 closing signal SIGTERM
E0219 09:17:04.000000 140066961602368 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 2 (pid: 2448008) of binary: /home/liuhongyuan/miniconda3/envs/colossal-chat/bin/python3.10
Traceback (most recent call last):
File "/home/liuhongyuan/miniconda3/envs/colossal-chat/bin/torchrun", line 8, in
sys.exit(main())
File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/liuhongyuan/miniconda3/envs/colossal-chat/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
./examples/training_scripts/lora_finetune.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2025-02-19_09:17:03
host : servers
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 2448008)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Error: failed to run torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --master_port=29500 ./examples/training_scripts/lora_finetune.py --pretrained ../../../models/DeepSeek-R1-Distill-Llama-70B --dataset ./examples/training_scripts/lora_sft_data.jsonl --plugin moe --lr 2e-5 --max_length 256 -g --ep 8 --pp 3 --batch_size 24 --lora_rank 8 --lora_alpha 16 --num_epochs 2 --tensorboard_dir logs --save_dir DeepSeek-R1-bf16-lora on 127.0.0.1, is localhost: True, exception: Encountered a bad command exit code!
Command: 'cd /dataset/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat && export SHELL="/bin/bash" CONDA_BACKUP_GCC_NM="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-gcc-nm" CONDA_EXE="/home/liuhongyuan/miniconda3/bin/conda" CONDA_BACKUP_LDFLAGS="-Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -Wl,--allow-shlib-undefined -Wl,-rpath,/home/liuhongyuan/miniconda3/envs/deepseek/lib -Wl,-rpath-link,/home/liuhongyuan/miniconda3/envs/deepseek/lib -L/home/liuhongyuan/miniconda3/envs/deepseek/lib -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -Wl,--allow-shlib-undefined -Wl,-rpath,/home/liuhongyuan/miniconda3/envs/deepseek/lib -Wl,-rpath-link,/home/liuhongyuan/miniconda3/envs/deepseek/lib -L/home/liuhongyuan/miniconda3/envs/deepseek/lib" CONDA_BACKUP_CONDA_BUILD_SYSROOT="/home/liuhongyuan/miniconda3/envs/deepseek/x86_64-conda-linux-gnu/sysroot" CONDA_BACKUP_STRIP="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-strip" CONDA_BACKUP_DEBUG_CFLAGS="-march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-all -fno-plt -Og -g -Wall -Wextra -fvar-tracking-assignments -ffunction-sections -pipe -isystem /home/liuhongyuan/miniconda3/envs/deepseek/include -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-all -fno-plt -Og -g -Wall -Wextra -fvar-tracking-assignments -ffunction-sections -pipe -isystem /home/liuhongyuan/miniconda3/envs/deepseek/include" CONDA_BACKUP_DEBUG_CPPFLAGS="-D_DEBUG -D_FORTIFY_SOURCE=2 -Og -isystem /home/liuhongyuan/miniconda3/envs/deepseek/include -D_DEBUG -D_FORTIFY_SOURCE=2 -Og -isystem /home/liuhongyuan/miniconda3/envs/deepseek/include" CONDA_BACKUP_build_alias="x86_64-conda-linux-gnu" CONDA_BACKUP_DEBUG_CXXFLAGS="-fvisibility-inlines-hidden -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-all -fno-plt -Og -g -Wall -Wextra -fvar-tracking-assignments -ffunction-sections -pipe -isystem /home/liuhongyuan/miniconda3/envs/deepseek/include -fvisibility-inlines-hidden -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-all -fno-plt -Og -g -Wall -Wextra -fvar-tracking-assignments -ffunction-sections -pipe -isystem /home/liuhongyuan/miniconda3/envs/deepseek/include" CONDA_BACKUP_ELFEDIT="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-elfedit" CONDA_BACKUP_SIZE="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-size" CONDA_BACKUP_BUILD="x86_64-conda-linux-gnu" TORCH_CUDA_ARCH_LIST="8.0+PTX" CONDA_BACKUP_CPP="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-cpp" PWD="/home/liuhongyuan/workspace/deepseek/ColossalAI/applications/ColossalChat" CONDA_BACKUP_CONDA_TOOLCHAIN_HOST="x86_64-conda-linux-gnu" LOGNAME="liuhongyuan" XDG_SESSION_TYPE="tty" CONDA_PREFIX="/home/liuhongyuan/miniconda3/envs/colossal-chat" CONDA_BACKUP_AS="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-as" CONDA_BACKUP_AR="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-ar" CXX="g++" _="/home/liuhongyuan/miniconda3/envs/colossal-chat/bin/colossalai" CONDA_BACKUP_GPROF="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-gprof" MOTD_SHOWN="pam" HOME="/home/liuhongyuan" LANG="en_US.UTF-8" CONDA_BACKUP_GXX="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-g++" CONDA_BACKUP_ADDR2LINE="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-addr2line" LS_COLORS="rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:.tar=01;31:.tgz=01;31:.arc=01;31:.arj=01;31:.taz=01;31:.lha=01;31:.lz4=01;31:.lzh=01;31:.lzma=01;31:.tlz=01;31:.txz=01;31:.tzo=01;31:.t7z=01;31:.zip=01;31:.z=01;31:.dz=01;31:.gz=01;31:.lrz=01;31:.lz=01;31:.lzo=01;31:.xz=01;31:.zst=01;31:.tzst=01;31:.bz2=01;31:.bz=01;31:.tbz=01;31:.tbz2=01;31:.tz=01;31:.deb=01;31:.rpm=01;31:.jar=01;31:.war=01;31:.ear=01;31:.sar=01;31:.rar=01;31:.alz=01;31:.ace=01;31:.zoo=01;31:.cpio=01;31:.7z=01;31:.rz=01;31:.cab=01;31:.wim=01;31:.swm=01;31:.dwm=01;31:.esd=01;31:.jpg=01;35:.jpeg=01;35:.mjpg=01;35:.mjpeg=01;35:.gif=01;35:.bmp=01;35:.pbm=01;35:.pgm=01;35:.ppm=01;35:.tga=01;35:.xbm=01;35:.xpm=01;35:.tif=01;35:.tiff=01;35:.png=01;35:.svg=01;35:.svgz=01;35:.mng=01;35:.pcx=01;35:.mov=01;35:.mpg=01;35:.mpeg=01;35:.m2v=01;35:.mkv=01;35:.webm=01;35:.webp=01;35:.ogm=01;35:.mp4=01;35:.m4v=01;35:.mp4v=01;35:.vob=01;35:.qt=01;35:.nuv=01;35:.wmv=01;35:.asf=01;35:.rm=01;35:.rmvb=01;35:.flc=01;35:.avi=01;35:.fli=01;35:.flv=01;35:.gl=01;35:.dl=01;35:.xcf=01;35:.xwd=01;35:.yuv=01;35:.cgm=01;35:.emf=01;35:.ogv=01;35:.ogx=01;35:.aac=00;36:.au=00;36:.flac=00;36:.m4a=00;36:.mid=00;36:.midi=00;36:.mka=00;36:.mp3=00;36:.mpc=00;36:.ogg=00;36:.ra=00;36:.wav=00;36:.oga=00;36:.opus=00;36:.spx=00;36:.xspf=00;36:" NVCC_PREPEND_FLAGS=" -ccbin=/home/liuhongyuan/miniconda3/envs/ds-r1-sft/bin/x86_64-conda-linux-gnu-c++" CONDA_BACKUP_OBJCOPY="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-objcopy" CONDA_BACKUP__CONDA_PYTHON_SYSCONFIGDATA_NAME="_sysconfigdata_x86_64_conda_cos6_linux_gnu" CONDA_BACKUP_CFLAGS="-march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/liuhongyuan/miniconda3/envs/deepseek/include -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/liuhongyuan/miniconda3/envs/deepseek/include" CONDA_BACKUP_HOST="x86_64-conda-linux-gnu" CONDA_PROMPT_MODIFIER="(colossal-chat) " CONDA_BACKUP_LD="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-ld" CUDA_NVCC_FLAGS="-allow-unsupported-compiler" CONDA_BACKUP_GCC="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-gcc" SSH_CONNECTION="58.56.19.187 1854 10.10.101.15 11500" CONDA_BACKUP_CC="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-cc" CONDA_BACKUP_NM="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-nm" CONDA_BACKUP_LD_GOLD="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-ld.gold" LESSCLOSE="/usr/bin/lesspipe %s %s" XDG_SESSION_CLASS="user" CONDA_BACKUP_CXXFLAGS="-fvisibility-inlines-hidden -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/liuhongyuan/miniconda3/envs/deepseek/include -fvisibility-inlines-hidden -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/liuhongyuan/miniconda3/envs/deepseek/include" CONDA_BACKUP_host_alias="x86_64-conda-linux-gnu" CONDA_BACKUP_RANLIB="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-ranlib" TERM="xterm-256color" CONDA_BACKUP_GCC_RANLIB="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-gcc-ranlib" CONDA_BACKUP_READELF="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-readelf" LESSOPEN="| /usr/bin/lesspipe %s" USER="liuhongyuan" CONDA_BACKUP_GCC_AR="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-gcc-ar" CONDA_BACKUP_DWP="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-dwp" CONDA_BACKUP_CXXFILT="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-c++filt" CONDA_SHLVL="4" SHLVL="2" CONDA_BACKUP_CMAKE_PREFIX_PATH="/home/liuhongyuan/miniconda3/envs/deepseek:/home/liuhongyuan/miniconda3/envs/deepseek/x86_64-conda-linux-gnu/sysroot/usr" XDG_SESSION_ID="1252" CONDA_BACKUP_CONDA_TOOLCHAIN_BUILD="x86_64-conda-linux-gnu" CONDA_BACKUP_OBJDUMP="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-objdump" CONDA_BACKUP_STRINGS="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-strings" CONDA_PYTHON_EXE="/home/liuhongyuan/miniconda3/bin/python" LD_LIBRARY_PATH="/opt/TensorRT-8.6.1.6/lib/:/usr/local/cuda-12.4/lib64:" CONDA_BACKUP_CC_FOR_BUILD="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-cc" XDG_RUNTIME_DIR="/run/user/1002" SSH_CLIENT="58.56.19.187 1854 11500" CONDA_BACKUP_CPPFLAGS="-DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /home/liuhongyuan/miniconda3/envs/deepseek/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /home/liuhongyuan/miniconda3/envs/deepseek/include" CONDA_DEFAULT_ENV="colossal-chat" CONDA_BACKUP_CXX_FOR_BUILD="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-c++" XDG_DATA_DIRS="/usr/local/share:/usr/share:/var/lib/snapd/desktop" HF_ENDPOINT="https://hf-mirror.com" PATH="/opt/TensorRT-8.6.1.6/bin:/home/liuhongyuan/miniconda3/envs/colossal-chat/bin:/home/liuhongyuan/miniconda3/condabin:/usr/local/cuda-12.4/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin" CC="gcc" CONDA_BACKUP_MESON_ARGS="-Dbuildtype=release" DBUS_SESSION_BUS_ADDRESS="unix:path=/run/user/1002/bus" CONDA_BACKUP_CXX="/home/liuhongyuan/miniconda3/envs/deepseek/bin/x86_64-conda-linux-gnu-c++" SSH_TTY="/dev/pts/17" CONDA_PREFIX_1="/home/liuhongyuan/miniconda3" CONDA_PREFIX_2="/home/liuhongyuan/miniconda3/envs/ds-r1-sft" CONDA_PREFIX_3="/home/liuhongyuan/miniconda3/envs/deepseek" OLDPWD="/home/liuhongyuan/workspace/deepseek/ColossalAI/applications" CUDA_DEVICE_MAX_CONNECTIONS="1" && torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --master_port=29500 ./examples/training_scripts/lora_finetune.py --pretrained ../../../models/DeepSeek-R1-Distill-Llama-70B --dataset ./examples/training_scripts/lora_sft_data.jsonl --plugin moe --lr 2e-5 --max_length 256 -g --ep 8 --pp 3 --batch_size 24 --lora_rank 8 --lora_alpha 16 --num_epochs 2 --tensorboard_dir logs --save_dir DeepSeek-R1-bf16-lora'
Exit code: 1
Stdout: already printed
Stderr: already printed
====== Training on All Nodes =====
127.0.0.1: failure
====== Stopping All Nodes =====
127.0.0.1: finish
Environment
No response
The text was updated successfully, but these errors were encountered: