-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Checklist / 检查清单
- I have searched existing issues, and this is a new bug report. / 我已经搜索过现有的 issues,确认这是一个新的 bug report。
Bug Description / Bug 描述
bash train_grpo_countdown.sh
Train: 0%| | 0/8333 [00:00<?, ?it/s][INFO:swift] last_model_checkpoint: None
[INFO:swift] best_model_checkpoint: None
[INFO:swift] images_dir: /home/jovyan/zzs/ms-swift-main/output/GRPO_COUNTDOWN/v10-20260310-033228/images
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/jovyan/zzs/ms-swift-main/swift/cli/rlhf.py", line 7, in <module>
[rank0]: rlhf_main()
[rank0]: File "/home/jovyan/zzs/ms-swift-main/swift/pipelines/train/rlhf.py", line 243, in rlhf_main
[rank0]: return SwiftRLHF(args).main()
[rank0]: File "/home/jovyan/zzs/ms-swift-main/swift/pipelines/base.py", line 52, in main
[rank0]: result = self.run()
[rank0]: File "/home/jovyan/zzs/ms-swift-main/swift/ray/base.py", line 168, in wrapper
[rank0]: return func(self, *args, **kwargs)
[rank0]: File "/home/jovyan/zzs/ms-swift-main/swift/pipelines/train/sft.py", line 197, in run
[rank0]: return self.train(trainer)
[rank0]: File "/home/jovyan/zzs/ms-swift-main/swift/pipelines/train/sft.py", line 270, in train
[rank0]: trainer.train(resume_checkpoint)
[rank0]: File "/home/jovyan/zzs/ms-swift-main/swift/trainers/mixin.py", line 892, in train
[rank0]: res = super().train(*args, **kwargs)
[rank0]: File "/home/jovyan/.conda/envs/ms-swift-py310/lib/python3.10/site-packages/transformers/trainer.py", line 2325, in train
[rank0]: return inner_training_loop(
[rank0]: File "/home/jovyan/.conda/envs/ms-swift-py310/lib/python3.10/site-packages/transformers/trainer.py", line 2674, in _inner_training_loop
[rank0]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank0]: File "/home/jovyan/zzs/ms-swift-main/swift/rlhf_trainers/grpo_trainer.py", line 1872, in training_step
[rank0]: return super().training_step(model, inputs, num_items_in_batch)
[rank0]: File "/home/jovyan/.conda/envs/ms-swift-py310/lib/python3.10/site-packages/trl/trainer/grpo_trainer.py", line 1094, in training_step
[rank0]: output = super().training_step(model, inputs, num_items_in_batch)
[rank0]: File "/home/jovyan/.conda/envs/ms-swift-py310/lib/python3.10/site-packages/transformers/trainer.py", line 4014, in training_step
[rank0]: inputs = self._prepare_inputs(inputs)
[rank0]: File "/home/jovyan/zzs/ms-swift-main/swift/rlhf_trainers/utils.py", line 607, in wrapper
[rank0]: return func(self, *args, **kwargs)
[rank0]: File "/home/jovyan/zzs/ms-swift-main/swift/rlhf_trainers/grpo_trainer.py", line 194, in _prepare_inputs
[rank0]: generation_batch = self._generate_and_score_completions(generation_batch)
[rank0]: File "/home/jovyan/zzs/ms-swift-main/swift/rlhf_trainers/utils.py", line 607, in wrapper
[rank0]: return func(self, *args, **kwargs)
[rank0]: File "/home/jovyan/zzs/ms-swift-main/swift/rlhf_trainers/grpo_trainer.py", line 227, in _generate_and_score_completions
[rank0]: inputs = self._generate_completions(inputs)
[rank0]: File "/home/jovyan/zzs/ms-swift-main/swift/rlhf_trainers/grpo_trainer.py", line 208, in _generate_completions
[rank0]: results = self._fast_infer(inputs)
[rank0]: File "/home/jovyan/zzs/ms-swift-main/swift/rlhf_trainers/rollout_mixin.py", line 921, in _fast_infer
[rank0]: self._move_model_to_vllm()
[rank0]: File "/home/jovyan/zzs/ms-swift-main/swift/rlhf_trainers/utils.py", line 607, in wrapper
[rank0]: return func(self, *args, **kwargs)
[rank0]: File "/home/jovyan/zzs/ms-swift-main/swift/rlhf_trainers/rollout_mixin.py", line 380, in _move_model_to_vllm
[rank0]: self._move_full_model_to_vllm()
[rank0]: File "/home/jovyan/zzs/ms-swift-main/swift/rlhf_trainers/rollout_mixin.py", line 677, in _move_full_model_to_vllm
[rank0]: self._load_state_dict_to_vllm(state_dict)
[rank0]: File "/home/jovyan/zzs/ms-swift-main/swift/rlhf_trainers/rollout_mixin.py", line 459, in _load_state_dict_to_vllm
[rank0]: _process_bucket_with_flattened_tensor(self, bucket)
[rank0]: File "/home/jovyan/zzs/ms-swift-main/swift/rlhf_trainers/utils.py", line 1440, in _process_bucket_with_flattened_tensor
[rank0]: trainer.vllm_client.update_flattened_params(metadatas, flattened_tensor)
[rank0]: File "/home/jovyan/zzs/ms-swift-main/swift/rlhf_trainers/vllm_client.py", line 407, in update_flattened_params
[rank0]: raise RuntimeError(f'Multiple errors: {all_errors}')
[rank0]: RuntimeError: Multiple errors: [AssertionError('Torch not compiled with CUDA enabled')]
[ERROR] 2026-03-10-03:34:02 (PID:1075906, Device:0, RankID:-1) ERR99999 UNKNOWN applicaiton exception
Train: 0%| | 0/8333 [00:04<?, ?it/s]
W0310 03:34:12.125000 1075511 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1075907 closing signal SIGTERM
W0310 03:34:12.128000 1075511 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1075908 closing signal SIGTERM
W0310 03:34:12.131000 1075511 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1075909 closing signal SIGTERM
W0310 03:34:12.133000 1075511 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1075910 closing signal SIGTERM
W0310 03:34:12.135000 1075511 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1075911 closing signal SIGTERM
E0310 03:34:16.411000 1075511 site-packages/torch/distributed/elastic/multiprocessing/api.py:882] failed (exitcode: 1) local_rank: 0 (pid: 1075906) of binary: /home/jovyan/.conda/envs/ms-swift-py310/bin/python3.10
Traceback (most recent call last):
File "/home/jovyan/.conda/envs/ms-swift-py310/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/jovyan/.conda/envs/ms-swift-py310/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/jovyan/.conda/envs/ms-swift-py310/lib/python3.10/site-packages/torch/distributed/run.py", line 940, in <module>
main()
File "/home/jovyan/.conda/envs/ms-swift-py310/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
return f(*args, **kwargs)
File "/home/jovyan/.conda/envs/ms-swift-py310/lib/python3.10/site-packages/torch/distributed/run.py", line 936, in main
run(args)
File "/home/jovyan/.conda/envs/ms-swift-py310/lib/python3.10/site-packages/torch/distributed/run.py", line 927, in run
elastic_launch(
File "/home/jovyan/.conda/envs/ms-swift-py310/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 156, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/jovyan/.conda/envs/ms-swift-py310/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 293, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/jovyan/zzs/ms-swift-main/swift/cli/rlhf.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2026-03-10_03:34:12
host : nb-579441364312275866-0
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1075906)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
[ERROR] 2026-03-10-03:34:16 (PID:1075511, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception
rollout
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO: 127.0.0.1:45786 - "GET /health/ HTTP/1.1" 200 OK
INFO: 127.0.0.1:45792 - "POST /close_communicator/ HTTP/1.1" 200 OK
INFO: 127.0.0.1:45792 - "GET /get_world_size/ HTTP/1.1" 200 OK
INFO: 127.0.0.1:45792 - "POST /init_communicator/ HTTP/1.1" 200 OK
(Worker_TP1 pid=1140112) INFO 03-10 04:54:37 [utils.py:285] Found hccl from library libhccl.so
(Worker_TP1 pid=1140112) INFO 03-10 04:54:37 [pyhccl.py:88] vLLM is using pyhccl
(Worker_TP0 pid=1140111) INFO 03-10 04:54:47 [utils.py:285] Found hccl from library libhccl.so
(Worker_TP0 pid=1140111) INFO 03-10 04:54:47 [pyhccl.py:88] vLLM is using pyhccl
INFO: 127.0.0.1:56586 - "POST /get_engine_type/ HTTP/1.1" 200 OK
INFO: 127.0.0.1:60016 - "POST /update_flattened_params/ HTTP/1.1" 200 OK
INFO: 127.0.0.1:60016 - "POST /close_communicator/ HTTP/1.1" 200 OK
How to Reproduce / 如何复现
env
npu-smi 24.1.0.3 Version: 24.1.0.3
8 * 910B2
CANN 8.3.rc2
pip list
Package Version Editable project location
--------------------------------- ----------------- ------------------------------
absl-py 2.4.0
accelerate 1.13.0
addict 2.4.0
aiofiles 24.1.0
aiohappyeyeballs 2.6.1
aiohttp 3.13.3
aiosignal 1.4.0
aliyun-python-sdk-core 2.16.0
aliyun-python-sdk-kms 2.16.5
annotated-doc 0.0.4
annotated-types 0.7.0
anthropic 0.71.0
antlr4-python3-runtime 4.9.3
anyio 4.12.1
apache-tvm-ffi 0.1.9
arctic_inference 0.1.1
asc_opc_tool 0.1.0
astor 0.8.1
async-timeout 5.0.1
attrdict 2.0.1
attrs 25.4.0
auto_tune 0.1.0
binpacking 2.0.1
blake3 1.0.8
blinker 1.9.0
brotli 1.2.0
cachetools 7.0.4
cbor2 5.8.0
certifi 2026.2.25
cffi 2.0.0
charset-normalizer 3.4.5
click 8.3.1
cloudpickle 3.1.2
cmake 4.2.3
compressed-tensors 0.13.0
conda-pack 0.9.1
contourpy 1.3.2
cpm-kernels 1.0.11
crcmod 1.7
cryptography 46.0.5
cuda-bindings 13.1.1
cuda-pathfinder 1.4.1
cuda-python 13.1.1
cupy-cuda12x 14.0.1
cycler 0.12.1
dacite 1.9.2
dataflow 0.0.1
datasets 3.6.0
decorator 5.2.1
deepspeed 0.18.7
depyf 0.20.0
dill 0.3.8
diskcache 5.6.3
distro 1.9.0
dnspython 2.8.0
docstring_parser 0.17.0
einops 0.8.2
email-validator 2.3.0
exceptiongroup 1.3.1
fastapi 0.123.10
fastapi-cli 0.0.24
fastapi-cloud-cli 0.14.1
fastar 0.8.0
ffmpy 1.0.0
filelock 3.25.0
flashinfer-python 0.5.3
Flask 3.1.3
fonttools 4.61.1
frozenlist 1.8.0
fsspec 2025.3.0
gguf 0.18.0
gradio 5.50.0
gradio_client 1.14.0
groovy 0.1.2
grpcio 1.78.1
grpcio-reflection 1.78.1
h11 0.16.0
h2 4.3.0
hccl 0.1.0
hccl_parser 0.1
hf-xet 1.3.2
hjson 3.1.0
hpack 4.1.0
httpcore 1.0.9
httptools 0.7.1
httpx 0.28.1
httpx-sse 0.4.3
huggingface_hub 0.36.2
Hypercorn 0.18.0
hyperframe 6.1.0
idna 3.11
ijson 3.5.0
importlib_metadata 8.7.1
interegular 0.3.3
itsdangerous 2.2.0
Jinja2 3.1.6
jiter 0.13.0
jmespath 0.10.0
joblib 1.5.3
json_repair 0.58.5
jsonschema 4.26.0
jsonschema-specifications 2025.9.1
kiwisolver 1.4.9
lark 1.2.2
llguidance 1.3.0
llm_datadist 0.0.1
llm_datadist_v1 0.0.1
llvmlite 0.44.0
lm-format-enforcer 0.11.3
loguru 0.7.3
Markdown 3.10.2
markdown-it-py 4.0.0
MarkupSafe 3.0.3
matplotlib 3.10.8
mcp 1.26.0
mdurl 0.1.2
mistral_common 1.9.1
ml_dtypes 0.5.4
model-hosting-container-standards 0.1.13
modelscope 1.34.0
mpmath 1.3.0
ms_swift 4.1.0.dev0 /home/jovyan/zzs/ms-swift-main
msgpack 1.1.2
msgspec 0.20.0
msobjdump 0.1.0
multidict 6.7.1
multiprocess 0.70.16
networkx 3.4.2
ninja 1.13.0
nltk 3.9.3
numba 0.61.2
numpy 1.26.4
nvidia-cudnn-frontend 1.18.0
nvidia-cutlass-dsl 4.4.1
nvidia-cutlass-dsl-libs-base 4.4.1
nvidia-ml-py 13.590.48
omegaconf 2.3.0
op_compile_tool 0.1.0
op_gen 0.1
op_test_frame 0.1
opc_tool 0.1.0
openai 2.26.0
openai-harmony 0.0.8
opencv-python-headless 4.11.0.86
orjson 3.11.7
oss2 2.19.1
outlines_core 0.2.11
packaging 25.0
pandas 2.3.3
pandas-stubs 2.3.3.260113
partial-json-parser 0.2.1.1.post7
peft 0.18.1
pillow 11.3.0
pip 26.0.1
priority 2.0.0
prometheus_client 0.24.1
prometheus-fastapi-instrumentator 7.1.0
propcache 0.4.1
protobuf 6.33.5
psutil 7.2.2
py-cpuinfo 9.0.0
pyarrow 23.0.1
pybase64 1.4.3
pybind11 3.0.2
pycountry 26.2.16
pycparser 3.0
pycryptodome 3.23.0
pydantic 2.12.3
pydantic_core 2.41.4
pydantic-extra-types 2.11.0
pydantic-settings 2.13.1
pydub 0.25.1
Pygments 2.19.2
PyJWT 2.11.0
pyparsing 3.3.2
python-dateutil 2.9.0.post0
python-dotenv 1.2.2
python-json-logger 4.0.0
python-multipart 0.0.22
pytz 2026.1.post1
PyYAML 6.0.3
pyzmq 27.1.0
Quart 0.20.0
ray 2.54.0
referencing 0.37.0
regex 2026.2.28
requests 2.32.5
rich 14.3.3
rich-toolkit 0.19.7
rignore 0.7.6
rouge 1.0.1
rpds-py 0.30.0
ruff 0.15.5
safehttpx 0.1.7
safetensors 0.7.0
schedule_search 0.0.1
scipy 1.15.3
semantic-version 2.10.0
sentencepiece 0.2.1
sentry-sdk 2.54.0
setproctitle 1.3.7
setuptools 80.10.2
setuptools-scm 9.2.2
shellingham 1.5.4
show_kernel_debug_data 0.1.0
simplejson 3.20.2
six 1.17.0
sniffio 1.3.1
sortedcontainers 2.4.0
sse-starlette 3.3.2
starlette 0.50.0
supervisor 4.3.0
sympy 1.14.0
tabulate 0.10.0
taskgroup 0.2.2
te 0.4.0
tensorboard 2.20.0
tensorboard-data-server 0.7.2
tiktoken 0.12.0
tokenizers 0.22.2
tomli 2.4.0
tomlkit 0.13.3
torch 2.9.0
torch_npu 2.9.0
torchaudio 2.9.1
torchvision 0.24.0
tornado 6.5.4
tqdm 4.67.3
transformers 4.57.6
transformers-stream-generator 0.0.5
triton-ascend 3.2.0
trl 0.29.0
typer 0.24.1
typer-slim 0.24.0
types-pytz 2026.1.1.20260304
typing_extensions 4.15.0
typing-inspection 0.4.2
tzdata 2025.3
urllib3 2.6.3
uvicorn 0.41.0
uvloop 0.22.1
vllm 0.14.0
vllm_ascend 0.14.0rc1
watchfiles 1.1.1
websockets 15.0.1
Werkzeug 3.1.6
wheel 0.46.3
wsproto 1.3.2
xgrammar 0.1.32
xxhash 3.6.0
yarl 1.23.0
zipp 3.23.0
zstandard 0.25.0
train_grpo_coutdown.sh
# # Rollout
# ASCEND_RT_VISIBLE_DEVICES=6,7 \
# swift rollout \
# --model '/home/jovyan/zzs/models/Qwen/Qwen2.5-7B-Instruct' \
# --vllm_tensor_parallel_size 2
export TASK_QUEUE_ENABLE=2
export CPU_AFFINITY_CONF=2
ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5 \
NPROC_PER_NODE=6 \
swift rlhf \
--rlhf_type grpo \
--model '/home/jovyan/zzs/models/Qwen/Qwen2.5-7B-Instruct' \
--external_plugins examples/train/grpo/plugin/plugin.py \
--reward_funcs external_countdown format \
--use_vllm true \
--vllm_mode server \
--vllm_server_host 127.0.0.1 \
--vllm_server_port 8000 \
--train_type lora \
--lora_rank 8 \
--lora_alpha 32 \
--torch_dtype bfloat16 \
--dataset '/home/jovyan/zzs/ms-swift-main/data/Countdown-Tasks-3to4#50000' \
--load_from_cache_file true \
--max_length 2048 \
--max_completion_length 1024 \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--learning_rate 5e-7 \
--gradient_accumulation_steps 8 \
--eval_steps 500 \
--save_steps 100 \
--save_total_limit 20 \
--logging_steps 1 \
--output_dir output/GRPO_COUNTDOWN \
--warmup_ratio 0.01 \
--dataloader_num_workers 4 \
--num_generations 8 \
--deepspeed zero3 \
--temperature 1.0 \
--system 'You are a helpful assistant. You first thinks about the reasoning process in the mind and then provides the user with the answer.' \
--log_completions true \
--beta 0.001 \
--num_iterations 1
Additional Information / 补充信息
均已执行:
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working