Skip to content

Ascend 910B上GRPO + vLLM server报错:'Torch not compiled with CUDA enabled' #8260

@leetakhing

Description

@leetakhing

Checklist / 检查清单

  • I have searched existing issues, and this is a new bug report. / 我已经搜索过现有的 issues,确认这是一个新的 bug report。

Bug Description / Bug 描述

bash train_grpo_countdown.sh

Train:   0%|                                                                                                                              | 0/8333 [00:00<?, ?it/s][INFO:swift] last_model_checkpoint: None
[INFO:swift] best_model_checkpoint: None
[INFO:swift] images_dir: /home/jovyan/zzs/ms-swift-main/output/GRPO_COUNTDOWN/v10-20260310-033228/images
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/jovyan/zzs/ms-swift-main/swift/cli/rlhf.py", line 7, in <module>
[rank0]:     rlhf_main()
[rank0]:   File "/home/jovyan/zzs/ms-swift-main/swift/pipelines/train/rlhf.py", line 243, in rlhf_main
[rank0]:     return SwiftRLHF(args).main()
[rank0]:   File "/home/jovyan/zzs/ms-swift-main/swift/pipelines/base.py", line 52, in main
[rank0]:     result = self.run()
[rank0]:   File "/home/jovyan/zzs/ms-swift-main/swift/ray/base.py", line 168, in wrapper
[rank0]:     return func(self, *args, **kwargs)
[rank0]:   File "/home/jovyan/zzs/ms-swift-main/swift/pipelines/train/sft.py", line 197, in run
[rank0]:     return self.train(trainer)
[rank0]:   File "/home/jovyan/zzs/ms-swift-main/swift/pipelines/train/sft.py", line 270, in train
[rank0]:     trainer.train(resume_checkpoint)
[rank0]:   File "/home/jovyan/zzs/ms-swift-main/swift/trainers/mixin.py", line 892, in train
[rank0]:     res = super().train(*args, **kwargs)
[rank0]:   File "/home/jovyan/.conda/envs/ms-swift-py310/lib/python3.10/site-packages/transformers/trainer.py", line 2325, in train
[rank0]:     return inner_training_loop(
[rank0]:   File "/home/jovyan/.conda/envs/ms-swift-py310/lib/python3.10/site-packages/transformers/trainer.py", line 2674, in _inner_training_loop
[rank0]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank0]:   File "/home/jovyan/zzs/ms-swift-main/swift/rlhf_trainers/grpo_trainer.py", line 1872, in training_step
[rank0]:     return super().training_step(model, inputs, num_items_in_batch)
[rank0]:   File "/home/jovyan/.conda/envs/ms-swift-py310/lib/python3.10/site-packages/trl/trainer/grpo_trainer.py", line 1094, in training_step
[rank0]:     output = super().training_step(model, inputs, num_items_in_batch)
[rank0]:   File "/home/jovyan/.conda/envs/ms-swift-py310/lib/python3.10/site-packages/transformers/trainer.py", line 4014, in training_step
[rank0]:     inputs = self._prepare_inputs(inputs)
[rank0]:   File "/home/jovyan/zzs/ms-swift-main/swift/rlhf_trainers/utils.py", line 607, in wrapper
[rank0]:     return func(self, *args, **kwargs)
[rank0]:   File "/home/jovyan/zzs/ms-swift-main/swift/rlhf_trainers/grpo_trainer.py", line 194, in _prepare_inputs
[rank0]:     generation_batch = self._generate_and_score_completions(generation_batch)
[rank0]:   File "/home/jovyan/zzs/ms-swift-main/swift/rlhf_trainers/utils.py", line 607, in wrapper
[rank0]:     return func(self, *args, **kwargs)
[rank0]:   File "/home/jovyan/zzs/ms-swift-main/swift/rlhf_trainers/grpo_trainer.py", line 227, in _generate_and_score_completions
[rank0]:     inputs = self._generate_completions(inputs)
[rank0]:   File "/home/jovyan/zzs/ms-swift-main/swift/rlhf_trainers/grpo_trainer.py", line 208, in _generate_completions
[rank0]:     results = self._fast_infer(inputs)
[rank0]:   File "/home/jovyan/zzs/ms-swift-main/swift/rlhf_trainers/rollout_mixin.py", line 921, in _fast_infer
[rank0]:     self._move_model_to_vllm()
[rank0]:   File "/home/jovyan/zzs/ms-swift-main/swift/rlhf_trainers/utils.py", line 607, in wrapper
[rank0]:     return func(self, *args, **kwargs)
[rank0]:   File "/home/jovyan/zzs/ms-swift-main/swift/rlhf_trainers/rollout_mixin.py", line 380, in _move_model_to_vllm
[rank0]:     self._move_full_model_to_vllm()
[rank0]:   File "/home/jovyan/zzs/ms-swift-main/swift/rlhf_trainers/rollout_mixin.py", line 677, in _move_full_model_to_vllm
[rank0]:     self._load_state_dict_to_vllm(state_dict)
[rank0]:   File "/home/jovyan/zzs/ms-swift-main/swift/rlhf_trainers/rollout_mixin.py", line 459, in _load_state_dict_to_vllm
[rank0]:     _process_bucket_with_flattened_tensor(self, bucket)
[rank0]:   File "/home/jovyan/zzs/ms-swift-main/swift/rlhf_trainers/utils.py", line 1440, in _process_bucket_with_flattened_tensor
[rank0]:     trainer.vllm_client.update_flattened_params(metadatas, flattened_tensor)
[rank0]:   File "/home/jovyan/zzs/ms-swift-main/swift/rlhf_trainers/vllm_client.py", line 407, in update_flattened_params
[rank0]:     raise RuntimeError(f'Multiple errors: {all_errors}')
 [rank0]: RuntimeError: Multiple errors: [AssertionError('Torch not compiled with CUDA enabled')]
[ERROR] 2026-03-10-03:34:02 (PID:1075906, Device:0, RankID:-1) ERR99999 UNKNOWN applicaiton exception
Train:   0%|                                                                                                                              | 0/8333 [00:04<?, ?it/s]
W0310 03:34:12.125000 1075511 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1075907 closing signal SIGTERM
W0310 03:34:12.128000 1075511 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1075908 closing signal SIGTERM
W0310 03:34:12.131000 1075511 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1075909 closing signal SIGTERM
W0310 03:34:12.133000 1075511 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1075910 closing signal SIGTERM
W0310 03:34:12.135000 1075511 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1075911 closing signal SIGTERM
E0310 03:34:16.411000 1075511 site-packages/torch/distributed/elastic/multiprocessing/api.py:882] failed (exitcode: 1) local_rank: 0 (pid: 1075906) of binary: /home/jovyan/.conda/envs/ms-swift-py310/bin/python3.10
Traceback (most recent call last):
  File "/home/jovyan/.conda/envs/ms-swift-py310/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/jovyan/.conda/envs/ms-swift-py310/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/jovyan/.conda/envs/ms-swift-py310/lib/python3.10/site-packages/torch/distributed/run.py", line 940, in <module>
    main()
  File "/home/jovyan/.conda/envs/ms-swift-py310/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
    return f(*args, **kwargs)
  File "/home/jovyan/.conda/envs/ms-swift-py310/lib/python3.10/site-packages/torch/distributed/run.py", line 936, in main
    run(args)
  File "/home/jovyan/.conda/envs/ms-swift-py310/lib/python3.10/site-packages/torch/distributed/run.py", line 927, in run
    elastic_launch(
  File "/home/jovyan/.conda/envs/ms-swift-py310/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 156, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/jovyan/.conda/envs/ms-swift-py310/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 293, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/home/jovyan/zzs/ms-swift-main/swift/cli/rlhf.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2026-03-10_03:34:12
  host      : nb-579441364312275866-0
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1075906)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
[ERROR] 2026-03-10-03:34:16 (PID:1075511, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception

rollout

INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO:     127.0.0.1:45786 - "GET /health/ HTTP/1.1" 200 OK
INFO:     127.0.0.1:45792 - "POST /close_communicator/ HTTP/1.1" 200 OK
INFO:     127.0.0.1:45792 - "GET /get_world_size/ HTTP/1.1" 200 OK
INFO:     127.0.0.1:45792 - "POST /init_communicator/ HTTP/1.1" 200 OK
(Worker_TP1 pid=1140112) INFO 03-10 04:54:37 [utils.py:285] Found hccl from library libhccl.so
(Worker_TP1 pid=1140112) INFO 03-10 04:54:37 [pyhccl.py:88] vLLM is using pyhccl
(Worker_TP0 pid=1140111) INFO 03-10 04:54:47 [utils.py:285] Found hccl from library libhccl.so
(Worker_TP0 pid=1140111) INFO 03-10 04:54:47 [pyhccl.py:88] vLLM is using pyhccl
INFO:     127.0.0.1:56586 - "POST /get_engine_type/ HTTP/1.1" 200 OK
INFO:     127.0.0.1:60016 - "POST /update_flattened_params/ HTTP/1.1" 200 OK
INFO:     127.0.0.1:60016 - "POST /close_communicator/ HTTP/1.1" 200 OK

How to Reproduce / 如何复现

env

npu-smi 24.1.0.3 Version: 24.1.0.3
8 * 910B2
CANN 8.3.rc2

pip list

Package                           Version           Editable project location
--------------------------------- ----------------- ------------------------------
absl-py                           2.4.0
accelerate                        1.13.0
addict                            2.4.0
aiofiles                          24.1.0
aiohappyeyeballs                  2.6.1
aiohttp                           3.13.3
aiosignal                         1.4.0
aliyun-python-sdk-core            2.16.0
aliyun-python-sdk-kms             2.16.5
annotated-doc                     0.0.4
annotated-types                   0.7.0
anthropic                         0.71.0
antlr4-python3-runtime            4.9.3
anyio                             4.12.1
apache-tvm-ffi                    0.1.9
arctic_inference                  0.1.1
asc_opc_tool                      0.1.0
astor                             0.8.1
async-timeout                     5.0.1
attrdict                          2.0.1
attrs                             25.4.0
auto_tune                         0.1.0
binpacking                        2.0.1
blake3                            1.0.8
blinker                           1.9.0
brotli                            1.2.0
cachetools                        7.0.4
cbor2                             5.8.0
certifi                           2026.2.25
cffi                              2.0.0
charset-normalizer                3.4.5
click                             8.3.1
cloudpickle                       3.1.2
cmake                             4.2.3
compressed-tensors                0.13.0
conda-pack                        0.9.1
contourpy                         1.3.2
cpm-kernels                       1.0.11
crcmod                            1.7
cryptography                      46.0.5
cuda-bindings                     13.1.1
cuda-pathfinder                   1.4.1
cuda-python                       13.1.1
cupy-cuda12x                      14.0.1
cycler                            0.12.1
dacite                            1.9.2
dataflow                          0.0.1
datasets                          3.6.0
decorator                         5.2.1
deepspeed                         0.18.7
depyf                             0.20.0
dill                              0.3.8
diskcache                         5.6.3
distro                            1.9.0
dnspython                         2.8.0
docstring_parser                  0.17.0
einops                            0.8.2
email-validator                   2.3.0
exceptiongroup                    1.3.1
fastapi                           0.123.10
fastapi-cli                       0.0.24
fastapi-cloud-cli                 0.14.1
fastar                            0.8.0
ffmpy                             1.0.0
filelock                          3.25.0
flashinfer-python                 0.5.3
Flask                             3.1.3
fonttools                         4.61.1
frozenlist                        1.8.0
fsspec                            2025.3.0
gguf                              0.18.0
gradio                            5.50.0
gradio_client                     1.14.0
groovy                            0.1.2
grpcio                            1.78.1
grpcio-reflection                 1.78.1
h11                               0.16.0
h2                                4.3.0
hccl                              0.1.0
hccl_parser                       0.1
hf-xet                            1.3.2
hjson                             3.1.0
hpack                             4.1.0
httpcore                          1.0.9
httptools                         0.7.1
httpx                             0.28.1
httpx-sse                         0.4.3
huggingface_hub                   0.36.2
Hypercorn                         0.18.0
hyperframe                        6.1.0
idna                              3.11
ijson                             3.5.0
importlib_metadata                8.7.1
interegular                       0.3.3
itsdangerous                      2.2.0
Jinja2                            3.1.6
jiter                             0.13.0
jmespath                          0.10.0
joblib                            1.5.3
json_repair                       0.58.5
jsonschema                        4.26.0
jsonschema-specifications         2025.9.1
kiwisolver                        1.4.9
lark                              1.2.2
llguidance                        1.3.0
llm_datadist                      0.0.1
llm_datadist_v1                   0.0.1
llvmlite                          0.44.0
lm-format-enforcer                0.11.3
loguru                            0.7.3
Markdown                          3.10.2
markdown-it-py                    4.0.0
MarkupSafe                        3.0.3
matplotlib                        3.10.8
mcp                               1.26.0
mdurl                             0.1.2
mistral_common                    1.9.1
ml_dtypes                         0.5.4
model-hosting-container-standards 0.1.13
modelscope                        1.34.0
mpmath                            1.3.0
ms_swift                          4.1.0.dev0        /home/jovyan/zzs/ms-swift-main
msgpack                           1.1.2
msgspec                           0.20.0
msobjdump                         0.1.0
multidict                         6.7.1
multiprocess                      0.70.16
networkx                          3.4.2
ninja                             1.13.0
nltk                              3.9.3
numba                             0.61.2
numpy                             1.26.4
nvidia-cudnn-frontend             1.18.0
nvidia-cutlass-dsl                4.4.1
nvidia-cutlass-dsl-libs-base      4.4.1
nvidia-ml-py                      13.590.48
omegaconf                         2.3.0
op_compile_tool                   0.1.0
op_gen                            0.1
op_test_frame                     0.1
opc_tool                          0.1.0
openai                            2.26.0
openai-harmony                    0.0.8
opencv-python-headless            4.11.0.86
orjson                            3.11.7
oss2                              2.19.1
outlines_core                     0.2.11
packaging                         25.0
pandas                            2.3.3
pandas-stubs                      2.3.3.260113
partial-json-parser               0.2.1.1.post7
peft                              0.18.1
pillow                            11.3.0
pip                               26.0.1
priority                          2.0.0
prometheus_client                 0.24.1
prometheus-fastapi-instrumentator 7.1.0
propcache                         0.4.1
protobuf                          6.33.5
psutil                            7.2.2
py-cpuinfo                        9.0.0
pyarrow                           23.0.1
pybase64                          1.4.3
pybind11                          3.0.2
pycountry                         26.2.16
pycparser                         3.0
pycryptodome                      3.23.0
pydantic                          2.12.3
pydantic_core                     2.41.4
pydantic-extra-types              2.11.0
pydantic-settings                 2.13.1
pydub                             0.25.1
Pygments                          2.19.2
PyJWT                             2.11.0
pyparsing                         3.3.2
python-dateutil                   2.9.0.post0
python-dotenv                     1.2.2
python-json-logger                4.0.0
python-multipart                  0.0.22
pytz                              2026.1.post1
PyYAML                            6.0.3
pyzmq                             27.1.0
Quart                             0.20.0
ray                               2.54.0
referencing                       0.37.0
regex                             2026.2.28
requests                          2.32.5
rich                              14.3.3
rich-toolkit                      0.19.7
rignore                           0.7.6
rouge                             1.0.1
rpds-py                           0.30.0
ruff                              0.15.5
safehttpx                         0.1.7
safetensors                       0.7.0
schedule_search                   0.0.1
scipy                             1.15.3
semantic-version                  2.10.0
sentencepiece                     0.2.1
sentry-sdk                        2.54.0
setproctitle                      1.3.7
setuptools                        80.10.2
setuptools-scm                    9.2.2
shellingham                       1.5.4
show_kernel_debug_data            0.1.0
simplejson                        3.20.2
six                               1.17.0
sniffio                           1.3.1
sortedcontainers                  2.4.0
sse-starlette                     3.3.2
starlette                         0.50.0
supervisor                        4.3.0
sympy                             1.14.0
tabulate                          0.10.0
taskgroup                         0.2.2
te                                0.4.0
tensorboard                       2.20.0
tensorboard-data-server           0.7.2
tiktoken                          0.12.0
tokenizers                        0.22.2
tomli                             2.4.0
tomlkit                           0.13.3
torch                             2.9.0
torch_npu                         2.9.0
torchaudio                        2.9.1
torchvision                       0.24.0
tornado                           6.5.4
tqdm                              4.67.3
transformers                      4.57.6
transformers-stream-generator     0.0.5
triton-ascend                     3.2.0
trl                               0.29.0
typer                             0.24.1
typer-slim                        0.24.0
types-pytz                        2026.1.1.20260304
typing_extensions                 4.15.0
typing-inspection                 0.4.2
tzdata                            2025.3
urllib3                           2.6.3
uvicorn                           0.41.0
uvloop                            0.22.1
vllm                              0.14.0
vllm_ascend                       0.14.0rc1
watchfiles                        1.1.1
websockets                        15.0.1
Werkzeug                          3.1.6
wheel                             0.46.3
wsproto                           1.3.2
xgrammar                          0.1.32
xxhash                            3.6.0
yarl                              1.23.0
zipp                              3.23.0
zstandard                         0.25.0

train_grpo_coutdown.sh

# # Rollout
# ASCEND_RT_VISIBLE_DEVICES=6,7 \
# swift rollout \
#   --model '/home/jovyan/zzs/models/Qwen/Qwen2.5-7B-Instruct' \
#   --vllm_tensor_parallel_size 2

export TASK_QUEUE_ENABLE=2
export CPU_AFFINITY_CONF=2

ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5 \
NPROC_PER_NODE=6 \
    swift rlhf \
    --rlhf_type grpo \
    --model '/home/jovyan/zzs/models/Qwen/Qwen2.5-7B-Instruct' \
    --external_plugins examples/train/grpo/plugin/plugin.py \
    --reward_funcs external_countdown format \
    --use_vllm true \
    --vllm_mode server \
    --vllm_server_host 127.0.0.1 \
    --vllm_server_port 8000 \
    --train_type lora \
    --lora_rank 8 \
    --lora_alpha 32 \
    --torch_dtype bfloat16 \
    --dataset '/home/jovyan/zzs/ms-swift-main/data/Countdown-Tasks-3to4#50000' \
    --load_from_cache_file true \
    --max_length 2048 \
    --max_completion_length 1024 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --learning_rate 5e-7 \
    --gradient_accumulation_steps 8 \
    --eval_steps 500 \
    --save_steps 100 \
    --save_total_limit 20 \
    --logging_steps 1 \
    --output_dir output/GRPO_COUNTDOWN \
    --warmup_ratio 0.01 \
    --dataloader_num_workers 4 \
    --num_generations 8 \
    --deepspeed zero3 \
    --temperature 1.0 \
    --system 'You are a helpful assistant. You first thinks about the reasoning process in the mind and then provides the user with the answer.' \
    --log_completions true \
    --beta 0.001 \
    --num_iterations 1

Additional Information / 补充信息

均已执行:

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions