Skip to content

运行weclone-cli train-sft,出现Can't pickle local object 问题 #165

@wang-geek

Description

@wang-geek

[INFO|trainer.py:2409] 2025-06-20 08:20:42,538 >> ***** Running training *****
[INFO|trainer.py:2410] 2025-06-20 08:20:42,538 >> Num examples = 16
[INFO|trainer.py:2411] 2025-06-20 08:20:42,538 >> Num Epochs = 2
[INFO|trainer.py:2412] 2025-06-20 08:20:42,538 >> Instantaneous batch size per device = 8
[INFO|trainer.py:2415] 2025-06-20 08:20:42,538 >> Total train batch size (w. parallel, distributed & accumulation) = 32
[INFO|trainer.py:2416] 2025-06-20 08:20:42,538 >> Gradient Accumulation steps = 4
[INFO|trainer.py:2417] 2025-06-20 08:20:42,538 >> Total optimization steps = 2
[INFO|trainer.py:2418] 2025-06-20 08:20:42,539 >> Number of trainable parameters = 1,261,568
0%| | 0/2 [00:00<?, ?it/s]Traceback (most recent call last):
File "/disk/wangzeyi/WeClone/.venv/bin/weclone-cli", line 10, in
sys.exit(cli())
File "/disk/wangzeyi/WeClone/.venv/lib/python3.10/site-packages/click/core.py", line 1442, in call
return self.main(*args, **kwargs)
File "/disk/wangzeyi/WeClone/.venv/lib/python3.10/site-packages/click/core.py", line 1363, in main
rv = self.invoke(ctx)
File "/disk/wangzeyi/WeClone/.venv/lib/python3.10/site-packages/click/core.py", line 1830, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/disk/wangzeyi/WeClone/.venv/lib/python3.10/site-packages/click/core.py", line 1226, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/disk/wangzeyi/WeClone/.venv/lib/python3.10/site-packages/click/core.py", line 794, in invoke
return callback(*args, **kwargs)
File "/disk/wangzeyi/WeClone/weclone/cli.py", line 33, in wrapper
return func(*args, **kwargs)
File "/disk/wangzeyi/WeClone/weclone/cli.py", line 51, in new_runtime_wrapper
return original_cmd_func(*args, **kwargs)
File "/disk/wangzeyi/WeClone/weclone/cli.py", line 85, in train_sft
train_sft_main()
File "/disk/wangzeyi/WeClone/weclone/train/train_sft.py", line 51, in main
run_exp(train_config.model_dump(mode="json"))
File "/disk/wangzeyi/WeClone/.venv/lib/python3.10/site-packages/llamafactory/train/tuner.py", line 110, in run_exp
_training_function(config={"args": args, "callbacks": callbacks})
File "/disk/wangzeyi/WeClone/.venv/lib/python3.10/site-packages/llamafactory/train/tuner.py", line 72, in _training_function
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
File "/disk/wangzeyi/WeClone/.venv/lib/python3.10/site-packages/llamafactory/train/sft/workflow.py", line 96, in run_sft
train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
File "/disk/wangzeyi/WeClone/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 2240, in train
return inner_training_loop(
File "/disk/wangzeyi/WeClone/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 2509, in _inner_training_loop
batch_samples, num_items_in_batch = self.get_batch_samples(epoch_iterator, num_batches, args.device)
File "/disk/wangzeyi/WeClone/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 5263, in get_batch_samples
batch_samples.append(next(epoch_iterator))
File "/disk/wangzeyi/WeClone/.venv/lib/python3.10/site-packages/accelerate/data_loader.py", line 563, in iter
dataloader_iter = self.base_dataloader.iter()
File "/disk/wangzeyi/WeClone/.venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 491, in iter
return self._get_iterator()
File "/disk/wangzeyi/WeClone/.venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 422, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "/disk/wangzeyi/WeClone/.venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1146, in init
w.start()
File "/root/.local/share/uv/python/cpython-3.10.18-linux-x86_64-gnu/lib/python3.10/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/root/.local/share/uv/python/cpython-3.10.18-linux-x86_64-gnu/lib/python3.10/multiprocessing/context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/root/.local/share/uv/python/cpython-3.10.18-linux-x86_64-gnu/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
return Popen(process_obj)
File "/root/.local/share/uv/python/cpython-3.10.18-linux-x86_64-gnu/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/root/.local/share/uv/python/cpython-3.10.18-linux-x86_64-gnu/lib/python3.10/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/root/.local/share/uv/python/cpython-3.10.18-linux-x86_64-gnu/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/root/.local/share/uv/python/cpython-3.10.18-linux-x86_64-gnu/lib/python3.10/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'PreTrainedModel.enable_input_require_grads..make_inputs_require_grads'
0%| | 0/2 [00:00<?, ?it/s]
[W620 08:20:43.310116176 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

我在问了AI之后,AI的回答是multiprocessing 模块中的 pickle 操作无法序列化一个局部对象。求解

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggood first issueGood for newcomers

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions