-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
[INFO|trainer.py:2409] 2025-06-20 08:20:42,538 >> ***** Running training *****
[INFO|trainer.py:2410] 2025-06-20 08:20:42,538 >> Num examples = 16
[INFO|trainer.py:2411] 2025-06-20 08:20:42,538 >> Num Epochs = 2
[INFO|trainer.py:2412] 2025-06-20 08:20:42,538 >> Instantaneous batch size per device = 8
[INFO|trainer.py:2415] 2025-06-20 08:20:42,538 >> Total train batch size (w. parallel, distributed & accumulation) = 32
[INFO|trainer.py:2416] 2025-06-20 08:20:42,538 >> Gradient Accumulation steps = 4
[INFO|trainer.py:2417] 2025-06-20 08:20:42,538 >> Total optimization steps = 2
[INFO|trainer.py:2418] 2025-06-20 08:20:42,539 >> Number of trainable parameters = 1,261,568
0%| | 0/2 [00:00<?, ?it/s]Traceback (most recent call last):
File "/disk/wangzeyi/WeClone/.venv/bin/weclone-cli", line 10, in
sys.exit(cli())
File "/disk/wangzeyi/WeClone/.venv/lib/python3.10/site-packages/click/core.py", line 1442, in call
return self.main(*args, **kwargs)
File "/disk/wangzeyi/WeClone/.venv/lib/python3.10/site-packages/click/core.py", line 1363, in main
rv = self.invoke(ctx)
File "/disk/wangzeyi/WeClone/.venv/lib/python3.10/site-packages/click/core.py", line 1830, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/disk/wangzeyi/WeClone/.venv/lib/python3.10/site-packages/click/core.py", line 1226, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/disk/wangzeyi/WeClone/.venv/lib/python3.10/site-packages/click/core.py", line 794, in invoke
return callback(*args, **kwargs)
File "/disk/wangzeyi/WeClone/weclone/cli.py", line 33, in wrapper
return func(*args, **kwargs)
File "/disk/wangzeyi/WeClone/weclone/cli.py", line 51, in new_runtime_wrapper
return original_cmd_func(*args, **kwargs)
File "/disk/wangzeyi/WeClone/weclone/cli.py", line 85, in train_sft
train_sft_main()
File "/disk/wangzeyi/WeClone/weclone/train/train_sft.py", line 51, in main
run_exp(train_config.model_dump(mode="json"))
File "/disk/wangzeyi/WeClone/.venv/lib/python3.10/site-packages/llamafactory/train/tuner.py", line 110, in run_exp
_training_function(config={"args": args, "callbacks": callbacks})
File "/disk/wangzeyi/WeClone/.venv/lib/python3.10/site-packages/llamafactory/train/tuner.py", line 72, in _training_function
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
File "/disk/wangzeyi/WeClone/.venv/lib/python3.10/site-packages/llamafactory/train/sft/workflow.py", line 96, in run_sft
train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
File "/disk/wangzeyi/WeClone/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 2240, in train
return inner_training_loop(
File "/disk/wangzeyi/WeClone/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 2509, in _inner_training_loop
batch_samples, num_items_in_batch = self.get_batch_samples(epoch_iterator, num_batches, args.device)
File "/disk/wangzeyi/WeClone/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 5263, in get_batch_samples
batch_samples.append(next(epoch_iterator))
File "/disk/wangzeyi/WeClone/.venv/lib/python3.10/site-packages/accelerate/data_loader.py", line 563, in iter
dataloader_iter = self.base_dataloader.iter()
File "/disk/wangzeyi/WeClone/.venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 491, in iter
return self._get_iterator()
File "/disk/wangzeyi/WeClone/.venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 422, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "/disk/wangzeyi/WeClone/.venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1146, in init
w.start()
File "/root/.local/share/uv/python/cpython-3.10.18-linux-x86_64-gnu/lib/python3.10/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/root/.local/share/uv/python/cpython-3.10.18-linux-x86_64-gnu/lib/python3.10/multiprocessing/context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/root/.local/share/uv/python/cpython-3.10.18-linux-x86_64-gnu/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
return Popen(process_obj)
File "/root/.local/share/uv/python/cpython-3.10.18-linux-x86_64-gnu/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/root/.local/share/uv/python/cpython-3.10.18-linux-x86_64-gnu/lib/python3.10/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/root/.local/share/uv/python/cpython-3.10.18-linux-x86_64-gnu/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/root/.local/share/uv/python/cpython-3.10.18-linux-x86_64-gnu/lib/python3.10/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'PreTrainedModel.enable_input_require_grads..make_inputs_require_grads'
0%| | 0/2 [00:00<?, ?it/s]
[W620 08:20:43.310116176 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
我在问了AI之后,AI的回答是multiprocessing 模块中的 pickle 操作无法序列化一个局部对象。求解