Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: RuntimeError #245

Open
1 task done
WoNiuHu opened this issue Sep 24, 2024 · 1 comment
Open
1 task done

[Bug]: RuntimeError #245

WoNiuHu opened this issue Sep 24, 2024 · 1 comment
Labels
bug Something isn't working triage

Comments

@WoNiuHu
Copy link

WoNiuHu commented Sep 24, 2024

Is there an existing issue ? / 是否已有相关的 issue ?

  • I have searched, and there is no existing issue. / 我已经搜索过了,没有相关的 issue。

Describe the bug / 描述这个 bug

执行命令如下。
1 formatted_time=$(date +"%Y%m%d%H%M%S")
2 echo $formatted_time
3
4 export NCCL_P2P_DISABLE=1
5 export NCCL_IB_DISABLE=1
6
7 mlx worker launch python finetune.py
8 --model_name_or_path MiniCPM-2B-sft-bf16
9 --output_dir output/OCNLILoRA/$formatted_time/
10 --train_data_path data/ocnli_public_chatml/train.json
11 --eval_data_path data/ocnli_public_chatml/dev.json
12 --learning_rate 5e-5 --per_device_train_batch_size 16
13 --per_device_eval_batch_size 128 --model_max_length 128 --bf16 --use_lora false
14 --gradient_accumulation_steps 1 --warmup_steps 100
15 --max_steps 1000 --weight_decay 0.01
16 --evaluation_strategy steps --eval_steps 500
17 --save_strategy steps --save_steps 500 --seed 42
18 --log_level info --logging_strategy steps --logging_steps 10
19 --deepspeed configs/ds_config_zero3_offload.json

Detected ZeRO Offload and non-DeepSpeed optimizers: This combination should work as long as the custom optimizer has both CPU and GPU implementation (except LAMB)
***** Running training *****
Num examples = 50,486
Num Epochs = 1
Instantaneous batch size per device = 16
Total train batch size (w. parallel, distributed & accumulation) = 16
Gradient Accumulation steps = 1
Total optimization steps = 1,000
Number of trainable parameters = 2,949,120
0%| | 0/1000 [00:00<?, ?it/s][rank0]: Traceback (most recent call last):
[rank0]: File "finetune.py", line 221, in
[rank0]: trainer.train()
[rank0]: File "/root/anaconda3/envs/minicpm/lib/python3.8/site-packages/transformers/trainer.py", line 1938, in train
[rank0]: return inner_training_loop(
[rank0]: File "/root/anaconda3/envs/minicpm/lib/python3.8/site-packages/transformers/trainer.py", line 2279, in _inner_training_loop
[rank0]: tr_loss_step = self.training_step(model, inputs)
[rank0]: File "/root/anaconda3/envs/minicpm/lib/python3.8/site-packages/transformers/trainer.py", line 3318, in training_step
[rank0]: loss = self.compute_loss(model, inputs)
[rank0]: File "/root/anaconda3/envs/minicpm/lib/python3.8/site-packages/transformers/trainer.py", line 3363, in compute_loss
[rank0]: outputs = model(**inputs)
[rank0]: File "/root/anaconda3/envs/minicpm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/minicpm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/minicpm/lib/python3.8/site-packages/accelerate/utils/operations.py", line 820, in forward
[rank0]: return model_forward(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/minicpm/lib/python3.8/site-packages/accelerate/utils/operations.py", line 808, in call
[rank0]: return convert_to_fp32(self.model_forward(*args, **kwargs))
[rank0]: File "/root/anaconda3/envs/minicpm/lib/python3.8/site-packages/torch/amp/autocast_mode.py", line 43, in decorate_autocast
[rank0]: return func(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/minicpm/lib/python3.8/site-packages/peft/peft_model.py", line 1577, in forward
[rank0]: return self.base_model(
[rank0]: File "/root/anaconda3/envs/minicpm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/minicpm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/minicpm/lib/python3.8/site-packages/peft/tuners/tuners_utils.py", line 188, in forward
[rank0]: return self.model.forward(*args, **kwargs)
[rank0]: File "/root/.cache/huggingface/modules/transformers_modules/MiniCPM-2B-sft-bf16/modeling_minicpm.py", line 1196, in forward
[rank0]: outputs = self.model(
[rank0]: File "/root/anaconda3/envs/minicpm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/minicpm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/root/.cache/huggingface/modules/transformers_modules/MiniCPM-2B-sft-bf16/modeling_minicpm.py", line 1040, in forward
[rank0]: inputs_embeds = self.embed_tokens(input_ids) * self.config.scale_emb
[rank0]: File "/root/anaconda3/envs/minicpm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/minicpm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/minicpm/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 164, in forward
[rank0]: return F.embedding(
[rank0]: File "/root/anaconda3/envs/minicpm/lib/python3.8/site-packages/torch/nn/functional.py", line 2267, in embedding
[rank0]: return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
[rank0]: RuntimeError: 'weight' must be 2-D
0%| | 0/1000 [00:01<?, ?it/s]

To Reproduce / 如何复现

RuntimeError

Expected behavior / 期望的结果

No response

Screenshots / 截图

No response

Environment / 环境

- OS: [e.g. Ubuntu 20.04]
- Pytorch: [e.g. torch 2.0.0]
- CUDA: [e.g. CUDA 11.8]
- Device: [e.g. A10, RTX3090]

Additional context / 其他信息

No response

@WoNiuHu WoNiuHu added bug Something isn't working triage labels Sep 24, 2024
@LDLINGLINGLING
Copy link
Collaborator

你好,请问你用的是哪个代码进行sft训练的,感觉跟我们官方推荐的代码以及llamafactory都不一样

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage
Projects
None yet
Development

No branches or pull requests

2 participants