Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

when loading the fine-tuned smaller model , an error happens:Trying to set a tensor of shape torch.Size([311164928]) in "weight" (which has shape torch.Size([151936, 2048])) #1

Open
FireCaramelPudding opened this issue Oct 23, 2024 · 2 comments

Comments

@FireCaramelPudding
Copy link

您好!我是哈尔滨工业大学的一名学生,最近正尝试复刻您关于CoGenesis的工作。我遇到了一些棘手的麻烦,希望得到您的帮助。
问题如下:在“基于草稿的方法”下,加载微调过的小模型时出现了如下的报错:**ValueError: Trying to set a tensor of shape torch.Size([311164928]) in "weight" (which has shape torch.Size([151936, 2048])), this looks incorrect.**这个问题看起来和模型的保存有关系,我尝试了以下方法,均没有奏效:删除并重新训练模型、使用transformers库中提供的保存模型的函数trainer.save_model而不是您写的trainer_save_model_safe函数(尽管原理相似)、仅使用主线程保存模型(会导致运行时间过长而产生nccl超时错误)。
这个问题困扰我一段时间了,希望您能拨冗解答一下,我将不胜感激!
相关环境如下:
GPU: a100-pcie-40gb *2
SLM: Qwen1.5-1.8B Chat
LLM: Qwen1.5-72B
Package Version


transformers 4.45.1
vllm 0.6.2
tqdm 4.66.5
colorama 0.4.4
srsly 2.4.8
fire 0.6.0
langchain 0.3.1
langchain-openai 0.2.1
orjson 3.10.7
uvicorn 0.30.6
fastapi 0.115.0
torch 2.1.0+cu121
numpy 1.26.4
requests 2.32.3

Hello!

I am a student from Harbin Institute of Technology and I am currently attempting to replicate your work on CoGenesis. I have encountered some tricky issues and I hope to receive your assistance.

Here is the problem: Under the "draft-based method," when loading the fine-tuned smaller model, I encountered the following error: ValueError: Trying to set a tensor of shape torch.Size([311164928]) in "weight" (which has shape torch.Size([151936, 2048])), this looks incorrect. This issue seems to be related to the saving of the model. I have tried the following methods without success: deleting and retraining the model, using the model saving function trainer.save_model provided by the transformers library instead of your trainer_save_model_safe function (although the principle is similar), and saving the model using only the main thread (which leads to a long running time and results in nccl timeout errors).

This problem has been bothering me for a while, and I hope you can spare some time to answer it. I would be very grateful!

Here is my environment setup:
GPU: a100-pcie-40gb *2
SLM: Qwen1.5-1.8B Chat
LLM: Qwen1.5-72B
Package Version


transformers 4.45.1
vllm 0.6.2
tqdm 4.66.5
colorama 0.4.4
srsly 2.4.8
fire 0.6.0
langchain 0.3.1
langchain-openai 0.2.1
orjson 3.10.7
uvicorn 0.30.6
fastapi 0.115.0
torch 2.1.0+cu121
numpy 1.26.4
requests 2.32.3

@iseesaw
Copy link
Contributor

iseesaw commented Oct 28, 2024

Hello! This issue may stem from compatibility issues with the environment or library versions. For this project, we used the full training code from FastChat. However, once you have created sketch-based data from LLMs, you can proceed with any code for supervised fine-tuning, such as trl or other utilities available within the Transformers library.

@FireCaramelPudding
Copy link
Author

Hello! This issue may stem from compatibility issues with the environment or library versions. For this project, we used the full training code from FastChat. However, once you have created sketch-based data from LLMs, you can proceed with any code for supervised fine-tuning, such as trl or other utilities available within the Transformers library.

Is it possible to provide detailed version information for every package? Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants