Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idea of the training dataset ratio? #21

Open
goodstudent9 opened this issue Oct 11, 2024 · 2 comments
Open

Idea of the training dataset ratio? #21

goodstudent9 opened this issue Oct 11, 2024 · 2 comments

Comments

@goodstudent9
Copy link

Hi,
I noticed that you pay a lot of attention to R2R dataset and less attention to CVDN dataset which is about 20:1:5:5:5

Could you please tell me why you choose that? Is that a experiment speaking? Or you have some idea about this ratio choice?

Thank you for your help!

@zd11024
Copy link
Owner

zd11024 commented Oct 16, 2024

Thank you for your question!
The choice to focus more on the R2R dataset compared to the other datasets is based on two main reasons: 1) The R2R dataset provides more detailed and diverse instructions, which helps in learning more effective modality alignment. 2) In other datasets, a single instruction often corresponds to multiple trajectories, which means the amount of unique data is actually smaller.

@goodstudent9
Copy link
Author

Make sense!

But maybe there is a little mistake in your 2) response. That is one trajectory corresponded to many instructions? It is around 1:10 in some dataset. I noticed that too!

So, in your opinion, the unique data is very import.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants