Idea of the training dataset ratio? #21

goodstudent9 · 2024-10-11T09:10:23Z

Hi,
I noticed that you pay a lot of attention to R2R dataset and less attention to CVDN dataset which is about 20:1:5:5:5

Could you please tell me why you choose that? Is that a experiment speaking? Or you have some idea about this ratio choice?

Thank you for your help!

zd11024 · 2024-10-16T09:44:09Z

Thank you for your question!
The choice to focus more on the R2R dataset compared to the other datasets is based on two main reasons: 1) The R2R dataset provides more detailed and diverse instructions, which helps in learning more effective modality alignment. 2) In other datasets, a single instruction often corresponds to multiple trajectories, which means the amount of unique data is actually smaller.

goodstudent9 · 2024-10-17T09:34:28Z

Make sense!

But maybe there is a little mistake in your 2) response. That is one trajectory corresponded to many instructions? It is around 1:10 in some dataset. I noticed that too!

So, in your opinion, the unique data is very import.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idea of the training dataset ratio? #21

Idea of the training dataset ratio? #21

goodstudent9 commented Oct 11, 2024

zd11024 commented Oct 16, 2024 •

edited

Loading

goodstudent9 commented Oct 17, 2024

Idea of the training dataset ratio? #21

Idea of the training dataset ratio? #21

Comments

goodstudent9 commented Oct 11, 2024

zd11024 commented Oct 16, 2024 • edited Loading

goodstudent9 commented Oct 17, 2024

zd11024 commented Oct 16, 2024 •

edited

Loading