You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It's a great work and I enjoyed reading the paper! I think it convincingly improved my understanding about VLM training for a lot :).
In the paper, you mentioned using SFT data for pre-alignment, which I assume is the same amount of data used for SFT in the full model (i.e., stage 3). If this is correct, I'm curious if you have conducted any comparisons where the pre-align stage is removed, and the pre-align stage's data is instead used as additional data in stage 2 (i.e., more data for 1 epoch) or stage 3 (i.e., 2 epochs, assuming stages 1 and 3 use the same data).
Does the pre-align stage show an advantage in these comparisons? I'm particularly interested in the scenario where the pre-align stage is discarded, and an equivalent amount of training time is added to stage 3. Another paper has shown that scaling training time (with the same amount of data) can also improve performance. I'm curious about your thoughts on this.
Thanks!
The text was updated successfully, but these errors were encountered:
Hi @zwcolin , thanks a lot for the great suggestion!
Yes we used the same amount of SFT data in the pre-alignment stage compared to full model.
Your suggestion makes a lot of sense. We do intend to conduct a more comprehensive comparison, including the settings you suggested. We will keep you posted.
Hello,
It's a great work and I enjoyed reading the paper! I think it convincingly improved my understanding about VLM training for a lot :).
In the paper, you mentioned using SFT data for pre-alignment, which I assume is the same amount of data used for SFT in the full model (i.e., stage 3). If this is correct, I'm curious if you have conducted any comparisons where the pre-align stage is removed, and the pre-align stage's data is instead used as additional data in stage 2 (i.e., more data for 1 epoch) or stage 3 (i.e., 2 epochs, assuming stages 1 and 3 use the same data).
Does the pre-align stage show an advantage in these comparisons? I'm particularly interested in the scenario where the pre-align stage is discarded, and an equivalent amount of training time is added to stage 3. Another paper has shown that scaling training time (with the same amount of data) can also improve performance. I'm curious about your thoughts on this.
Thanks!
The text was updated successfully, but these errors were encountered: