-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BERT-score is so different #3
Comments
Thank you for pointing that out! Sorry for the confusion. I think that is because we used a different base model for BERTScore, and the score calculated by different base models could have very different ranges [please take a look here], but we can still use the relative score to compare each model. I believe we used |
of course,I'll try it again. thanks a lot. |
btw, i would like to ask why my results from recap-style version is worser than cap version. specifically in Embed Sim↑, i only get the 39.40 in recap_style version and for cap version, it is 40.08. and the results all i get are almost lower than your paper, i would like to know why, may be i do something wrong? looking forward to your reply. thanks. |
Sorry, I'm not sure about the exact reason. I remember you mentioned in another issue that you replaced one of the input feature with a new feature. May I ask what the new feature you used is? Maybe that's the reason for the difference, but I'm not very sure. Thank you! |
hi, thanks for your reply, about the new feature, I just do it for fix the bug. so actually, we use the same features. |
I see. Thank you! We inspected that issue and found the root cause is a dependency conflict. We pushed the fix last week. Could you please try the new version? Also, I realized that with the current code, you will get two saved checkpoints after training, since it seems the Trainer also saves the final checkpoint. You should load the one with the lowest eval_loss instead of the final checkpoint for inference for both the retriever and the generator. Sorry for the vagueness in the best checkpoint. If you load the final checkpoint for inference, that might be the reason for the performance degradation. |
i get the BERT-score result which is very different from the results in your paper.
for recap_style version, i follow the very step in this repo and i get the bert_score 0.8360 which is so different from 0.4350.
looking forward to you reply. thanks.
The text was updated successfully, but these errors were encountered: