Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BERT-score is so different #3

Open
Thornhill-GYL opened this issue Aug 19, 2023 · 6 comments
Open

BERT-score is so different #3

Thornhill-GYL opened this issue Aug 19, 2023 · 6 comments

Comments

@Thornhill-GYL
Copy link

i get the BERT-score result which is very different from the results in your paper.
for recap_style version, i follow the very step in this repo and i get the bert_score 0.8360 which is so different from 0.4350.
looking forward to you reply. thanks.

@Thornhill-GYL Thornhill-GYL changed the title BERT-score is so much different BERT-score is so different Aug 19, 2023
@shuailiu6626
Copy link
Collaborator

Thank you for pointing that out! Sorry for the confusion. I think that is because we used a different base model for BERTScore, and the score calculated by different base models could have very different ranges [please take a look here], but we can still use the relative score to compare each model. I believe we used microsoft/deberta-v3-xsmall, but in our published code, we didn't specify the model_type. Could you add an argument model_type="microsoft/deberta-v3-xsmall" to the compute_bert_score function and try again? Sorry again for the inconsistency.

@Thornhill-GYL
Copy link
Author

of course,I'll try it again. thanks a lot.

@Thornhill-GYL
Copy link
Author

Thornhill-GYL commented Aug 21, 2023

btw, i would like to ask why my results from recap-style version is worser than cap version. specifically in Embed Sim↑, i only get the 39.40 in recap_style version and for cap version, it is 40.08. and the results all i get are almost lower than your paper, i would like to know why, may be i do something wrong? looking forward to your reply. thanks.

@shuailiu6626
Copy link
Collaborator

Sorry, I'm not sure about the exact reason. I remember you mentioned in another issue that you replaced one of the input feature with a new feature. May I ask what the new feature you used is? Maybe that's the reason for the difference, but I'm not very sure. Thank you!

@Thornhill-GYL
Copy link
Author

Thornhill-GYL commented Aug 28, 2023

hi, thanks for your reply, about the new feature, I just do it for fix the bug. so actually, we use the same features.

@shuailiu6626
Copy link
Collaborator

I see. Thank you! We inspected that issue and found the root cause is a dependency conflict. We pushed the fix last week. Could you please try the new version? Also, I realized that with the current code, you will get two saved checkpoints after training, since it seems the Trainer also saves the final checkpoint. You should load the one with the lowest eval_loss instead of the final checkpoint for inference for both the retriever and the generator. Sorry for the vagueness in the best checkpoint. If you load the final checkpoint for inference, that might be the reason for the performance degradation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants