BERT-score is so different #3

Thornhill-GYL · 2023-08-19T03:16:27Z

i get the BERT-score result which is very different from the results in your paper.
for recap_style version, i follow the very step in this repo and i get the bert_score 0.8360 which is so different from 0.4350.
looking forward to you reply. thanks.

shuailiu6626 · 2023-08-19T05:21:11Z

Thank you for pointing that out! Sorry for the confusion. I think that is because we used a different base model for BERTScore, and the score calculated by different base models could have very different ranges [please take a look here], but we can still use the relative score to compare each model. I believe we used microsoft/deberta-v3-xsmall, but in our published code, we didn't specify the model_type. Could you add an argument model_type="microsoft/deberta-v3-xsmall" to the compute_bert_score function and try again? Sorry again for the inconsistency.

Thornhill-GYL · 2023-08-21T02:41:04Z

of course,I'll try it again. thanks a lot.

Thornhill-GYL · 2023-08-21T03:10:27Z

btw, i would like to ask why my results from recap-style version is worser than cap version. specifically in Embed Sim↑, i only get the 39.40 in recap_style version and for cap version, it is 40.08. and the results all i get are almost lower than your paper, i would like to know why, may be i do something wrong? looking forward to your reply. thanks.

shuailiu6626 · 2023-08-23T04:49:09Z

Sorry, I'm not sure about the exact reason. I remember you mentioned in another issue that you replaced one of the input feature with a new feature. May I ask what the new feature you used is? Maybe that's the reason for the difference, but I'm not very sure. Thank you!

Thornhill-GYL · 2023-08-28T02:25:50Z

hi, thanks for your reply, about the new feature, I just do it for fix the bug. so actually, we use the same features.

shuailiu6626 · 2023-08-30T15:44:38Z

I see. Thank you! We inspected that issue and found the root cause is a dependency conflict. We pushed the fix last week. Could you please try the new version? Also, I realized that with the current code, you will get two saved checkpoints after training, since it seems the Trainer also saves the final checkpoint. You should load the one with the lowest eval_loss instead of the final checkpoint for inference for both the retriever and the generator. Sorry for the vagueness in the best checkpoint. If you load the final checkpoint for inference, that might be the reason for the performance degradation.

Thornhill-GYL changed the title ~~BERT-score is so much different~~ BERT-score is so different Aug 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BERT-score is so different #3

BERT-score is so different #3

Thornhill-GYL commented Aug 19, 2023

shuailiu6626 commented Aug 19, 2023

Thornhill-GYL commented Aug 21, 2023

Thornhill-GYL commented Aug 21, 2023 •

edited

Loading

shuailiu6626 commented Aug 23, 2023

Thornhill-GYL commented Aug 28, 2023 •

edited

Loading

shuailiu6626 commented Aug 30, 2023

BERT-score is so different #3

BERT-score is so different #3

Comments

Thornhill-GYL commented Aug 19, 2023

shuailiu6626 commented Aug 19, 2023

Thornhill-GYL commented Aug 21, 2023

Thornhill-GYL commented Aug 21, 2023 • edited Loading

shuailiu6626 commented Aug 23, 2023

Thornhill-GYL commented Aug 28, 2023 • edited Loading

shuailiu6626 commented Aug 30, 2023

Thornhill-GYL commented Aug 21, 2023 •

edited

Loading

Thornhill-GYL commented Aug 28, 2023 •

edited

Loading