-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ALCE citation evaluation #7
Comments
Hi, thank you for your interest in our work! You are correct, the NLI model should be Could you share the results that you got and the arguments that you used to run the evaluation? |
Thanks for getting back to me! I realized that I was not looking at the right context length when comparing my results against that in the spreadsheet. Thought it does seem like there is some differences in For reference, I ran he test using this config but for Llama-3.1-8B-Instruct and got below results for ALCE:
|
Are these the results for ASQA? These appear to be rather close to the results that we report in the paper. In general, 1-2 points difference in absolute scores is reasonable given the nondeterministic nature of Flash Attention. However, it would be good to double check that the arguments used are the exact same, this is what the args in my output file look like:
For reference, our score file looks like this:
|
Thanks for the great work! I am looking at running the ALCE evaluation and notice that the script is loading an NLI model from a local path.
AUTOAIS_MODEL="/scratch/gpfs/hyen/models/t5_xxl_true_nli_mixture"
Is this the same model as this one on huggingface?
I tried running the script with the huggingface model but got different citation precision / recall numbers for Meta-Llama-3.1-8B-instruct than that reported in the spreadsheet.
Thanks!
The text was updated successfully, but these errors were encountered: