Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ALCE citation evaluation #7

Open
carriex opened this issue Oct 22, 2024 · 3 comments
Open

ALCE citation evaluation #7

carriex opened this issue Oct 22, 2024 · 3 comments

Comments

@carriex
Copy link

carriex commented Oct 22, 2024

Thanks for the great work! I am looking at running the ALCE evaluation and notice that the script is loading an NLI model from a local path.

AUTOAIS_MODEL="/scratch/gpfs/hyen/models/t5_xxl_true_nli_mixture"

Is this the same model as this one on huggingface?

I tried running the script with the huggingface model but got different citation precision / recall numbers for Meta-Llama-3.1-8B-instruct than that reported in the spreadsheet.

Thanks!

@howard-yen
Copy link
Collaborator

Hi, thank you for your interest in our work!

You are correct, the NLI model should be google/t5_xxl_true_nli_mixture, and I will update the evaluation script accordingly, thanks for catching this mistake.

Could you share the results that you got and the arguments that you used to run the evaluation?

@carriex
Copy link
Author

carriex commented Oct 23, 2024

Thanks for getting back to me! I realized that I was not looking at the right context length when comparing my results against that in the spreadsheet. Thought it does seem like there is some differences in citation_rec / citation_prec for Llama-3.1-8b-Instruct for ALCE at 128k context length.

For reference, I ran he test using this config but for Llama-3.1-8B-Instruct and got below results for ALCE:

{ "length": 175.03, "str_em": 15.283333333333331, "str_hit": 4.0, "rougeLsum": 19.226019459757403, "citation_rec": 0.08695652173913043, "citation_prec": 0.10526315789473684, "citation_positions": { "0": 2, "1": 2, "2": 1, "3": 1, "4": 1, "5": 1, "6": 1, "7": 1, "8": 1, "9": 1, "10": 1, "11": 1, "12": 1, "13": 1, "14": 1, "15": 1, "16": 1, "17": 1, "18": 1 } }

@howard-yen
Copy link
Collaborator

Are these the results for ASQA? These appear to be rather close to the results that we report in the paper. In general, 1-2 points difference in absolute scores is reasonable given the nondeterministic nature of Flash Attention. However, it would be good to double check that the arguments used are the exact same, this is what the args in my output file look like:

{
  "config": "configs/alce.yaml",
  "tag": "v12",
  "model_name_or_path": "/scratch/gpfs/hyen/models/Meta-Llama-3.1-8B-Instruct",
  "use_vllm": false,
  "datasets": "alce_asqa_700",
  "demo_files": "prompts/asqa_revised.json",
  "test_files": "data/alce/asqa_eval_gtr_top2000.json",
  "output_dir": "output/Meta-Llama-3.1-8B-Instruct",
  "overwrite": false,
  "max_test_samples": 100,
  "num_workers": 4,
  "num_depths": 10,
  "shots": 2,
  "input_max_length": 131072,
  "do_sample": false,
  "generation_max_length": 300,
  "generation_min_length": 0,
  "temperature": 1.0,
  "top_p": 1.0,
  "stop_newline": false,
  "seed": 42,
  "no_cuda": false,
  "no_bf16": false,
  "no_torch_compile": false,
  "use_chat_template": true,
  "rope_theta": null,
  "debug": false,
  "count_tokens": false,
  "stop_new_line": false
}

For reference, our score file looks like this:

{
  "length": 167.54,
  "str_em": 16.95,
  "str_hit": 5.0,
  "rougeLsum": 19.499888451263413,
  "mauve": 28.60813027025195,
  "citation_rec": 0.0,
  "citation_prec": 0.0,
  "citation_positions": {
    "1": 1,
    "0": 1
  }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants