ALCE citation evaluation #7

carriex · 2024-10-22T14:42:00Z

Thanks for the great work! I am looking at running the ALCE evaluation and notice that the script is loading an NLI model from a local path.

AUTOAIS_MODEL="/scratch/gpfs/hyen/models/t5_xxl_true_nli_mixture"

Is this the same model as this one on huggingface?

I tried running the script with the huggingface model but got different citation precision / recall numbers for Meta-Llama-3.1-8B-instruct than that reported in the spreadsheet.

Thanks!

The text was updated successfully, but these errors were encountered:

howard-yen · 2024-10-22T19:23:19Z

Hi, thank you for your interest in our work!

You are correct, the NLI model should be google/t5_xxl_true_nli_mixture, and I will update the evaluation script accordingly, thanks for catching this mistake.

Could you share the results that you got and the arguments that you used to run the evaluation?

carriex · 2024-10-23T14:52:50Z

Thanks for getting back to me! I realized that I was not looking at the right context length when comparing my results against that in the spreadsheet. Thought it does seem like there is some differences in citation_rec / citation_prec for Llama-3.1-8b-Instruct for ALCE at 128k context length.

For reference, I ran he test using this config but for Llama-3.1-8B-Instruct and got below results for ALCE:

{ "length": 175.03, "str_em": 15.283333333333331, "str_hit": 4.0, "rougeLsum": 19.226019459757403, "citation_rec": 0.08695652173913043, "citation_prec": 0.10526315789473684, "citation_positions": { "0": 2, "1": 2, "2": 1, "3": 1, "4": 1, "5": 1, "6": 1, "7": 1, "8": 1, "9": 1, "10": 1, "11": 1, "12": 1, "13": 1, "14": 1, "15": 1, "16": 1, "17": 1, "18": 1 } }

howard-yen · 2024-10-24T13:59:57Z

Are these the results for ASQA? These appear to be rather close to the results that we report in the paper. In general, 1-2 points difference in absolute scores is reasonable given the nondeterministic nature of Flash Attention. However, it would be good to double check that the arguments used are the exact same, this is what the args in my output file look like:

{
  "config": "configs/alce.yaml",
  "tag": "v12",
  "model_name_or_path": "/scratch/gpfs/hyen/models/Meta-Llama-3.1-8B-Instruct",
  "use_vllm": false,
  "datasets": "alce_asqa_700",
  "demo_files": "prompts/asqa_revised.json",
  "test_files": "data/alce/asqa_eval_gtr_top2000.json",
  "output_dir": "output/Meta-Llama-3.1-8B-Instruct",
  "overwrite": false,
  "max_test_samples": 100,
  "num_workers": 4,
  "num_depths": 10,
  "shots": 2,
  "input_max_length": 131072,
  "do_sample": false,
  "generation_max_length": 300,
  "generation_min_length": 0,
  "temperature": 1.0,
  "top_p": 1.0,
  "stop_newline": false,
  "seed": 42,
  "no_cuda": false,
  "no_bf16": false,
  "no_torch_compile": false,
  "use_chat_template": true,
  "rope_theta": null,
  "debug": false,
  "count_tokens": false,
  "stop_new_line": false
}

For reference, our score file looks like this:

{
  "length": 167.54,
  "str_em": 16.95,
  "str_hit": 5.0,
  "rougeLsum": 19.499888451263413,
  "mauve": 28.60813027025195,
  "citation_rec": 0.0,
  "citation_prec": 0.0,
  "citation_positions": {
    "1": 1,
    "0": 1
  }
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ALCE citation evaluation #7

ALCE citation evaluation #7

carriex commented Oct 22, 2024

howard-yen commented Oct 22, 2024

carriex commented Oct 23, 2024 •

edited

Loading

howard-yen commented Oct 24, 2024

ALCE citation evaluation #7

ALCE citation evaluation #7

Comments

carriex commented Oct 22, 2024

howard-yen commented Oct 22, 2024

carriex commented Oct 23, 2024 • edited Loading

howard-yen commented Oct 24, 2024

carriex commented Oct 23, 2024 •

edited

Loading