Misalignment in Reproducing Evaluation Results #11

jingluw · 2025-01-03T08:01:53Z

We downloaded pretrained models from PhysioNet and conducted evaluations on both MeLLaMA-13B-chat and MeLLaMA-70B-chat models. However, the results we obtained were significantly different. Here we list our reproduced results. Do you know any potential issues that might be affecting the reproducibility of the results? Looking forward to your reply.

Here's our running script.

`eval_path='/workspace/Me-LLaMA'
export PYTHONPATH="$eval_path/src:$eval_path/src/medical-evaluation:$eval_path/src/metrics/BARTScore"
echo $PYTHONPATH

export VLLM_WORKER_MULTIPROC_METHOD=spawn

MODEL_NAME=hf-causal-vllm

TASKS="PUBMEDQA,MedQA,MedMCQA,DDI2013,hoc,MTSample,PUBMEDSUM,BioNLI"

PRETRAINED=/datasets/MedData/me-llama/models/1.0.0/MeLLaMA-13B-chat
BATCHSZIE=50000
python src/eval.py
--model $MODEL_NAME
--tasks $TASKS
--model_args "use_accelerate=True,pretrained="$PRETRAINED",use_fast=False"
--no_cache
--batch_size $BATCHSZIE
--write_out
--output_path "./MeLLaMA-13B-chat_results_${TASKS}.json"
`

sincere1994 · 2025-01-07T05:22:07Z

The issue has been addressed via email communication

sincere1994 closed this as completed Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misalignment in Reproducing Evaluation Results #11

Misalignment in Reproducing Evaluation Results #11

jingluw commented Jan 3, 2025

sincere1994 commented Jan 7, 2025

Misalignment in Reproducing Evaluation Results #11

Misalignment in Reproducing Evaluation Results #11

Comments

jingluw commented Jan 3, 2025

sincere1994 commented Jan 7, 2025