Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misalignment in Reproducing Evaluation Results #11

Closed
jingluw opened this issue Jan 3, 2025 · 1 comment
Closed

Misalignment in Reproducing Evaluation Results #11

jingluw opened this issue Jan 3, 2025 · 1 comment

Comments

@jingluw
Copy link

jingluw commented Jan 3, 2025

We downloaded pretrained models from PhysioNet and conducted evaluations on both MeLLaMA-13B-chat and MeLLaMA-70B-chat models. However, the results we obtained were significantly different. Here we list our reproduced results. Do you know any potential issues that might be affecting the reproducibility of the results? Looking forward to your reply.

image
image

Here's our running script.

`eval_path='/workspace/Me-LLaMA'
export PYTHONPATH="$eval_path/src:$eval_path/src/medical-evaluation:$eval_path/src/metrics/BARTScore"
echo $PYTHONPATH

export VLLM_WORKER_MULTIPROC_METHOD=spawn

MODEL_NAME=hf-causal-vllm

TASKS="PUBMEDQA,MedQA,MedMCQA,DDI2013,hoc,MTSample,PUBMEDSUM,BioNLI"

PRETRAINED=/datasets/MedData/me-llama/models/1.0.0/MeLLaMA-13B-chat
BATCHSZIE=50000
python src/eval.py
--model $MODEL_NAME
--tasks $TASKS
--model_args "use_accelerate=True,pretrained="$PRETRAINED",use_fast=False"
--no_cache
--batch_size $BATCHSZIE
--write_out
--output_path "./MeLLaMA-13B-chat_results_${TASKS}.json"
`

@sincere1994
Copy link
Collaborator

The issue has been addressed via email communication

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants