-
Notifications
You must be signed in to change notification settings - Fork 29
Description
Hi @Yangsenqiao ,
Thank you for your excellent work!
I am trying to reproduce the results reported in your paper using the released model weights VisionThink-Efficient. Some of my results are very close to yours, but for a few benchmarks, my scores are lower and seem a bit random.
My Results
| Benchmark | Paper's Score | My Score |
|---|---|---|
| MMBench | 80.0 | 79.59 |
| RealWorldQA | 68.5 | 68.5 |
| POPE | 86.0 | 86.69 |
| MME | 2400 | 2403.3 |
| MathVista | 67.5 | 65.7 |
| MathVerse | 48.0 | 45.9 |
| MMVet | 67.1 | 61.8 |
The scores for MMBench, RealWorldQA, POPE, and MME look great and match your paper.
However, my scores for MathVista, MathVerse, and MMVet are lower and I also noticed the scores for these three change each time I run the test.
Questions
-
I think the difference might be because these benchmarks use LLM-as-judge. Could you please share some details about your evaluation setup?
-
I noticed that samples in MMMU can contain multiple images. So how did you handle multiple images input for evaluation?
Thanks!