A unified, extensible evaluation harness for Vision-Language Models (VLMs).
VLM-Eval provides a single, consistent interface to evaluate any vision-language model across popular benchmarks. Forget writing custom eval scripts for every dataset — just plug in your model and run.
Why another eval framework? Most repos vendor their own ad-hoc eval code that diverges from the official metrics. VLM-Eval centralises dataset loading, preprocessing, and scoring behind a clean API, and adds Chinese-centric benchmarks that are often missing from English-only suites.
| Benchmark | Task | Metric |
|---|---|---|
| VQAv2 | Open-ended VQA | Accuracy (soft) |
| TextVQA | OCR-aware VQA | Accuracy |
| COCO Captions | Image captioning | CIDEr, SPICE |
| NoCaps | Zero-shot captioning | CIDEr |
| MMBench | Multi-task VQA | Accuracy |
| MMBench-CN | Chinese multi-task VQA | Accuracy |
| MMStar | Reasoning & perception | Accuracy |
| ScienceQA | Multi-choice science QA | Accuracy |
| Model | Backend |
|---|---|
| LLaVA-1.5 / LLaVA-1.6 | HuggingFace |
| InternVL2 (4B / 8B / 26B) | HuggingFace |
| Qwen-VL / Qwen2-VL | HuggingFace |
| GPT-4o / GPT-4V | OpenAI API |
| Custom | Plugin any BaseVLM subclass |
pip install vlm-evalvlm-eval run \
--model llava-1.6-mistral-7b \
--benchmark vqav2 \
--split validation \
--output ./results/vlm-eval run \
--model internvl2-8b \
--benchmark vqav2 textvqa mmbench mmbench-cn \
--output ./results/from vlm_eval import Evaluator
from vlm_eval.models import LLaVAModel
model = LLaVAModel("llava-hf/llava-v1.6-mistral-7b-hf")
evaluator = Evaluator(model=model)
results = evaluator.run("vqav2", split="validation", max_samples=1000)
print(results)
# {'accuracy': 0.812, 'n_samples': 1000, 'benchmark': 'vqav2'}vlm-eval leaderboard --results_dir ./results/ --output leaderboard.csvvlm_eval/
├── harness.py # Core evaluation loop
├── models/
│ ├── base.py # Abstract model interface
│ ├── llava.py # LLaVA-1.5/1.6
│ ├── internvl.py # InternVL2
│ └── qwenvl.py # Qwen-VL / Qwen2-VL
├── datasets/
│ ├── vqav2.py
│ ├── coco_caption.py
│ ├── textvqa.py
│ └── mmbench.py # MMBench + MMBench-CN
├── metrics/
│ ├── vqa_accuracy.py
│ └── caption_metrics.py
└── cli.py
Results on validation / test splits with greedy decoding:
| Model | VQAv2 | TextVQA | MMBench | MMBench-CN |
|---|---|---|---|---|
| LLaVA-1.6-Mistral-7B | 81.2 | 65.3 | 72.8 | 62.1 |
| InternVL2-8B | 83.5 | 77.4 | 79.6 | 75.3 |
| Qwen2-VL-7B | 83.0 | 84.3 | 80.5 | 79.8 |
| GPT-4o | 85.7 | 89.1 | 83.2 | 82.4 |
from vlm_eval.models.base import BaseVLM
from PIL import Image
class MyModel(BaseVLM):
def __init__(self, model_path: str):
super().__init__(model_name="my-model")
self.model = load_my_model(model_path)
def generate(self, image: Image.Image, prompt: str, **kwargs) -> str:
return self.model.predict(image, prompt)
# Register and run
from vlm_eval import Evaluator
evaluator = Evaluator(model=MyModel("./checkpoints/my_model"))
results = evaluator.run("mmbench")@misc{xu2024vlmeval,
title={VLM-Eval: A Unified Evaluation Harness for Vision-Language Models},
author={Xu, Haowen},
year={2024},
url={https://github.com/suncatchin/vlm-eval}
}Apache 2.0 License