VLM-Eval 📊

A unified, extensible evaluation harness for Vision-Language Models (VLMs).

VLM-Eval provides a single, consistent interface to evaluate any vision-language model across popular benchmarks. Forget writing custom eval scripts for every dataset — just plug in your model and run.

Why another eval framework? Most repos vendor their own ad-hoc eval code that diverges from the official metrics. VLM-Eval centralises dataset loading, preprocessing, and scoring behind a clean API, and adds Chinese-centric benchmarks that are often missing from English-only suites.

🎯 Supported Benchmarks

Benchmark	Task	Metric
VQAv2	Open-ended VQA	Accuracy (soft)
TextVQA	OCR-aware VQA	Accuracy
COCO Captions	Image captioning	CIDEr, SPICE
NoCaps	Zero-shot captioning	CIDEr
MMBench	Multi-task VQA	Accuracy
MMBench-CN	Chinese multi-task VQA	Accuracy
MMStar	Reasoning & perception	Accuracy
ScienceQA	Multi-choice science QA	Accuracy

🤖 Supported Models

Model	Backend
LLaVA-1.5 / LLaVA-1.6	HuggingFace
InternVL2 (4B / 8B / 26B)	HuggingFace
Qwen-VL / Qwen2-VL	HuggingFace
GPT-4o / GPT-4V	OpenAI API
Custom	Plugin any `BaseVLM` subclass

🚀 Quick Start

pip install vlm-eval

Run a single benchmark

vlm-eval run \
  --model llava-1.6-mistral-7b \
  --benchmark vqav2 \
  --split validation \
  --output ./results/

Evaluate on multiple benchmarks

vlm-eval run \
  --model internvl2-8b \
  --benchmark vqav2 textvqa mmbench mmbench-cn \
  --output ./results/

Python API

from vlm_eval import Evaluator
from vlm_eval.models import LLaVAModel

model = LLaVAModel("llava-hf/llava-v1.6-mistral-7b-hf")
evaluator = Evaluator(model=model)

results = evaluator.run("vqav2", split="validation", max_samples=1000)
print(results)
# {'accuracy': 0.812, 'n_samples': 1000, 'benchmark': 'vqav2'}

Export leaderboard

vlm-eval leaderboard --results_dir ./results/ --output leaderboard.csv

📂 Project Structure

vlm_eval/
├── harness.py          # Core evaluation loop
├── models/
│   ├── base.py         # Abstract model interface
│   ├── llava.py        # LLaVA-1.5/1.6
│   ├── internvl.py     # InternVL2
│   └── qwenvl.py       # Qwen-VL / Qwen2-VL
├── datasets/
│   ├── vqav2.py
│   ├── coco_caption.py
│   ├── textvqa.py
│   └── mmbench.py      # MMBench + MMBench-CN
├── metrics/
│   ├── vqa_accuracy.py
│   └── caption_metrics.py
└── cli.py

📋 Leaderboard (as of 2025-01)

Results on validation / test splits with greedy decoding:

Model	VQAv2	TextVQA	MMBench	MMBench-CN
LLaVA-1.6-Mistral-7B	81.2	65.3	72.8	62.1
InternVL2-8B	83.5	77.4	79.6	75.3
Qwen2-VL-7B	83.0	84.3	80.5	79.8
GPT-4o	85.7	89.1	83.2	82.4

⚙️ Adding a Custom Model

from vlm_eval.models.base import BaseVLM
from PIL import Image

class MyModel(BaseVLM):
    def __init__(self, model_path: str):
        super().__init__(model_name="my-model")
        self.model = load_my_model(model_path)

    def generate(self, image: Image.Image, prompt: str, **kwargs) -> str:
        return self.model.predict(image, prompt)

# Register and run
from vlm_eval import Evaluator
evaluator = Evaluator(model=MyModel("./checkpoints/my_model"))
results = evaluator.run("mmbench")

📖 Citation

@misc{xu2024vlmeval,
  title={VLM-Eval: A Unified Evaluation Harness for Vision-Language Models},
  author={Xu, Haowen},
  year={2024},
  url={https://github.com/suncatchin/vlm-eval}
}

📄 License

Apache 2.0 License

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
tests		tests
vlm_eval		vlm_eval
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VLM-Eval 📊

🎯 Supported Benchmarks

🤖 Supported Models

🚀 Quick Start

Run a single benchmark

Evaluate on multiple benchmarks

Python API

Export leaderboard

📂 Project Structure

📋 Leaderboard (as of 2025-01)

⚙️ Adding a Custom Model

📖 Citation

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VLM-Eval 📊

🎯 Supported Benchmarks

🤖 Supported Models

🚀 Quick Start

Run a single benchmark

Evaluate on multiple benchmarks

Python API

Export leaderboard

📂 Project Structure

📋 Leaderboard (as of 2025-01)

⚙️ Adding a Custom Model

📖 Citation

📄 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages