Skip to content

suncatchin/vlm-eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VLM-Eval 📊

A unified, extensible evaluation harness for Vision-Language Models (VLMs).

Python 3.9+ License: Apache 2.0 Stars

VLM-Eval provides a single, consistent interface to evaluate any vision-language model across popular benchmarks. Forget writing custom eval scripts for every dataset — just plug in your model and run.

Why another eval framework? Most repos vendor their own ad-hoc eval code that diverges from the official metrics. VLM-Eval centralises dataset loading, preprocessing, and scoring behind a clean API, and adds Chinese-centric benchmarks that are often missing from English-only suites.

🎯 Supported Benchmarks

Benchmark Task Metric
VQAv2 Open-ended VQA Accuracy (soft)
TextVQA OCR-aware VQA Accuracy
COCO Captions Image captioning CIDEr, SPICE
NoCaps Zero-shot captioning CIDEr
MMBench Multi-task VQA Accuracy
MMBench-CN Chinese multi-task VQA Accuracy
MMStar Reasoning & perception Accuracy
ScienceQA Multi-choice science QA Accuracy

🤖 Supported Models

Model Backend
LLaVA-1.5 / LLaVA-1.6 HuggingFace
InternVL2 (4B / 8B / 26B) HuggingFace
Qwen-VL / Qwen2-VL HuggingFace
GPT-4o / GPT-4V OpenAI API
Custom Plugin any BaseVLM subclass

🚀 Quick Start

pip install vlm-eval

Run a single benchmark

vlm-eval run \
  --model llava-1.6-mistral-7b \
  --benchmark vqav2 \
  --split validation \
  --output ./results/

Evaluate on multiple benchmarks

vlm-eval run \
  --model internvl2-8b \
  --benchmark vqav2 textvqa mmbench mmbench-cn \
  --output ./results/

Python API

from vlm_eval import Evaluator
from vlm_eval.models import LLaVAModel

model = LLaVAModel("llava-hf/llava-v1.6-mistral-7b-hf")
evaluator = Evaluator(model=model)

results = evaluator.run("vqav2", split="validation", max_samples=1000)
print(results)
# {'accuracy': 0.812, 'n_samples': 1000, 'benchmark': 'vqav2'}

Export leaderboard

vlm-eval leaderboard --results_dir ./results/ --output leaderboard.csv

📂 Project Structure

vlm_eval/
├── harness.py          # Core evaluation loop
├── models/
│   ├── base.py         # Abstract model interface
│   ├── llava.py        # LLaVA-1.5/1.6
│   ├── internvl.py     # InternVL2
│   └── qwenvl.py       # Qwen-VL / Qwen2-VL
├── datasets/
│   ├── vqav2.py
│   ├── coco_caption.py
│   ├── textvqa.py
│   └── mmbench.py      # MMBench + MMBench-CN
├── metrics/
│   ├── vqa_accuracy.py
│   └── caption_metrics.py
└── cli.py

📋 Leaderboard (as of 2025-01)

Results on validation / test splits with greedy decoding:

Model VQAv2 TextVQA MMBench MMBench-CN
LLaVA-1.6-Mistral-7B 81.2 65.3 72.8 62.1
InternVL2-8B 83.5 77.4 79.6 75.3
Qwen2-VL-7B 83.0 84.3 80.5 79.8
GPT-4o 85.7 89.1 83.2 82.4

⚙️ Adding a Custom Model

from vlm_eval.models.base import BaseVLM
from PIL import Image

class MyModel(BaseVLM):
    def __init__(self, model_path: str):
        super().__init__(model_name="my-model")
        self.model = load_my_model(model_path)

    def generate(self, image: Image.Image, prompt: str, **kwargs) -> str:
        return self.model.predict(image, prompt)

# Register and run
from vlm_eval import Evaluator
evaluator = Evaluator(model=MyModel("./checkpoints/my_model"))
results = evaluator.run("mmbench")

📖 Citation

@misc{xu2024vlmeval,
  title={VLM-Eval: A Unified Evaluation Harness for Vision-Language Models},
  author={Xu, Haowen},
  year={2024},
  url={https://github.com/suncatchin/vlm-eval}
}

📄 License

Apache 2.0 License

About

Unified evaluation harness for Vision-Language Models — VQAv2, TextVQA, COCO, MMBench, MMBench-CN

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages