VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception

VLM-FO1 endows pre-trained VLMs with superior fine-grained perception without compromising their inherent high-level reasoning and general understanding capabilities. It operates as a plug-and-play module that can be integrated with any existing VLM, establishing an effective and flexible paradigm for building the next generation of perception-aware models.

VLM-FO1 excels at a wide range of fine-grained perception tasks, including Object Grounding, Region Generative Understanding, Visual Region Reasoning, and more.

For more details, check out our paper

✨ Key Features

🧩 Plug-and-Play Modularity: Our framework is designed as a set of enhancement modules that can be seamlessly integrated with any pre-trained VLM, preserving its original weights and capabilities.

🧠 Hybrid Region Encoder (HFRE): We introduce a novel Dual-Vision Encoder architecture that fuses semantic-rich features with perception-enhanced features, creating powerful region tokens that capture both high-level meaning and fine-grained spatial detail.

🎯 State-of-the-Art Performance: VLM-FO1 achieves SOTA results across a diverse suite of benchmarks.

✅ Preserves General Abilities: Our two-stage training strategy ensures that fine-grained perception is gained without causing catastrophic forgetting of the base model's powerful general visual understanding abilities.

📝 ToDo

Release Gradio demo

🔗 Related works

If you are interested in our research, we welcome you to explore our other wonderful projects.

📋 Table of Contents

Installation
Quick Start
- Inference with Provided Bounding Boxes
- Inference with Object Detector
Task Templates
Evaluation
Citation
Acknowledgements

🛠️ Installation

Requirements

Python 3.10+
PyTorch (GPU recommended). Please install the CUDA-enabled build that matches your system.
Linux is the primary tested platform.

Setup

git clone https://github.com/om-ai-lab/VLM-FO1.git
cd VLM-FO1
pip install -r requirements.txt

If you use conda:

conda create -n vlm-fo1 python=3.10 -y
conda activate vlm-fo1
pip install -r requirements.txt

🚀 Quick Start

1) Download Model Checkpoints

Download the pre-trained VLM-FO1 checkpoints from Hugging Face and place them under resources/:

Pre-trained model: Hugging Face link
Suggested local path: resources/VLM-FO1_Qwen2.5-VL-3B-v01/

2) Inference with Provided Bounding Boxes

Use inference.py to run VLM-FO1 with provided bounding boxes.

Run:

python inference.py

The visualization with predicted boxes will be saved to demo/vlm_fo1_result.jpg.

3) Inference with Object Detector

Note: Due to company policy, we are unable to release the object detector OPN referenced in our paper. Instead, we provide integration with UPN (from ChatRex), which offers similar functionality. You are also free to use any object detector of your choice by preparing bounding box proposals compatible with our pipeline.

Before running, you need to install UPN's requirements and build the ops extension:

cd detect_tools/upn
pip install -r requirements.txt
cd ops
pip install -v -e .

Then, download the UPN checkpoint from this link and place it in your resources/ folder:

wget https://github.com/IDEA-Research/ChatRex/releases/download/upn-large/upn_large.pth -P resources/

Use scripts/inference_with_upn.py to run FO1 with UPN-generated proposals.

Run:

python scripts/inference_with_upn.py

The script will:

Load UPN and produce fine-grained object proposals
Filter proposals by score
Pass top proposals to FO1 for reasoning
Save the visualization to demo/vlm_fo1_result.jpg

📝 Task Templates

We provide task templates to facilitate tasks beyond detection (e.g., counting, grounding, and other vision-language reasoning tasks). See vlm_fo1/task_templates for examples. You can adapt the prompts or message construction to your specific application.

Evaluation

Model performance when integrated with UPN is as follows:

Model	COCO mAP	CountBench Accuracy	Pixmo-Count Accuracy	HumanRef DF1/P/R	LVIS SS/S-IoU	PACO SS/S-IoU	COCOText
VLM-FO1_Qwen2.5-VL-3B-v01 (FO1 + UPN)	44.4	89.73%	85.07%	82.6/86.8/83.5	92.5/87	88.1/77.9	57.1

For benchmarking FO1 with UPN on your datasets, you can adapt the inference pipeline to export predictions and compare against ground-truth. We also include evaluation utilities under evaluation/ for specific tasks.

CountBench/Pixmo-Count evaluation: evaluation/eval_countbench.py
COCO evaluation: evaluation/eval_coco.py (this script will output a prediction json file which you can use with the standard COCO evaluation tool to obtain scores)

📄 Citation

If you find VLM-FO1 useful in your research or applications, please cite our work:

@article{liu2025vlm,
  title={VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs},
  author={Liu, Peng and Shen, Haozhan and Fang, Chunxin and Sun, Zhicheng and Liao, Jiajia and Zhao, Tiancheng},
  journal={arXiv preprint arXiv:2509.25916},
  year={2025}
}

🙏 Acknowledgements

Thanks to the Qwen Team for their powerful Qwen2.5-VL model.
UPN detector is integrated from an external open-source project. Please refer to: ChatRex
Dataset images used above are from public benchmarks (Pixmo-Count, CountBenchQA), credited to their respective authors.
Thanks to the numerous researchers and developers who have contributed to the datasets used in our training.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception

✨ Key Features

📝 ToDo

🔗 Related works

📋 Table of Contents

🛠️ Installation

Requirements

Setup

🚀 Quick Start

1) Download Model Checkpoints

2) Inference with Provided Bounding Boxes

3) Inference with Object Detector

📝 Task Templates

Evaluation

📄 Citation

🙏 Acknowledgements

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
demo		demo
detect_tools/upn		detect_tools/upn
evaluation		evaluation
scripts		scripts
vlm_fo1		vlm_fo1
README.md		README.md
inference.py		inference.py
requirements.txt		requirements.txt

om-ai-lab/VLM-FO1

Folders and files

Latest commit

History

Repository files navigation

VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception

✨ Key Features

📝 ToDo

🔗 Related works

📋 Table of Contents

🛠️ Installation

Requirements

Setup

🚀 Quick Start

1) Download Model Checkpoints

2) Inference with Provided Bounding Boxes

3) Inference with Object Detector

📝 Task Templates

Evaluation

📄 Citation

🙏 Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages