✨Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence✨

Diankun Wu^1*, Fangfu Liu^1*, Yi-Hsin Hung¹, Yueqi Duan¹,
^*Equal Contribution.
¹Tsinghua University
NeurIPS 2025 (Spotlight)

Spatial-MLLM: We propose Spatial-MLLM, a method that significantly enhances the visual-based spatial intelligence of existing video MLLMs. As shown, Spatial-MLLM can understand and reason about the underlying scene based on video input and achieves SOTA performance in a wide range of spatial reasoning tasks.

📢 News

🎉[01/05/2026] We release two new SFT models: Spatial-MLLM-v1.1-Instruct-135K and Spatial-MLLM-v1.1-Instruct-820K.
🎉[01/05/2026] We refactor our repo and release the refined SFT training code for Spatial-MLLM-v1.1-Instruct. We also release code for space-aware frame sampling.
🎉[05/30/2025] We release Spatial-MLLM-subset-sft, which is trained on a subset of our proposed Spatial-MLLM-120k dataset. We also release the evaluation code on VSI-Bench. You can refer to previous_version to use and evaluate this model.
🔥[05/30/2025] We release "Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence". Check our project page and arXiv paper.

🌟 Overview

Overview of Spatial-MLLM. Our model is composed of a 2D visual encoder, a spatial encoder which is initialized from a feed-forward visual geometry foundation model, a connector, and a large language model backbone. At inference time, we incorporate a space-aware frame sampling strategy to select spatially informative frames when the number of input frames is limited due to GPU memory constraints.

⚙️ Setup

1. Clone Repository

git clone https://github.com/diankun-wu/Spatial-MLLM
cd Spatial-MLLM

2. Environment Setup

We use conda to manage the environment. First, create conda environment:

conda create -n spatial-mllm python=3.10 -y
conda activate spatial-mllm

Install PyTorch 2.6.0 with CUDA 12.4 support:

pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124

Install other required packages:

pip install transformers==4.51.3
pip install accelerate datasets decord deepspeed einops matplotlib pandas python_Levenshtein qwen_vl_utils ray safetensors tqdm tyro wandb

Finally, download and install the pre-built wheel for Flash Attention 2:

wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

💻 Inference

To run inference, you can use the script src/inference.py. For example:

python src/inference.py \
    --model_path Diankun/Spatial-MLLM-v1.1-Instruct-135K \
    --model_type spatial-mllm \
    --text "How many chair(s) are in this room?\nPlease answer the question using a single word or phrase."

📊 Evaluation

Evaluation on VSI-Bench

To evaluate the model on VSI-Bench, you should first download the VSI-Bench dataset and place it in the datasets/evaluation/vsibench directory. You can use the following command:

# download the VSI-Bench dataset from Hugging Face
hf download nyu-visionx/VSI-Bench \
    --local-dir datasets/evaluation/vsibench \
    --repo-type dataset

# extract the downloaded dataset
for f in datasets/evaluation/vsibench/*.zip; do
    unzip "$f" -d datasets/evaluation/vsibench
done

Download the Spatial-MLLM-v1.1-Instruct models to the checkpoints directory (recommended when using multiple GPUs):

mkdir -p checkpoints

# download Spatial-MLLM-v1.1-Instruct-135K
hf download Diankun/Spatial-MLLM-v1.1-Instruct-135K \
    --local-dir checkpoints/Spatial-MLLM-v1.1-Instruct-135K

# download Spatial-MLLM-v1.1-Instruct-820K
hf download Diankun/Spatial-MLLM-v1.1-Instruct-820K \
    --local-dir checkpoints/Spatial-MLLM-v1.1-Instruct-820K

Then you can use the provided bash script to evaluate the model.

bash scripts/evaluation/evaluate_vsibench_spatial_mllm.sh

The script will automatically use all available GPUs. If you want to specify the GPUs to use, you can set the CUDA_VISIBLE_DEVICES environment variable before running the script.

Using Space-aware Frame Sampling

To use the space-aware frame sampling strategy during evaluation, we recommend using our pre-sampled frames. You can download them using the following command:

# Download the zip file
hf download Diankun/Spatial-MLLM-Data evaluation/vsibench/sa_sampling_16f.zip \
    --repo-type dataset \
    --local-dir . 

# Unzip the file
unzip evaluation/vsibench/sa_sampling_16f.zip -d datasets/evaluation/vsibench/arkitscenes_sampling_16f

You can also sample frames using our provided script:

bash scripts/evaluation/sa_sampling.sh

This script will use space-aware frame sampling to sample frames for all videos in datasets/evaluation/vsibench and save the sampled frames to datasets/evaluation/vsibench/sa_sampling_16f.

Then, you can use the provided bash script to evaluate the model with the sampled frames.

bash scripts/evaluation/evaluate_vsibench_spatial_mllm_w_sa_sampling.sh

Here are our evaluation results for Spatial-MLLM-v1.1-Instruct and baseline models on VSI-Bench (16 frames input):

Model	VSIBench Micro			VSIBench Macro
Model	Acc	MRA	All	Acc	MRA	All
Qwen2.5-VL-3B-Instruct	35.42	20.72	27.86	37.12	21.65	30.93
Qwen2.5-VL-3B-Instruct-135K	46.91	52.60	49.84	46.16	52.81	48.82
Spatial-MLLM-v1.1-Instruct-135K	49.28	52.88	51.13	49.12	53.88	51.02
Spatial-MLLM-v1.1-Instruct-135K (SA Sampling)	52.13	53.33	52.75	52.84	54.46	53.49
Spatial-MLLM-v1.1-Instruct-820K	49.56	57.27	53.53	48.02	57.39	51.77
Spatial-MLLM-v1.1-Instruct-820K (SA Sampling)	50.60	57.68	54.24	50.12	58.09	53.30

Evaluation on ScanQA

We also provide evaluation scripts for ScanQA:

bash scripts/evaluation/evaluate_scanqa_spatial_mllm.sh

Note that you need to download and preprocess the scannet raw video data and place them in datasets/visuals/scannet/videos before evaluation.

After evaluation, you will get the results saved in results/scanqa/Spatial-MLLM-v1.1-Instruct-135K-16f.json. Then use the following command to calculate the metrics:

python src/evaluation/scanqa/score_scanqa.py \
    --input-file "results/scanqa/Spatial-MLLM-v1.1-Instruct-135K-16f.json"

🚂 Training

You can refer to our TRAINING.md for detailed training instructions.

📚 Citation

If you find it useful for your research and applications, please cite our paper using this BibTeX:

@article{wu2025spatialmllmboostingmllmcapabilities,
    title={Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence},
    author={Wu, Diankun  and Liu, Fangfu and Hung, Yi-Hsin and Duan, Yueqi},
    journal={arXiv preprint arXiv:2505.23747},
    year={2025}
}

Acknowledgements

Thanks to these great repositories: thinking-in-space, VGGT, Qwen2.5-VL, open-r1, R1-V, Video-3D-LLM, VLM-R1, VLM-3R, MindCube, Cambrian-S and many other inspiring works in the community.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
assets		assets
previous_version		previous_version
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
TRAINING.md		TRAINING.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

✨Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence✨

📢 News

🌟 Overview

⚙️ Setup

1. Clone Repository

2. Environment Setup

💻 Inference

📊 Evaluation

Evaluation on VSI-Bench

Using Space-aware Frame Sampling

Evaluation on ScanQA

🚂 Training

📚 Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

diankun-wu/Spatial-MLLM

Folders and files

Latest commit

History

Repository files navigation

✨Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence✨

📢 News

🌟 Overview

⚙️ Setup

1. Clone Repository

2. Environment Setup

💻 Inference

📊 Evaluation

Evaluation on VSI-Bench

Using Space-aware Frame Sampling

Evaluation on ScanQA

🚂 Training

📚 Citation

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages