Skip to content

THU-SI/Spatial-TTT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

✨ Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training ✨

Fangfu Liu*,1, Diankun Wu*,1, Jiawei Chi*,1, Yimo Cai1, Yi-Hsin Hung1, Xumin Yu2, Hao Li3,
Han Hu2, Yongming Rao†,2, Yueqi Duan†,1
*Equal Contribution   Corresponding Author
1Tsinghua University   2Tencent Hunyuan   3NTU

              

Teaser

Spatial-TTT: We propose Spatial-TTT, a framework for streaming visual-based spatial intelligence with Test-Time Training (TTT). Given a visual-based spatial task, our method updates spatial state with streaming chunks then answers the question, achieving state-of-the-art performance on video spatial benchmarks.


📢 News

  • [2026/03/13] 🎉 We release the paper on arXiv!
  • [2026/03/13] We release the training and evaluation code for Spatial-TTT, the official implementation of Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training.

🌟 Overview

Pipeline

Overview of Spatial-TTT. The model employs a hybrid architecture that interleaves TTT layers with self-attention anchor layers to preserve pretrained knowledge while enabling efficient long spatial-context compression. Within each TTT layer, sliding-window attention (SWA) and the TTT branch operate in parallel with shared Q/K/V projections; the TTT branch applies a spatial-predictive mechanism with depthwise spatiotemporal convolution to capture geometric structure and temporal continuity.

🌟 Introduction

Humans perceive and understand real-world spaces through a stream of visual observations. The ability to streamingly maintain and update spatial evidence from potentially unbounded video streams is essential for spatial intelligence. The core challenge is not simply longer context windows but how spatial information is selected, organized, and retained over time.

Spatial-TTT maintains adaptive fast weights that are updated online, acting as a compact non-linear memory to accumulate 3D evidence from long-horizon video streams. Key designs include:

  • Hybrid TTT architecture: Interleaves TTT layers with self-attention anchor layers to preserve pretrained visual-semantic knowledge while enabling efficient long spatial-context compression.
  • Large-chunk updates + sliding-window attention: Large chunk size for higher parallelism and hardware efficiency; sliding-window attention in parallel to preserve intra-chunk spatiotemporal continuity.
  • Spatial-predictive mechanism: Lightweight depth-wise 3D convolutions on the TTT branch to capture geometric correspondence and temporal continuity across frames.
  • Dense scene description: A dense scene-description dataset guides the model to update fast weights to memorize and organize global 3D spatial signals in a structured manner.

⚙️ Setup

1. Clone Repository

git clone https://github.com/THU-SI/Spatial-TTT.git
cd Spatial-TTT/qwen-vl-finetune

2. Environment Setup

We use conda to manage the environment. Recommended versions:

  • Python 3.10+
  • torch>=2.6.0, torchvision, transformers>=4.57.0
  • deepspeed, flash-attn, accelerate, peft, triton, torchcodec
conda create -n spatial-ttt python=3.10 -y
conda activate spatial-ttt
pip install torch torchvision deepspeed accelerate peft transformers==4.57.0
pip install flash-attn --no-build-isolation
pip install torchcodec
pip install qwen-vl-utils

Configure the dataset in qwen-vl-finetune/qwenvl/data/__init__.py: set annotation_path and data_path for Spatial-TTT-Data-97k (download from THU-SI/Spatial-TTT-Data-97k on Hugging Face).

🚂 Training

We provide a single training script with chunk size 2648 and the Spatial-TTT-Data-97k dataset.

Run training

  1. Set in qwen-vl-finetune/spatial_ttt_train.sh:

    • MODEL_PATH: path to pretrained Qwen3-VL (or your base checkpoint).
    • OUTPUT_DIR: where to save checkpoints.
  2. Set Spatial-TTT-Data-97k paths in qwen-vl-finetune/qwenvl/data/__init__.py: edit SPATIAL_TTT_DATA_97K with your annotation_path and data_path (after downloading from THU-SI/Spatial-TTT-Data-97k).

  3. From the qwen-vl-finetune/ directory:

cd qwen-vl-finetune
# 8 GPUs by default; set NPROC_PER_NODE or CUDA_VISIBLE_DEVICES as needed
bash spatial_ttt_train.sh

Main settings: lact_chunk_size=2648, window_size=2648, video_max_frames=128, dataset spatial_ttt_data_97k.

📊 Evaluation

Evaluation on VSI-Bench is under evaluation/spatial/. Use the script with checkpoint path and output name:

# Evaluates on VSI-Bench with 128 frames
bash evaluation/spatial/scripts/eval_spatial_ttt_2b.sh /path/to/checkpoint my_model 8

See evaluation/spatial/readme.md for result summarization.

📦 Data and Model

We release Spatial-TTT-nano, an SFT model trained on a mini spatial dataset with less than 1M samples. Download: Spatial-TTT-nano on Hugging Face. See the Releases page for more.

We provide Spatial-TTT-Data-97k (THU-SI/Spatial-TTT-Data-97k on Hugging Face), a mini high-quality spatial dataset from Spatial-TTT with ~97k samples for training and reproduction. This is the dataset used in the configuration and training steps above.

We also release Spatial-TTT-Data-Streaming (THU-SI/Spatial-TTT-Data-Streaming on Hugging Face), part of our self-prepared streaming data. It can be helpful for VSR (long-horizon visual spatial recall) and VSC (continual visual spatial counting) related training in Cambrian-S: Towards Spatial Supersensing in Video (see arXiv:2511.04670).

📋 TODO

  • Update the full model (trained on all data).
  • Release full training data (general spatial QA and dense scene caption data).
  • Release larger-scale Spatial-TTT models.

📁 Repository Structure

Spatial-TTT/
├── assets/
│   ├── teaser.png
│   └── pipeline.png             # Framework figure
├── qwen-vl-finetune/
│   ├── spatial_ttt_train.sh    # Spatial-TTT training (2648, Spatial-TTT-Data-97k)
│   ├── qwenvl/
│   │   ├── train/              # train_spatial_ttt.py, trainer, arguments
│   │   └── data/               # Dataset configs and data processor
│   ├── models/                 # LaCT/TTT layers (causal_swa_lact, spatial_ttt)
│   └── scripts/                # DeepSpeed configs (e.g. zero2.json)
├── evaluation/spatial/         # VSI-Bench evaluation scripts
└── README.md

📚 Citation

If you find Spatial-TTT useful for your research, please cite:

@article{liu2026spatialttt,
  title   = {Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training},
  author  = {Liu, Fangfu and Wu, Diankun and Chi, Jiawei and Cai, Yimo and Hung, Yi-Hsin and Yu, Xumin and Li, Hao and Hu, Han and Rao, Yongming and Duan, Yueqi},
  journal = {arXiv preprint arXiv:2603.12255},
  year    = {2026}
}

Acknowledgements

Thanks to these great repositories and works: Spatial-MLLM, Qwen3-VL, Test-Time Training Done Right (LaCT), and the spatial understanding community.

About

Official Implementation of Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors