Fangfu Liu*,1,
Diankun Wu*,1,
Jiawei Chi*,1,
Yimo Cai1,
Yi-Hsin Hung1,
Xumin Yu2,
Hao Li3,
Han Hu2,
Yongming Rao†,2,
Yueqi Duan†,1
*Equal Contribution †Corresponding Author
1Tsinghua University 2Tencent Hunyuan 3NTU
Spatial-TTT: We propose Spatial-TTT, a framework for streaming visual-based spatial intelligence with Test-Time Training (TTT). Given a visual-based spatial task, our method updates spatial state with streaming chunks then answers the question, achieving state-of-the-art performance on video spatial benchmarks.
- [2026/03/13] 🎉 We release the paper on arXiv!
- [2026/03/13] We release the training and evaluation code for Spatial-TTT, the official implementation of Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training.
Overview of Spatial-TTT. The model employs a hybrid architecture that interleaves TTT layers with self-attention anchor layers to preserve pretrained knowledge while enabling efficient long spatial-context compression. Within each TTT layer, sliding-window attention (SWA) and the TTT branch operate in parallel with shared Q/K/V projections; the TTT branch applies a spatial-predictive mechanism with depthwise spatiotemporal convolution to capture geometric structure and temporal continuity.
Humans perceive and understand real-world spaces through a stream of visual observations. The ability to streamingly maintain and update spatial evidence from potentially unbounded video streams is essential for spatial intelligence. The core challenge is not simply longer context windows but how spatial information is selected, organized, and retained over time.
Spatial-TTT maintains adaptive fast weights that are updated online, acting as a compact non-linear memory to accumulate 3D evidence from long-horizon video streams. Key designs include:
- Hybrid TTT architecture: Interleaves TTT layers with self-attention anchor layers to preserve pretrained visual-semantic knowledge while enabling efficient long spatial-context compression.
- Large-chunk updates + sliding-window attention: Large chunk size for higher parallelism and hardware efficiency; sliding-window attention in parallel to preserve intra-chunk spatiotemporal continuity.
- Spatial-predictive mechanism: Lightweight depth-wise 3D convolutions on the TTT branch to capture geometric correspondence and temporal continuity across frames.
- Dense scene description: A dense scene-description dataset guides the model to update fast weights to memorize and organize global 3D spatial signals in a structured manner.
git clone https://github.com/THU-SI/Spatial-TTT.git
cd Spatial-TTT/qwen-vl-finetuneWe use conda to manage the environment. Recommended versions:
- Python 3.10+
torch>=2.6.0,torchvision,transformers>=4.57.0deepspeed,flash-attn,accelerate,peft,triton,torchcodec
conda create -n spatial-ttt python=3.10 -y
conda activate spatial-ttt
pip install torch torchvision deepspeed accelerate peft transformers==4.57.0
pip install flash-attn --no-build-isolation
pip install torchcodec
pip install qwen-vl-utilsConfigure the dataset in qwen-vl-finetune/qwenvl/data/__init__.py: set annotation_path and data_path for Spatial-TTT-Data-97k (download from THU-SI/Spatial-TTT-Data-97k on Hugging Face).
We provide a single training script with chunk size 2648 and the Spatial-TTT-Data-97k dataset.
-
Set in
qwen-vl-finetune/spatial_ttt_train.sh:MODEL_PATH: path to pretrained Qwen3-VL (or your base checkpoint).OUTPUT_DIR: where to save checkpoints.
-
Set Spatial-TTT-Data-97k paths in
qwen-vl-finetune/qwenvl/data/__init__.py: editSPATIAL_TTT_DATA_97Kwith yourannotation_pathanddata_path(after downloading from THU-SI/Spatial-TTT-Data-97k). -
From the
qwen-vl-finetune/directory:
cd qwen-vl-finetune
# 8 GPUs by default; set NPROC_PER_NODE or CUDA_VISIBLE_DEVICES as needed
bash spatial_ttt_train.shMain settings: lact_chunk_size=2648, window_size=2648, video_max_frames=128, dataset spatial_ttt_data_97k.
Evaluation on VSI-Bench is under evaluation/spatial/. Use the script with checkpoint path and output name:
# Evaluates on VSI-Bench with 128 frames
bash evaluation/spatial/scripts/eval_spatial_ttt_2b.sh /path/to/checkpoint my_model 8See evaluation/spatial/readme.md for result summarization.
We release Spatial-TTT-nano, an SFT model trained on a mini spatial dataset with less than 1M samples. Download: Spatial-TTT-nano on Hugging Face. See the Releases page for more.
We provide Spatial-TTT-Data-97k (THU-SI/Spatial-TTT-Data-97k on Hugging Face), a mini high-quality spatial dataset from Spatial-TTT with ~97k samples for training and reproduction. This is the dataset used in the configuration and training steps above.
We also release Spatial-TTT-Data-Streaming (THU-SI/Spatial-TTT-Data-Streaming on Hugging Face), part of our self-prepared streaming data. It can be helpful for VSR (long-horizon visual spatial recall) and VSC (continual visual spatial counting) related training in Cambrian-S: Towards Spatial Supersensing in Video (see arXiv:2511.04670).
- Update the full model (trained on all data).
- Release full training data (general spatial QA and dense scene caption data).
- Release larger-scale Spatial-TTT models.
Spatial-TTT/
├── assets/
│ ├── teaser.png
│ └── pipeline.png # Framework figure
├── qwen-vl-finetune/
│ ├── spatial_ttt_train.sh # Spatial-TTT training (2648, Spatial-TTT-Data-97k)
│ ├── qwenvl/
│ │ ├── train/ # train_spatial_ttt.py, trainer, arguments
│ │ └── data/ # Dataset configs and data processor
│ ├── models/ # LaCT/TTT layers (causal_swa_lact, spatial_ttt)
│ └── scripts/ # DeepSpeed configs (e.g. zero2.json)
├── evaluation/spatial/ # VSI-Bench evaluation scripts
└── README.md
If you find Spatial-TTT useful for your research, please cite:
@article{liu2026spatialttt,
title = {Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training},
author = {Liu, Fangfu and Wu, Diankun and Chi, Jiawei and Cai, Yimo and Hung, Yi-Hsin and Yu, Xumin and Li, Hao and Hu, Han and Rao, Yongming and Duan, Yueqi},
journal = {arXiv preprint arXiv:2603.12255},
year = {2026}
}Thanks to these great repositories and works: Spatial-MLLM, Qwen3-VL, Test-Time Training Done Right (LaCT), and the spatial understanding community.

