Chenyang Liu
·
Jiafan Zhang
·
Keyan Chen
·
Man Wang
·
Zhengxia Zou
·
Zhenwei Shi*✉
This repo is used for recording, and tracking recent Remote Sensing Spatio-Temporal Vision-Language Models (RS-STVLMs). If you find any work missing or have any suggestions (papers, implementations, and other resources), feel free to pull requests.
Share us a ⭐ if you're interested in this repo. We will continue to track relevant progress and update this repository.
- You are welcome to give us an issue or PR for your RS-STVLM work !!!!! We will record it for next version update of our survey
🔥🔥🔥 The rep is updating 🔥🔥🔥
✅ The first survey for Remote Sensing Spatio-Temporal Vision-Language Models.
✅ Some public datasets and code links are provided.
✅ We will continue to track related work in this repository.
Timeline of RS-STVLMs:
- 📚 Remote Sensing Spatio-Temporal Vision-language Tasks and Methods
- 👨🏫 Large Language Models Meets Temporal Images
- 🛰️ Dataset
- 💻 Others
- 🖊️ Citation
- 🐲 Contact
Time | Model Name | Paper Title | Code/Project |
---|---|---|---|
2024.06 | ChangeRetCap | Towards a multimodal framework for remote sensing image change retrieval and captioning | link |
2025.01 | text-ITSR | Self-Supervised Cross-Modal Text-Image Time Series Retrieval in Remote Sensing | N/A |
........ |
Time | Model Name | Grounding Output | Paper Title | Code/Project |
---|---|---|---|---|
2024.09 | ChangeChat | mask | ChangeChat: An Interactive Model for Remote Sensing Change Analysis via Multimodal Instruction Tuning | link |
2024.10 | TEOChat | bbox | TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data | link |
2024.10 | VisTA | mask | Show Me What and Where has Changed? Question Answering and Grounding for Remote Sensing Change Detection | link |
2024.12 | RSUniVLM | mask | RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of Experts | link |
2024.12 | EarthDial | bbox | EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues | link |
2025.03 | Falcon | mask | Falcon: A Remote Sensing Vision-Language Foundation Model | link |
2025.03 | GeoRSMLLM | mask | GeoRSMLLM: A Multimodal Large Language Model for Vision-Language Tasks in Geoscience and Remote Sensing | N/A |
........ |
Time | Model Name | Paper Title | Code/Project |
---|---|---|---|
2025.02 | TGIPG | Image Editing based on Diffusion Model for Remote Sensing Image Change Captioning | N/A |
2025.04 | ChangeDiff | ChangeDiff: A Multi-Temporal Change Detection Data Generator with Flexible Text Prompts via Diffusion Model | link |
2025.07 | -- | Open-vocabulary generative vision-language models for creating a large-scale remote sensing change detection dataset | link |
2025.07 | ChangeBridge | ChangeBridge: Spatiotemporal Image Generation with Multimodal Controls for Remote Sensing | N/A |
........ |
Time | Method | Paper Title | Function | Code |
---|---|---|---|---|
2024.01 | RSChatgpt | Remote Sensing ChatGPT: Solving Remote Sensing Tasks with ChatGPT and Visual Models | Single-image analysis | Link |
2024.03 | Change-Agent | Change-Agent: Toward Interactive Comprehensive Remote Sensing Change Interpretation and Analysis | Spatio-Temporal Change Interpretation | Link |
2024.06 | RS-Agent | RS-Agent: Automating Remote Sensing Tasks through Intelligent Agent | Tool selection and knowledge search | Link |
2024.07 | RS-AGENT | RS-AGENT: Large Language Models Guided Agent System for Remote Sensing Image Generation | Image Generation | N/A |
2024.12 | GeoTool-GPT | GeoTool-GPT: a trainable method for facilitating Large Language Models to master GIS tools | Master GIS tools | N/A |
2025.01 | RescueADI | RescueADI: Adaptive Disaster Interpretation in Remote Sensing Images With Autonomous Agents | Disaster Interpretation | N/A |
........ |
Dataset | Time | Image Size | Image Resolution | Image Pairs | Captions* | Masks | Temporal Image Data Source | Anno. | Link |
---|---|---|---|---|---|---|---|---|---|
DUBAI CCD | 2022.08 | 50×50 | 30m | 500 | 2,500 | - | Landsat-7 imagery | Manual | Link |
LEVIR CCD | 2022.08 | 256×256 | 0.5m | 500 | 2,500 | - | LEVIR-CD | Manual | Link |
LEVIR-CC | 2022.11 | 256×256 | 0.5m | 10,077 | 50,385 | - | LEVIR-CD | Manual | Link |
CCExpert | 2024.11 | - | - | 200K | 1.2M | - | LEVIR-CC, CLVER-Change, ImageEdit, Spot-the-dif, STVchrono, Vismin, ChangeSim, SYSU-CD, SECOND | Auto. | Link |
SECTION | 2025.07 | 256×256 | 0.3-3m | 4,059 | 12,200 | - | SECOND | Manual | Link |
LEVIR-MCI | 2024.03 | 256×256 | 0.5m | 10,077 | 50,385 | building, road | LEVIR-CC | Manual | Link |
LEVIR-CDC | 2024.11 | 256×256 | 0.5m | 10,077 | 50,385 | building | LEVIR-CC | Manual | Link |
WHU-CDC | 2024.11 | 256×256 | 0.075m | 7,434 | 37,170 | building | WHU-CD | Manual | Link |
SECOND-CC | 2025.01 | 256×256 | 0.3∼3m | 6,041 | 30,205 | 6 classes | SECOND | Manual | Link |
Dataset | Time | Instruction Samples | Number of Images | Temporal Length | Temporal Image Data Source | Anno. | Link |
---|---|---|---|---|---|---|---|
CDVQA | 2022.09 | 122,000 | 2,968 | 2 | SECOND | Manual | Link |
ChangeChat-87k | 2024.09 | 87,195 | 10,077 | 2 | LEVIR-CC, LEVIR-MCI | Auto. | Link |
QAG-360K | 2024.10 | 360,000 | 6,810 | 2 | Hi-UCD, SECOND, LEVIR-CD | Auto. | Link |
GeoLLaVA | 2024.10 | 100,000 | 100,000 | 2 | fMoW | Auto. | Link |
TEOChatlas | 2024.10 | 554,071 | - | 1~8 | xBD, S2Looking, QFabric, fMoW | Auto. | Link |
EarthDial | 2024.12 | 11.11 Million | - | 1~4 | fMoW, TreeSatAI-Time-Series, MUDS, xBD, QuakeSet | Manual & Auto. | Link |
UniRS | 2024.12 | 318.8 K | - | 1~T (T>2) | LEVIR-CC, ERA-Video | Auto. | Link |
Falcon_SFT | 2025.03 | 78 Million | 5.6 Million | 1~2 | CDD, EGY-BCD, HRSCD, LEVIR-CD, MSBC, MSOSCD, NJDS, S2Looking, SYSU-CD, WHU-CD | Auto. | Link |
DVL-Suite | 2025.05 | 69,926 | 15,063 | 6.9 (Average) | U.S. National Agriculture Imagery Program (NAIP) | Manual & Auto. | N/A |
.... |
Time | Model Name | Paper Title | Code/Project |
---|---|---|---|
2023.06 | RemoteCLIP | RemoteCLIP: A Vision Language Foundation Model for Remote Sensing | link |
2023.06 | GeoRSCLIP | RS5M and GeoRSCLIP: A Large-Scale Vision- Language Dataset and a Large Vision-Language Model for Remote Sensing | link |
2023.12 | SkyCLIP | SkyScript: a large and semantically diverse vision-language dataset for remote sensing | link |
2025.01 | Git-RSCLIP | Text2Earth: Unlocking text-driven remote sensing image generation with a global-scale dataset and a foundation model | link |
If you find our survey and repository useful for your research, please consider citing our paper:
@misc{liu2024remotesensingtemporalvisionlanguage,
title={Remote Sensing Spatio-Temporal Vision-Language Models: A Comprehensive Survey},
author={Chenyang Liu and Jiafan Zhang and Keyan Chen and Man Wang and Zhengxia Zou and Zhenwei Shi},
year={2024},
eprint={2412.02573},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.02573},
}
liuchenyang@buaa.edu.cn