Skip to content

Chen-Yang-Liu/Awesome-RS-SpatioTemporal-VLMs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 

Repository files navigation

Awesome PR's Welcome

Remote Sensing Spatio-Temporal Vision-Language Models: A Comprehensive Survey


Chenyang Liu · Jiafan Zhang · Keyan Chen · Man Wang · Zhengxia Zou · Zhenwei Shi*✉

arXiv PDF


This repo is used for recording, and tracking recent Remote Sensing Spatio-Temporal Vision-Language Models (RS-STVLMs). If you find any work missing or have any suggestions (papers, implementations, and other resources), feel free to pull requests.

⭐ Share us a ⭐

Share us a ⭐ if you're interested in this repo. We will continue to track relevant progress and update this repository.

🙌 Add Your Paper in our Repo and Survey!

  • You are welcome to give us an issue or PR for your RS-STVLM work !!!!! We will record it for next version update of our survey

🥳 News

🔥🔥🔥 The rep is updating 🔥🔥🔥

✨ Highlight!!

✅ The first survey for Remote Sensing Spatio-Temporal Vision-Language Models.

✅ Some public datasets and code links are provided.

✅ We will continue to track related work in this repository.

📖 Introduction

Timeline of RS-STVLMs:

Alt Text

📖 Table of Contents

📚 Remote Sensing Spatio-Temporal Vision-language Tasks and Methods

Change Captioning

Time Model Name Paper Title Visual Encoder Language Decoder Code/Project
2021.10 CNN-RNN Captioning changes in bi-temporal remote sensing images VGG-16 RNN N/A
2022.08 CC-RNN/SVM Change captioning: A new paradigm for multitemporal remote sensing image analysis VGG-16 RNN,SVM N/A
2022.11 RSICCformer Remote sensing image change captioning with dual-branch transformers: A new method and a large scale dataset ResNet-101 Transformer Decoder link
2023.07 PSNet Progressive Scale-aware Network for Remote sensing Image Change Captioning ViT-B/32 Transformer Decoder link
2023.10 PromptCC A Decoupling Paradigm with Prompt Learning for Remote Sensing Image Change Captioning ViT-B/32 GPT-2 link
2023.11 Chg2Cap Changes to Captions: An Attentive Network for Remote Sensing Change Captioning ResNet-101 Transformer Decoder link
2023.11 ICT-Net Interactive Change-Aware Transformer Network for Remote Sensing Image Change Captioning ResNet-101 Transformer Decoder link
2024.03 SITS-CC Change Caption for Satellite Images Time Series ResNet-101 Transformer Decoder link
2024.05 RSCaMa RSCaMa: Remote Sensing Image Change Captioning with State Space Model ViT-B/32 Mamba, Transformer Decoder, GPT-2 link
2024.05 SparseFocus A Lightweight Sparse Focus Transformer for Remote Sensing Image Change Captioning ResNet-101 Transformer Decoder link
2024.05 SEN Single-stream Extractor Network with Contrastive Pre-training for Remote Sensing Change Captioning ResNet with 6-channel Transformer Decoder link
2024.05 Diffusion-RSCC Diffusion model for learning cross-modal data distribution ResNet-101 Diffusion link
2024.05 CARD Context-aware Difference Distilling for Multi-change Captioning ResNet-101 Transformer Decoder link
2024.06 ChangeRetCap Towards a multimodal framework for remote sensing image change retrieval and captioning ResNet-101 Transformer Decoder link
2024.06 Intelli-Change Intelli-Change Remote Sensing - A Novel Transformer Approach ResNet-101 Transformer Decoder N/A
2024.06 ChangeExp Towards Temporal Change Explanations from Bi-Temporal Satellite Images LLaVA-1.5 LLaVA-1.5 N/A
2024.07 MAF-Net Multi-scale Attentive Fusion Network for Remote Sensing Image Change Captioning ResNet-101 Transformer Decoder N/A
2024.07 SFEN Scale-wised feature enhancement network for change captioning of remote sensing images WideResNet Transformer Decoder N/A
2024.09 MfrNet MfrNet: A New Multi-Scale Feature Refining Method for Remote Sensing Image Change Captioning ResNet-18 Transformer Decoder N/A
2024.09 SEIFNet Inter-Temporal Interaction and Symmetric Difference Learning for Remote Sensing Image Change Captioning ResNet-101 Transformer Decoder link
2024.10 MV-CC MV-CC: Mask Enhanced Video Model for Remote Sensing Change Caption InternVideo2 Transformer Decoder link
2024.10 Chareption Chareption: Change-Aware Adaption Empowers Large Language Model for Effective Remote Sensing Image Change Captioning CLIP ViT-L/14 LLaMA-7B N/A
2024.11 MADiffCC Remote Sensing Image Change Captioning Using Multi-Attentive Network with Diffusion Model Diffusion Transformer Decoder N/A
2024.11 CCExpert CCExpert: Advancing MLLM Capability in Remote Sensing Change Captioning with Difference-Aware Integration and a Foundational Dataset Diffusion Transformer Decoder link
2024.12 --- Data Augmentation in Remote Sensing Image Change Captioning ViT-B/32 Transformer Decoder N/A
2024.12 Mask Approx Net Mask Approximation Net: A Novel Diffusion Model Approach for Remote Sensing Change Captioning ResNet Transformer Decoder link
2025.01 SAT-Cap Change Captioning in Remote Sensing: Evolution to SAT-Cap -- A Single-Stage Transformer Approach ResNet-101 Transformer Decoder link
2025.01 MModalCC Robust Change Captioning in Remote Sensing: SECOND-CC Dataset and MModalCC Framework ResNet-101 Transformer Decoder link
2025.01 SGD-RSCCN Scene Graph and Dependency Grammar Enhanced Remote Sensing Change Caption Network (SGD-RSCCN) ResNet-101 Transformer Decoder N/A
2025.02 TGIPG Image Editing based on Diffusion Model for Remote Sensing Image Change Captioning // // N/A
2025.03 Change3D Change3D: Revisiting Change Detection and Captioning from A Video Modeling Perspective X3D-L(video) Transformer Decoder link
2025.03 CD4C CD4C: Change Detection for Remote Sensing Image Change Captioning ResNet-101 Transformer Decoder N/A
2025.04 RDD+ACR Region-aware Difference Distilling with Attribute-guided Contrastive Regularization for Change Captioning ResNet-101 Transformer Decoder N/A
2025.04 FST-Net Frequency–Spatial–Temporal Domain Fusion Network for Remote Sensing Image Change Captioning Segformer Transformer Decoder N/A
2025.05 CTSD-Net A Cross-Spatial Differential Localization Network for Remote Sensing Change Captioning SegFormer Transformer Decoder N/A
2025.06 CTM Cross-Temporal Remote Sensing Image Change Captioning: A Manifold Mapping and Bayesian Diffusion Approach for Land Use Monitoring CLIP Transformer Decoder N/A
2025.06 IHM-SNet IHM-SNet: An Interactive Hierarchical Mamba-Based Screening Network for Remote Sensing Image Change Captioning CLIP-ViT Transformer Decoder N/A
2025.07 MTI-CC Cross-layer Attention Enhanced Remote Sensing Image Change Captioning via Mamba-Transformer Interaction CLIP-ViT Transformer Decoder N/A
2025.08 CI-Net Restricted supervised Cascade Information Network for remote sensing change captioning with serial sentences Asymmetric Siamese Network Cascade Linguistic Module N/A
2025.08 SCCNet SCCNet: Siamese Networks for Selective Change Captioning in Bi-Temporal Remote Sensing Images ViT Transformer Decoder N/A
2025.08 -- Text-Augmented Semantic Feature Extraction and Difference Information Learning for Remote Sensing Image Change Captioning FastSAM+CLIP Transformer Decoder link
2025.08 C3aptioner C3aptioner: Improving Change Captioning by Leveraging Momentum Cross-view and Cross-modality Contrastive Learning ResNet-101 Transformer Decoder N/A
........

Multitask Learning of Change Detection and Change Captioning

Time Model Name Paper Title Visual Encoder Language Decoder Code/Project
2024.01 Pix4Cap Pixel-Level Change Detection Pseudo-Label Learning for Remote Sensing Change Captioning ViT-B/32 Transformer Decoder link
2024.03 Change-Agent Change-Agent: Toward Interactive Comprehensive Remote Sensing Change Interpretation and Analysis ViT-B/32 Transformer Decoder link
2024.07 Semantic-CC Semantic-CC: Boosting Remote Sensing Image Change Captioning via Foundational Knowledge and Semantic Guidance SAM Vicuna N/A
2024.09 DetACC * Detection Assisted Change Captioning for Remote Sensing Image ResNet-101 Transformer Decoder N/A
2024.09 KCFI Enhancing Perception of Key Changes in Remote Sensing Image Change Captioning ViT Qwen link
2024.10 MV-CC * MV-CC: Mask Enhanced Video Model for Remote Sensing Change Caption InternVideo2 Transformer Decoder link
2024.10 ChangeMinds ChangeMinds: Multi-task Framework for Detecting and Describing Changes in Remote Sensing Swin Transformer Transformer Decoder link
2024.10 CTMTNet A Multi-Task Network and Two Large Scale Datasets for Change Detection and Captioning in Remote Sensing Images ResNet-101 Transformer Decoder N/A
2024.12 Mask Approx Net Mask Approximation Net: A Novel Diffusion Model Approach for Remote Sensing Change Captioning ResNet Transformer Decoder link
2025.01 MModalCC * Robust Change Captioning in Remote Sensing: SECOND-CC Dataset and MModalCC Framework ResNet-101 Transformer Decoder link
2025.03 CD4C * CD4C: Change Detection for Remote Sensing Image Change Captioning ResNet-101 Transformer Decoder N/A
2025.04 FST-Net Frequency–Spatial–Temporal Domain Fusion Network for Remote Sensing Image Change Captioning Segformer Transformer Decoder N/A
......

Change Question Answering

Time Model Name Paper Title Visual Encoder Language Decoder Code/Project
2022.07 change-aware VQA Change-Aware Visual Question Answering CNN RNN N/A
2022.09 CDVQA-Net Change Detection Meets Visual Question Answering CNN RNN link
2024.09 ChangeChat ChangeChat: An Interactive Model for Remote Sensing Change Analysis via Multimodal Instruction Tuning CLIP-ViT Vicuna-v1.5 link
2024.09 CDchat CDChat: A Large Multimodal Model for Remote Sensing Change Description CLIP ViT-L/14 Vicuna-v1.5 link
2024.10 TEOChat TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data CLIP ViT-L/14 LLaMA-2 link
2024.10 GeoLLaVA GeoLLaVA: Efficient Fine-Tuned Vision-Language Models for Temporal Change Detection in Remote Sensing Video encoder LLaVA-NeXT, Video-LLaVA link
2024.10 VisTA Show Me What and Where has Changed? Question Answering and Grounding for Remote Sensing Change Detection CLIP image Encoder CLIP Text Encoder link
2024.12 RSUniVLM RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of Experts Siglip-400m Qwen2-0.5B link
2024.12 EarthDial EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues InternViT-300M Phi-3-mini link
2024.12 UniRS UniRS: Unifying Multi-temporal Remote Sensing Tasks through Vision Language Models Siglip-400m Sheared-LLAMA-3B link
2025.05 DVLChat DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding SAM Qwen2.5-VL N/A
......

Text-driven Temporal Images Retrieval

Time Model Name Paper Title Code/Project
2024.06 ChangeRetCap Towards a multimodal framework for remote sensing image change retrieval and captioning link
2025.01 text-ITSR Self-Supervised Cross-Modal Text-Image Time Series Retrieval in Remote Sensing N/A
........

Change Grounding

Time Model Name Grounding Output Paper Title Code/Project
2024.09 ChangeChat mask ChangeChat: An Interactive Model for Remote Sensing Change Analysis via Multimodal Instruction Tuning link
2024.10 TEOChat bbox TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data link
2024.10 VisTA mask Show Me What and Where has Changed? Question Answering and Grounding for Remote Sensing Change Detection link
2024.12 RSUniVLM mask RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of Experts link
2024.12 EarthDial bbox EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues link
2025.03 Falcon mask Falcon: A Remote Sensing Vision-Language Foundation Model link
2025.03 GeoRSMLLM mask GeoRSMLLM: A Multimodal Large Language Model for Vision-Language Tasks in Geoscience and Remote Sensing N/A
........

Text-driven Temporal Images Generation

Time Model Name Paper Title Code/Project
2025.02 TGIPG Image Editing based on Diffusion Model for Remote Sensing Image Change Captioning N/A
2025.04 ChangeDiff ChangeDiff: A Multi-Temporal Change Detection Data Generator with Flexible Text Prompts via Diffusion Model link
2025.07 -- Open-vocabulary generative vision-language models for creating a large-scale remote sensing change detection dataset link
2025.07 ChangeBridge ChangeBridge: Spatiotemporal Image Generation with Multimodal Controls for Remote Sensing N/A
........

👨‍🏫 Large Language Models Meets Temporal Images

LLM-driven Task-Specific Spatio-Temporal VLMs

Time Method Paper Title LLM LLM Fine-tuning Code/Project
2023.10 PromptCC A Decoupling Paradigm with Prompt Learning for Remote Sensing Image Change Captioning CLIP-ViT-B/32 GPT-2 Prompt Tuning link
2024.06 ChangeExp Towards Temporal Change Explanations from Bi-Temporal Satellite Images CLIP-ViT-L LLaVA-1.5 Prompt Method N/A
2024.07 Semantic-CC Semantic-CC: Boosting Remote Sensing Image Change Captioning via Foundational Knowledge and Semantic Guidance SAM Vicuna LoRA N/A
2024.09 KCFI Enhancing Perception of Key Changes in Remote Sensing Image Change Captioning ViT Qwen Prompt Tuning link
2024.09 CDChat CDChat: A Large Multimodal Model for Remote Sensing Change Description CLIP-ViT-L/14 Vicuna-v1.5 LoRA link
2024.10 GeoLLaVA GeoLLaVA: Efficient Fine-Tuned Vision-Language Models for Temporal Change Detection in Remote Sensing Siglip-400m LLaVA-NeXT LoRA link
2024.10 Chareption Chareption: Change-Aware Adaption Empowers Large Language Model for Effective Remote Sensing Image Change Captioning CLIP-ViT-L/14 LLaMA-7B Adapter N/A
2024.11 CCExpert CCExpert: Advancing MLLM Capability in Remote Sensing Change Captioning with Difference-Aware Integration and a Foundational Dataset Siglip-400m Qwen-2 LoRA link
........

Unified Spatio-Temporal Vision-Language Foundation Models

Time Method Paper Title Visual Encoder LLM Fine-tuning Code/Project
2024.03 Change-Agent Change-Agent: Toward Interactive Comprehensive Remote Sensing Change Interpretation and Analysis Segformer Chatgpt Frozen link
2024.09 ChangeChat ChangeChat: An Interactive Model for Remote Sensing Change Analysis via Multimodal Instruction Tuning CLIP-ViT Vicuna-v1.5 LoRA link
2024.10 TEOChat TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data CLIP ViT-L/14 LLaMA-2 LoRA link
2024.12 RingMoGPT RingMoGPT: A Unified Remote Sensing Foundation Model for Vision, Language, and grounded tasks ViT-g/14(EVA-CLIP) Vicuna-13B Frozen N/A
2024.12 RSUniVLM RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of Experts Siglip-400m Qwen2-0.5B MoE link
2024.12 EarthDial EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues InternViT-300M Phi-3-mini Fully Fine-tuning link
2024.12 UniRS UniRS: Unifying Multi-temporal Remote Sensing Tasks through Vision Language Models Siglip-400m Sheared-LLAMA-3B Fully Fine-tuning link
2025.03 Falcon Falcon: A Remote Sensing Vision-Language Foundation Model DaViT Florence-2 Fully Fine-tuning link
2025.03 GeoRSMLLM GeoRSMLLM: A Multimodal Large Language Model for Vision-Language Tasks in Geoscience and Remote Sensing SigLIP Qwen2-7B N/A N/A
2025.05 DVLChat DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding SAM Qwen2.5-VL LoRA N/A
........

LLM-driven Remote Sensing Vision-Language Agents

Time Method Paper Title Function Code
2024.01 RSChatgpt Remote Sensing ChatGPT: Solving Remote Sensing Tasks with ChatGPT and Visual Models Single-image analysis Link
2024.03 Change-Agent Change-Agent: Toward Interactive Comprehensive Remote Sensing Change Interpretation and Analysis Spatio-Temporal Change Interpretation Link
2024.06 RS-Agent RS-Agent: Automating Remote Sensing Tasks through Intelligent Agent Tool selection and knowledge search Link
2024.07 RS-AGENT RS-AGENT: Large Language Models Guided Agent System for Remote Sensing Image Generation Image Generation N/A
2024.12 GeoTool-GPT GeoTool-GPT: a trainable method for facilitating Large Language Models to master GIS tools Master GIS tools N/A
2025.01 RescueADI RescueADI: Adaptive Disaster Interpretation in Remote Sensing Images With Autonomous Agents Disaster Interpretation N/A
........

🛰️ Dataset

Matching Temporal Images, Text, and Masks

Dataset Time Image Size Image Resolution Image Pairs Captions* Masks Temporal Image Data Source Anno. Link
DUBAI CCD 2022.08 50×50 30m 500 2,500 - Landsat-7 imagery Manual Link
LEVIR CCD 2022.08 256×256 0.5m 500 2,500 - LEVIR-CD Manual Link
LEVIR-CC 2022.11 256×256 0.5m 10,077 50,385 - LEVIR-CD Manual Link
CCExpert 2024.11 - - 200K 1.2M - LEVIR-CC, CLVER-Change, ImageEdit, Spot-the-dif, STVchrono, Vismin, ChangeSim, SYSU-CD, SECOND Auto. Link
SECTION 2025.07 256×256 0.3-3m 4,059 12,200 - SECOND Manual Link
LEVIR-MCI 2024.03 256×256 0.5m 10,077 50,385 building, road LEVIR-CC Manual Link
LEVIR-CDC 2024.11 256×256 0.5m 10,077 50,385 building LEVIR-CC Manual Link
WHU-CDC 2024.11 256×256 0.075m 7,434 37,170 building WHU-CD Manual Link
SECOND-CC 2025.01 256×256 0.3∼3m 6,041 30,205 6 classes SECOND Manual Link

Matching Temporal Images, Instruction and Response

Dataset Time Instruction Samples Number of Images Temporal Length Temporal Image Data Source Anno. Link
CDVQA 2022.09 122,000 2,968 2 SECOND Manual Link
ChangeChat-87k 2024.09 87,195 10,077 2 LEVIR-CC, LEVIR-MCI Auto. Link
QAG-360K 2024.10 360,000 6,810 2 Hi-UCD, SECOND, LEVIR-CD Auto. Link
GeoLLaVA 2024.10 100,000 100,000 2 fMoW Auto. Link
TEOChatlas 2024.10 554,071 - 1~8 xBD, S2Looking, QFabric, fMoW Auto. Link
EarthDial 2024.12 11.11 Million - 1~4 fMoW, TreeSatAI-Time-Series, MUDS, xBD, QuakeSet Manual & Auto. Link
UniRS 2024.12 318.8 K - 1~T (T>2) LEVIR-CC, ERA-Video Auto. Link
Falcon_SFT 2025.03 78 Million 5.6 Million 1~2 CDD, EGY-BCD, HRSCD, LEVIR-CD, MSBC, MSOSCD, NJDS, S2Looking, SYSU-CD, WHU-CD Auto. Link
DVL-Suite 2025.05 69,926 15,063 6.9 (Average) U.S. National Agriculture Imagery Program (NAIP) Manual & Auto. N/A
....

💻 Others

Some CLIP Models in Remote Sensing

Time Model Name Paper Title Code/Project
2023.06 RemoteCLIP RemoteCLIP: A Vision Language Foundation Model for Remote Sensing link
2023.06 GeoRSCLIP RS5M and GeoRSCLIP: A Large-Scale Vision- Language Dataset and a Large Vision-Language Model for Remote Sensing link
2023.12 SkyCLIP SkyScript: a large and semantically diverse vision-language dataset for remote sensing link
2025.01 Git-RSCLIP Text2Earth: Unlocking text-driven remote sensing image generation with a global-scale dataset and a foundation model link

🖊️ Citation

If you find our survey and repository useful for your research, please consider citing our paper:

@misc{liu2024remotesensingtemporalvisionlanguage,
      title={Remote Sensing Spatio-Temporal Vision-Language Models: A Comprehensive Survey}, 
      author={Chenyang Liu and Jiafan Zhang and Keyan Chen and Man Wang and Zhengxia Zou and Zhenwei Shi},
      year={2024},
      eprint={2412.02573},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.02573}, 
}

🐲 Contact

liuchenyang@buaa.edu.cn

About

🔥Remote Sensing Spatio-Temporal Vision-Language Models: A Comprehensive Survey

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •