A scene graph is a topological structure representing a scene described in text, image, video, or etc. In this graph, the nodes correspond to object bounding boxes with their category labels and attributes, while the edges represent the pair-wise relationships between objects.
- 🌷 Scene Graph Datasets
- 🍕 Scene Graph Generation
- 🥝 Scene Graph Application
- 🤶 Evaluation Metrics
- 🐱🚀 Miscellaneous
- ⭐️ Star History
Dataset | Modality | Obj. Class | BBox | Rela. Class | Triplets | Instances |
---|---|---|---|---|---|---|
Visual Phrase | Image | 8 | 3,271 | 9 | 1,796 | 2,769 |
Scene Graph | Image | 266 | 69,009 | 68 | 109,535 | 5,000 |
VRD | Image | 100 | - | 70 | 37,993 | 5,000 |
Open Images v7 | Image | 600 | 3,290,070 | 31 | 374,768 | 9,178,275 |
Visual Genome | Image | 5,996 | 3,843,636 | 1,014 | 2,347,187 | 108,077 |
GQA | Image | 200 | - | 310 | - | 3,795,907 |
VrR-VG | Image | 1,600 | 282,460 | 117 | 203,375 | 58,983 |
UnRel | Image | - | - | 18 | 76 | 1,071 |
SpatialSense | Image | 3,679 | - | 9 | 13,229 | 11,569 |
SpatialVOC2K | Image | 20 | 5,775 | 34 | 9,804 | 2,026 |
OpenSG | Image (panoptic) | 133 | - | 56 | - | 49K |
AUG | Image (Overhead View) | 76 | - | 61 | - | - |
STAR | Satellite Imagery | 48 | 219,120 | 58 | 400,795 | 31,096 |
ReCon1M | Satellite Imagery | 60 | 859,751 | 64 | 1,149,342 | 21,392 |
SkySenseGPT | Satellite Imagery (Instruction) | - | - | - | - | - |
ImageNet-VidVRD | Video | 35 | - | 132 | 3,219 | 100 |
VidOR | Video | 80 | - | 50 | - | 10,000 |
Action Genome | Video | 35 | 0.4M | 25 | 1.7M | 10,000 |
AeroEye | Video (Drone-View) | 56 | - | 384 | - | 2.2M |
PVSG | Video (panoptic) | 126 | - | 57 | 4,587 | 400 |
ASPIRe | Video(Interlacements) | - | - | 4.5K | - | 1.5K |
Ego-EASG | Video(Ego-view) | 407 | - | 235 | - | - |
3D Semantic Scene Graphs (3DSSG) | 3D | 528 | - | 39 | - | 48K |
PSG4D | 4D | 46 | - | 15 | - | - |
4D-OR | 4D(operating room) | 12 | - | 14 | - | - |
FACTUAL | Image, Text | 4,042 | - | 1607 | 40,149 | 40,369 |
There are three subtasks:
Predicate classification
: given ground-truth labels and bounding boxes of object pairs, predict the predicate label.Scene graph classification
: joint classification of predicate labels and the objects' category given the grounding bounding boxes.Scene graph detection
: detect the objects and their categories, and predict the predicate between object pairs.
-
Tri-modal Confluence with Temporal Dynamics for Scene Graph Generation in Operating Rooms
-
Scene Graph Generation Strategy with Co-occurrence Knowledge and Learnable Term Frequency
-
Scene Graph Generation with Role-Playing Large Language Models
-
SkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language Understanding
-
VLPrompt: Vision-Language Prompting for Panoptic Scene Graph Generation
-
From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models
-
LLM4SGG: Large Language Models for Weakly Supervised Scene Graph Generation
-
Visually-Prompted Language Model for Fine-Grained Scene Graph Generation in an Open World
-
GPT4SGG: Synthesizing Scene Graphs from Holistic and Region-specific Narratives
-
Less is More: Toward Zero-Shot Local Scene Graph Generation via Foundation Models
-
Enhancing Scene Graph Generation with Hierarchical Relationships and Commonsense Knowledge 🙆♀️👈
-
Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection
-
Fine-Grained Scene Graph Generation via Sample-Level Bias Prediction
-
BCTR: Bidirectional Conditioning Transformer for Scene Graph Generation
-
Hydra-SGG: Hybrid Relation Assignment for One-stage Scene Graph Generation
-
Scene Graph Generation Strategy with Co-occurrence Knowledge and Learnable Term Frequency
-
Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention
-
Semantic Diversity-aware Prototype-based Learning for Unbiased Scene Graph Generation
-
Fine-Grained Scene Graph Generation via Sample-Level Bias Prediction
-
Multi-Granularity Sparse Relationship Matrix Prediction Network for End-to-End Scene Graph Generation
-
Groupwise Query Specialization and Quality-Aware Multi-Assignment for Transformer-based Visual Relationship Detection
-
Leveraging Predicate and Triplet Learning for Scene Graph Generation
-
DSGG: Dense Relation Transformer for an End-to-end Scene Graph Generation
-
HiKER-SGG: Hierarchical Knowledge Enhanced Robust Scene Graph Generation
-
EGTR: Extracting Graph from Transformer for Scene Graph Generation
-
STAR: A First-Ever Dataset and A Large-Scale Benchmark for Scene Graph Generation in Large-Size Satellite Imagery
-
Improving Scene Graph Generation with Relation Words’ Debiasing in Vision-Language Models
-
Adaptive Visual Scene Understanding: Incremental Scene Graph Generation
-
Ensemble Predicate Decoding for Unbiased Scene Graph Generation
-
ReCon1M:A Large-scale Benchmark Dataset for Relation Comprehension in Remote Sensing Imagery
-
RepSGG: Novel Representations of Entities and Relationships for Scene Graph Generation
-
Hierarchical Relationships: A New Perspective to Enhance Scene Graph Generation
-
Improving Scene Graph Generation with Superpixel-Based Interaction Learning
-
Unbiased Scene Graph Generation via Two-stage Causal Modeling
-
Zero-Shot Scene Graph Generation via Triplet Calibration and Reduction
-
Evidential Unvertainty and Diversity Guided Active Learning for Scene Graph Generation
-
Prototype-based Embedding Network for Scene Graph Generation
-
IS-GGT: Iterative Scene Graph Generation With Generative Transformers
-
Learning to Generate Language-supervised and Open-vocabulary Scene Graph using Pre-trained Visual-Semantic Space
-
Fast Contextual Scene Graph Generation with Unbiased Context Augmentation
-
Devil’s on the Edges: Selective Quad Attention for Scene Graph Generation
-
Fine-Grained is Too Coarse: A Novel Data-Centric Approach for Efficient Scene Graph Generation
-
Vision Relation Transformer for Unbiased Scene Graph Generation
-
Compositional Feature Augmentation for Unbiased Scene Graph Generation
-
The Devil Is in the Labels: Noisy Label Correction for Robust Scene Graph Generation
-
Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene Graphs with Language Structures via Dependency Relationships
-
Not All Relations are Equal: Mining Informative Labels for Scene Graph Generation
-
Towards Open-vocabulary Scene Graph Generation with Prompt-based Finetuning
-
Unbiased Heterogeneous Scene Graph Generation with Relation-Aware Message Passing Neural Network
-
VARSCENE: A Deep Generative Model for Realistic Scene Graph Synthesis
-
Linguistic Structures as Weak Supervision for Visual Scene Graph Generation
-
CogTree: Cognition Tree Loss for Unbiased Scene Graph Generation
-
Learning to Generate Scene Graph from Natural Language Supervision
-
Context-Aware Scene Graph Generation With Seq2Seq Transformers
-
Generative Compositional Augmentations for Scene Graph Prediction
-
GPS-Net: Graph Property Sensing Network for Scene Graph Generation
-
Learning to Compose Dynamic Tree Structures for Visual Contexts
-
Knowledge-Embedded Routing Network for Scene Graph Generation
-
Scene Graph Generation From Objects, Phrases and Region Captions
Compared with traditional scene graph, each object is grounded by a panoptic segmentation mask
in PSG, achieving a compresensive structured scene representation.
-
Pair then Relation: Pair-Net for Panoptic Scene Graph Generation
-
From Easy to Hard: Learning Curricular Shape-aware Features for Robust Panoptic Scene Graph Generation
-
A Fair Ranking and New Model for Panoptic Scene Graph Generation
-
OpenPSG: Open-set Panoptic Scene Graph Generation via Large Multimodal Models
-
Panoptic scene graph generation with semantics-prototype learning
-
TextPSG: Panoptic Scene Graph Generation from Textual Descriptions
-
HiLo: Exploiting high low frequency relations for unbiased panoptic scene graph generation
-
Haystack: A Panoptic Scene Graph Dataset to Evaluate Rare Predicate Classes
-
Deep Generative Probabilistic Graph Neural Networks for Scene Graph Generation
Spatio-Temporal (Video) Scene Graph Generation, a.k.a, dynamic scene graph generation, aims to provide a detailed and structured interpretation of the whole scene by parsing an event into a sequence of interactions between different visual entities. It ususally involves two subtasks:
Scene graph detection
: aims to generate scene graphs for given videos, comprising detection results of subject-object pari and the associatde predicates. The localization of object prediction is considered accurate when the Intersection over Union (IoU) between the prediction and ground truth is greater than 0.5.Predicate classification
: classifiy predicates for given oracle detection results of subject-object pairs.-
Noted
Noted: Evaluation is conducted with two settings: ***With Constraint*** and ***No constraints***. In the former the generated graphs are restricted to at most one edge, i.e., each subject-object pair is allowed only one predicate and in the latter, the graphs can have multiple edges. More details can refer to Metrics.
-
End-to-end Open-vocabulary Video Visual Relationship Detection using Multi-modal Prompting
-
Towards Unbiased and Robust Spatio-Temporal Scene Graph Generation and Anticipation
-
CYCLO: Cyclic Graph Transformer Approach to Multi-Object Relationship Modeling in Aerial Videos
-
OED: Towards One-stage End-to-End Dynamic Scene Graph Generation
-
Action Scene Graphs for Long-Form Understanding of Egocentric Videos
-
HIG: Hierarchical Interlacement Graph Approach to Scene Graph Generation in Video Understanding
Summary
Introduce a new dataset which delves into interactivities understanding within visual content by deriving scene graph representations from dense interactivities among humans and objects -
Action Scene Graphs for Long-Form Understanding of Egocentric Videos
-
End-to-End Video Scene Graph Generation With Temporal Propagation Transformer
-
Cross-Modality Time-Variant Relation Learning for Generating Dynamic Scene Graphs
-
Video Scene Graph Generation from Single-Frame Weak Supervision
-
Prior Knowledge-driven Dynamic Scene Graph Generation with Causal Inference
-
Exploiting Long-Term Dependencies for Generating Dynamic Scene Graphs
-
VRDFormer: End-to-End Video Visual Relation Detection with Transformers
-
Dynamic Scene Graph Generation via Anticipatory Pre-training
-
Meta Spatio-Temporal Debiasing for Video Scene Graph Generation
-
Spatial-temporal transformer for dynamic scene graph generation
-
Target adaptive context aggregation for video scene graph generation
Given a 3D point cloud 3D Scene Graph Generation
aims to map the input 3D point cloud to a reliable semantically structured scene graph
-
ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning
-
Point2Graph: An End-to-end Point Cloud-based 3D Open-Vocabulary Scene Graph for Robot Navigation
-
Heterogeneous Graph Learning for Scene Graph Prediction in 3D Point Clouds
-
EchoScene: Indoor Scene Generation via Information Echo over Scene Graph Diffusion
-
Weakly-Supervised 3D Scene Graph Generation via Visual-Linguistic Assisted Pseudo-labeling
-
SGRec3D: Self-Supervised 3D Scene Graph Learning via Object-Level Scene Reconstruction
-
Open3DSG: Open-Vocabulary 3D Scene Graphs from Point Clouds with Queryable Objects and Open-Set Relationships
-
CLIP-Driven Open-Vocabulary 3D Scene Graph Generation via Cross-Modality Contrastive Learning
-
Incremental 3D Semantic Scene Graph Prediction from RGB Sequences
-
VL-SAT: Visual-Linguistic Semantics Assisted Training for 3D Semantic Scene Graph Prediction in Point Cloud
-
3D Spatial Multimodal Knowledge Accumulation for Scene Graph Prediction in Point Cloud
-
Aria Digital Twin: A New Benchmark Dataset for Egocentric 3D Machine Perception
-
Lang3DSG: Language-based contrastive pre-training for 3D Scene Graph prediction
-
SceneGraphFusion: Incremental 3D Scene Graph Prediction from RGB-D Sequences
-
Exploiting Edge-Oriented Reasoning for 3D Point-based Scene Graph Analysis
-
Learning 3D Semantic Scene Graphs from 3D Indoor Reconstructions
-
3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera
-
FACTUAL: A Benchmark for Faithful and Consistent Textual Scene Graph Parsing
-
Scene Graph Parsing via Abstract Meaning Representation in Pre-trained Language Models
-
Generating Semantically Precise Scene Graphs from Textual Descriptions for Improved Image Retrieval
-
SceneGraphLoc: Cross-Modal Coarse Visual Localization on 3D Scene Graphs
SceneGraphLoc addresses the novel problem of localizing a query image in a database of 3D scenes represented as compact multi-modal 3D scene graphs
-
Composing Object Relations and Attributes for Image-Text Matching
-
Cross-modal Scene Graph Matching for Relationship-aware Image-Text Retrieval
-
Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions
Introducing new dataset GBC10M
Humans describe complex scenes with compositionality, using simple text descriptions enriched with links and relationships. While vision-language research has aimed to develop models with compositional understanding capabilities, this is not reflected yet in existing datasets which, for the most part, still use plain text to describe images. In this work, we propose a new annotation strategy, graph-based captioning (GBC) that describes an image using a labelled graph structure, with nodes of various types. We demonstrate that GBC can be produced automatically, using off-the-shelf multimodal LLMs and open-vocabulary detection models, by building a new dataset, GBC10M, gathering GBC annotations for about 10M images of the CC12M dataset -
Cross2StrA: Unpaired Cross-lingual Image Captioning with Cross-lingual Cross-modal Structure-pivoted Alignment
-
Comprehensive Image Captioning via Scene Graph Decomposition
-
From Show to Tell: A Survey on Deep Learning-based Image Captioning
-
LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations
-
SSGEdit: Bridging LLM with Text2Image Generative Model for Scene Graph-based Image Editing
-
SG-Adapter: Enhancing Text-to-Image Generation with Scene Graph Guidance
-
Joint Generative Modeling of Scene Graphs and Images via Diffusion Models
-
Image Synthesis with Graph Conditioning: CLIP-Guided Diffusion Models for Scene Graphs
-
R3CD: Scene Graph to Image Generation with Relation-Aware Compositional Contrastive Control Diffusion
-
Scene Graph Disentanglement and Composition for Generalizable Complex Image Generation
-
Imagine that! abstract-to-intricate text-to-image synthesis with scene graph hallucination diffusion
-
SceneGenie: Scene Graph Guided Diffusion Models for Image Synthesis
-
Diffusion-Based Scene Graph to Image Generation with Masked Contrastive Pre-Trainin
-
OSCAR-Net: Object-centric Scene Graph Attention for Image Attribution
-
SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models
-
Towards Flexible Visual Relationship Segmentation
A single model that seamlessly integrates Visual relationship understanding has been studied separately in human-object interaction (HOI) detection, scene graph generation (SGG), and referring relationships (RR) tasks.
FleVRS leverages the synergy between text and image modalities, to ground various types of relationships from images and use textual features from vision-language models to visual conceptual understanding. -
LLaVA-SG: Leveraging Scene Graphs as Visual Semantic Expression in Vision-Language Models
-
Multi-modal Situated Reasoning in 3D Scenes
Introducing a large-scale multimodal situated reasoning dataset, scalably collected leveraging 3D scene graphs and vision-language models (VLMs) across a diverse range of real-world 3D scenes
MSQA includes 251K situated question-answering pairs across 9 distinct question categories, covering complex scenarios and object modalities within 3D scenes. We introduce a novel interleaved multi-modal input setting in our benchmark to provide both texts, images, and point clouds for situation and question description, aiming to resolve ambiguity in describing situations with single-modality inputs (\eg, texts). -
SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge
-
VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual Question Answering
-
Semantic Compositions Enhance Vision-Language Contrastive Learning
-
Compositional Chain-of-Thought Prompting for Large Multimodal Models
-
Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs
-
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World
New dataset and New Task (Relation Conversation)
we propose a novel task, termed Relation Conversation (ReC), which unifies the formulation of text generation, object localization, and relation comprehension. Based on the unified formulation, we construct the AS-V2 dataset, which consists of 127K high-quality relation conversation samples, to unlock the ReC capability for Multi-modal Large Language Models (MLLMs). -
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World
New dataset and a unified vision-language model for open-word panoptic visual recognition and understanding
we propose a new large-scale dataset (AS-1B) for open-world panoptic visual recognition and understanding, using an economical semi-automatic data engine that combines the power of off-the-shelf vision/language models and human feedback. Moreover, we develop a unified vision-language foundation model (ASM) for open-world panoptic visual recognition and understanding. Aligning with LLMs, our ASM supports versatile image-text retrieval and generation tasks, demonstrating impressive zero-shot capability. -
Cross-modal Attention Congruence Regularization for Vision-Language Relation Alignment
-
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
-
Fine-Grained Semantically Aligned Vision-Language Pre-Training
-
ERNIE-ViL: Knowledge Enhanced Vision-Language Representations through Scene Graphs
-
M3S: Scene Graph Driven Multi-Granularity Multi-Task Learning for Multi-Modal NER
-
Information screening whilst exploiting! multimodal relation extraction with feature denoising and multimodal topic modeling
-
Multimodal Relation Extraction with Efficient Graph Alignment
-
EchoScene: Indoor Scene Generation via Information Echo over Scene Graph Diffusion
-
INSTRUCTLAYOUT: Instruction-Driven 2D and 3D Layout Synthesis with Semantic Graph Prior
-
Compositional 3D Scene Synthesis with Scene Graph Guided Layout-Shape Generation
-
GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs
-
CommonScenes: Generating Commonsense 3D Indoor Scenes with Scene Graph Diffusion
-
Graph-to-3D: End-to-End Generation and Manipulation of 3D Scenes Using Scene Graphs
-
Reefknot: A Comprehensive Benchmark for Relation Hallucination Evaluation, Analysis and Mitigation in Multimodal Large Language Models
Introducing a benchmark based on scene graph dataset
Specifically, we first provide a systematic definition of relation hallucinations, integrating perspectives from perceptive and cognitive domains. Furthermore, we construct the relation-based corpus utilizing the representative scene graph dataset Visual Genome (VG), from which semantic triplets follow real-world distributions -
BACON: Supercharge Your VLM with Bag-of-Concept Graph to Mitigate Hallucinations
-
Mitigating Hallucination in Visual Language Models with Visual Supervision
-
SG-Nav: Online 3D Scene Graph Prompting for LLM-based Zero-shot Object Navigation
-
Open Vocabulary 3D Scene Understanding via Geometry Guided Self-Distillation
-
Semantic Abstraction: Open-World 3D Scene Understanding from 2D Vision-Language Models
-
Hierarchical Open-Vocabulary 3D Scene Graphs for Language-Grounded Robot Navigation
-
LLM-enhanced Scene Graph Learning for Household Rearrangement
household rearrangement
The household rearrangement task involves spotting misplaced objects in a scene and accommodate them with proper places. -
Situational Instructions Database: Task Guidance in Dynamic Environments
Situational Instructions Database (SID)
Situational Instructions Database (SID) is a dataset for dynamic task guidance. It contains situationally-aware instructions for performing a wide range of everyday tasks or completing scenarios in 3D environments. The dataset provides step-by-step instructions for these scenarios which are grounded in the context of the situation. This context is defined through a scenario-specific scene graph that captures the objects, their attributes, and their relations in the environment. The dataset is designed to enable research in the areas of grounded language learning, instruction following, and situated dialogue. -
RoboHop: Segment-based Topological Map Representation for Open-World Visual Navigation
-
LLM-Personalize: Aligning LLM Planners with Human Preferences via Reinforced Self-Training for Housekeeping Robots
- Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions
A triplet-matching objective to fine-tune the vision-language alignment models.
To mitigate this gap, we leverage large foundation models to disentangle both images and texts into triplets in the format of (subject, predicate, object). After that, grounding is accomplished by calculating the structural similarity matrix between visual and textual triplets with a VLA model, and subsequently propagate it to an instancelevel similarity matrix. Furthermore, to equip VLA models with the ability of relationship nderstanding, we design a triplet-matching objective to fine-tune the VLA models on a collection of curated dataset containing abundant entity relationships
-
A Review and Efficient Implementation of Scene Graph Generation Metrics
-
Semantic Similarity Score for Measuring Visual Similarity at Semantic Level
Here, we provide some toolkits for parsing scene graphs or other useful tools for referencess.
-
2nd Workshop on Scene Graphs and Graph Representation Learning
-
First ICCV Workshop on Scene Graphs and Graph Representation Learning [paper_list]