- An Outlook into the Future of Egocentric Vision [link]
- (NeurIPS) EgoTracks: A Long-term Egocentric Visual Object Tracking Dataset [link]
- (NeurIPS) EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding [link]
- (NeurIPS) Ego4D Goal-Step: Toward Hierarchical Understanding of Procedural Activities [link]
- (ICCV) HoloAssist: An Egocentric Human Interaction Dataset for Interactive AI Assistants in the Real World [link]
- (ICCV) Aria Digital Twin: A New Benchmark Dataset for Egocentric 3D Machine Perception [link]
- (CVPR) AssemblyHands: Towards Egocentric Activity Understanding via 3D Hand Pose Estimation [link]
- MIMIC-IT: Multi-Modal In-Context Instruction Tuning [link]
- EgoExoLearn: A Dataset for Bridging Asynchronous Ego-and Exo-centric View of Procedural Activities in Real World [link]
- (NeurIPS) Egocentric Video-Language Pretraining [link]
- (CVPR) Actor and Observer: Joint Modeling of First and Third-Person Videos [link]
- (CVPR) Ego-Exo: Transferring Visual Representations from Third-person to First-person Videos [link]
- (NeurIPS) Learning Fine-grained View-Invariant Representations from Unpaired Ego-Exo Videos via Temporal Alignment [link]
- (ACMMM) POV: Prompt-Oriented View-Agnostic Learning for Egocentric Hand-Object Interaction in the Multi-view World [link]
- (CVPR) Egocentric Audio-Visual Object Localization [link]
- (ICCV) Self-Supervised Object Detection from Egocentric Videos [link]
- (ICCV) Ego-only: Egocentric Action Detection without Exocentric Transferring [link]
- Retrieval-Augmented Egocentric Video Captioning [link]
- (NeurIPS)EgoDistill: Egocentric Head Motion Distillation for Efficient Video Understanding [link]
- (ECCV) My View is the Best View: Procedure Learning from Egocentric Videos [link]
- (ICCV) STEPs: Self-Supervised Key Step Extraction and Localization from Unlabeled Procedural Videos [link]
- (CVPR 2020) GSM - Gate-Shift Networks for Video Action Recognition [link]
- (ICCV 2019) TSM - Temporal Shift Module for Efficient Video Understanding[link]
- (ICCV 2019) TBN - EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition [link]
- (ICCV 2019) SlowFast - SlowFast Networks for Video Recognition [link]
- (CVPRW 2022) Ego-Stan - Building Spatio-temporal Transformers for Egocentric 3D Pose Estimation [link]
- (NeurIPS 2021) XViT - Space-time Mixing Attention for Video Transformer[link]
- (ICML 2021) TimeSformer - Is Space-Time Attention All You Need for Video Understanding? [link]
- (ICCV 2021) ViViT - ViViT: A Video Vision Transformer [link]
- (CVPR 2022) MAE - Masked Autoencoders are Scalable Vision Learners [link]
- (CVPR 2022) VideoMAE - VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training [link]
- (ECCV 2022) ViS4mer - Long Movie Clip Classification with State-Space Video Models [link]
- (CVPR 2023) Selective Structured State-Spaces for Long-Form Video Understanding [link]
- VMamba - VMamba: Visual State Space Model [link]
- VisionMamba - Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model [link]
- VideoMamba - Videomamba: State Space Model for Efficient Video Understanding [link]
-
Keystep (Ego-exo)
-
Fine-grained Keystep Recognition [link]
[Train] Input: 1 ego + N exo trimmed video clips Output: Keystep label [Inference] Input: A trimmed egocentric video clip Output: Predicted keystep label
-
Task Graph [link]
Determine how each task should be performed (using keysteps) based on a given video segment. Given a video segment
$s_i$ and its segment history$S_{:i-1} = {s_1, \ldots, s_{i-1}}$ , models have to:- Determine the list of previous keysteps to be performed before
$s_i$ ; - Infer if
$s_i$ is an optional keystep, i.e., the procedure can be completed even skipping this keystep; - Infer if
$s_i$ is a procedural mistake, i.e., a mistake due to incorrect keystep ordering; - Predict a list of missing keysteps. These are keysteps which should have been performed before
$s_i$ but have not been performed; - Forecast next keysteps. These are keysteps for which dependencies are satisfied after the execution of
$s_i$ and hence can be executed next.
- Determine the list of previous keysteps to be performed before
-
Energy Efficient [link]
[Input] - Egocentric video of arbitrary length T comprising a stream of K different sensory modalities (e.g., RGB images, audio, etc.) - Energy budget [Output] - Per-frame keystep label (the prediction happens at 5 fps) - Estimated inference energy consumption
-
-
Ego-exo relations
-
Proficiency Estimation (Ego-exo) [link]
- Demonstrator proficiency estimation: the goal is to estimate the absolute skill level of a participant at the task.
[Input] - Egocentric video clip - [Optional] Exocentric videos synchronized in timestamp [Output] Proficiency label: Novice, Early Expert, Intermediate Expert, Late Expert
- Demonstration proficiency estimation: the goal is to perform fine-grained analysis of a given task execution to identify good actions from the participant and suggest tips for improvement.
[Input] - Egocentric video clip - [Optional] Exocentric videos synchronized in timestamp [Output] - Temporal localization of a proficiency category: list of tuples, each containing a timestamp, a proficiency category (i.e., good execution or needs improvement), and its probability
- Demonstrator proficiency estimation: the goal is to estimate the absolute skill level of a participant at the task.
-
Goal-Step (Ego) [link]
- annotation
- video_uid: unique video ID - start_time: A timestamp where a goal segment starts (in seconds) - end_time: A timestamp where a goal segment ends (in seconds) - goal_category: Goal category name - goal_description: Natural language description of the goal - goal_wikihow_url: A wikiHow URL that best captures the steps captured in the video - summary: A list of natural language descriptions summarizing steps captured in the video - is_procedural: Binary flag indicating whether the current segment contains procedural steps - segments: A list of step segments - start_time: A timestamp where a step segment starts (in seconds) - end_time: A timestamp where a step segment ends (in seconds) - step_category: Step category name (shares the same taxonomy with substep categegories) - step_description: Natural language description of the step - is_continued: Binary flag indicating whether the current segment contains a step that is continued from an earlier segment - is_procedural: Binary flag indicating whether the current segment contains procedural steps - is_relevant: A flag indicating whether the current segment is essential, optional, or irrelevant to the (parent) goal segment - summary: A list of natural language descriptions summarizing substeps captured in the video - segments: A list of substep segments
- annotation