Skip to content

Latest commit

 

History

History
194 lines (150 loc) · 12.7 KB

File metadata and controls

194 lines (150 loc) · 12.7 KB

Egocentric-Video-Analysis-and-Understanding

  • An Outlook into the Future of Egocentric Vision [link]

Dataset

2022

  1. (CVPR) Ego4D: Around the World in 3000 Hours of Egocentric Video [link]
  2. (ECCV) Find-Grained Egocentric Hand-Object Segmentation: Dataset, Model, and Applications [link]

2023

  1. (NeurIPS) EgoTracks: A Long-term Egocentric Visual Object Tracking Dataset [link]
  2. (NeurIPS) EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding [link]
  3. (NeurIPS) Ego4D Goal-Step: Toward Hierarchical Understanding of Procedural Activities [link]
  4. (ICCV) HoloAssist: An Egocentric Human Interaction Dataset for Interactive AI Assistants in the Real World [link]
  5. (ICCV) Aria Digital Twin: A New Benchmark Dataset for Egocentric 3D Machine Perception [link]
  6. (CVPR) AssemblyHands: Towards Egocentric Activity Understanding via 3D Hand Pose Estimation [link]
  7. MIMIC-IT: Multi-Modal In-Context Instruction Tuning [link]

2024

  1. EgoExoLearn: A Dataset for Bridging Asynchronous Ego-and Exo-centric View of Procedural Activities in Real World [link]

Video-Language Pretraining Model

2022

  1. (NeurIPS) Egocentric Video-Language Pretraining [link]

2023

  1. (ICCV) EgoVLPv2: Egocentric Video-Language Pretraining with Fusion in the BackBone [link]
  2. (ICCV) Helping Hands: An Object-Aware Ego-centric Video Recognition Model [link]

Visual Representation (ego-exo)

2018

  1. (CVPR) Actor and Observer: Joint Modeling of First and Third-Person Videos [link]

2021

  1. (CVPR) Ego-Exo: Transferring Visual Representations from Third-person to First-person Videos [link]

2023

  1. (NeurIPS) Learning Fine-grained View-Invariant Representations from Unpaired Ego-Exo Videos via Temporal Alignment [link]
  1. (ACMMM) POV: Prompt-Oriented View-Agnostic Learning for Egocentric Hand-Object Interaction in the Multi-view World [link]

Audio-Visual Object Localization

2023

  1. (CVPR) Egocentric Audio-Visual Object Localization [link]

Object Detection

2023

  1. (ICCV) Self-Supervised Object Detection from Egocentric Videos [link]

Action Detection and Recognition

2023

  1. (ICCV) Ego-only: Egocentric Action Detection without Exocentric Transferring [link]

Egocentric Video Captioning

2024

  1. Retrieval-Augmented Egocentric Video Captioning [link]

Efficient Egocentric Video Understanding

2023

  1. (NeurIPS)EgoDistill: Egocentric Head Motion Distillation for Efficient Video Understanding [link]

Procedure Learning

2022

  1. (ECCV) My View is the Best View: Procedure Learning from Egocentric Videos [link]

2023

  1. (ICCV) STEPs: Self-Supervised Key Step Extraction and Localization from Unlabeled Procedural Videos [link]

Popular Model Architectures

2D

  1. (CVPR 2020) GSM - Gate-Shift Networks for Video Action Recognition [link]
  2. (ICCV 2019) TSM - Temporal Shift Module for Efficient Video Understanding[link]
  3. (ICCV 2019) TBN - EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition [link]

3D

  1. (ICCV 2019) SlowFast - SlowFast Networks for Video Recognition [link]

Transformer

  1. (CVPRW 2022) Ego-Stan - Building Spatio-temporal Transformers for Egocentric 3D Pose Estimation [link]
  2. (NeurIPS 2021) XViT - Space-time Mixing Attention for Video Transformer[link]
  3. (ICML 2021) TimeSformer - Is Space-Time Attention All You Need for Video Understanding? [link]
  4. (ICCV 2021) ViViT - ViViT: A Video Vision Transformer [link]

Auto-Encoder

  1. (CVPR 2022) MAE - Masked Autoencoders are Scalable Vision Learners [link]
  2. (CVPR 2022) VideoMAE - VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training [link]

State Space Model

  1. (ECCV 2022) ViS4mer - Long Movie Clip Classification with State-Space Video Models [link]
  2. (CVPR 2023) Selective Structured State-Spaces for Long-Form Video Understanding [link]
  3. VMamba - VMamba: Visual State Space Model [link]
  4. VisionMamba - Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model [link]
  5. VideoMamba - Videomamba: State Space Model for Efficient Video Understanding [link]

Benchmarks

  1. Keystep (Ego-exo)

    • Fine-grained Keystep Recognition [link]

      [Train]
      Input: 1 ego + N exo trimmed video clips
      Output: Keystep label
      
      [Inference]
      Input: A trimmed egocentric video clip
      Output: Predicted keystep label
      
    • Task Graph [link]

      Determine how each task should be performed (using keysteps) based on a given video segment. Given a video segment $s_i$ and its segment history $S_{:i-1} = {s_1, \ldots, s_{i-1}}$, models have to:

      • Determine the list of previous keysteps to be performed before $s_i$;
      • Infer if $s_i$ is an optional keystep, i.e., the procedure can be completed even skipping this keystep;
      • Infer if $s_i$ is a procedural mistake, i.e., a mistake due to incorrect keystep ordering;
      • Predict a list of missing keysteps. These are keysteps which should have been performed before $s_i$ but have not been performed;
      • Forecast next keysteps. These are keysteps for which dependencies are satisfied after the execution of $s_i$ and hence can be executed next.
    • Energy Efficient [link]

      [Input]
      - Egocentric video of arbitrary length T comprising a stream of K different sensory modalities (e.g., RGB images, audio, etc.)
      - Energy budget
      
      [Output]
      - Per-frame keystep label (the prediction happens at 5 fps)
      - Estimated inference energy consumption
      
  2. Ego-exo relations

    • Correspondence [link]

      [Input]
      - Time-synchronized Egocentric + Exocentric video clips
      - Object segmentation track in Egocentric or Exocentric view
      
      [Output]
      Segmentation masks in the other view of the frames that the object is visible in both views. 
      
    • Translation [link]

  3. Proficiency Estimation (Ego-exo) [link]

    • Demonstrator proficiency estimation: the goal is to estimate the absolute skill level of a participant at the task.
      [Input]
          - Egocentric video clip
          - [Optional] Exocentric videos synchronized in timestamp
      
      [Output]
          Proficiency label: Novice, Early Expert, Intermediate Expert, Late Expert
      
    • Demonstration proficiency estimation: the goal is to perform fine-grained analysis of a given task execution to identify good actions from the participant and suggest tips for improvement.
      [Input]
          - Egocentric video clip
          - [Optional] Exocentric videos synchronized in timestamp
      
      [Output]
          - Temporal localization of a proficiency category: list of tuples, each containing a timestamp, a proficiency category (i.e., good execution or needs improvement), and its probability
      
  4. Goal-Step (Ego) [link]

    • annotation
      - video_uid: unique video ID
      - start_time: A timestamp where a goal segment starts (in seconds)
      - end_time: A timestamp where a goal segment ends (in seconds)
      - goal_category: Goal category name
      - goal_description: Natural language description of the goal
      - goal_wikihow_url: A wikiHow URL that best captures the steps captured in the video
      - summary: A list of natural language descriptions summarizing steps captured in the video
      - is_procedural: Binary flag indicating whether the current segment contains procedural steps
      - segments: A list of step segments
      - start_time: A timestamp where a step segment starts (in seconds)
      - end_time: A timestamp where a step segment ends (in seconds)
      - step_category: Step category name (shares the same taxonomy with substep categegories)
      - step_description: Natural language description of the step
      - is_continued: Binary flag indicating whether the current segment contains a step that is continued from an earlier segment
      - is_procedural: Binary flag indicating whether the current segment contains procedural steps
      - is_relevant: A flag indicating whether the current segment is essential, optional, or irrelevant to the (parent) goal segment
      - summary: A list of natural language descriptions summarizing substeps captured in the video
          - segments: A list of substep segments