Egocentric-Video-Analysis-and-Understanding

An Outlook into the Future of Egocentric Vision [link]

Dataset

2022

(CVPR) Ego4D: Around the World in 3000 Hours of Egocentric Video [link]

(ECCV) Find-Grained Egocentric Hand-Object Segmentation: Dataset, Model, and Applications [link]

2023

(NeurIPS) EgoTracks: A Long-term Egocentric Visual Object Tracking Dataset [link]

(NeurIPS) EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding [link]

(NeurIPS) Ego4D Goal-Step: Toward Hierarchical Understanding of Procedural Activities [link]

(ICCV) HoloAssist: An Egocentric Human Interaction Dataset for Interactive AI Assistants in the Real World [link]

(ICCV) Aria Digital Twin: A New Benchmark Dataset for Egocentric 3D Machine Perception [link]

(CVPR) AssemblyHands: Towards Egocentric Activity Understanding via 3D Hand Pose Estimation [link]

MIMIC-IT: Multi-Modal In-Context Instruction Tuning [link]

2024

EgoExoLearn: A Dataset for Bridging Asynchronous Ego-and Exo-centric View of Procedural Activities in Real World [link]

Video-Language Pretraining Model

2022

(NeurIPS) Egocentric Video-Language Pretraining [link]

2023

(ICCV) EgoVLPv2: Egocentric Video-Language Pretraining with Fusion in the BackBone [link]

(ICCV) Helping Hands: An Object-Aware Ego-centric Video Recognition Model [link]

Visual Representation (ego-exo)

2018

(CVPR) Actor and Observer: Joint Modeling of First and Third-Person Videos [link]

2021

(CVPR) Ego-Exo: Transferring Visual Representations from Third-person to First-person Videos [link]

2023

(NeurIPS) Learning Fine-grained View-Invariant Representations from Unpaired Ego-Exo Videos via Temporal Alignment [link]

(ACMMM) POV: Prompt-Oriented View-Agnostic Learning for Egocentric Hand-Object Interaction in the Multi-view World [link]

Audio-Visual Object Localization

2023

(CVPR) Egocentric Audio-Visual Object Localization [link]

Object Detection

2023

(ICCV) Self-Supervised Object Detection from Egocentric Videos [link]

Action Detection and Recognition

2023

(ICCV) Ego-only: Egocentric Action Detection without Exocentric Transferring [link]

Egocentric Video Captioning

2024

Retrieval-Augmented Egocentric Video Captioning [link]

Efficient Egocentric Video Understanding

2023

(NeurIPS)EgoDistill: Egocentric Head Motion Distillation for Efficient Video Understanding [link]

Procedure Learning

2022

(ECCV) My View is the Best View: Procedure Learning from Egocentric Videos [link]

2023

(ICCV) STEPs: Self-Supervised Key Step Extraction and Localization from Unlabeled Procedural Videos [link]

Popular Model Architectures

`2D`

(CVPR 2020) GSM - Gate-Shift Networks for Video Action Recognition [link]
(ICCV 2019) TSM - Temporal Shift Module for Efficient Video Understanding[link]
(ICCV 2019) TBN - EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition [link]

`3D`

(ICCV 2019) SlowFast - SlowFast Networks for Video Recognition [link]

`Transformer`

(CVPRW 2022) Ego-Stan - Building Spatio-temporal Transformers for Egocentric 3D Pose Estimation [link]
(NeurIPS 2021) XViT - Space-time Mixing Attention for Video Transformer[link]
(ICML 2021) TimeSformer - Is Space-Time Attention All You Need for Video Understanding? [link]
(ICCV 2021) ViViT - ViViT: A Video Vision Transformer [link]

`Auto-Encoder`

(CVPR 2022) MAE - Masked Autoencoders are Scalable Vision Learners [link]
(CVPR 2022) VideoMAE - VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training [link]

`State Space Model`

(ECCV 2022) ViS4mer - Long Movie Clip Classification with State-Space Video Models [link]
(CVPR 2023) Selective Structured State-Spaces for Long-Form Video Understanding [link]
VMamba - VMamba: Visual State Space Model [link]
VisionMamba - Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model [link]
VideoMamba - Videomamba: State Space Model for Efficient Video Understanding [link]

Benchmarks

Keystep (Ego-exo)
- Fine-grained Keystep Recognition [link]
```
[Train]
Input: 1 ego + N exo trimmed video clips
Output: Keystep label

[Inference]
Input: A trimmed egocentric video clip
Output: Predicted keystep label
```
- Task Graph [link]
  
  Determine how each task should be performed (using keysteps) based on a given video segment. Given a video segment $s_i$ and its segment history $S_{:i-1} = {s_1, \ldots, s_{i-1}}$, models have to:
  - Determine the list of previous keysteps to be performed before $s_i$;
  - Infer if $s_i$ is an optional keystep, i.e., the procedure can be completed even skipping this keystep;
  - Infer if $s_i$ is a procedural mistake, i.e., a mistake due to incorrect keystep ordering;
  - Predict a list of missing keysteps. These are keysteps which should have been performed before $s_i$ but have not been performed;
  - Forecast next keysteps. These are keysteps for which dependencies are satisfied after the execution of $s_i$ and hence can be executed next.
- Energy Efficient [link]
```
[Input]
- Egocentric video of arbitrary length T comprising a stream of K different sensory modalities (e.g., RGB images, audio, etc.)
- Energy budget

[Output]
- Per-frame keystep label (the prediction happens at 5 fps)
- Estimated inference energy consumption
```

Ego-exo relations

Correspondence [link]

[Input]
- Time-synchronized Egocentric + Exocentric video clips
- Object segmentation track in Egocentric or Exocentric view

[Output]
Segmentation masks in the other view of the frames that the object is visible in both views.

Translation [link]

Proficiency Estimation (Ego-exo) [link]

Demonstrator proficiency estimation: the goal is to estimate the absolute skill level of a participant at the task.

[Input]
    - Egocentric video clip
    - [Optional] Exocentric videos synchronized in timestamp

[Output]
    Proficiency label: Novice, Early Expert, Intermediate Expert, Late Expert

Demonstration proficiency estimation: the goal is to perform fine-grained analysis of a given task execution to identify good actions from the participant and suggest tips for improvement.

[Input]
    - Egocentric video clip
    - [Optional] Exocentric videos synchronized in timestamp

[Output]
    - Temporal localization of a proficiency category: list of tuples, each containing a timestamp, a proficiency category (i.e., good execution or needs improvement), and its probability

Goal-Step (Ego) [link]

annotation

- video_uid: unique video ID
- start_time: A timestamp where a goal segment starts (in seconds)
- end_time: A timestamp where a goal segment ends (in seconds)
- goal_category: Goal category name
- goal_description: Natural language description of the goal
- goal_wikihow_url: A wikiHow URL that best captures the steps captured in the video
- summary: A list of natural language descriptions summarizing steps captured in the video
- is_procedural: Binary flag indicating whether the current segment contains procedural steps
- segments: A list of step segments
- start_time: A timestamp where a step segment starts (in seconds)
- end_time: A timestamp where a step segment ends (in seconds)
- step_category: Step category name (shares the same taxonomy with substep categegories)
- step_description: Natural language description of the step
- is_continued: Binary flag indicating whether the current segment contains a step that is continued from an earlier segment
- is_procedural: Binary flag indicating whether the current segment contains procedural steps
- is_relevant: A flag indicating whether the current segment is essential, optional, or irrelevant to the (parent) goal segment
- summary: A list of natural language descriptions summarizing substeps captured in the video
    - segments: A list of substep segments

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Egocentric-Video-Analysis-and-Understanding

Dataset

2022

2023

2024

Video-Language Pretraining Model

2022

2023

Visual Representation (ego-exo)

2018

2021

2023

Audio-Visual Object Localization

2023

Object Detection

2023

Action Detection and Recognition

2023

Egocentric Video Captioning

2024

Efficient Egocentric Video Understanding

2023

Procedure Learning

2022

2023

Popular Model Architectures

`2D`

`3D`

`Transformer`

`Auto-Encoder`

`State Space Model`

Benchmarks

Files

README.md

Latest commit

History

README.md

File metadata and controls

Egocentric-Video-Analysis-and-Understanding

Dataset

2022

2023

2024

Video-Language Pretraining Model

2022

2023

Visual Representation (ego-exo)

2018

2021

2023

Audio-Visual Object Localization

2023

Object Detection

2023

Action Detection and Recognition

2023

Egocentric Video Captioning

2024

Efficient Egocentric Video Understanding

2023

Procedure Learning

2022

2023

Popular Model Architectures

2D

3D

Transformer

Auto-Encoder

State Space Model

Benchmarks

`2D`

`3D`

`Transformer`

`Auto-Encoder`

`State Space Model`