Maria Escobar1, Juanita Puentes1, Cristhian Forigua1, Jordi Pont-Tuset2, Kevis-Kokitsi Maninis2, Pablo Arbelaez1. EgoCast: Forecasting Egocentric Human Pose in the Wild. arXiv, 2025. |
EgoCast is a novel framework for full-body pose forecasting. We use visual and proprioceptive cues to accurately predict body motion.
Our method leverages proprioception and visual streams to estimate 3D human pose. (Top) For forecasting, we input previous camera poses and 3D full-body pose predictions through a forecasting head to estimate future 3D poses from t+1 to t+n. (Bottom) Since ground-truth 3D full-body poses are not available in real-case scenarios, we implement a current-frame estimation module that integrates camera poses and visual cues to estimate 3D pose at time t.
-
Clone the repository.
git clone https://github.com/BCV-Uniandes/EgoCast.git
-
Install general dependencies.
To set up the environment and install the necessary dependencies, run the following commands:
cd EgoCast conda create -n egocast python=3.11 -y conda activate egocast pip install .
-
Download model checkpoint.
We use the EgoVPL model from EgoVPL implementation. Please download and put the checkpoint under
model_zoo/
We utilize EgoExo-4D, a large-scale, multi-modal, multi-view video dataset collected across 13 cities worldwide. This dataset serves as a benchmark for egocentric and exocentric human motion analysis.
For training, our model leverages camera poses and egocentric video data.
-
Data Download
To download the dataset, follow the instructions provided in the EgoExo-4D documentation.
To obtain metadata and body pose annotations, run the following command:
egoexo -o dataset --parts annotations --benchmarks egopose --release v2
To download the downscaled takes (448p resolution) of the egocentric videos, run the following command:
egoexo -o dataset --parts annotations --benchmarks egopose --release v2
-
Data Preparation
To train our model, the downloaded egocentric video takes must be converted into individual frames. This step extracts frames from the videos and saves them as images for further processing.
python video2image.py
The Current-Frame Estimation Module predicts the full-body pose at the current timestamp using camera poses and, optionally, egocentric video. This eliminates the reliance on ground-truth body poses at test time, enabling real-world applicability. We offer two training approaches:
-
IMU-Based Approach (Uses only camera poses) Train using only IMU (headset pose) data:
python main_train_egocast.py -opt options/train_egocast_imu.json
-
EgoCast Approach (Uses camera poses and egocentric video) Train using both camera pose and visual data:
python main_train_egocast.py -opt options/train_egocast_video.json
-
IMU-Based Testing (Uses only camera poses) Run the following command to evaluate the IMU-based model:
python main_test_egocast.py -opt options/test_egocast_imu.json
-
EgoCast Testing (Uses camera poses and egocentric video) Run the following command to test the model using both IMU data and video:
python main_test_multiprocessing.py -opt options/test_egocast_multiprocessing.json
Make sure you are on the forecasting
branch before running the following command:
python main_train_egocast.py -opt options/train_egocast_forecasting.json
If you find EgoCast useful for your work please cite:
@article{escobar2025egocast,
author = {Escobar, Maria and Puentes, Juanita and Forigua, Cristhian and Pont-Tuset, Jordi and Maninis, Kevis-Kokitsi and Arbeláez, Pablo},
title = {EgoCast: Forecasting Egocentric Human Pose in the Wild},
booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision},
year = {2025},
}
This project borrows heavily from AvatarPoser, we thank the authors for their contributions to the community.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.