This repository provides a preprocessing pipeline for monocular videos containing human motion in static scenes.
Given an input video, our pipeline estimates camera poses, reconstructs human poses in world coordinates, and extracts monocular geometric cues (depth and surface normals). The processed data can then be used by HSR to create human-scene reconstructions.
This preprocessing pipeline is maintained as a standalone repository to facilitate its use in other applications beyond HSR.
The pipeline consists of the following sequential steps:
Extract and select sharp frames from a video or an image sequence
Generate human masks
Estimate camera poses
Generate monocular depth and normal maps
Estimate human poses in the camera coordinate frame
Extract human 2D keypoints
Refine human poses with 2D keypoints and temporal smoothness
Align human poses in the world coordinate frame and scale scene to metric units using human body scale
Save processed data in HSR-compatible format
Clone the repository and its submodules:
git clone --recursive
Setup the environment for Grounded-SAM2 and most of the code in this repository:
conda create -n hsr-data python=3.10
conda activate hsr-data
# SAM2.1 requires torch >=2.5.1
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url
# For Grounded-SAM2
cd third_party/Grouned-SAM-2
cd checkpoints
cd ../
cd gdino_checkpoints
cd ../
export CUDA_HOME="/usr/local/cuda-12.1"
pip install -e .
pip install --no-build-isolation -e grounding_dino
pip install opencv-python supervision transformers addict yapf pycocotools timm
# For hloc
cd ../../
cd third_party/Hierarchical-Localization
git submodule update --init --recursive
pip install -e .
pip install pyquaternion scipy
pip install cython
pip install simple-romp
pip install --no-index --no-cache-dir pytorch3d -f
pip install smplx open3d
Create a separate environment for Metric3Dv2 following the official instructions.
Build openpose python package following the official guide.
Update python paths in and
SAM2_PYTHON_PATH = "/home/lixin/miniconda3/envs/sam21/bin/python"
METRIC3D_PYTHON_PATH = "/home/lixin/miniconda3/envs/metric3d/bin/python"
OPENPOSE_PYTHON_PATH = "/usr/bin/python3"
OPENPOSE_MODEL_PATH = "/home/lixin/softwares/openpose/models/"
Download SMPL model (version 1.1.0 for Python 2.7 (female/male)) and place them under checkpoints/smpl
mkdir -p checkpoints/smpl
mv /path_to_smpl_models/basicmodel_f_lbs_10_207_0_v1.1.0.pkl checkpoints/smpl/SMPL_FEMALE.pkl
mv /path_to_smpl_models/basicmodel_m_lbs_10_207_0_v1.1.0.pkl checkpoints/smpl/SMPL_MALE.pkl
Prepare SMPL model files needed by ROMP according to the official instructions and place them under checkpoints/romp
mkdir -p checkpoints/romp
mv /path_to_romp_models/SMPL_MALE.pth checkpoints/romp/SMPL_MALE.pth
mv /path_to_romp_models/SMPL_FEMALE.pth checkpoints/romp/SMPL_FEMALE.pth
We provide a python script and a shell script as examples to process the data.
# Modify the arguments in first to fit your data
# Run each step with indices, e.g. 0 1 2 (modify indices as needed)
bash 0 1 2
You can also run each step separately by uncommenting the corresponding command in
Each script contains detailed documentation of its functionality. For example, in
Frame Selection Utility for Videos and Image Sequences
--input_path: path to the input video file or a directory of images
--data_dir: output directory for the processed data
--window_size: number of frames to consider in each selection window (default: 10)
--frame_start: starting frame number to process (default: 0)
--frame_end: ending frame number (inclusive) to process (default: 1000000)
--image_resize_factor: factor by which to reduce image size (1, 2, 4, or 8)
Output Structure:
├── images/
│ ├── all_frames/ # Contains all processed frames
│ ├── selected_frames/ # Contains selected sharp frames
│ └── selected_idxs.npy # Numpy array of selected frame indices
This work builds upon several excellent open-source projects. We would like to thank the authors of: Vid2Avatar, NeuMAN, hloc, colmap Metric3D, Grounded-SAM2 openpose, ROMP .
If you find this work useful for your research, please consider citing our paper:
author={Xue, Lixin and Guo, Chen and Zheng, Chengwei and Wang, Fangjinhua and Jiang, Tianjian and Ho, Hsuan-I and Kaufmann, Manuel and Song, Jie and Hilliges Otmar},
title={{HSR:} Holistic 3D Human-Scene Reconstruction from Monocular Videos},
booktitle={European Conference on Computer Vision (ECCV)},