Skip to content

TencentARC/MotionCrafter

Repository files navigation

MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE

Ruijie Zhu1,2, Jiahao Lu3, Wenbo Hu2, Xiaoguang Han4,
Jianfei Cai5, Ying Shan2, Chuanxia Zheng1
1 NTU 2 ARC Lab, Tencent PCG 3 HKUST 4 CUHK(SZ) 5 Monash University

                       


We introduce MotionCrafter, the first video diffusion-based framework that jointly reconstructs 4D geometry and estimates dense object motion. Given a monocular video as input, MotionCrafter simultaneously predicts dense point map and scene flow for each frame within a shared world coordinate system, without requiring any post-optimization.

If you find MotionCrafter useful, please help ⭐ this repo, which is important to Open-Source projects. Thanks!

🚀 Quick Start

🛠️ Installation

  1. Clone this repo:
git clone https://github.com/TencentARC/MotionCrafter
  1. Install dependencies (please refer to requirements.txt):
pip install -r requirements.txt

🔥 Inference

Run inference code with our default model:

python run.py \
  --video_path examples/video.mp4 \
  --save_folder examples_output

Run inference code with your own model:

python run.py \
  --video_path examples/video.mp4 \
  --save_folder examples_output \
  --cache_dir workspace/pretrained_models \
  --unet_path path/to/your/unet \
  --vae_path path/to/your/vae \
  --model_type determ \ # determ or diff
  --height 320 --width 640 \
  --adjust_resolution True \
  --num_frames 25

Visualization

Visualize the predicted point maps & scene flows with Viser

python visualize/visualize.py \
  --video_path examples/video.mp4 \
  --data_path examples_output/video.npz

🌟 Training Your Own Model

Dataset Preparation

To train MotionCrafter, you should download the training datasets following DATASET.md.

Or you can prepare your own data like this:

DATASET_NAME
├── SCENE_NAME_1
│   ├── xxxx.hdf5
│   ├── xxxx.mp4
├── SCENE_NAME_2
│   ├── xxxx.hdf5
│   ├── xxxx.mp4
└── mete_infos.txt

xxxx.mp4 is the processed video, xxxx.hdf5 is the processed annotations, including:

point_map: T x H x W x 3, camera-centric xyz coordinates.
camera_pose: T x 4 x 4, camera extrinsics.
valid_mask: T x H x W, valid mask for point map.
scene_flow (optional): T x H x W x 3, camera-centric dx dy dz.
deform_mask (optional): T x H x W, valid mask for scene flow.

Model Training

First, we train Geometry VAE:

bash scripts/launch.sh configs/vae_train/geometry_vae_train.gin

Then, we combine the pretrained Geometry VAE to train Unified 4D VAE:

bash scripts/launch.sh configs/vae_train/unify_4d_vae_train.gin

Finally, we train the Diffusion Unet via the pretrained Unified 4D VAE:

# Deterministic Version
bash scripts/launch.sh configs/unet_train/unet_determ_unify_vae_train.gin
# Diffusion Version
bash scripts/launch.sh configs/unet_train/unet_diffusion_unify_vae_train.gin

📜 Citation

If you find our work useful, please cite:

@article{zhu2025motioncrafter,
  title={MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE},
  author={Zhu, Ruijie and Lu, Jiahao and Hu, Wenbo and Han, Xiaoguang and Cai, Jianfei and Shan, Ying and Zheng, Chuanxia},
  journal={arXiv preprint arXiv:2602.08961},
  year={2026}
}

🤝 Acknowledgements

Our code is based on GeometryCrafter. We thank Tianxing for providing the excellent codebase!