Ruijie Zhu1,2,
Jiahao Lu3,
Wenbo Hu2,
Xiaoguang Han4,
Jianfei Cai5,
Ying Shan2,
Chuanxia Zheng1
1 NTU
2 ARC Lab, Tencent PCG
3 HKUST
4 CUHK(SZ)
5 Monash University
We introduce MotionCrafter, the first video diffusion-based framework that jointly reconstructs 4D geometry and estimates dense object motion. Given a monocular video as input, MotionCrafter simultaneously predicts dense point map and scene flow for each frame within a shared world coordinate system, without requiring any post-optimization.
If you find MotionCrafter useful, please help ⭐ this repo, which is important to Open-Source projects. Thanks!
- Clone this repo:
git clone https://github.com/TencentARC/MotionCrafter- Install dependencies (please refer to requirements.txt):
pip install -r requirements.txtRun inference code with our default model:
python run.py \
--video_path examples/video.mp4 \
--save_folder examples_outputRun inference code with your own model:
python run.py \
--video_path examples/video.mp4 \
--save_folder examples_output \
--cache_dir workspace/pretrained_models \
--unet_path path/to/your/unet \
--vae_path path/to/your/vae \
--model_type determ \ # determ or diff
--height 320 --width 640 \
--adjust_resolution True \
--num_frames 25Visualize the predicted point maps & scene flows with Viser
python visualize/visualize.py \
--video_path examples/video.mp4 \
--data_path examples_output/video.npzTo train MotionCrafter, you should download the training datasets following DATASET.md.
Or you can prepare your own data like this:
DATASET_NAME
├── SCENE_NAME_1
│ ├── xxxx.hdf5
│ ├── xxxx.mp4
├── SCENE_NAME_2
│ ├── xxxx.hdf5
│ ├── xxxx.mp4
└── mete_infos.txt
xxxx.mp4 is the processed video, xxxx.hdf5 is the processed annotations, including:
point_map: T x H x W x 3, camera-centric xyz coordinates.
camera_pose: T x 4 x 4, camera extrinsics.
valid_mask: T x H x W, valid mask for point map.
scene_flow (optional): T x H x W x 3, camera-centric dx dy dz.
deform_mask (optional): T x H x W, valid mask for scene flow.
First, we train Geometry VAE:
bash scripts/launch.sh configs/vae_train/geometry_vae_train.ginThen, we combine the pretrained Geometry VAE to train Unified 4D VAE:
bash scripts/launch.sh configs/vae_train/unify_4d_vae_train.ginFinally, we train the Diffusion Unet via the pretrained Unified 4D VAE:
# Deterministic Version
bash scripts/launch.sh configs/unet_train/unet_determ_unify_vae_train.gin
# Diffusion Version
bash scripts/launch.sh configs/unet_train/unet_diffusion_unify_vae_train.ginIf you find our work useful, please cite:
@article{zhu2025motioncrafter,
title={MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE},
author={Zhu, Ruijie and Lu, Jiahao and Hu, Wenbo and Han, Xiaoguang and Cai, Jianfei and Shan, Ying and Zheng, Chuanxia},
journal={arXiv preprint arXiv:2602.08961},
year={2026}
}Our code is based on GeometryCrafter. We thank Tianxing for providing the excellent codebase!

