Extension to novel view video synthesis

Hi @JianhongBai First thanks for open-sourcing this great work!

I would like to ask whether the currently released checkpoint supports video-conditioned multi-view video generation (as in Sec 3.5 of the paper). I follow the inference instructions in Sec 3.5 to replace the first video noisy latent with the reference video features, but got very strange results. Do I need to finetune the given model checkpoint following the training instructions in Sec 3.5? Thanks!

Btw, I find the text-conditioned generation results from the released checkpoint tend to be "synthetic" rather than real-world. Below is a video produced by the given text prompt example. Is this because the model is mostly trained with a synthetic multi-view dataset?

https://github.com/user-attachments/assets/6434cf71-4ba9-4d86-89f2-8f08216d193d

Thanks for your time and look forward to your reply!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extension to novel view video synthesis #17

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Extension to novel view video synthesis #17

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions