-
Notifications
You must be signed in to change notification settings - Fork 22
Description
Hi @JianhongBai First thanks for open-sourcing this great work!
I would like to ask whether the currently released checkpoint supports video-conditioned multi-view video generation (as in Sec 3.5 of the paper). I follow the inference instructions in Sec 3.5 to replace the first video noisy latent with the reference video features, but got very strange results. Do I need to finetune the given model checkpoint following the training instructions in Sec 3.5? Thanks!
Btw, I find the text-conditioned generation results from the released checkpoint tend to be "synthetic" rather than real-world. Below is a video produced by the given text prompt example. Is this because the model is mostly trained with a synthetic multi-view dataset?
prompt0_merged.mp4
Thanks for your time and look forward to your reply!