Hao Yang2, Zhiyu Tan1,2†, Jia Gong2, Luozheng Qin2, Hesen Chen1,2, Xiaomeng Yang2, Yuqing Sun2, Yuetan Lin2, Mengping Yang2*, Hao Li1,2*
1Fudan University | 2Shanghai Academy of Artificial Intelligence for Science
*Corresponding Author †Project Lead
- February 12, 2026: 🔥🔥 We release the Technical Report of Omni-Video 2 on arXiv!
- February 12, 2026: 🔥🔥 We are glad to release a more light model OmniVideo2-1.3B, it will be much smaller and quicker, and the performance is still qualitative!
- January 22, 2026: 🔥🔥 The whole new OmniVideo2 is released now!
- August 6, 2025: We are glad to release v0.1's code, which includes support for both inference and fine-tuning!
- August 6, 2025: Our version v0.1 model is uploaded to HF Model now!
- July 7, 2025: We release the Technique-Report of Omni-Video
- July 7, 2025: We release the project page of Omni-Video
We present a unified video editing and generation framework that pairs a text-to-video DiT backbone with vision-language understanding for precise, controllable edits. A VLM reads the source video and edit instruction to predict a detailed caption of the expected edited result, converting sparse prompts into explicit semantics about content, attributes, and temporal changes. The DiT model then uses mixed cross-attention conditioning, injecting source VAE latents (optionally concatenated with other cues) together with the expanded text semantics, to preserve identity, layout, and motion while enabling flexible control. This yields a single pipeline that supports text-to-video, video-to-video editing, and mixed-condition generation.
Note: Left side shows the source video, right side shows the edited result.
Multi-element transformations combining appearance, lighting, and environmental changes.
Challenging edits on fast-moving subjects with dynamic clothing and dramatic motion.
Precise object-level modifications while preserving surrounding context and motion.
Adding objects and accessories to videos.
|
Add a scarf around the first fox's neck
|
Add a tiny pirate hat on the parrot's head
|
|
Add a red headband to the player's forehead
|
Add a tiny crown to the hummingbird's head
|
Removing elements from videos while maintaining scene coherence.
|
Remove the meditation cushion from the scene
|
Remove the two cubs from the scene
|
|
Remove the two lizards from the scene
|
Remove the black cat from the scene
|
Local attribute changes on specific objects.
omnivideo2_release/
├── omnivideo/
│ ├── configs/ # Model configurations
│ ├── distributed/ # FSDP and sequence parallel utilities
│ ├── modules/ # Core model components (attention, VAE, T5, etc.)
│ ├── utils/ # Utility functions and solvers
│ ├── vllm_model.py # Qwen3-VL integration
│ └── x2x_gen_unified.py # Main generation pipeline
└── tools/
└── inference/
├── generate_omni_v2v.py # Inference script
└── inference_omni_v2v.sh # Shell launcher
- Python >= 3.10
- PyTorch >= 2.8 with CUDA support
- NVIDIA GPU with sufficient VRAM (recommended: 80GB for A14B model)
- Clone the repository:
git clone https://github.com/your-org/omnivideo2.git
cd omnivideo2- Create a conda environment:
conda create -n omnivideo2 python=3.10
conda activate omnivideo2- Install dependencies:
pip install -r requirements.txt
pip install flash-attn --no-build-isolation # Optional but recommended for faster attentionDownload the pretrained checkpoints and organize them as follows:
${CKPT_DIR}/
├── high_noise_model/
│ └── model.pt # High-noise timestep model
├── low_noise_model/
│ └── model.pt # Low-noise timestep model
├── special_tokens.pkl # Special token embeddings
├── models_t5_umt5-xxl-enc-bf16.pth # T5 encoder
└── Wan2.1_VAE.pth # VAE model
You will also need the Qwen3-VL model for visual feature extraction:
- Download from: Qwen3-VL-30B-A3B-Instruct
Create a JSONL file with your prompts. Each line should be a JSON object:
For Video-to-Video editing:
{"sample_id": "001", "edit_prompt": "Change the dog to a cat", "source_clip_path": "/path/to/source_video.mp4"}- Edit the configuration in
tools/inference/inference_omni_e2e.sh:
# Update these paths
CKPT_DIR="/path/to/your/checkpoints"
QWEN3VL_MODEL_PATH="/path/to/Qwen3-VL-30B-A3B-Instruct"
DATA_FILE="/path/to/your/prompts.jsonl"
# Adjust generation parameters as needed
GEN_SIZE="832*480" # Video resolution (width*height)
GEN_FRAME_NUM=41 # Number of frames
GEN_SAMPLE_FPS=8 # Output FPS
GEN_TASK="v2v-A14B" # Task type: v2v-A14B or t2v-A14B or v2v-A1.3B or t2v-A1.3B- Run the inference script:
## for OmniVideo2-A14B
bash tools/inference/inference_omni_e2e.sh
## for OmniVideo2-1.3B
bash tools/inference/inference_omni_v2v_1_3B.sh| Task | Description |
|---|---|
t2v-A14B |
Text-to-Video generation with OmniVideo2-A14B model |
v2v-A14B |
Video-to-Video editing with OmniVideo2-A14B model |
t2v-1.3B |
Text-to-Video generation with OmniVideo2-1.3B model |
v2v-1.3B |
Video-to-Video editing with OmniVideo2-1.3B model |
| Parameter | Default | Description |
|---|---|---|
--size |
832*480 |
Output video resolution (width*height) |
--frame_num |
41 |
Number of frames to generate |
--sample_fps |
8 |
Output video FPS |
--sample_steps |
40 |
Number of diffusion sampling steps |
--sample_guide_scale |
3.0 |
Classifier-free guidance scale |
--sample_shift |
5 |
Noise schedule shift parameter |
--sample_solver |
unipc |
Sampling solver (unipc, ddim, euler) |
We sincerely thank the following teams for their outstanding contributions that made this project possible:
-
Wan Team: For the foundational video generation architecture, VAE model, and diffusion framework.
-
Qwen-VL Team: For the powerful Qwen3-VL vision-language model.
Please refer to the LICENSE file for details.
If you find this work useful, please consider citing:
@article{yang2026omnivideo2,
title={Omni-Video 2: Scaling MLLM-Conditioned Diffusion for Unified Video Generation and Editing},
author={Yang, Hao and Tan, Zhiyu and Gong, Jia and Qin, Luozheng and Chen, Hesen and Yang, Xiaomeng and Sun, Yuqing and Lin, Yuetan and Yang, Mengping and Li, Hao},
journal={arXiv preprint arXiv:2602.08820},
year={2026}
}
@article{tan2025omni,
title={Omni-Video: Democratizing Unified Video Understanding and Generation},
author={Tan, Zhiyu and Yang, Hao and Qin, Luozheng and Gong, Jia and Yang, Mengping and Li, Hao},
journal={arXiv preprint arXiv:2507.06119},
year={2025}
}
























