Skip to content

SAIS-FUXI/Omni-Video

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Omni-Video 2

A flexible framework to bridge video understanding, generation and editing

Project Page   HuggingFace Model   arXiv

Hao Yang2, Zhiyu Tan1,2†, Jia Gong2, Luozheng Qin2, Hesen Chen1,2, Xiaomeng Yang2, Yuqing Sun2, Yuetan Lin2, Mengping Yang2*, Hao Li1,2*

1Fudan University  |  2Shanghai Academy of Artificial Intelligence for Science
*Corresponding Author    Project Lead


🔥 Latest News

  • February 12, 2026: 🔥🔥 We release the Technical Report of Omni-Video 2 on arXiv!
  • February 12, 2026: 🔥🔥 We are glad to release a more light model OmniVideo2-1.3B, it will be much smaller and quicker, and the performance is still qualitative!
  • January 22, 2026: 🔥🔥 The whole new OmniVideo2 is released now!
  • August 6, 2025: We are glad to release v0.1's code, which includes support for both inference and fine-tuning!
  • August 6, 2025: Our version v0.1 model is uploaded to HF Model now!
  • July 7, 2025: We release the Technique-Report of Omni-Video
  • July 7, 2025: We release the project page of Omni-Video

Introduction

We present a unified video editing and generation framework that pairs a text-to-video DiT backbone with vision-language understanding for precise, controllable edits. A VLM reads the source video and edit instruction to predict a detailed caption of the expected edited result, converting sparse prompts into explicit semantics about content, attributes, and temporal changes. The DiT model then uses mixed cross-attention conditioning, injecting source VAE latents (optionally concatenated with other cues) together with the expanded text semantics, to preserve identity, layout, and motion while enabling flexible control. This yields a single pipeline that supports text-to-video, video-to-video editing, and mixed-condition generation.

Framework


Video Editing Demos

Note: Left side shows the source video, right side shows the edited result.

Advanced Video Editing

Complex Edit

Multi-element transformations combining appearance, lighting, and environmental changes.

Change the man's black jacket to a tattered gray overcoat, replace the wall with faded blue wallpaper

Change the woman's red shirt to glowing neon cyan, transform window glow to electric blue moonlight

Change the man's black jacket to a gray coat with glowing thread, replace blue light with warm amber

Change workout attire to vibrant crimson sports bra and leggings, replace towel with flowing silk scarf

High Motion

Challenging edits on fast-moving subjects with dynamic clothing and dramatic motion.

Change the woman's black top to a flowing blood-red silk gown that billows with motion

Change the woman's green jacket to a deep crimson cloak that billows dramatically

Change the armored suit from red-and-black to matte charcoal gray with cyan circuitry accents

Change the woman's white shirt to a blood-red silk blouse that clings to her form

Diverse Local Edit

Precise object-level modifications while preserving surrounding context and motion.

Change the real raccoon to a stuffed raccoon

Change the firefighter's pizza to a steaming cup of coffee

Change the light brown fur to deep obsidian-black fur with icy blue ethereal mist

Change the golden retriever to a black Labrador


Basic Video Editing

Add

Adding objects and accessories to videos.

Add a scarf around the first fox's neck

Add a tiny pirate hat on the parrot's head

Add a red headband to the player's forehead

Add a tiny crown to the hummingbird's head

Remove

Removing elements from videos while maintaining scene coherence.

Remove the meditation cushion from the scene

Remove the two cubs from the scene

Remove the two lizards from the scene

Remove the black cat from the scene

Local Change

Local attribute changes on specific objects.

Change the woman's white dress to a blood-stained black gown

Change the fox into a badger

Change the man with thick beard to a woman with short silver hair

Change the engineer's navy jacket to a bright crimson trench coat

Project Structure

omnivideo2_release/
├── omnivideo/
│   ├── configs/           # Model configurations
│   ├── distributed/       # FSDP and sequence parallel utilities
│   ├── modules/           # Core model components (attention, VAE, T5, etc.)
│   ├── utils/             # Utility functions and solvers
│   ├── vllm_model.py      # Qwen3-VL integration
│   └── x2x_gen_unified.py # Main generation pipeline
└── tools/
    └── inference/
        ├── generate_omni_v2v.py    # Inference script
        └── inference_omni_v2v.sh   # Shell launcher

Environment Setup

Requirements

  • Python >= 3.10
  • PyTorch >= 2.8 with CUDA support
  • NVIDIA GPU with sufficient VRAM (recommended: 80GB for A14B model)

Installation

  1. Clone the repository:
git clone https://github.com/your-org/omnivideo2.git
cd omnivideo2
  1. Create a conda environment:
conda create -n omnivideo2 python=3.10
conda activate omnivideo2
  1. Install dependencies:
pip install -r requirements.txt
pip install flash-attn --no-build-isolation  # Optional but recommended for faster attention

Model Checkpoints

Download the pretrained checkpoints and organize them as follows:

${CKPT_DIR}/
├── high_noise_model/
│   └── model.pt              # High-noise timestep model
├── low_noise_model/
│   └── model.pt              # Low-noise timestep model
├── special_tokens.pkl        # Special token embeddings
├── models_t5_umt5-xxl-enc-bf16.pth  # T5 encoder
└── Wan2.1_VAE.pth            # VAE model

You will also need the Qwen3-VL model for visual feature extraction:

Inference

Prepare Input Data

Create a JSONL file with your prompts. Each line should be a JSON object:

For Video-to-Video editing:

{"sample_id": "001", "edit_prompt": "Change the dog to a cat", "source_clip_path": "/path/to/source_video.mp4"}

Run Inference

  1. Edit the configuration in tools/inference/inference_omni_e2e.sh:
# Update these paths
CKPT_DIR="/path/to/your/checkpoints"
QWEN3VL_MODEL_PATH="/path/to/Qwen3-VL-30B-A3B-Instruct"
DATA_FILE="/path/to/your/prompts.jsonl"

# Adjust generation parameters as needed
GEN_SIZE="832*480"       # Video resolution (width*height)
GEN_FRAME_NUM=41         # Number of frames
GEN_SAMPLE_FPS=8         # Output FPS
GEN_TASK="v2v-A14B"      # Task type: v2v-A14B or t2v-A14B or v2v-A1.3B or t2v-A1.3B
  1. Run the inference script:
## for OmniVideo2-A14B 
bash tools/inference/inference_omni_e2e.sh
## for OmniVideo2-1.3B
bash tools/inference/inference_omni_v2v_1_3B.sh

Available Tasks

Task Description
t2v-A14B Text-to-Video generation with OmniVideo2-A14B model
v2v-A14B Video-to-Video editing with OmniVideo2-A14B model
t2v-1.3B Text-to-Video generation with OmniVideo2-1.3B model
v2v-1.3B Video-to-Video editing with OmniVideo2-1.3B model

Generation Parameters

Parameter Default Description
--size 832*480 Output video resolution (width*height)
--frame_num 41 Number of frames to generate
--sample_fps 8 Output video FPS
--sample_steps 40 Number of diffusion sampling steps
--sample_guide_scale 3.0 Classifier-free guidance scale
--sample_shift 5 Noise schedule shift parameter
--sample_solver unipc Sampling solver (unipc, ddim, euler)

Acknowledgements

We sincerely thank the following teams for their outstanding contributions that made this project possible:

  • Wan Team: For the foundational video generation architecture, VAE model, and diffusion framework.

  • Qwen-VL Team: For the powerful Qwen3-VL vision-language model.

License

Please refer to the LICENSE file for details.

Citation

If you find this work useful, please consider citing:

@article{yang2026omnivideo2,
  title={Omni-Video 2: Scaling MLLM-Conditioned Diffusion for Unified Video Generation and Editing},
  author={Yang, Hao and Tan, Zhiyu and Gong, Jia and Qin, Luozheng and Chen, Hesen and Yang, Xiaomeng and Sun, Yuqing and Lin, Yuetan and Yang, Mengping and Li, Hao},
  journal={arXiv preprint arXiv:2602.08820},
  year={2026}
}
@article{tan2025omni,
  title={Omni-Video: Democratizing Unified Video Understanding and Generation},
  author={Tan, Zhiyu and Yang, Hao and Qin, Luozheng and Gong, Jia and Yang, Mengping and Li, Hao},
  journal={arXiv preprint arXiv:2507.06119},
  year={2025}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages