This repository contains the instruction-based video editing evaluation code for EditVerseBench in the paper "EditVerse: A Unified Framework for Editing and Generation via In-Context Learning".
Xuan Ju12, Tianyu Wang1, Yuqian Zhou1, He Zhang1, Qing Liu1, Nanxuan Zhao1, Zhifei Zhang1, Yijun Li1, Yuanhao Cai3, Shaoteng Liu1, Daniil Pakhomov1, Zhe Lin1, Soo Ye Kim1*, Qiang Xu2*
1Adobe Research 2The Chinese University of Hong Kong 3Johns Hopkins University *Corresponding Author
🌐 Project Page | 📜 Arxiv | 🤗 Benchmark | 📹 Slides | 👀 Comparison
(Optional) Create a Conda environment
conda create -n EditVerse python=3.10
conda activate EditVerse
Install Pytorch
(You may adjust the version or CUDA support depending on your hardware)
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
Install required packages
pip install -r requirements.txt
Download benchmark dataset
git lfs install
git clone https://huggingface.co/datasets/sooyek/EditVerseBench
Download the videos
The source videos cannot be directly distributed due to licensing restrictions. Instead, you can download them using the provided script with the Pixabay API. (The network connection may occasionally fail, so you might need to run the script multiple times.)
⚠️ Note: Please remember to revise the API key to your own key in download_source_video.py. You can find the API key here (marked in Parameters-key(required) on the website). The API is free, but you need to sign up for an account to get the API key.
cd EditVerseBench
python download_source_video.py
The benchmark file structure should be like:
EditVerseBench/
  ├── test.json
  ├── depths/
  │   ├── xx.mp4
  ├── edited_first_frame/
  │   ├── xx.mp4
  ├── images/
  │   ├── xx.mp4
  ├── inpaint_video_and_masks/
  │   ├── xx.mp4
  ├── poses/
  │   ├── xx.mp4
  ├── sketchs/
  │   ├── xx.mp4
  ├── videos/
  │   ├── xx.mp4
Unpack comparison results
cd EditVerseBench
tar -zxvf EditVerse_Comparison_Results.tar.gz
rm EditVerse_Comparison_Results.tar.gz
Command
python eval.py --metrics [metrics] \
--test_json_path EditVerseBench/EditVerseBench/test.json \
--generate_results_dir [results_dir] \
--output_csv [output_csv] \
--gpt_api_key [your_api_key]
Arguments
- 
metrics: Use all to evaluate all metrics.To select specific metrics, provide a comma-separated list (no spaces). Example:
clip_temporal_consistency,dino_temporal_consistencySupported metrics include:
- clip_temporal_consistency
 - dino_temporal_consistency
 - frame_text_alignment
 - video_text_alignment
 - pick_score_video_quality
 - editing_vlm_evaluation
 
 - 
test_json_path: Path to the benchmark entrypoint JSON file. - 
generate_results_dir: Directory containing generated results (must follow the required structure). - 
output_csv: Path to save the evaluation CSV file. - 
gpt_api_key: penAI API key (required for editing_vlm_evaluation). 
Example
Evaluate the provided EditVerse results and save output to EditVerse_eval.csv:
python eval.py --metrics all \
--test_json_path EditVerseBench/EditVerseBench/test.json \
--generate_results_dir EditVerseBench/EditVerse_Comparison_Results/EditVerse \
--output_csv EditVerse_eval.csv \
--gpt_api_key [Your API key]
👉 Pre-computed evaluation results for EditVerse and previous methods are available at: EditVerseBench/automatic_evaluation_results.
You can also evaluate your model outputs by following the same format.
Step 1: Refer to benchmark JSON format
See EditVerseBench/EditVerseBench/test.json for reference.
Each entry looks like this:
{
    "0": {
        "<text>": "<video1> Add a small golden crown ...",
        "<video1>": "videos/174008-850361316.mp4",
        "<video1> link": "https://pixabay.com/videos/woman-smile-communication-gesture-174008/",
        "direction": "horizontal",
        "target_prompt": "A young woman stands outside in front of ...",
        "type": "add object",
        "source_prompt": "A young woman stands outside in front of ..."
    },
    "1": {
        ...
    },
    ...
}
Key fields:
<text>: A natural language instruction describing the required edit in an interleaved format.- The instruction may include special tags such as 
<video1>,<video2>, or<image1>. - Each tag corresponds to a specific key field defined in the same JSON entry.
 
- The instruction may include special tags such as 
 <video1>: The local file path of the source video.<video1> link: The reference URL pointing to the source video’s original location.direction: horizontal or vertical.target_prompt: A detailed textual description of the desired edited video outcome.type: The category of the editsource_prompt: A description of the original, unedited video.
Step 2: Format your results
After generating results with your model, arrange files as follows:
Your_Folder/
  ├── 0/
  │   ├── generate.mp4   # model-generated video
  │   └── video1.mp4     # source video
  ├── 1/
  │   ├── generate.mp4
  │   └── video1.mp4
  ...
Step 3: Run evaluation
python eval.py --metrics all \
--test_json_path EditVerseBench/EditVerseBench/test.json \
--generate_results_dir [Your_Folder] \
--output_csv [Your_Results.csv] \
--gpt_api_key [your_api_key]
| Method | VLM evaluation | Video Quality | Text Alignment | Temporal Consistency | ||
|---|---|---|---|---|---|---|
| Editing Quality ↑ | Pick Score ↑ | Frame ↑ | Video ↑ | CLIP ↑ | DINO ↑ | |
| Attention Manipulation (Training-free) | ||||||
| TokenFlow | 5.26 | 19.73 | 25.57 | 22.70 | 98.36 | 98.09 | 
| STDF | 4.41 | 19.45 | 25.24 | 22.26 | 96.04 | 95.22 | 
| First-Frame Propagation (w/ End-to-End Training) | ||||||
| Señorita-2M | 6.97 | 19.71 | 26.34 | 23.24 | 98.05 | 97.99 | 
| Instruction-Guided (w/ End-to-End Training) | ||||||
| InsV2V | 5.21 | 19.39 | 24.99 | 22.54 | 97.15 | 96.57 | 
| Lucy Edit | 5.89 | 19.67 | 26.00 | 23.11 | 98.49 | 98.38 | 
| Ours (Ours) | 7.65 | 20.07 | 26.73 | 23.93 | 98.56 | 98.42 | 
Files under ./automatic_evaluation/viclip are from InternVideo and under Apache 2.0 License. Files under ./automatic_evaluation except for those under the folder viclip are modified from awesome-diffusion-v2v under MIT License and modifications by Adobe are under Adobe Research License. All other materials are licensed under Adobe Research License.
If you find our work useful for your research, please consider citing our paper:
@article{ju2025editverse,
  title   = {EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning},
  author  = {Xuan Ju and Tianyu Wang and Yuqian Zhou and He Zhang and Qing Liu and Nanxuan Zhao and Zhifei Zhang and Yijun Li and Yuanhao Cai and Shaoteng Liu and Daniil Pakhomov and Zhe Lin and Soo Ye Kim and Qiang Xu},
  journal = {arXiv preprint arXiv:2509.20360},
  year    = {2025},
  url     = {https://arxiv.org/abs/2509.20360}
}