1st

scofield7419 · Dec 3, 2024 · 4d4fad2 · 4d4fad2
1 parent 0b90927
commit 4d4fad2
Show file tree

Hide file tree

Showing 48 changed files with 6,492 additions and 28 deletions.
diff --git a/README.md b/README.md
@@ -1,25 +1,34 @@
-## 🤔🎞️ Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition
-<a href="https://github.com/scofield7419/Video-of-Thought">
-  <img src="https://img.shields.io/badge/VoT-1.0-blue" alt="pytorch 1.8.1">
-</a>
-<a href="https://github.com/scofield7419/Video-of-Thought" rel="nofollow">
-  <img src="https://img.shields.io/badge/MLLM-1.0-red" alt="pytorch 1.8.1">
-</a>
-<a href="https://huggingface.co/docs/transformers/index" rel="nofollow">
-  <img src="https://img.shields.io/badge/transformers-4.24.0-green" alt="Build Status">
-</a>
+<h1 align="center">
+🤔🎞️ Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition
+</h1>
+
+**ICML (Oral) 2024**
+
+[Hao Fei](http://haofei.vip/), [Shengqiong Wu](https://chocowu.github.io/), [Wei Ji](), [Hanwang Zhang](https://personal.ntu.edu.sg/hanwangzhang/), [Meishan Zhang](https://zhangmeishan.github.io/), [Mong Li Lee](https://www.comp.nus.edu.sg/~leeml/), and [Wynne Hsu](https://www.comp.nus.edu.sg/~whsu/)
+
+
+<a href='https://haofei.vip/VoT/'><img src='https://img.shields.io/badge/Project-Page-Green'></a>
+<a href='https://openreview.net/pdf?id=fO31YAyNbI'><img src='https://img.shields.io/badge/Paper-PDF-orange'></a> 
+[![YouTube](https://badges.aleen42.com/src/youtube.svg)](https://youtu.be/2fKCWjetV-Y)
+![License](https://img.shields.io/badge/License-BSD-blue.svg)
 <a href="https://pytorch.org" rel="nofollow">
   <img src="https://img.shields.io/badge/pytorch-1.10.0-orange" alt="pytorch 1.8.1">
 </a>
 
 
+This repository contains the code of ICML 2024 paper [Video-of-Thought](https://is.gd/fcfZeO).
 
-**The implementation of the ICML 2024 paper [Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition](https://is.gd/fcfZeO)**
+### 🎉 Visit the project page: [VoT](http://haofei.vip/VoT/)
 
 ----------
- ### 🎉 Visit the project page: [VoT](http://haofei.vip/VoT/)
 
-----------
+## Abstract
+Existing research of video understanding still struggles to achieve in-depth comprehension and reasoning in complex videos, primarily due to the under-exploration of two key bottlenecks: fine-grained spatial-temporal perceptive understanding and cognitive-level video scene comprehension. 
+This paper bridges the gap by presenting a novel solution. We first introduce a novel video Multimodal Large Language Model (MLLM), MotionEpic, which achieves fine-grained pixel-level spatial-temporal video grounding by integrating video spatial-temporal scene graph (STSG) representation. 
+Building upon MotionEpic, we then develop a Video-ofThought (VoT) reasoning framework. VoT inherits the Chain-of-Thought (CoT) core, breaking down a complex task into simpler and manageable sub-problems, and addressing them step-bystep from a low-level pixel perception to highlevel cognitive interpretation. 
+Extensive experiments across various complex video QA benchmarks demonstrate that our overall framework strikingly boosts existing state-of-the-art.
+
+  ![framework](./assets/intro.png)
 
 
 ## Overview<a name="overview" />
@@ -29,52 +38,111 @@ multiple steps from low to high levels, enabling not only pixel perceptive recog
 cognitive understanding of videos.
 
 <p align="center">
-  <img src="./figures/VoT.png" width="550"/>
+  <img src="./assets/VoT.png" width="550"/>
 </p>
 
 > We also introduce a novel video MLLM, namely MotionEpic, which supports not only video input but also the encoding, understanding and generation of STSGs.
 
 
 <p align="center">
-  <img src="./figures/MotionEpic.png" width="650"/>
+  <img src="./assets/MotionEpic.png" width="650"/>
 </p>
 
 
-----------
+## Method
 
-## Code 
+- MotionEpic: Fine-grained Spatial-temporal Grounded Video MLLM
+    - employ the [Vicuna-7B (v1.5)](https://huggingface.co/lmsys/vicuna-7b-v1.5) as the backbone LLM. To perceive video input, 
+    - adopt the [ViT-L/14](https://huggingface.co/openai/clip-vit-large-patch14) encoder and Q-Former projector. 
+    - design MotionEpic to support the STSG signal, where we
+retrofit the Graph Transformer with recurrent propagation to encode the multi-frame STSG information
 
-(TBD)
+- Video-of-Thought Reasoning Framework
+    - Step-1: Task Definition and Target Identification
+    - Step-2: Object Tracking
+    - Step-3: Action Analyzing
+    - Step-4: Question Answering via Ranking
+    - Step-5: Answer Verification
 
 
 
+----
 
-----------
+## Installation
+
+Please first clone the repo and install the required environment, which can be done by running the following commands:
+```
+conda env create -n motionepic python=3.8
+
+conda activate motionepic
+
+#### CUDA 12.1
+conda install pytorch==2.1.2 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.6 -c pytorch -c nvidia
+
+git clone https://github.com/scofield7419/Video-of-Thought.git
+cd Video-of-Thought
+
+pip install -r requirements.txt
+```
+
+----
+
+## Training
+Firstly, you need to prepare the dataset, including [Action Geome](https://github.com/JingweiJ/ActionGenome), [webvid](), [MSR-VTT](), and [ActivityNet](http://activity-net.org/). 
+Then, you need to modify the parameter, `DATASET_NAME_LIST` to determine the dataset used for training and fine-tuning.
+Next, run the command for training and fine-tuning:
+```
+#### for alignment learning
+bash pretrain.sh
+
+#### for finetuning.
+bash fine-tune.sh
+```
+
+----
+
+## Inference
+We implement CoT-based inference (i.e., VoT), please refer to the [predict.py](predict.py) for details. 
+Run the command to obtain the results:
+```
+python predict.py
+```
+
+----
 
 ## Citation
 
 If you use this work, please kindly cite:
 
 ```
-@inproceedings{VoT24Hao,
-  author    = {Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Meishan Zhang, Mong-Li Lee, Wynne Hsu},
-  title     = {Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition},
-  journal   = {Proceedings of the International Conference on Machine Learning (ICML)},
-  year      = {2024},
-}
+@inproceedings{0001W0ZZLH24,
+  author       = {Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Meishan Zhang, Mong-Li Lee, Wynne Hsu},
+  title        = {Video-of-Thought: Step-by-Step Video Reasoning from Perception to
+                  Cognition},
+  booktitle    = {Proceeding of the ICML},
+  year         = {2024}
+} 
 ```
 
 
-----------
+----
+
+## Acknowledgement
+Our code is based on the respective official repositories, [NExT-GPT](next-gpt.github.io), and [graphtransformer](https://github.com/graphdeeplearning/graphtransformer/). We fully thank the authors to release their code.
+
+
+
+
 ### License
 
 The code is released under Apache License 2.0 for Noncommercial use only. 
 
 
 
-----------
-
 
 ### Contact
 
 For any questions, feel free to contact [Hao Fei](mailto:haofei37@nus.edu.sg).
+
+
+
diff --git a/figures/MotionEpic.png → assets/MotionEpic.png b/figures/MotionEpic.png → assets/MotionEpic.png
diff --git a/figures/VoT.png → assets/VoT.png b/figures/VoT.png → assets/VoT.png
diff --git a/assets/intro.png b/assets/intro.png
diff --git a/data/AG/prepare.md b/data/AG/prepare.md
diff --git a/data/ActivityNet/prepare.md b/data/ActivityNet/prepare.md
diff --git a/data/NExT-QA/prepare.md b/data/NExT-QA/prepare.md
diff --git a/data/webvid/config.yaml b/data/webvid/config.yaml
@@ -0,0 +1,20 @@
+subsampling: {}
+
+reading:
+    yt_args:
+        download_size: 360
+        download_audio_rate: 44100
+        yt_metadata_args: null
+    timeout: 60
+    sampler: null
+
+storage:
+    number_sample_per_shard: 1000
+    oom_shard_count: 5
+    captions_are_subtitles: False
+
+distribution:
+    processes_count: 16
+    thread_count: 16
+    subjob_size: 1000
+    distributor: "multiprocessing"
diff --git a/data/webvid/download.sh b/data/webvid/download.sh
@@ -0,0 +1,13 @@
+#!/bin/bash
+
+# wget -nc http://www.robots.ox.ac.uk/~maxbain/webvid/results_10M_train.csv
+
+video2dataset --url_list="results_2M_train.csv" \
+        --input_format="csv" \
+        --output-format="webdataset" \
+	--output_folder="dataset" \
+        --url_col="contentUrl" \
+        --caption_col="name" \
+        --save_additional_columns='[videoid,page_idx,page_dir,duration]' \
+        --enable_wandb=True \
+	--config="path/to/config.yaml" \
diff --git a/data/webvid/prepare.md b/data/webvid/prepare.md
@@ -0,0 +1,46 @@
+## Preparation
+
+WebVid is a large-scale text-video dataset, containing 10 million video-text pairs scraped from the stock footage sites.
+To download the dataset, run the following command:
+
+```angular2html
+wget -nc http://www.robots.ox.ac.uk/~maxbain/webvid/results_2M_train.csv
+
+video2dataset --url_list="results_10M_train.csv" \
+        --input_format="csv" \
+        --output-format="webdataset" \
+	    --output_folder="dataset" \
+        --url_col="contentUrl" \
+        --caption_col="name" \
+        --save_additional_columns='[videoid,page_idx,page_dir,duration]' \
+        --enable_wandb=True \
+	    --config="path/to/config.yaml" \
+```
+For more datails, please refer to [video2dataset](https://github.com/iejMac/video2dataset/blob/main/dataset_examples/WebVid.md).
+
+
+### Postprocess
+Once you've downloaded the dataset, please verify the download status, as some video files may not have been successfully downloaded. Afterward, organize the dataset into a json file with the following format:
+```angular2html
+[   
+    {
+        "caption": "Merida, mexico - may 23, 2017: tourists are walking on a roadside near catholic church in the street of mexico at sunny summer day.",
+        "video_name": "31353427.mp4"
+    },
+    {
+        "caption": "Happy family using laptop on bed at home",
+        "video_name": "14781349.mp4"
+    },
+    ...
+]
+```
+
+The data file structure should be:
+```angular2html
+data/T-X_pair_data/webvid
+├── webvid.json
+├── videos
+|   ├── 31353427.mp4
+|   ├── 14781349.mp4
+|   └── ...
+```
diff --git a/data/webvid/preprocess.sh b/data/webvid/preprocess.sh
@@ -0,0 +1,15 @@
+#!/bin/bash
+
+# in this file, we will unzip the downloaded data and preprocess it
+D_NUM=0
+for i in {100..226}; do
+    if [ $D_NUM -lt 10 ]; then
+        rm -f ./data/webvid/videos/*.txt
+        D_NUM=1
+    else
+        D_NUM=$((D_NUM + 1))
+    fi
+    echo "Extracting 00${i}.tar"
+
+        tar -xf ./data/webvid/dataset/00${i}.tar -C ./data/webvid/videos/
+    done
diff --git a/finetune.sh b/finetune.sh
@@ -0,0 +1,56 @@
+#!/bin/bash
+
+
+
+# =================== Encoding-side Training ======================
+DATASET_NAME_LIST=(
+   ""
+)
+DATASET_NAME_LIST="${DATASET_NAME_LIST[@]}"
+
+LLM_MODEL_NAME="./pretrain_ckpt/vicuna-7b-v1.5"
+MM_MODEL_NAME="./pretrain_ckpt/clip"
+
+echo "DATASET_NAME_LIST: $DATASET_NAME_LIST"
+echo "LLM_MODEL_NAME: $LLM_MODEL_NAME"
+echo "MM_MODEL_NAME: $MM_MODEL_NAME"
+
+
+accelerate launch --main_process_port 8922 train_mem.py \
+    --lora_enable True --lora_r 128 --lora_alpha 256  \
+    --mm_input_projector_lr 2e-5  --mm_output_projector_lr 2e-5 \
+    --deepspeed ./scripts/zero2.json \
+    --model_name_or_path $LLM_MODEL_NAME \
+    --version v1 \
+    --dataset_name_list $DATASET_NAME_LIST \
+    --multimodal_tower $MM_MODEL_NAME \
+    --group_by_modality_length True \
+    --group_by_modality_type False \
+    --pretrain_mm_input_adapter ./checkpoints/pretrain/mm_input_projector.bin \
+    --tune_mm_input_adapter True \
+    --freeze_mm_input_adapter False \
+    --mm_input_projector_type mlp \
+    --mm_use_vid_start_end False \
+    --mm_use_vid_patch_token False \
+    --image_aspect_ratio pad \
+    --bf16 True \
+    --output_dir ./checkpoints/finetune \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 32 \
+    --per_device_eval_batch_size 4 \
+    --gradient_accumulation_steps 1 \
+    --evaluation_strategy "no" \
+    --save_strategy "steps" \
+    --save_steps 50000 \
+    --save_total_limit 1 \
+    --learning_rate 2e-4 \
+    --weight_decay 0. \
+    --warmup_ratio 0.03 \
+    --lr_scheduler_type "cosine" \
+    --logging_steps 1 \
+    --tf32 True \
+    --model_max_length 2048 \
+    --gradient_checkpointing True \
+    --dataloader_num_workers 4 \
+    --lazy_preprocess True \
+    --report_to tensorboard
diff --git a/merge_lora_weights.py b/merge_lora_weights.py
@@ -0,0 +1,23 @@
+import argparse
+from motionepic.model.builder import load_pretrained_model
+from motionepic.mm_utils import get_model_name_from_path
+
+
+def merge_lora(args):
+    # model_name = get_model_name_from_path(args.model_path) 
+    model_name = 'motionepic-v1.5-7b-lora'
+    tokenizer, model, video_processor, context_len, model_config = load_pretrained_model(args.model_path, args.model_base, model_name, device_map='cpu')
+
+    model.save_pretrained(args.save_model_path)
+    tokenizer.save_pretrained(args.save_model_path)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model-path", type=str, default='./checkpoints/finetune')
+    parser.add_argument("--model-base", type=str, default='./checkpoints/pretrain')
+    parser.add_argument("--save-model-path", type=str, default='./checkpoints/motionepic-v1.5-7b-lora')
+
+    args = parser.parse_args()
+
+    merge_lora(args)
diff --git a/motionepic/__init__.py b/motionepic/__init__.py
diff --git a/motionepic/constants.py b/motionepic/constants.py
@@ -0,0 +1,36 @@
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+CONTROLLER_HEART_BEAT_EXPIRATION = 30
+WORKER_HEART_BEAT_INTERVAL = 15
+LOGDIR = "."
+IGNORE_INDEX = -100
+VIDEO_TOKEN_INDEX = -300
+DEFAULT_VIDEO_TOKEN = "<video>"
+DEFAULT_VIDEO_PATCH_TOKEN = "<vid_patch>"
+DEFAULT_VID_START_TOKEN = "<vid_start>"
+DEFAULT_VID_END_TOKEN = "<vid_end>"
+VIDEO_PLACEHOLDER = "<video-placeholder>"
+MAX_VIDEO_LENGTH = 16
+
+SG_TOKEN_INDEX = -200
+MAX_SG_LENGTH = 16
+SG_PLACEHOLDER = "<sg-placeholder>"
+DEFAULT_SG_TOKEN = "<sg>"
+DEFAULT_SG_PATCH_TOKEN = "<sg_patch>"
+DEFAULT_SG_START_TOKEN = "<sg_start>"
+DEFAULT_SG_END_TOKEN = "<sg_end>"
+
+
+