Skip to content

Commit

Permalink
1st
Browse files Browse the repository at this point in the history
  • Loading branch information
scofield7419 committed Dec 3, 2024
1 parent 0b90927 commit 4d4fad2
Show file tree
Hide file tree
Showing 48 changed files with 6,492 additions and 28 deletions.
124 changes: 96 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,34 @@
## 🤔🎞️ Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition
<a href="https://github.com/scofield7419/Video-of-Thought">
<img src="https://img.shields.io/badge/VoT-1.0-blue" alt="pytorch 1.8.1">
</a>
<a href="https://github.com/scofield7419/Video-of-Thought" rel="nofollow">
<img src="https://img.shields.io/badge/MLLM-1.0-red" alt="pytorch 1.8.1">
</a>
<a href="https://huggingface.co/docs/transformers/index" rel="nofollow">
<img src="https://img.shields.io/badge/transformers-4.24.0-green" alt="Build Status">
</a>
<h1 align="center">
🤔🎞️ Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition
</h1>

**ICML (Oral) 2024**

[Hao Fei](http://haofei.vip/), [Shengqiong Wu](https://chocowu.github.io/), [Wei Ji](), [Hanwang Zhang](https://personal.ntu.edu.sg/hanwangzhang/), [Meishan Zhang](https://zhangmeishan.github.io/), [Mong Li Lee](https://www.comp.nus.edu.sg/~leeml/), and [Wynne Hsu](https://www.comp.nus.edu.sg/~whsu/)


<a href='https://haofei.vip/VoT/'><img src='https://img.shields.io/badge/Project-Page-Green'></a>
<a href='https://openreview.net/pdf?id=fO31YAyNbI'><img src='https://img.shields.io/badge/Paper-PDF-orange'></a>
[![YouTube](https://badges.aleen42.com/src/youtube.svg)](https://youtu.be/2fKCWjetV-Y)
![License](https://img.shields.io/badge/License-BSD-blue.svg)
<a href="https://pytorch.org" rel="nofollow">
<img src="https://img.shields.io/badge/pytorch-1.10.0-orange" alt="pytorch 1.8.1">
</a>


This repository contains the code of ICML 2024 paper [Video-of-Thought](https://is.gd/fcfZeO).

**The implementation of the ICML 2024 paper [Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition](https://is.gd/fcfZeO)**
### 🎉 Visit the project page: [VoT](http://haofei.vip/VoT/)

----------
### 🎉 Visit the project page: [VoT](http://haofei.vip/VoT/)

----------
## Abstract
Existing research of video understanding still struggles to achieve in-depth comprehension and reasoning in complex videos, primarily due to the under-exploration of two key bottlenecks: fine-grained spatial-temporal perceptive understanding and cognitive-level video scene comprehension.
This paper bridges the gap by presenting a novel solution. We first introduce a novel video Multimodal Large Language Model (MLLM), MotionEpic, which achieves fine-grained pixel-level spatial-temporal video grounding by integrating video spatial-temporal scene graph (STSG) representation.
Building upon MotionEpic, we then develop a Video-ofThought (VoT) reasoning framework. VoT inherits the Chain-of-Thought (CoT) core, breaking down a complex task into simpler and manageable sub-problems, and addressing them step-bystep from a low-level pixel perception to highlevel cognitive interpretation.
Extensive experiments across various complex video QA benchmarks demonstrate that our overall framework strikingly boosts existing state-of-the-art.

![framework](./assets/intro.png)


## Overview<a name="overview" />
Expand All @@ -29,52 +38,111 @@ multiple steps from low to high levels, enabling not only pixel perceptive recog
cognitive understanding of videos.

<p align="center">
<img src="./figures/VoT.png" width="550"/>
<img src="./assets/VoT.png" width="550"/>
</p>

> We also introduce a novel video MLLM, namely MotionEpic, which supports not only video input but also the encoding, understanding and generation of STSGs.

<p align="center">
<img src="./figures/MotionEpic.png" width="650"/>
<img src="./assets/MotionEpic.png" width="650"/>
</p>


----------
## Method

## Code
- MotionEpic: Fine-grained Spatial-temporal Grounded Video MLLM
- employ the [Vicuna-7B (v1.5)](https://huggingface.co/lmsys/vicuna-7b-v1.5) as the backbone LLM. To perceive video input,
- adopt the [ViT-L/14](https://huggingface.co/openai/clip-vit-large-patch14) encoder and Q-Former projector.
- design MotionEpic to support the STSG signal, where we
retrofit the Graph Transformer with recurrent propagation to encode the multi-frame STSG information

(TBD)
- Video-of-Thought Reasoning Framework
- Step-1: Task Definition and Target Identification
- Step-2: Object Tracking
- Step-3: Action Analyzing
- Step-4: Question Answering via Ranking
- Step-5: Answer Verification



----

----------
## Installation

Please first clone the repo and install the required environment, which can be done by running the following commands:
```
conda env create -n motionepic python=3.8
conda activate motionepic
#### CUDA 12.1
conda install pytorch==2.1.2 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.6 -c pytorch -c nvidia
git clone https://github.com/scofield7419/Video-of-Thought.git
cd Video-of-Thought
pip install -r requirements.txt
```

----

## Training
Firstly, you need to prepare the dataset, including [Action Geome](https://github.com/JingweiJ/ActionGenome), [webvid](), [MSR-VTT](), and [ActivityNet](http://activity-net.org/).
Then, you need to modify the parameter, `DATASET_NAME_LIST` to determine the dataset used for training and fine-tuning.
Next, run the command for training and fine-tuning:
```
#### for alignment learning
bash pretrain.sh
#### for finetuning.
bash fine-tune.sh
```

----

## Inference
We implement CoT-based inference (i.e., VoT), please refer to the [predict.py](predict.py) for details.
Run the command to obtain the results:
```
python predict.py
```

----

## Citation

If you use this work, please kindly cite:

```
@inproceedings{VoT24Hao,
author = {Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Meishan Zhang, Mong-Li Lee, Wynne Hsu},
title = {Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition},
journal = {Proceedings of the International Conference on Machine Learning (ICML)},
year = {2024},
}
@inproceedings{0001W0ZZLH24,
author = {Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Meishan Zhang, Mong-Li Lee, Wynne Hsu},
title = {Video-of-Thought: Step-by-Step Video Reasoning from Perception to
Cognition},
booktitle = {Proceeding of the ICML},
year = {2024}
}
```


----------
----

## Acknowledgement
Our code is based on the respective official repositories, [NExT-GPT](next-gpt.github.io), and [graphtransformer](https://github.com/graphdeeplearning/graphtransformer/). We fully thank the authors to release their code.




### License

The code is released under Apache License 2.0 for Noncommercial use only.



----------


### Contact

For any questions, feel free to contact [Hao Fei](mailto:haofei37@nus.edu.sg).



File renamed without changes
File renamed without changes
Binary file added assets/intro.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file added data/AG/prepare.md
Empty file.
Empty file added data/ActivityNet/prepare.md
Empty file.
Empty file added data/NExT-QA/prepare.md
Empty file.
20 changes: 20 additions & 0 deletions data/webvid/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
subsampling: {}

reading:
yt_args:
download_size: 360
download_audio_rate: 44100
yt_metadata_args: null
timeout: 60
sampler: null

storage:
number_sample_per_shard: 1000
oom_shard_count: 5
captions_are_subtitles: False

distribution:
processes_count: 16
thread_count: 16
subjob_size: 1000
distributor: "multiprocessing"
13 changes: 13 additions & 0 deletions data/webvid/download.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
#!/bin/bash

# wget -nc http://www.robots.ox.ac.uk/~maxbain/webvid/results_10M_train.csv

video2dataset --url_list="results_2M_train.csv" \
--input_format="csv" \
--output-format="webdataset" \
--output_folder="dataset" \
--url_col="contentUrl" \
--caption_col="name" \
--save_additional_columns='[videoid,page_idx,page_dir,duration]' \
--enable_wandb=True \
--config="path/to/config.yaml" \
46 changes: 46 additions & 0 deletions data/webvid/prepare.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
## Preparation

WebVid is a large-scale text-video dataset, containing 10 million video-text pairs scraped from the stock footage sites.
To download the dataset, run the following command:

```angular2html
wget -nc http://www.robots.ox.ac.uk/~maxbain/webvid/results_2M_train.csv
video2dataset --url_list="results_10M_train.csv" \
--input_format="csv" \
--output-format="webdataset" \
--output_folder="dataset" \
--url_col="contentUrl" \
--caption_col="name" \
--save_additional_columns='[videoid,page_idx,page_dir,duration]' \
--enable_wandb=True \
--config="path/to/config.yaml" \
```
For more datails, please refer to [video2dataset](https://github.com/iejMac/video2dataset/blob/main/dataset_examples/WebVid.md).


### Postprocess
Once you've downloaded the dataset, please verify the download status, as some video files may not have been successfully downloaded. Afterward, organize the dataset into a json file with the following format:
```angular2html
[
{
"caption": "Merida, mexico - may 23, 2017: tourists are walking on a roadside near catholic church in the street of mexico at sunny summer day.",
"video_name": "31353427.mp4"
},
{
"caption": "Happy family using laptop on bed at home",
"video_name": "14781349.mp4"
},
...
]
```

The data file structure should be:
```angular2html
data/T-X_pair_data/webvid
├── webvid.json
├── videos
| ├── 31353427.mp4
| ├── 14781349.mp4
| └── ...
```
15 changes: 15 additions & 0 deletions data/webvid/preprocess.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
#!/bin/bash

# in this file, we will unzip the downloaded data and preprocess it
D_NUM=0
for i in {100..226}; do
if [ $D_NUM -lt 10 ]; then
rm -f ./data/webvid/videos/*.txt
D_NUM=1
else
D_NUM=$((D_NUM + 1))
fi
echo "Extracting 00${i}.tar"

tar -xf ./data/webvid/dataset/00${i}.tar -C ./data/webvid/videos/
done
56 changes: 56 additions & 0 deletions finetune.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
#!/bin/bash



# =================== Encoding-side Training ======================
DATASET_NAME_LIST=(
""
)
DATASET_NAME_LIST="${DATASET_NAME_LIST[@]}"

LLM_MODEL_NAME="./pretrain_ckpt/vicuna-7b-v1.5"
MM_MODEL_NAME="./pretrain_ckpt/clip"

echo "DATASET_NAME_LIST: $DATASET_NAME_LIST"
echo "LLM_MODEL_NAME: $LLM_MODEL_NAME"
echo "MM_MODEL_NAME: $MM_MODEL_NAME"


accelerate launch --main_process_port 8922 train_mem.py \
--lora_enable True --lora_r 128 --lora_alpha 256 \
--mm_input_projector_lr 2e-5 --mm_output_projector_lr 2e-5 \
--deepspeed ./scripts/zero2.json \
--model_name_or_path $LLM_MODEL_NAME \
--version v1 \
--dataset_name_list $DATASET_NAME_LIST \
--multimodal_tower $MM_MODEL_NAME \
--group_by_modality_length True \
--group_by_modality_type False \
--pretrain_mm_input_adapter ./checkpoints/pretrain/mm_input_projector.bin \
--tune_mm_input_adapter True \
--freeze_mm_input_adapter False \
--mm_input_projector_type mlp \
--mm_use_vid_start_end False \
--mm_use_vid_patch_token False \
--image_aspect_ratio pad \
--bf16 True \
--output_dir ./checkpoints/finetune \
--num_train_epochs 1 \
--per_device_train_batch_size 32 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 50000 \
--save_total_limit 1 \
--learning_rate 2e-4 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 2048 \
--gradient_checkpointing True \
--dataloader_num_workers 4 \
--lazy_preprocess True \
--report_to tensorboard
23 changes: 23 additions & 0 deletions merge_lora_weights.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
import argparse
from motionepic.model.builder import load_pretrained_model
from motionepic.mm_utils import get_model_name_from_path


def merge_lora(args):
# model_name = get_model_name_from_path(args.model_path)
model_name = 'motionepic-v1.5-7b-lora'
tokenizer, model, video_processor, context_len, model_config = load_pretrained_model(args.model_path, args.model_base, model_name, device_map='cpu')

model.save_pretrained(args.save_model_path)
tokenizer.save_pretrained(args.save_model_path)


if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--model-path", type=str, default='./checkpoints/finetune')
parser.add_argument("--model-base", type=str, default='./checkpoints/pretrain')
parser.add_argument("--save-model-path", type=str, default='./checkpoints/motionepic-v1.5-7b-lora')

args = parser.parse_args()

merge_lora(args)
Empty file added motionepic/__init__.py
Empty file.
36 changes: 36 additions & 0 deletions motionepic/constants.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

CONTROLLER_HEART_BEAT_EXPIRATION = 30
WORKER_HEART_BEAT_INTERVAL = 15
LOGDIR = "."
IGNORE_INDEX = -100
VIDEO_TOKEN_INDEX = -300
DEFAULT_VIDEO_TOKEN = "<video>"
DEFAULT_VIDEO_PATCH_TOKEN = "<vid_patch>"
DEFAULT_VID_START_TOKEN = "<vid_start>"
DEFAULT_VID_END_TOKEN = "<vid_end>"
VIDEO_PLACEHOLDER = "<video-placeholder>"
MAX_VIDEO_LENGTH = 16

SG_TOKEN_INDEX = -200
MAX_SG_LENGTH = 16
SG_PLACEHOLDER = "<sg-placeholder>"
DEFAULT_SG_TOKEN = "<sg>"
DEFAULT_SG_PATCH_TOKEN = "<sg_patch>"
DEFAULT_SG_START_TOKEN = "<sg_start>"
DEFAULT_SG_END_TOKEN = "<sg_end>"



Loading

0 comments on commit 4d4fad2

Please sign in to comment.