-
Notifications
You must be signed in to change notification settings - Fork 416
Tasks: Add video-text-to-text task page #878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
19 commits
Select commit
Hold shift + click to select a range
86d75ba
Add video-text-to-text task page
merveenoyan a76c704
Merge branch 'main' into add-vt2t-task-page
merveenoyan cbd2e3e
replace model
merveenoyan 909a2b1
Merge branch 'main' into add-vt2t-task-page
merveenoyan 73485b0
Update packages/tasks/src/tasks/video-text-to-text/about.md
merveenoyan 9960f90
Update packages/tasks/src/tasks/video-text-to-text/about.md
merveenoyan fa4d91d
Update packages/tasks/src/tasks/video-text-to-text/about.md
merveenoyan cc86687
Update packages/tasks/src/tasks/video-text-to-text/about.md
merveenoyan 5992301
Update packages/tasks/src/tasks/video-text-to-text/data.ts
merveenoyan 9a0291c
Add opencv
merveenoyan 48c68c1
fix type
merveenoyan 5f49726
Update packages/tasks/src/tasks/video-text-to-text/data.ts
merveenoyan 62b90d6
Update packages/tasks/src/tasks/video-text-to-text/data.ts
merveenoyan 5da861c
Merge branch 'main' into add-vt2t-task-page
merveenoyan 666811d
lint
adaedfd
Update packages/tasks/src/tasks/video-text-to-text/about.md
merveenoyan 4632795
Update packages/tasks/src/tasks/video-text-to-text/about.md
merveenoyan be95adc
Update packages/tasks/src/tasks/video-text-to-text/about.md
merveenoyan 498775b
Merge branch 'main' into add-vt2t-task-page
merveenoyan File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,98 @@ | ||
Most of the video language models can take in videos, multiple videos, images and multiple images. Some of these models can also take interleaved inputs, which can have images and videos inside the text, where you can refer to the input images and input videos within the text prompt. | ||
|
||
## Different Types of Video Language Models | ||
|
||
Video language models come in three types: | ||
|
||
- **Base:** Pre-trained models that can be fine-tuned. | ||
- **Instruction:** Base models fine-tuned on video-instruction pairs and answers. | ||
- **Chatty/Conversational:** Base models fine-tuned on video conversation datasets. | ||
|
||
## Use Cases | ||
|
||
### Video Question Answering | ||
|
||
Video language models trained on video-question-answer pairs can be used for video question answering and generating captions for videos. | ||
|
||
### Video Chat | ||
|
||
Video language models can be used to have a dialogue about a video. | ||
|
||
### Video Recognition with Instructions | ||
|
||
Video language models can recognize images through descriptions. When given detailed descriptions of specific entities, they can classify the entities in a video. | ||
|
||
## Inference | ||
|
||
You can use the Transformers library to interact with video-language models. | ||
Below we load [a video language model](https://huggingface.co/llava-hf/LLaVA-NeXT-Video-7B-hf), write a simple utility to sample videos, use chat template to format the text prompt, process the video and the text prompt and infer. To run the snippet below, please install [OpenCV](https://pypi.org/project/opencv-python/) by running `pip install opencv-python`. | ||
|
||
```python | ||
import uuid | ||
import requests | ||
import cv2 | ||
import torch | ||
from transformers import LlavaNextVideoProcessor, LlavaNextVideoForConditionalGeneration | ||
|
||
device = "cuda" if torch.cuda.is_available() else "cpu" | ||
model_id = "llava-hf/LLaVA-NeXT-Video-7B-hf" | ||
|
||
model = LlavaNextVideoForConditionalGeneration.from_pretrained( | ||
model_id, | ||
torch_dtype=torch.float16, | ||
low_cpu_mem_usage=True, | ||
).to(device) | ||
|
||
processor = LlavaNextVideoProcessor.from_pretrained(model_id) | ||
|
||
def sample_frames(url, num_frames): | ||
response = requests.get(url) | ||
path_id = str(uuid.uuid4()) | ||
|
||
path = f"./{path_id}.mp4" | ||
|
||
with open(path, "wb") as f: | ||
f.write(response.content) | ||
|
||
video = cv2.VideoCapture(path) | ||
total_frames = int(video.get(cv2.CAP_PROP_FRAME_COUNT)) | ||
interval = total_frames // num_frames | ||
frames = [] | ||
for i in range(total_frames): | ||
ret, frame = video.read() | ||
if not ret: | ||
continue | ||
if i % interval == 0: | ||
pil_img = Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)) | ||
frames.append(pil_img) | ||
video.release() | ||
return frames | ||
|
||
conversation = [ | ||
{ | ||
|
||
"role": "user", | ||
"content": [ | ||
{"type": "text", "text": "Why is this video funny?"}, | ||
{"type": "video"}, | ||
], | ||
}, | ||
] | ||
|
||
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True) | ||
|
||
video_url = "https://huggingface.co/spaces/merve/llava-interleave/resolve/main/cats_1.mp4" | ||
video = sample_frames(video, 8) | ||
|
||
inputs = processor(text=prompt, videos=video, padding=True, return_tensors="pt").to(model.device) | ||
|
||
output = model.generate(**inputs, max_new_tokens=100, do_sample=False) | ||
print(processor.decode(output[0][2:], skip_special_tokens=True)) | ||
|
||
# Why is this video funny? ASSISTANT: The humor in this video comes from the cat's facial expression and body language. The cat appears to be making a funny face, with its eyes squinted and mouth open, which can be interpreted as a playful or mischievous expression. Cats often make such faces when they are in a good mood or are playful, and this can be amusing to people who are familiar with their behavior. The combination of the cat's expression and the close- | ||
|
||
``` | ||
|
||
## Useful Resources | ||
|
||
- [Transformers task guide on video-text-to-text](https://huggingface.co/docs/transformers/tasks/video_text_to_text) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
import type { TaskDataCustom } from ".."; | ||
|
||
const taskData: TaskDataCustom = { | ||
datasets: [ | ||
{ | ||
description: "Multiple-choice questions and answers about videos.", | ||
id: "lmms-lab/Video-MME", | ||
}, | ||
{ | ||
description: "A dataset of instructions and question-answer pairs about videos.", | ||
id: "lmms-lab/VideoChatGPT", | ||
}, | ||
], | ||
demo: { | ||
inputs: [ | ||
{ | ||
filename: "video-text-to-text-input.gif", | ||
type: "img", | ||
}, | ||
{ | ||
label: "Text Prompt", | ||
content: "What is happening in this video?", | ||
type: "text", | ||
}, | ||
], | ||
outputs: [ | ||
{ | ||
label: "Answer", | ||
content: | ||
"The video shows a series of images showing a fountain with water jets and a variety of colorful flowers and butterflies in the background.", | ||
type: "text", | ||
}, | ||
], | ||
}, | ||
metrics: [], | ||
models: [ | ||
{ | ||
description: "A robust video-text-to-text model that can take in image and video inputs.", | ||
id: "llava-hf/llava-onevision-qwen2-72b-ov-hf", | ||
}, | ||
{ | ||
description: "Large and powerful video-text-to-text model that can take in image and video inputs.", | ||
id: "llava-hf/LLaVA-NeXT-Video-34B-hf", | ||
}, | ||
], | ||
spaces: [ | ||
{ | ||
description: "An application to chat with a video-text-to-text model.", | ||
id: "llava-hf/video-llava", | ||
merveenoyan marked this conversation as resolved.
Show resolved
Hide resolved
|
||
}, | ||
], | ||
summary: | ||
"Video-text-to-text models take in a video and a text prompt and output text. These models are also called video-language models.", | ||
widgetModels: [""], | ||
youtubeId: "", | ||
}; | ||
|
||
export default taskData; |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.