Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tasks: Add video-text-to-text task page #878

Merged
merged 19 commits into from
Sep 6, 2024
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion packages/tasks/src/tasks/image-text-to-text/data.ts
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ const taskData: TaskDataCustom = {
},
{
description: "Strong image-text-to-text model.",
id: "llava-hf/llava-v1.6-mistral-7b-hf",
id: "microsoft/Phi-3.5-vision-instruct",
},
],
spaces: [
Expand Down
99 changes: 99 additions & 0 deletions packages/tasks/src/tasks/video-text-to-text/about.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
Most of the video language models can take in videos, multiple videos, images and multiple images. Some of these models can also take interleaved inputs, which can have images and videos inside the text, where you can refer to the input image and input video within the text prompt.
merveenoyan marked this conversation as resolved.
Show resolved Hide resolved

## Different Types of Video Language Models

Video language models come in three types:

- **Base:** Pre-trained models that can be fine-tuned.
- **Instruction:** Base models fine-tuned on video-instruction pairs and answers.
- **Chatty/Conversational:** Base models fine-tuned on video conversation datasets.


## Use Cases

### Video Question Answering

Video language models trained on video-question-answer pairs can be used for video question answering and generating captions for videos.

### Video Chat

Video language models can be used to have a dialogue about a video.

### Video Recognition with Instructions

Video language models can recognize images through descriptions. When given detailed descriptions of specific entities, it can classify the entities in a video.
merveenoyan marked this conversation as resolved.
Show resolved Hide resolved

## Inference

You can use the Transformers library to interact with video-language models.
Below we load [a video language model](https://huggingface.co/llava-hf/LLaVA-NeXT-Video-7B-hf), write a simple utility to sample videos, use chat template to format the text prompt, process the video and the text prompt and infer.

```python
import uuid
import requests
import cv2
merveenoyan marked this conversation as resolved.
Show resolved Hide resolved
import torch
from transformers import LlavaNextVideoProcessor, LlavaNextVideoForConditionalGeneration

device = "cuda" if torch.cuda.is_available() else "cpu"
model_id = "llava-hf/LLaVA-NeXT-Video-7B-hf"

model = LlavaNextVideoForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
).to(device)

processor = LlavaNextVideoProcessor.from_pretrained(model_id)

def sample_frames(url, num_frames):

merveenoyan marked this conversation as resolved.
Show resolved Hide resolved
response = requests.get(url)
path_id = str(uuid.uuid4())

path = f"./{path_id}.mp4"

with open(path, "wb") as f:
f.write(response.content)
merveenoyan marked this conversation as resolved.
Show resolved Hide resolved

video = cv2.VideoCapture(path)
total_frames = int(video.get(cv2.CAP_PROP_FRAME_COUNT))
interval = total_frames // num_frames
frames = []
for i in range(total_frames):
ret, frame = video.read()
pil_img = Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
if not ret:
continue
if i % interval == 0:
frames.append(pil_img)
merveenoyan marked this conversation as resolved.
Show resolved Hide resolved
video.release()
return frames

conversation = [
{

"role": "user",
"content": [
{"type": "text", "text": "Why is this video funny?"},
{"type": "video"},
],
},
]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

video_url = "https://huggingface.co/spaces/merve/llava-interleave/resolve/main/cats_1.mp4"
video = sample_frames(video, 8)

inputs = processor(text=prompt, videos=video, padding=True, return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=100, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True))

# Why is this video funny? ASSISTANT: The humor in this video comes from the cat's facial expression and body language. The cat appears to be making a funny face, with its eyes squinted and mouth open, which can be interpreted as a playful or mischievous expression. Cats often make such faces when they are in a good mood or are playful, and this can be amusing to people who are familiar with their behavior. The combination of the cat's expression and the close-

```

## Useful Resources
- [Transformers task guide on video-text-to-text](https://huggingface.co/docs/transformers/tasks/video_text_to_text)
59 changes: 59 additions & 0 deletions packages/tasks/src/tasks/video-text-to-text/data.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
import type { TaskDataCustom } from "..";

const taskData: TaskDataCustom = {
datasets: [
{
description: "Multiple-choice questions and answers about videos.",
id: "lmms-lab/Video-MME",
},
{
description: "A dataset of instructions and question-answer pairs about videos.",
id: "lmms-lab/VideoChatGPT",
},

],
merveenoyan marked this conversation as resolved.
Show resolved Hide resolved
demo: {
inputs: [
{
filename: "video-text-to-text-input.gif",
type: "video",
merveenoyan marked this conversation as resolved.
Show resolved Hide resolved
},
{
label: "Text Prompt",
content: "What is happening in this video?",
type: "text",
},
],
outputs: [
{
label: "Answer",
content:
"The video shows a series of images showing a fountain with water jets and a variety of colorful flowers and butterflies in the background.",
type: "text",
},
],
},
metrics: [],
models: [
{
description: "A robust video-text-to-text model that can take in image and video inputs.",
id: "llava-hf/llava-onevision-qwen2-72b-ov-hf",
},
{
merveenoyan marked this conversation as resolved.
Show resolved Hide resolved
description: "Large and powerful video-text-to-text model that can take in image and video inputs.",
id: "llava-hf/LLaVA-NeXT-Video-34B-hf",
},
],
spaces: [
{
description: "An application to chat with a video-text-to-text model.",
id: "llava-hf/video-llava",
merveenoyan marked this conversation as resolved.
Show resolved Hide resolved
},
],
summary:
"Video-text-to-text models take in an video and text prompt and output text. These models are also called video-language models.",
merveenoyan marked this conversation as resolved.
Show resolved Hide resolved
widgetModels: [""],
youtubeId: "",
};

export default taskData;
Loading