Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Draft]Add Multimodal RAG notebook #2497

Merged
merged 29 commits into from
Dec 23, 2024
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
ee3c3c5
first draft
openvino-dev-samples Oct 31, 2024
ebb5fbe
reformat
openvino-dev-samples Nov 1, 2024
5db6113
fix ci
openvino-dev-samples Nov 1, 2024
6a17791
add gradio demo
openvino-dev-samples Nov 4, 2024
e09b544
reformat
openvino-dev-samples Nov 4, 2024
9dfa8c4
update gradio UI
openvino-dev-samples Nov 4, 2024
669ad71
transfer to optimum-intel
openvino-dev-samples Nov 18, 2024
ec0126f
Merge branch 'latest' into multimodal-rag
openvino-dev-samples Nov 18, 2024
2a531e2
fix spelling
openvino-dev-samples Nov 18, 2024
655ab9c
add load image function
openvino-dev-samples Nov 19, 2024
7dffa6a
update the method of audio extraction
openvino-dev-samples Nov 19, 2024
044ec6b
replaace video upload component
openvino-dev-samples Nov 20, 2024
7691683
update the screenshot display method
openvino-dev-samples Nov 20, 2024
7207e0b
replace llm test case
openvino-dev-samples Nov 21, 2024
89f9ec5
update with accruaracy aware quantization
openvino-dev-samples Nov 22, 2024
087787f
solve conflict
openvino-dev-samples Nov 22, 2024
a868ab5
solve conflict
openvino-dev-samples Nov 22, 2024
fd1df74
switch to int8 ASR
openvino-dev-samples Nov 22, 2024
c7456d3
skip macos
openvino-dev-samples Nov 27, 2024
023d915
solve conflict
openvino-dev-samples Nov 29, 2024
dfc16ae
reduce the number of frame saved
openvino-dev-samples Nov 29, 2024
0833101
solve conflict
openvino-dev-samples Dec 2, 2024
5203bc1
add skipped os
openvino-dev-samples Dec 6, 2024
316d3b2
Merge branch 'latest' into multimodal-rag
eaidova Dec 16, 2024
dfd4aed
update video url
openvino-dev-samples Dec 17, 2024
7290856
change the ASR model id
openvino-dev-samples Dec 17, 2024
6a6f0ce
update ov version
openvino-dev-samples Dec 18, 2024
c66976e
ignore mul-rag in docker ci
openvino-dev-samples Dec 19, 2024
650f625
Merge branch 'latest' into multimodal-rag
openvino-dev-samples Dec 20, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .ci/spellcheck/.pyspelling.wordlist.txt
Original file line number Diff line number Diff line change
Expand Up @@ -600,6 +600,7 @@ OpenVINO
openvino
OpenVino
OpenVINO's
OpenVINOMultiModal
openvoice
OpenVoice
OpenVoiceBaseClass
Expand Down Expand Up @@ -981,6 +982,7 @@ VITS
vitt
VL
vl
VLM
VLModel
VM
Vladlen
Expand Down Expand Up @@ -1025,6 +1027,7 @@ YOLOv
yolov
Youri
youri
YouTube
ZavyChromaXL
Zongyuan
ZeroScope
Expand Down
27 changes: 27 additions & 0 deletions notebooks/multimodal-rag/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Multimodal RAG for video analytics with LlamaIndex

Constructing a RAG pipeline for text is relatively straightforward, thanks to the tools developed for parsing, indexing, and retrieving text data. However, adapting RAG models for video content presents a greater challenge. Videos combine visual, auditory, and textual elements, requiring more processing power and sophisticated video pipelines.

To build a truly multimodal search for videos, you need to work with different modalities of a video like spoken content, visual. In this notebook, we showcase a Multimodal RAG pipeline designed for video analytics. It utilizes Whisper model to convert spoken content to text, CLIP model to generate multimodal embeddings, and Vision Language model (VLM) to process retrieved images and text messages. The following picture illustrates how this pipeline is working.

![image](https://github.com/user-attachments/assets/fb3ec06f-e4b0-4ca3-aac6-71465ae14808)

## Notebook contents
The tutorial consists from following steps:

- Install requirements
- Convert and Optimize model
- Download and process video
- Create the multi-modal index
- Search text and image embeddings
- Generate final response using VLM
- Launch Interactive demo

In this demonstration, you'll create interactive Q&A system that can answer questions about provided video's content.

## Installation instructions
This is a self-contained example that relies solely on its own code.</br>
We recommend running the notebook in a virtual environment. You only need a Jupyter server to start.
For details, please refer to [Installation Guide](../../README.md).

<img referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=5b5a4db0-7875-4bfb-bdbd-01698b5b1a77&file=notebooks/multimodal-rag/README.md" />
132 changes: 132 additions & 0 deletions notebooks/multimodal-rag/gradio_helper.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
from typing import Callable
import gradio as gr

examples = [
["Tell me more about gaussian function"],
["Show me the formula of gaussian function"],
["What is the Herschel Maxwell derivation of a Gaussian ?"],
]


def clear_files():
return "Vector Store is Not ready"


def handle_user_message(message, history):
"""
callback function for updating user messages in interface on submit button click

Params:
message: current message
history: conversation history
Returns:
None
"""
# Append the user's message to the conversation history
return "", history + [[message, ""]]


def make_demo(
example_path: str,
build_index: Callable,
search: Callable,
run_fn: Callable,
stop_fn: Callable,
):

with gr.Blocks(
theme=gr.themes.Soft(),
css=".disclaimer {font-variant-caps: all-small-caps;}",
) as demo:
gr.Markdown("""<h1><center>QA over Video</center></h1>""")
gr.Markdown(f"""<center>Powered by OpenVINO</center>""")
image_list = gr.State([])
txt_list = gr.State([])

with gr.Row():
with gr.Column(scale=1):
video_file = gr.File(
label="Step 1: Load a '.mp4' video file",
value=example_path,
file_types=[
".mp4",
],
)
load_video = gr.Button("Step 2: Build Vector Store", variant="primary")
status = gr.Textbox(
"Vector Store is Ready",
show_label=False,
max_lines=1,
interactive=False,
)

with gr.Column(scale=4):
chatbot = gr.Chatbot(
height=800,
label="Step 3: Input Query",
)
with gr.Row():
with gr.Column():
with gr.Row():
msg = gr.Textbox(
label="QA Message Box",
placeholder="Chat Message Box",
show_label=False,
container=False,
)
with gr.Column():
with gr.Row():
submit = gr.Button("Submit", variant="primary")
stop = gr.Button("Stop")
clear = gr.Button("Clear")
gr.Examples(
examples,
inputs=msg,
label="Click on any example and press the 'Submit' button",
)
video_file.clear(clear_files, outputs=[status], queue=False).then(lambda: gr.Button(interactive=False), outputs=submit)
load_video.click(lambda: gr.Button(interactive=False), outputs=submit).then(
fn=build_index,
inputs=[video_file],
outputs=[status],
queue=True,
).then(lambda: gr.Button(interactive=True), outputs=submit)
submit_event = (
msg.submit(handle_user_message, [msg, chatbot], [msg, chatbot], queue=False)
.then(
search,
[chatbot],
[image_list, txt_list],
queue=True,
)
.then(
run_fn,
[chatbot, image_list, txt_list],
chatbot,
queue=True,
)
)
submit_click_event = (
submit.click(handle_user_message, [msg, chatbot], [msg, chatbot], queue=False)
.then(
search,
[chatbot],
[image_list, txt_list],
queue=True,
)
.then(
run_fn,
[chatbot, image_list, txt_list],
chatbot,
queue=True,
)
)
stop.click(
fn=stop_fn,
inputs=None,
outputs=None,
cancels=[submit_event, submit_click_event],
queue=False,
)
clear.click(lambda: None, None, chatbot, queue=False)
return demo
Loading
Loading