VisionQA-Llama2-OWLViT

Introduce

This is a multimodal model design for the Vision Question Answering (VQA) task. It integrates the Llama2 13B, OWL-ViT, and YOLOv8 models, utilizing hard prompt tuning.

features:

Llama2 13B handles language understanding and generation.
OWL-ViT identifies objects in the image relevant to the question.
YOLOv8 efficiently detects and annotates objects within the image

Combining these models leverages their strengths for precise and efficient VQA, ensuring accurate object recognition and context understanding from both language and visual inputs.

Requirement

pip install requirements.txt

Data

I evaluate the testing data from the GQA dataset.

Eval

python val_zero_shot.py

--imgs_path: The path of the GQA data image file
--dataroot: The path of the GQA data
--mode: ['testdev', 'val', 'train']

Run

python zero_shot.py

--img_path: The path of the question image
--yolo_weight: The pre-train yolov8 weight

Predict result

The resutl of GQA accuracy score is 0.52.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
git_image		git_image
README.md		README.md
requirements.txt		requirements.txt
test.jpg		test.jpg
val_zero_shot.py		val_zero_shot.py
zero_short.py		zero_short.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VisionQA-Llama2-OWLViT

Introduce

features:

Requirement

Data

Eval

Run

Predict result

About

Releases

Packages

Languages

ycchen218/VisionQA-Llama2-OWLViT

Folders and files

Latest commit

History

Repository files navigation

VisionQA-Llama2-OWLViT

Introduce

features:

Requirement

Data

Eval

Run

Predict result

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages