This is a multimodal model design for the Vision Question Answering (VQA) task. It integrates the Llama2 13B, OWL-ViT, and YOLOv8 models, utilizing hard prompt tuning.
- Llama2 13B handles language understanding and generation.
- OWL-ViT identifies objects in the image relevant to the question.
- YOLOv8 efficiently detects and annotates objects within the image
Combining these models leverages their strengths for precise and efficient VQA, ensuring accurate object recognition and context understanding from both language and visual inputs.
pip install requirements.txt
I evaluate the testing data from the GQA dataset.
python val_zero_shot.py
--imgs_path: The path of the GQA data image file
--dataroot: The path of the GQA data
--mode: ['testdev', 'val', 'train']
python zero_shot.py
--img_path: The path of the question image
--yolo_weight: The pre-train yolov8 weight
- The resutl of GQA accuracy score is 0.52.