Link to paper: [1] https://drive.google.com/drive/folders/1KQ5kyeSGI-lTvtb5xW2XubNO7nPyWhJw?usp=sharing
This project presents a image retrieval system that effectively retrieves images based on textual descriptions. It leverages the capabilities of Vision Transformers (ViT) for image encoding and BERT for text embedding, employing an additive attention mechanism to fuse these modalities. The system aims to enhance user experience in applications like e-commerce and multimedia search by allowing more intuitive searches through natural language.
Traditional systems often rely on visual similarity, neglecting semantic context. This work proposes a novel system that combines ViT for image encoding and BERT for text embedding, achieving improved retrieval performance as demonstrated on the FashionIQ dataset.
-
Feature Extraction
- Image Encoding: Utilizes ViT to process images into feature vectors.
- Text Encoding: Employs BERT to convert textual queries into embeddings.
-
Fine-Tuning
- Custom architectures for ViT and BERT are fine-tuned to optimize embeddings for retrieval tasks.
-
Multimodal Fusion
- Combines image and text embeddings into a unified representation using attention mechanisms.
-
Training and Optimization
- The system is trained using cross-entropy loss and optimized with the AdamW optimizer.
-
Inference
- Retrieves top K images based on cosine similarity between query embeddings and database images.
The system was evaluated on the FashionIQ dataset, demonstrating:
- Improved Recall, Precision, and F1 Scores as the number of retrieved images increases.
- Notable performance at higher values of K, indicating robust retrieval capabilities.
k | Recall | Precision | F1 Score |
---|---|---|---|
5 | 0.3055 | 0.1479 | 0.1994 |
10 | 0.4342 | 0.2510 | 0.3181 |
20 | 0.6345 | 0.4305 | 0.5130 |
30 | 0.7314 | 0.5386 | 0.6203 |
50 | 0.9318 | 0.7430 | 0.8268 |
This study introduces an innovative approach to multimodal image retrieval that combines textual feedback with image queries, significantly improving retrieval accuracy and user experience.
To run this project:
- Clone the repository.
- Install required libraries.
- Prepare the FashionIQ dataset from Kaggle.
- Run the training script to fine-tune the models.