README

Link to paper: [1] https://drive.google.com/drive/folders/1KQ5kyeSGI-lTvtb5xW2XubNO7nPyWhJw?usp=sharing

Overview

This project presents a image retrieval system that effectively retrieves images based on textual descriptions. It leverages the capabilities of Vision Transformers (ViT) for image encoding and BERT for text embedding, employing an additive attention mechanism to fuse these modalities. The system aims to enhance user experience in applications like e-commerce and multimedia search by allowing more intuitive searches through natural language.

Traditional systems often rely on visual similarity, neglecting semantic context. This work proposes a novel system that combines ViT for image encoding and BERT for text embedding, achieving improved retrieval performance as demonstrated on the FashionIQ dataset.

Methodology

Feature Extraction
- Image Encoding: Utilizes ViT to process images into feature vectors.
- Text Encoding: Employs BERT to convert textual queries into embeddings.
Fine-Tuning
- Custom architectures for ViT and BERT are fine-tuned to optimize embeddings for retrieval tasks.
Multimodal Fusion
- Combines image and text embeddings into a unified representation using attention mechanisms.
Training and Optimization
- The system is trained using cross-entropy loss and optimized with the AdamW optimizer.
Inference
- Retrieves top K images based on cosine similarity between query embeddings and database images.

Results

The system was evaluated on the FashionIQ dataset, demonstrating:

Improved Recall, Precision, and F1 Scores as the number of retrieved images increases.
Notable performance at higher values of K, indicating robust retrieval capabilities.

k	Recall	Precision	F1 Score
5	0.3055	0.1479	0.1994
10	0.4342	0.2510	0.3181
20	0.6345	0.4305	0.5130
30	0.7314	0.5386	0.6203
50	0.9318	0.7430	0.8268

Conclusion

This study introduces an innovative approach to multimodal image retrieval that combines textual feedback with image queries, significantly improving retrieval accuracy and user experience.

Installation

To run this project:

Clone the repository.
Install required libraries.
Prepare the FashionIQ dataset from Kaggle.
Run the training script to fine-tune the models.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
LICENSE		LICENSE
README.md		README.md
data.ipynb		data.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

README

Overview

Methodology

Results

Conclusion

Installation

About

Releases

Packages

Languages

License

Ayushi8989/Image-retrieval-with-text-feedback

Folders and files

Latest commit

History

Repository files navigation

README

Overview

Methodology

Results

Conclusion

Installation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages