Skip to content

"AutoImageCaption-CNNvsResNet" leverages the Flickr 8k Dataset to automate image captioning, comparing CNN+LSTM and ResNet+GRU models using BLEU scores for performance evaluation.

Notifications You must be signed in to change notification settings

tanzealist/AutoImageCaption-CNNvsResNet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Caption It! - Harnessing the Power of CNN/ResNet Models for Image Description

Introduction

"Caption It!" is a deep learning project focusing on the automation of image captioning. Utilizing the Flickr 8k Dataset and pre-trained CNN/ResNet models, this project compares two approaches: CNN+LSTM and ResNet+GRU, evaluating their performance using BLEU scores.

Agenda

  1. Problem Statement
  2. Technical Approach
  3. Dataset Analysis
  4. Exploratory Data Analysis
  5. Deep Learning Approaches: VGG16+LSTM and RESNET50+GRU
  6. Performance Evaluation
  7. Conclusion

Problem Statement

Our goal was to develop a model that automatically generates captions for images using advanced deep learning techniques.

Technical Approach

  • Import libraries and modules
  • Load and preprocess dataset
  • Perform feature extraction and caption tokenization
  • Model building, training, and evaluation

Dataset Source

Dataset Analysis

The Flickr 8k Dataset, comprising 8000 images each with 5 captions, was used. This dataset offers a diverse range of images and high-quality captions, ideal for training image captioning models. Below are some visuals from the dataset!

Caption Length Distribution

Caption Length Distribution

Top 50 Words in the Dataset

Top 50 Words

Model Training Visualization: Sample Image Captions

Model Training Visualization

Deep Learning Approaches

VGG16 & LSTM

  • VGG16: A pre-trained CNN for image classification.
  • LSTM: A recurrent neural network excellent for capturing temporal dependencies.

Output of trained VGG16 & LSTM model

image

ResNet50 & GRU

  • ResNet50: A deep residual network for image recognition.
  • GRU: Efficient at capturing temporal relationships in sequence modeling.

Output of trained ResNet50 & GRU model

image

Performance Evaluation

  • Model performance was evaluated using BLEU scores.

BLUE Scores comparision

Screenshot 2024-01-27 at 11 10 09 AM

  • The VGG16+LSTM model exhibited higher BLEU scores, indicating its effectiveness in generating more accurate captions.

Conclusion

The project reveals the impact of different pre-trained image feature extraction models and sequence models on the quality of generated captions. It opens pathways for further improvements in the field of automated image captioning.

Dependencies

This project requires the following libraries:

  • TensorFlow
  • Keras
  • NumPy
  • Pandas
  • Matplotlib
  • PIL
  • NLTK

Install these using pip or conda as shown in the provided Python notebooks.

Instructions to Run the Project

To run this project, follow these steps:

  1. Clone the repository to your local machine.
  2. Ensure you have Jupyter Notebook installed.
  3. Open and run the eda.ipynb notebook for exploratory data analysis.
  4. Proceed with vgg16_lstm.ipynb for the VGG16+LSTM model training and evaluation.
  5. Finally, execute resnet_gru.ipynb for the ResNet50+GRU model training and evaluation.
  6. Compare the BLEU scores as outputted by the notebooks to evaluate the models.
  7. Refer PPT for flow of the project

Please refer to each notebook for detailed instructions on the steps involved in the respective processes.

About

"AutoImageCaption-CNNvsResNet" leverages the Flickr 8k Dataset to automate image captioning, comparing CNN+LSTM and ResNet+GRU models using BLEU scores for performance evaluation.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published