Image Captioning Using EfficientNet and GRU

EfficientNetB0 which is state of the art model for image processing is used. The model and its variantes are here.
Attention Mechanism is used to focus on certain parts of images. Here is a nice explanation for Attention.
Gated Recurrent Unit(GRU) is model used for processing text.
The Dataset used here is Flickr(30K).
The Dataset contains 30k images along with 5 annotations for each image. Here I have used only first annotation of each image.

Tensorflow, NLTK, Numpy, Pandas are used.
For converting text to embedding vectors TextVectorization is used with vocabulary size of 5000, sequence length of 25 and with embedding dimension of 256.
The image size for EfficientNet is (224, 224, 3). EfficientNet was loaded with weights from ImageNet.
Units for GRU are 512.

Here training is done on batch size of 64 for 25 epochs.
Encoder consists of EfficientNet and a FC layer for fine tunning. Decoder consists of GRU along with Attention Mechanism.
First the image is passed to EfficientNet and image context vector is obtained then along with image context vector hidden_state(intial state of decoder) is passed to Attention layer now its output is passed to GRU along with embedding vector of "[start]" token.
Here Teacher Forcing is used which is while training we pass the word vector of target sentence to GRU.
While testing the model input to GRU is previous output along with Attention output.
Loss obtained by model is 0.511. And BLEU score on test data is 0.129
Here are 2 examples after training.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
decoder_weights.h5		decoder_weights.h5
encoder_weights.h5		encoder_weights.h5
image-captioning.ipynb		image-captioning.ipynb
sent_vectorizer.pkl		sent_vectorizer.pkl

Provide feedback