To generate a gramatically correct sentence which can accurately describe the scene of an image, enabling any individual to visualize the image mentally. Instead of simply detecting objects, the network aims to establish a relationship among entities in the image.
Image Features are extracted using InceptionV3 model(Pretrained) Captioning Model is trained on the Flickr8k Dataset
Navigate to the 'Flickr' directory in the command prompt:
python run.py
The parent folder of this repository should contain the trained caption_model weights.
The Image Captioning Model is deployed as a REST API i.e., the web-app as well as our Flutter Application makes API calls to the server by sending an image and the server responds with a caption
The Web-App displays the Bidirectional as well as Uni directional Approaches side-by-side and also a table for the accuracy of each predicted word
Football players:
Snowy Scene:
Running Dog:
Jumping Dog:
2 Running Dogs:
The Caption Model is deployed as a REST API locally on my laptop and the Flutter Application fetches data using it
Dog on Beach:
3 Dogs:
Dog Jumping over Hurdle:
Basketball Boy:
The BLEU Metric has been used to evaluate the test images. A higher BLEU rating (closer to 1) corresponds to an accurate description
The captions are generated by the model on a Laptop CPU which results in a higher processing time Deploying the model on the cloud can improve performance