The VGG takes the photo and makes a feature vector from it
The BiGRU , BiLSTM , GRU , LSTM model are used to train the neural model from the feature vector and the captions present for each photo. Still a lot of developement has to be done to improve the accuracy of this model