Image classification is well known problem set for Computer Vision. Here we will be using the already known CIFAR10 dataset and will use basic approach of BoVW and VLAD approach
The Bag of Visual Words approach(BoVW) is similar to Bag of Words(BoW) approach used inNLP. In NLP, BoW approach represents text in the form of frequency histogram. Similarly, BoVWapproach will represent each image as a frequency histogram of various important featurespoints and its descriptors. Keypoints are unique to an image and even if the image is rotated,shrinked or expanded, these points will not change.For extracting the feature points and their descriptors, SURF descriptor is used which willprovide 64- dimensional vector describing each Keypoint. These vectors provided by the SURFdescriptor are now looked upon as code words which is similar to words in text documents.From these codewords, we are going to create a codebook which is nothing but the vocabularyfor the images.For creating this vocabulary, KMeans algorithm is used which will provide us with the clustercentroids which will act as the codeword. The number of clusters will comprise the wholecodebook. Now for performing Kmeans it’s a quite complex task since it is an unsupervisedalgorithm, it’s pretty difficult to correctly approximate the number of cluster centres(K). Butafter various hit and trial K = 300 seems to be a pretty good number of clusters.Now after this vocabulary i.e. the codebook is created, we need to perform classification usingany linear based classification algorithm. But before that we need to convert each image to ahistogram where each bin will represent to a codeword and the frequency will determine howmany times that codeword appeared in the image. Using this histogram we can create featurevectors and the labels provided to train any linear based classification algorithm.Logistic Regression gave an accuracy of 24.52% while SVM gave an accuracy of 23 % with thetraining CIFAR-10 dataset.
The full form of VLAD is Vector of Locally Aggregated Descriptors. In case of BoVW, we create a codebook which is actually the vocabulary obtained from all the images consisting of the descriptors and we create a training vector out of each image by matching the codewords with the image and whichever matches, the frequency is updated accordingly. But incase of VLAD, we are going to use a slightly different approach of creating these training vectors for all the images. It’s an extension of the BoVW concept. Here we calculate the vocabulary as obtained earlier but for creating the training vectors, we accumulate the residual of each descriptor with respect to its assigned cluster. In short, a descriptor of an image is matched with its closest cluster and then for each cluster, we finally store the sum of the differences of the descriptors assigned to the cluster and the centroid of that cluster. Finally, we perform power normalization also known as square-rooting normalization and then perform L2 Normalization. In order to train in batches for the vocabulary, we use MiniBatchKMeans with batch size of 1000 and the number of clusters as 50. But the maximum accuracy increased by decreasing the number of clusters and the maximum accuracy is obtained by creating the vocabulary of size 20 i.e. K= 20. With the simple Logistic Regression as the linear classification algorithm, the accuracy obtained is 38.96%