In this group work, we introduce several approaches to identify fine-grained locations from 2D images. The methods firstly shortlist the candidate images based on the similarity of the extracted features from pre-trained Vision Transformer (ViT) [1] to speed up the process. After then, they select the best match image sharing the most number of confident matching points from pre-trained LoFTR [2]. The geographic information of the best image represents the input image location. Different approaches involve different techniques such as the image transformation or the result adjustment for better accuracy. We evaluate the performance of the different methods through the experiments and show they can plausibly recognise the positions with unseen images.
The total number of 7,500 training images and 1,200 test images are used. Each of which are taken at an art gallery in 680 (W) x 490 (H) pixels. Training images have geographic information, x and y, values from a mapping algorithm. For simplicity, we assume there is no radial distortion and ignore other artefacts and distortion on the images. In order to select the best model, we compare the performance on initially split validation dataset that contains 750 images from the original training images.
Mean Absolute Error (MAE) between the predicted and the true geographic coordinates.
A. Top 1 ViT-similar directly uses the coordinates of the most similar image in train set (top 1 ViT-similar) as the prediction.
B. Top 10 ViT-similar & most LoFTR matching point
- Shortlist top 10 similar images in the train set
- Count the number of matching points from LoFTR between the test image and each of the 10 images, select one image with the most counts
- Predict the location as the same x,y coordinates of the best matching train image
C. Affine Transformation & Linear Regression
- Shortlist top 10 similar images in the train set
- Count the number of good points (over 0.7 confidence) from LoFTR, select one image with the most counts
- Randomly select three points from the confident points and generate affine matrix. There is no output when it fails to find the matrix elements
- Train a linear regression (w/ default setting) with affine elements (as features) and difference between the true location and the selected image location
D. Camera Transformation - Essential Matrix
- Shortlist top 10 similar images in the train set
- Find the essential matrices between every 10 pairs with LoFTR detected matching points
- Select one image with the most inlier points from its corresponding essential matrix
E. Camera Transformation - Camera Pose
- Shortlist top 10 similar images in the train set
- Find the essential matrices between every 10 pairs with LoFTR detected matching points
- Decompose the essential matrices into the rotation and translation matrices
- Select one image with the most points from LoFTR that can be described by the two transformations (rotation and translation)
Approaches | A | B | C | D | E |
---|---|---|---|---|---|
MAE (test) | 7.01 | 6.14 | 5.86 | 4.35 | 4.37 |
Those approaches achieve a competitive performance on Kaggle Competition. The best model ranks 11th out of 215 teams.
Torch 1.7.1
Torchvision 0.10.1
PIL 8.3.1
Pandas 1.3.1
Numpy 1.19.5
Matplotlib 3.4.2
OpenCV 4.5.2
[1] A. Dosovitskiy et al., ”An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
[2] J. Sun, Z. Shen, Y. Wang, H. Bao, and X. Zhou, ”LoFTR: Detector free local feature matching with transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8922-8931
Chen-An Fan @derek20F
Hee Won Kim @Heewon-Hailey