- Faster-RCNN-50-FPN -pretrained off-the-box pytorch network
- Mask-RCNN-50-FPN - pretrained off-the-box pytorch network
- Mask-RCNN-101-FPN - built with pretrained backbone on ImageNet dataset
- Mask-RCNN-101-FPN with customised anchors sizes=(16, 32, 64, 128, 256, 512) - built with pretrained backbone on ImageNet dataset
- Mean Average Precision (AP or mAP) at IoU [0.5, 0.05, 0.95], 0.75 and 0.50
With the developments in deep learning, the applicability of computer vision has been widely spread in fields such as robotics, image search, recognition and autonomous driving. In this work Mask-RCNN-ResNet50-FPN, Mask-RCNN-ResNet101-FPN, Mask-RCNN-ResNet101-FPN with customised anchor sizes and Faster-ResNet50-FPN were used for the instance semantic segmentation and object detection on the PASCAL VOC 2012 dataset.
A python environment was setup and the experiments were built using Pytorch.
For this project only the images, annotations and segmented class masks of the Pascal VOC 2012 kit dataset were used [1]. This dataset has 21 classes including the background. The dataset contained in total 2913 images with annotations and ground truth. The holdout method with proportion of 80:10:10 for training, validation and test of the models.
voc_classes: ' __background__ ','aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus','car','cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor'
The approach to training was transfer learning and fine tuning the last three layers of the models. The models were trained to minimize the loss with SGD and Adam optimizers, a weght decay of 0.0001 and learning rates of 0.001, 0.005 and 0.0001 with a step scheduler for 5 steps and gamma of 0.2. All models were trained on google colab.
The performance of the model was measured after each epoch training on the validation subset and on the test subset after training. The coco style mean average precision was measure at IoU thresholds [0:50,0:05,0:95], 0.50 and 0.75 [2]. The mean average precision (mAP) was computed for all classes and per each class on the test subset [3].
Tables 1-2 show the mean average precision of the best models for all classes and per class, respectively. Mask-RCNN-ResNet-50-FPN pretrained with the v2 of the parameters hit the highest mAP for segmentation and detection at all intersections over union (IoU) thresholds (Table 1). It can be observed that the custom size of the anchors had a positive effect on increasing the mAP for Mask-RCNN-101-FPN for both v1 and v2 of the weights (Table 1). This maybe has contributed to detecting occluded or difficult-to-detect objects in images. The Mask-RCNN models' performance was higher with the v2 (Table 1-2). Except for the aeroplane and boat, Mask-RCNN-ResNet-50 FPN all classes obtained the highest detection accuracies (Table 2)
Table 1. Best models performance on the test dataset. The mean average precision is a percentage for all Pascal VOC classes 2012. Mask-RCNN-ResNet101-FPN-CA has customised anchors sizes=(16, 32, 64, 128, 256, 512). The best results are shown in bold.
Network | Epochs | lr | Optimizer | mAP @IoU [0:50,0:05,0:95] detection |
mAP @IoU=50 detection |
mAP @IoU=75 detection |
mAP @IoU [0:50,0:05,0:95] segmentation |
mAP @IoU=50 segmentation |
mAP @IoU=75 segmentation |
---|---|---|---|---|---|---|---|---|---|
Faster-RCNN-ResNet50-FPN | 15 | 0.005 | SGD | 51.7 | 79.1 | 56.13 | NA | NA | NA |
Mask-RCNN-ResNet50-FPN | 15 | 0.005 | SGD | 53.6 | 78.2 | 62.43 | 44.56 | 71.43 | 47.7 |
Mask-RCNN-ResNet101-FPN | 20 | 0.0001 | Adam | 42.1 | 72.2 | 44.4 | 38.2 | 64.2 | 39.8 |
Mask-RCNN-ResNet101-FPN-CA& | 20 | 0.0001 | Adam | 43.7 | 69.9 | 48.8 | 39.2 | 63.9 | 42.5 |
Mask-RCNN-ResNet50-FPN& | 20 | 0.005 | SGD | 73.51 | 92.08 | 84.86 | 66.7 | 91.04 | 75.01 |
Mask-RCNN-ResNet101-FPN& | 20 | 0.0001 | Adam | 42.82 | 80.0 | 52.18 | 49.62 | 78.25 | 55.29 |
Mask-RCNN-ResNet101-FPN-CA& | 20 | 0.0001 | Adam | 60.97 | 85.62 | 68.61 | 56.03 | 83.67 | 61.25 |
& Results were obtained by fine-tuning the models using the weight version 2.
Table 2. Best models performance on the test dataset. The mean average precision is given per each object class of the Pascal VOC 2012 dataset. All results are shown in percentages. Mask-RCNN-ResNet101-FPN-CA has customised anchors sizes=(16, 32, 64, 128, 256, 512).
Model | Task | aeroplane | bicycle | bird | boat | bottle | bus | car | cat | chair | cow | diningtable | dog | horse | motorbike | person | pottedplant | sheep | sofa | train | tvmonitor |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Faster-RCNN-ResNet50-FPN | detection | 92.5 | 68.17 | 89.73 | 76.43 | 69.0 | 92.32 | 81.42 | 85.87 | 48.28 | 78.64 | 53.57 | 84.38 | 88.61 | 93.32 | 89.21 | 71.39 | 57.67 | 80.08 | 94.1 | 87.68 |
Mask-RCNN-ResNet50-FPN& | detection | 94.91 | 95.45 | 93.16 | 93.03 | 86.3 | 99.84 | 93.23 | 93.72 | 84.78 | 93.18 | 91.42 | 94.26 | 97.36 | 95.78 | 94.35 | 82.14 | 90.05 | 80.49 | 98.77 | 89.43 |
Mask-RCNN-ResNet50-FPN& | segmentation | 95.79 | 95.91 | 93.16 | 85.4 | 86.12 | 99.84 | 93.93 | 93.72 | 80.58 | 93.99 | 89.91 | 92.57 | 97.36 | 95.78 | 94.21 | 83.61 | 90.05 | 70.72 | 98.77 | 89.43 |
Mask-RCNN-ResNet101-FPN& | detection | 81.08 | 86.14 | 81.77 | 86.67 | 68.85 | 95.55 | 75.43 | 83.41 | 69.79 | 71.82 | 73.77 | 82.97 | 74.76 | 86.72 | 89.02 | 78.17 | 81.37 | 58.37 | 86.88 | 87.46 |
Mask-RCNN-ResNet101-FPN& | segmentation | 83.99 | 75.4 | 83.76 | 84.12 | 64.91 | 94.15 | 74.92 | 83.41 | 61.93 | 71.82 | 66.31 | 82.77 | 73.26 | 86.68 | 87.67 | 80.03 | 83.24 | 53.21 | 86.88 | 86.46 |
Mask-RCNN-ResNet101-FPN-CA& | detection | 97.11 | 86.02 | 92.59 | 94.73 | 67.88 | 89.2 | 87.47 | 89.2 | 78.35 | 84.82 | 84.95 | 82.84 | 88.24 | 92.61 | 90.93 | 76.63 | 81.53 | 68.6 | 93.47 | 85.28 |
Mask-RCNN-ResNet101-FPN-CA& | segmentation | 97.11 | 73.47 | 94.68 | 92.53 | 67.88 | 89.2 | 87.47 | 89.97 | 69.28 | 85.93 | 69.15 | 79.87 | 87.69 | 93.85 | 90.3 | 78.02 | 82.42 | 66.41 | 94.65 | 83.53 |
& Results by class were obtained by fine-tuning the models using the weight version 2.
Models inference on the test subset are depicted in Figure 1.
(a)
(b)
(c)
(d)
Figure 1. Trained Models inference on the test set. (a) Faster-RCNN-ResNet-50-FPN. (b) Mask-RCNN-ResNet50-FPN. (c) Mask-RCNN-ResNet101-FPN. (d) Mask-RCNN-ResNet101-FPN-CA.
[1] The PASCAL Visual Object Classes Homepage’. http://host.robots.ox.ac.uk/pascal/VOC/
[2] COCO - Common Objects in Context’. https://cocodataset.org/#detection-eval (accessedAug. 02, 2020).
[3] K. Morabia, J. Arora, and T. Vijaykumar, ‘Attention-based Joint Detection of Object and Semantic Part’, arXiv:2007.02419 [cs], Jul. 2020, Accessed: Jul. 02, 2020. [Online]. Available: http://arxiv.org/abs/2007.02419