Bharath Kumar Bolla, Dinesh, Manu, Sabeesh
ASSIGNMENT
12 different models were built and executed using various model architectures. The following is the architecture of the models experimented upon.
The following were the augmentations used in the code
The following were the augmentations used in the code
Model 1 – Base model for architecture tuning
Model 2
- There is no dilation block. Total 147,616 parameters
- Four Convolutional blocks – 2 layers per block
- No separate dilation layer.
- 3x3 convolution with stride 2 to replicate max pooling like layer. No 1x1 convolution in the max pooling like layer.
- Normal sequential passing of layers. No specialized functions such as torch.add to concatenate layer outputs as there is no dilation.
- Highest Accuracy – 84.01 (100 epoch)
- Target Accuracy – 84.01 (100epoch)
Model 3
- Total 196,336 parameters
- Three convolutional layer per block - Total Four Convolutional blocks
- No separate dilation layer.
- 3x3 convolution with stride 2 , padding 2 and dilation 2 to replicate (dilation + maxpooling). No 1x1 convolution in max pooling like layer
- Normal sequential passing of layers.
- Target Accuracy – 85.51(125 epoch)
- Highest Accuracy – 86.32 (236)
Model 4
- This is Similar to model 7 but without the dilation block. Total 147,616 parameters
- Four Convolutional blocks – 2 layers per block
- No separate dilation layer.
- 3x3 convolution with stride 2 to replicate max pooling like layer. No 1x1 convolution in the max pooling like layer.
- Normal sequential passing of layers. No specialized functions such as torch.add to concatenate layer outputs as there is no dilation.
- Model terminated at epoch 23 as there as no improvement. Highest Accuracy – 75.71 (21 epoch)
- Target Accuracy – 75.71 (21 epoch)
Model 5
- This is Similar to model 7 but without the dilation block. Total 147,616 parameters
- Four Convolutional blocks - 2 layers per block
- No separate dilation layer.
- 3x3 convolution with stride 2 to replicate max pooling like layer. 1x1 convolution is introduced for the first time in the max pooling like layer.
- Normal sequential passing of layers. No specialized functions such as torch.add to concatenate layer outputs as there is no dilation.
- Target Accuracy – 85.19(166epoch)
- Highest Accuracy – 85.82(201)
Model 6
- Total 187,296 parameters
- Four Convolutional blocks 2 layers per block
- No separate dilation layer.
- Pure dilation with different kernel sizes (k =10,5,3) in successive blocks followed - 1x1 convolution– max pool like layer
- Normal sequential passing of layers. No specialized functions such as torch.add to concatenate layer outputs
- Highest Accuracy – 77.96 (232 epoch)
- Target Accuracy – 77.96 (232 epoch)
Model 7
- Total 153,104 parameters
- Four Convolutional blocks 2 layers per block
- Dilation layer in third block
- No adding of features of dilation layer with normal layer in the third block
- 3x3 convolution with stride 2 to replicate max pooling like layer.
- Target Accuracy – 84.50(248epoch)
- Highest Accuracy – 84.50 (248 epoch). Non addition of layers in the dilation block does not result in improvement in performance.
Model 8
- Total 153,104 parameters
- Four Convolutional blocks2 layers per block
- Dilation layer in third block
- Torch. Add layers in the 1st, 2nd and 3rd conv block - adding of two similar output layers before passing in to max pool like layer
- 3x3 convolution with stride 2 + 1x1 convolution block – max pool like layer
- Target Accuracy – 85.08 (171 epoch)
- Highest Accuracy – 85.40 (248 epoch)
Model 9
- Total 197,888 parameters
- Four Convolutional blocks
- Dilation layer in second convolutional block
- Torch. Add layers in the 2nd conv block - adding of two similar output layers before passing in to max pool like layer.
- Pure dilation layer (8,4,2) followed by 1x1 convolution– max pool like layer
- There is no significant improvement in model accuracy (Static at 67% validation and 53% training – random model) on using pure dilation layers. Model fails in case of pure dilation layer.
Model 10 - This is the ideal model
- Total 153,104 parameters
- Four Convolutional blocks
- Dilation layer in third block
- Torch. Add layers in the third conv block
- 3x3 convolution stride 2 followed by 1x1 convolution– max pool like layer
- four depth wise convolutional layers
- Target Accuracy – 85.09 (139 epoch)
- Highest Accuracy – 86.31 (316 epoch)
- Receptive field calculation - Effective receptive field is 83.
Model 11
- Total 153,104 parameters
- Four Convolutional blocks
- Dilation layer in third block
- Torch. Add layers in the 1st, 2nd and 3rd conv block - adding of two similar output layers before passing into max pool like layer
- 3x3 convolution with stride 2 followed by 1x1 convolution– max pool like layer
- Target Accuracy – 85.08 (171 epoch)
- Highest Accuracy – 85.40 (248 epoch).Accuracy is the same as addition of features of just the dilation block. No contribution of normal layer feature addition.
Model 12
- Total 99,936 parameters
- Four convolutions block
- Dilation layer in the third block
- Torch. Multiplicative layers in the 1st, 2nd and 3rd conv block - adding of two similar output layers before passing in to max pool like layer
- 3x3 convolution - followed by 1x1 convolution in stride 2 – max pool like layer
- All the layers have depth wise convolution
- Target Accuracy – 82.98 (249 epoch)
- Highest Accuracy – 82.98 (249 epoch). No significant improvement while using multiplicative features of dilation and non-dilation layers.
Analysis and Findings of the architecture
-
Reason for normal 3x3 convolution layer following Depth wise convolution layer. A conventional 3x3 convolutional layer has been used in the first layer of every block and in all the layers of the fourth block. It is hypothesized that since depth wise convolution has lesser number of parameters and as initial extraction of features is important in the final prediction, this preliminary feature extraction process cannot be compromised. Lesser parameters means that lesser quality of feature extraction at the initial layers. Adding a normal 3x3 convolution following a depth-wise convolution ensures that there is an increase in parameters and hence the feature learning is not compromised.
-
Addition of features from layer after the dilated kernel layer. The third convolutional block consists of two layers:- layer without dilation and layer with dilation which extracts same number of feature which same number of output dimension. Due to the dilation of kernel, there is a change in the pattern of feature extraction from the previously trained layers, hence may result in variation of validation accuracy of the model. To prevent this the layers are added using torch.add(). It is hypothesized that this will result in feature augmentation and hence better model performance than without feature addition from layers.
-
Adding a 1x1 pooling layer after the “max pool like” layer. Since there is no max pooling layer used here, a kernel of stride 2x2 will result in feature extraction with some features being missed out due to the stride. To compensate for this loss, the feature learning is augmented by using a 1x1 convolution. As 1x1 convolution sums up the features across channels to result in a new dimensional feature, this property may be exploited to is used as there is no max pooling layer. Hence to prevent loss of features 1x1 is used to add all the features that have been convolved separately
-
Torch.add () on normal layers. It was found that adding the feature output from same channel – same dimension output of two consecutive layers in the same convolutional block did not result in a significant increase in the performance of the model. However, removing Torch.add () from the convolutional block consisting of dilation layers resulted in fall of the performance of the model. This can be hypothesized that the way a feature needs to be extracted is to remain the same (i.e. gradual increase in receptive field) in all the layers. Any sudden increase in the receptive field size results in distortion of the learned features. Hence resulting in drop in performance. Adding the normal output to a dilated output restores this feature learning and results in better model performance.
-
Torch.mul() on all layers. Multiplication of features were also experimented upon on all the layers with same dimension – same channel output. It was hypothesized that multiplying the output would result in more exaggerated feature extraction. But however, this was proved to be incorrect. It is hence hypothesized that multiplying the features from similar output similar dimension channels will result in variation of the extracted features by a multiplicative factor. Hence some features might be over-represented while some may be under-represented. This results in distortion of learning hence reduced model performance.