This is an attempt to replicate the following paper as the hyperparameter link is not working in the paper.
arXiv:1302.4389 [stat.ML]
- dataset: THE MNIST DATABASE
- GPU: 1, 8GB, GM204GL [Tesla M60]
- CPU: 4, 30.5 GiB
- logs and model: here
The following diagram shows the maxout module with multilayer perceptrons.
- Train: (first 50000 training data) - python mnist.py --mlp 1 --train true
- Validation: (remaining 10000 training data) - python mnist.py --mlp 1 --valid true
- Train Continuation: (whole train data, continue from previous training) - python mnist.py --mlp 1 --train_cont true
- Testing: python mnist.py --mlp 1 --test true
For complete hyperparameter tuning check hyper-tuning.rst
file.
- Learning rate: 0.005
Epochs | Batch size | Layer1 | Layer2 |
|
Loss | ||
---|---|---|---|---|---|---|---|
|
|
|
|
||||
5 | 64 | 4 | 2048 | 2 | 10 | 97.79 | 1.5060 |
5 | 64 | 4 | 1024 | 2 | 10 | 97.44 | 1.5107 |
|
Batch size | Layer1 | Layer2 |
|
Loss | ||
---|---|---|---|---|---|---|---|
|
|
|
|
||||
5 | 64 | 4 | 2048 | 2 | 10 | 96.94 | 1.5097 |
5 | 64 | 4 | 1024 | 2 | 10 | 96.83 | 1.5108 |
It has been trained further with whole training dataset with the following accuracies and loss.
Epochs | Batch size | Layer1 | Layer2 |
|
Loss | ||
---|---|---|---|---|---|---|---|
|
|
|
|
||||
5 | 64 | 4 | 2048 | 2 | 10 | 99.02 | 1.4827 |
Batch size | Layer1 | Layer2 |
|
Loss | ||
---|---|---|---|---|---|---|
|
|
|
|
|||
64 | 4 | 2048 | 2 | 10 | 97.17 | 1.5007 |
- Train: (50000 shuffled training data) - python mnist.py --conv 1 --train true
- Validation: (remaining 10000 training data) - python mnist.py --conv 1 --valid true
- Train Continuation: (whole train data, continue from previous training) - python mnist.py --conv 1 --train_cont true
- Testing: python mnist.py --conv 1 --test true
First learning rate is set to 0.01
. Then it is halved at epoch 5
for training of 50000
shuffled data. With least error for validation, it is retrained with the pretrained weights. But this time the starting learning rate is 0.001
, it is halved at epoch 5
.
The architecture presented in paper is as follows:
conv -> maxpool -> conv -> maxpool -> conv -> maxpool -> MLP -> softmax
.
It is evident that the output of MLP is 10
and the input of MLP is whatever number comes from
3rd maxpool
. Only I had to adjust was kernels, paddings of convolutional layers. Because those
are the only parameters in the network.
Epochs | Batch | Conv1 | Maxpool1 | Conv2 | Maxpool2 | Conv3 | Maxpool3 | MLP | Acc % | Loss | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
kernel | pad | pool | stride | kernel | pad | pool | stride | kernel | pad | pool | stride | in | out | ||||
10 | 64 | 7 x 7 | 3 | 2 x 2 | 1 | 5 x 5 | 2 | 2 x 2 | 1 | 5 x 5 | 2 | 2 x 2 | 1 | 625 | 10 | 97.09 | 1.4921 |
10 | 64 | 5 x 5 | 3 | 2 x 2 | 1 | 5 x 5 | 2 | 2 x 2 | 1 | 5 x 5 | 2 | 2 x 2 | 1 | 729 | 10 | 87.62 | 1.5856 |
10 | 64 | 5 x 5 | 3 | 2 x 2 | 1 | 3 x 3 | 2 | 2 x 2 | 1 | 3 x 3 | 2 | 2 x 2 | 1 | 961 | 10 | 95.43 | 1.5088 |
10 | 64 | 5 x 5 | 2 | 2 x 2 | 1 | 3 x 3 | 2 | 2 x 2 | 1 | 3 x 3 | 2 | 2 x 2 | 1 | 841 | 10 | 95.96 | 1.5037 |
Batch | Conv1 | Maxpool1 | Conv2 | Maxpool2 | Conv3 | Maxpool3 | MLP | Acc % | Loss | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
kernel | pad | pool | stride | kernel | pad | pool | stride | kernel | pad | pool | stride | in | out | |||
64 | 7 x 7 | 3 | 2 x 2 | 1 | 5 x 5 | 2 | 2 x 2 | 1 | 5 x 5 | 2 | 2 x 2 | 1 | 625 | 10 | 96.85 | 1.4928 |
64 | 5 x 5 | 3 | 2 x 2 | 1 | 5 x 5 | 2 | 2 x 2 | 1 | 5 x 5 | 2 | 2 x 2 | 1 | 729 | 10 | 87.76 | 1.5828 |
64 | 5 x 5 | 3 | 2 x 2 | 1 | 3 x 3 | 2 | 2 x 2 | 1 | 3 x 3 | 2 | 2 x 2 | 1 | 961 | 10 | 95.16 | 1.5828 |
64 | 5 x 5 | 2 | 2 x 2 | 1 | 3 x 3 | 2 | 2 x 2 | 1 | 3 x 3 | 2 | 2 x 2 | 1 | 841 | 10 | 96.15 | 1.5012 |
Epochs | Batch | Conv1 | Maxpool1 | Conv2 | Maxpool2 | Conv3 | Maxpool3 | MLP | Acc % | Loss | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
kernel | pad | pool | stride | kernel | pad | pool | stride | kernel | pad | pool | stride | in | out | ||||
10 | 64 | 7 x 7 | 3 | 2 x 2 | 1 | 5 x 5 | 2 | 2 | 1 | 5 x 5 | 2 | 2 | 1 | 625 | 10 | 97.58 | 1.4874 |
10 | 64 | 5 x 5 | 3 | 2 x 2 | 1 | 5 x 5 | 2 | 2 | 1 | 5 x 5 | 2 | 2 | 1 | 729 | 10 | 88.04 | 1.5811 |
10 | 64 | 5 x 5 | 3 | 2 x 2 | 1 | 3 x 3 | 2 | 2 | 1 | 3 x 3 | 2 | 2 | 1 | 961 | 10 | 96.25 | 1.5011 |
10 | 64 | 5 x 5 | 2 | 2 x 2 | 1 | 3 x 3 | 2 | 2 | 1 | 3 x 3 | 2 | 2 | 1 | 841 | 10 | 96.75 | 1.4960 |
Batch | Conv1 | Maxpool1 | Conv2 | Maxpool2 | Conv3 | Maxpool3 | MLP | Acc % | Loss | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
kernel | pad | pool | stride | kernel | pad | pool | stride | kernel | pad | pool | stride | in | out | |||
64 | 7 x 7 | 3 | 2 x 2 | 1 | 5 x 5 | 2 | 2 x 2 | 1 | 5 x 5 | 2 | 2 x 2 | 1 | 625 | 10 | 96.87 | 1.4929 |
64 | 5 x 5 | 3 | 2 x 2 | 1 | 5 x 5 | 2 | 2 x 2 | 1 | 5 x 5 | 2 | 2 x 2 | 1 | 729 | 10 | 87.39 | 1.5861 |
64 | 5 x 5 | 3 | 2 x 2 | 1 | 3 x 3 | 2 | 2 x 2 | 1 | 3 x 3 | 2 | 2 x 2 | 1 | 961 | 10 | 95.52 | 1.5070 |
64 | 5 x 5 | 2 | 2 x 2 | 1 | 3 x 3 | 2 | 2 x 2 | 1 | 3 x 3 | 2 | 2 x 2 | 1 | 841 | 10 | 96.30 | 1.4989 |