Implementing the vision transformer (ViT, https://arxiv.org/abs/2010.11929) on the cloud using AWS. Since AWS doesn't allow GPU's to new users we had to scale down the original ViT significantly. Using a c5.4xlarge instance (16 intel CPUs) we obtain a 75% classification accuracy on the CIFAR-10 dataset. Hyperparameters:
Hyperparameter | value |
---|---|
Epochs | 20 |
Batch size | 256 |
Optimiser | AdamW |
Weight decay | 1e-4 |
Learning rate | 1e-3 |
Heads | 5 |
Patch size | 4 |
Layers | 6 |
Hidden MLP units | 128, 64 |
MLP units | 512, 128 |