In this step, we randomly visualize a sample of images and get rough idea of how the data are distributed.
Some things to try:
- Number of images in each group
- Number of images across different sizes.
- Images with watermark / weird logos.
In this step, we create a pipeline of preprocessing steps to generate a final dataset of images with equal sizes.
P.S. Size of an image will depend on the neural network used (I usually use the default size from the paper, but this could change depending on the dataset).
This step is only to generate additional datasets. Things to try:
- Mirroring
- Random Cropping (Crop a large portion of image, probably there are some common heuristic for this)
- Rotation
- Shearing
- Color shifting (See PCA Color Augmentation for more advanced algorithms)
Apply this only if it makes sense to do so. Common implementation for this is to use some CPU threads to perform data augmentation s.t. some processes load data from harddisk, some processes perform data augmentation, and some other processsed perform Training (GPU). I think people normally use ImageDataGenerator
for keras
or manually assigning CPU and GPU for tensorflow
.
Do not forget to resize the original images to the size you want to fit to your model.
If there are public datasets that have the same distribution as our test data, please recommend.
Things to try:
- All pretrained model available in
keras
API. For starters, try fine-tuning only the last few layers of the pre-trained model (Do not forget to freeze all the layers you do not want to fine tune). Also, replace the final layer to suit the number of classes of the problem which in this case is 18. - Simple Conv Neural Network with different hyperparameters.
- Other pretrained models that have been trained for this particular problem.
- Retrain the whole pretrained models (Be cautious of overfitting, because our dataset doesn't seem to be that large.
Do not forget to always save the model trained with details unless you only want to do some quick dirty prototyping.
This step has to be synchronized with the Modelling
part and the part that we have to fix before working on our solutions.
Different evaluation methods that are possible:
- K-Fold Cross Validation / Stratified K-Fold (More robust if the dataset is small, but more expensive)
- Holdout / Stratified Holdout (Good if the dataset is enough)
Save training results for future analysis
Before generating prediction, retrain every model using the whole training data available.
Generate submission file with the same format as sample submission.csv
. Always save with different name files that explains how they are generated.
Main analysis:
- Bias & Variance Trade-off