CNNs are typically used for image recognition tasks like image classification, but they also have applications in recommender systems, natural language processing and time series.
In general, whenever you face a problem that requires sliding a window over the input, think of a CNN (e.g. a NLP problem that requires a sliding window of text can also be addressed with a CNN).
In this work, we will anchor all the explanation of CNNs through image classification tasks, but the knowledge is transferable to other applications.
CNNs work with images / video-frames as inputs. This might look complicated, but under the hood images are just N-dimensional arrays. CNNs work with these arrays as input.
For example, a Black and White image is a 2D array in which each cell contains a number between 0 and 255, indicating the "level of white" of the pixel.
A color image is a 3D array, with layers for red, green and blue (RGB).
The next example shows how a "smiley face" can be represented using a simple Black (1) OR White (0) encoding.
This tool provides a nice visualization of how a CNN works: https://www.cs.ryerson.ca/~aharley/vis/conv/flat.html
In the convolution step, feature detectors are applied to the input image to transform it into multiple feature maps.
A Feature Detector is a small grid of numbers (weights) that is designed to detect a specific feature in a image (e.g. a filter to detect grass).
- Feature detectors are also called filters or kernels.
- The weights of the feature detectors are LEARNED during training. This is where the power of CNNs lie: The algorithm learns which features are important from the data.
- Size of feature detectors: traditionally they are 3x3, but others sizes like 5x5 or 7x7 are also used.
- Stride: the step (in pixels) we use to move the feature detector for creating the feature map.
- Typically people use 2, but this is a hyper-parameter.
- The stride and the size of the feature maps are related. The higher the stride, the smaller the feature map.
- Smaller feature maps make all downstream processing easier because there is less information to handle.
- However, strides that are too high may miss important areas on the image.
- When the combination of stride and filter size do not fit exactly into the image size , 0-padding is added to the image to make the dimensions match.
The Feature Map is the result of applying a feature detector to an input image. It is a spatial representation of how much is each feature detected in each area of the image (i.e. how active each area is for that particular feature).
- Feature Maps are also called activation maps or convolved features.
- Are we losing information by applying a feature detector?
- Yes and No.
- Yes because we are reducing the size to the original image, so some information is lost.
- No because the learnt feature detectors only focus on relevant features and get rid of information that is irrelevant for the problem at hand.
CNNs simultaneously learn and use different feature detectors on the Convolution Layer. This means that multiple feature maps are created from one image (one feature map per feature detector).
In the end, the hyperparams of the convolution layer are: feature detector size, stride and depth (number of feature detectors).
Clipping of negative values in the feature maps is done at the output of each convolution layer by using a non linear activation function.
- Clipping = map negative values to 0 or something close to 0.
Typically, a ReLU activation function is used on each cell of the feature map. However, other functions like Sigmoid or a Parametrized RelU (PReLU) (aka Leaky ReLU) can also be used.
Why do we need this? It has been observed experimentally that, if the nonlinear clipping operation is removed, the system performance drops by a large margin. The mathematical grounding of why this is needed is complicated and beyond the scope of this summary.
An intuitive (non-precise) explanation of why this is needed is because we want a system architecture that promotes feature detectors to learn weights that ACTIVATE (i.e. go positive) when a feature is detected. With this in mind, a negative activation makes no sense.
Additionally, negative activation in layer 1 might be picked up by a negative filter in layer 2, resulting in a positive layer 2 output. At this point, the system cannot differentiate between this double negative activation and an activation that has been positive all the way. This ultimately hurst robustness and performance.
You can find more on the mathematical grounding of this here: https://arxiv.org/pdf/1609.04112.pdf
Downsampling is done to make the network more robust and to prevent parameter explosion.
- Robustness: pooling makes the network "less picky" about minor variances in the features the filters detect. In particular we don't want the network to be "tricked" by small differences of the same feature like slight rotations or differences in texture (this is called spatial invariance).
- Prevent parameter explosion: Downsampling removes all the information that is not related to the the filter's feature. In practice this reduces the number of parameters downstream which in turn prevents overfitting and makes the processing faster.
- Types: there are multiple types of pooling (e.g. min pooling, max pooling, subsampling / average pooling).
- Max pooling is the most common.
- Size of the pooling window.
- Stride: 2 is a very common setting to use.
More information about the details of different pooling configurations: http://ais.uni-bonn.de/papers/icann2010_maxpool.pdf
- Especially parts 1 and 3.
- In the paper: Subsampling = average pooling
On the last layer of max pooled maps, we take each map and flatten it to form one big columnar vector that will serve as the input layer for the next step.
We then add a fully connected feed forward ANN using the flattening layer as input layer. This FF-ANN might have one or many hidden fully connected layers. The number of neurons in the output layer follows the nature of the task (e.g. 1 neuron for a binary classification problem).
The hyperparameters of the fully-connected network are the same as the ones in an ANN.
The training of a CNN is similar to the training in ANNs.
It is also rooted on the calculation of an error
using a cost function
and back propagating
the error to
adjust the weights
on each neuron.
- One important thing to highlight is that the feature detectors are made up of
weights
that are LEARNT during training.
Prediction works the same way as it works on a traditional ANN. The image is fed as input and all the weight operations are applied and propagated forward until the output layer.