-
Notifications
You must be signed in to change notification settings - Fork 0
Convolutional Neural Networks
CNNs are very good at image recognition. Being made up of multiple layers, each layer builds on the knowledge gained from the previous layer. Initially the first layer recognises things like edges and dots. In the second layer they can recognise eyes or car tyres, and in later layers they build on this knowledge to recognise faces or cars. In the example below, we have two images of an X in white pixels on a black background. Should they both be interpreted as an X, and thus the same image?


If a letter is guaranteed to occupy the exact same pixels each time, it would be relatively simple to check the pixel values. However we want a more flexible system which can cope with scaling (different size), translation (different relative location), rotation or altered weighting (line thickness). A number of steps are required for the convolutional neural network to do this.
Below is how a computer would see the images, a 2-dimensional array of numbers with values -1 or 1 to represent black and white. Greyscale would use different numbers to represent the opaqueness of the pixel.


A convolutional neural network effectively breaks the image up into pieces or "features", and uses these features to find similar patterns in the other image.


In this case we'll use 3x3 pixel features, shown below.



The math behind this is relatively simple.
- Line up the feature and the image patch.
- Multiple each image pixel and the corresponding feature pixel.

As its a perfect match, each pixel of the filter is being multiplied by the same value in the image, so we get a vector of 1's.
- Add up all the results.
- Divide by the total number of pixels in the filter.

To keep track of where the feature was in the image we put the 1 in the place of the feature which means a perfect match. We repeat these steps for every location in the image.
If the filter doesn't match perfectly with the image, the result will be some number less than 1. This is shown in the example below.

Again we add up the resulting numbers and divide by the number of values in the feature vector.

We perform the convolution operation by moving the 3 filters chosen earlier across every location in the image. Resulting in some vector of values like the ones below.


This convolution is the result of combining the original X image with the upward sloping filter.


This convolution is the result of combining the original X image with the cross filter.


This convolution is the result of combining the original X image with the downward sloping filter.
So in convolution, one image becomes a stack of filtered images. The number of filtered images is the same as the number of filters.
- Pick a window size (usually 2 or 3).
- Pick a stride (usually 2).
- Walk your window across your filtered images.
- From each window, take the maximum value.

The result is a similar pattern but with a smaller image, we've gone from 7x7 to 4x4, effectively cutting the image size in half. This technique can be used to shrink very large images to make them easier to work with, while retaining their important information.
Pooling is not overly sensitive to position, as it takes the maximum value from the window, without caring the precise location of that value. This has the result that if your looking for a particular feature in the image, it can be slightly to the left or right, or rotated etc. and it will still be detected.
Although normalisation is not necessarily needed for the classifier to work, it can help. Having a ReLU layer will change every negative value to zero, and it can alter a neighbourhood of neurons so some individual excited neuron within the neighbourhood effectively dampens the signal of its neighbours, increasing the local maxima. This dampening effect happens in any given uniformly large local neighbourhood, reducing their values and boosting the neurons with larger activation's.



The layers are stacked on top of one another, so the output of one layer becomes the input of the next layer.

This can be done as many times as needed. With deep stacking images become more filtered (as they go through convolution layers) and smaller (as they go through pooling layers).

This is the final layer in the network. The stack of images can be conceptualised as a vector of values, and each value can be thought of as a vote. When an X is inputted into the network, certain values will be high, thus strongly predicting an X rather than for example an O.

This tutorial was adapted from Brandon Rohrer's video tutorial which can be found here
A very good video which I highly recommend to understand how the dimensions of your data change as it progresses through the network and also how kernels/filters work can be found here