Skip to content

Convolutional Neural Networks

Ishani Kathuria edited this page Dec 25, 2022 · 6 revisions

Overview

A Convolutional Neural Network (ConvNet/CNN) is a Deep Learning algorithm that can take in an input image, assign importance (learnable weights and biases) to various aspects/objects in the image, and be able to differentiate one from the other.

The advancements in Computer Vision with Deep Learning have been constructed and perfected with time, primarily over one particular algorithm — a Convolutional Neural Network. What makes CNNs special is their ability to extract features from images.

Table of Contents

  1. Architecture
    1. Convolution Layer
    2. Pooling Layer
    3. Fully Connected Layer
  2. Properties
    1. Invariance
    2. Equivariance
    3. Stability

Architecture

The basic CNN architecture consists of a convolutional layer followed by pooling layers and fully connected layers.
The convolutional layer is the first layer of a convolutional network. While convolutional layers can be followed by additional convolutional layers or pooling layers, the fully-connected layer is the final layer.

Convolution Layer

  • The convolutional layer is the first layer of a convolutional network.
  • It requires a few components, which are input data, a filter, and a feature map.
  • The feature detector, also known as a kernel or a filter, which will move across the receptive fields of the image, checking if the feature is present. This process is known as a convolution.
  • A feature detector which is a two-dimensional (2-D) array of weights, represents part of the image.
  • While they can vary in size, the filter size is typically a 3x3 matrix; this also determines the size of the receptive field.
  • The filter is then applied to an area of the image, and a dot product is calculated between the input pixels and the filter. This dot product is then fed into an output array.
  • Afterwards, the filter shifts by a stride, repeating the process until the kernel has swept across the entire image.
  • The final output from the series of dot products from the input and the filter is known as a feature map, activation map, or convolved feature.

There are three hyperparameters that affect the volume size of the output that needs to be set before the training of the neural network begins. These include:

  • The number of filters affects the depth of the output. For example, three distinct filters would yield three different feature maps, creating a depth of three.
  • Stride is the distance, or the number of pixels, that the kernel moves over the input matrix. While stride values of two or greater is rare, a larger stride yields a smaller output.
  • Zero-padding is usually used when the filters do not fit the input image. This sets all elements that fall outside of the input matrix to zero, producing a larger or equally sized output. There are three types of padding:
    • Valid padding: This is also known as no padding. In this case, the last convolution is dropped if the dimensions do not align.
    • Same padding: This padding ensures that the output layer has the same size as the input layer
    • Full padding: This type of padding increases the size of the output by adding zeros to the border of the input.

After each convolution operation, a CNN applies a Rectified Linear Unit (ReLU) transformation to the feature map, introducing nonlinearity to the model.

Since the output array does not need to map directly to each input value, convolutional (and pooling) layers are commonly referred to as “partially connected” layers.

Ultimately, the convolutional layer converts the image into numerical values, allowing the neural network to interpret and extract relevant patterns.

Pooling Layer

  • Pooling layers, also known as downsampling, conduct dimensionality reduction, reducing the number of parameters in the input.
  • Like the convolutional layer, the pooling operation sweeps a filter across the entire input, but the difference is that this filter does not have any weights.
  • Instead, the kernel applies an aggregation function to the values within the receptive field, populating the output array.
  • Max pooling: As the filter moves across the input, it selects the pixel with the maximum value to send to the output array. As an aside, this approach tends to be used more often compared to average pooling.
  • Average pooling: As the filter moves across the input, it calculates the average value within the receptive field to send to the output array.

While a lot of information is lost in the pooling layer, it also has several benefits for CNNs. They help to:

  • Reduce complexity
  • Improve efficiency
  • Limit the risk of overfitting

Fully Connected Layer

  • The pixel values of the input image are not directly connected to the output layer in partially connected layers (convolution and pooling layers).
  • However, in the fully connected layer, each node in the output layer connects directly to a node in the previous layer.
  • This layer performs the task of classification based on the features extracted through the previous layers and their different filters.
  • While convolutional and pooling layers tend to use ReLu functions, FC layers usually leverage a softmax activation function to classify inputs appropriately, producing a probability from 0 to 1.

Properties

Invariance

  • In simple terms, invariance is the property of a function, where for a perturbed input, the output stays unchanged. Types:
    • Translation invariance, in essence, is the ability to ignore positional shifts, or translations, of the target in the image.
    • Rotation invariance is where, regardless of the change in the input tilt, the predicted class does not change.
    • Invariance is also often related to robustness to noise which is the ability of the model to maintain high performance even on corrupted inputs. For example, a cat is still a cat regardless of whether it appears in the top half or the bottom half of the image, whether it has been rotated, or flipped, if the image is noisy, etc.
  • CNNs have inbuilt translation invariance and are thus better suited to dealing with image datasets than their ordinary counterparts.
  • This translation invariance in the convolutional NN is achieved by a combination of convolutional layers and max pooling layers.
    • Firstly, the convolutional layer reduces the image to a set of features and their respective positions.
    • Then the max pooling layer takes the output from the convolutional layer and reduces its resolution and complexity.
    • It does so by outputting only the max value from a grid.
    • So, the information about the exact position of the max value in the grid is discarded.
  • Disadvantage of Max Pooling
    • The major issue with max pooling is that the network fails to learn the spatial relation between different features, and thus will give a false positive if all features are present in the wrong position with respect to one another.
    • Due to separately applying max pooling on each filter, much of the relative positional information between features is lost.
    • So next layer can only give output based on the presence of features.
    • As we go to subsequent layers, more and more of the spatial information is lost, as max pooling layer compounds the effect of translation invariance.
    • Replacements of max pooling for translation invariance: Data Augmentation
    • Data augmentation is a form of regularisation that is used to make the Model more robust against distortion in data.
    • We do this by creating the distortion in our training data and training the model on this “Augmented” data.

Equivariance

  • Equivariance is a term often confused with invariance but it means that the function responds to a change in the input with a similar change in the output.
  • This property is desirable for image translation tasks, for example, consider image segmentation, where, if we rotate the image, the segmentation map of that image will rotate as well.

Capsule Networks

  • Pooling discards a substantial amount of information and provides invariance instead of equivariance.
  • CNNs have the significant drawback of being unable to build relationships between the image features.
  • An example of such a failure is the “Picasso problem”, where a face with mouth and eye swapped would still be classified as a face.
  • Capsule networks are an alternative to CNNs and can also be incorporated as part of their architecture in the deeper layers, which remedy the above-mentioned limitations.
  • Capsules are sets of neurons that encode both the orientation of the feature and its probability.
  • These generally require much less data to be trained; however, the training procedure, different to a standard backpropagation, is much more time-consuming.

Harmonic Networks

  • Harmonic networks design patch-wise 360-rotational equivariance into deep image representations, by constraining the filters to the family of circular harmonics.
  • The circular harmonics are steerable filters, which means that we can represent all rotated versions of a filter, using just a finite, linear combination of steering bases.
  • This overcomes the issue of learning multiple filter copies in CNNs, guarantees rotational equivariance, and produces feature maps that transform predictably under input rotation.
  • Each layer is a collection of feature maps of different rotation orders, which transform predictably under the rotation of the input to the network and the 360-rotation equivariance is achieved with finite computation.

Stability

Stability to deformations.