Skip to content

Deep Learning

Ishani Kathuria edited this page Dec 25, 2022 · 10 revisions

What are Neural Networks?

Neural networks are a type of machine learning model which are designed to operate similarly to biological neurons and the human nervous system.
They have the following properties:

  1. The core architecture of a Neural Network model is comprised of a large number of simple processing nodes called Neurons which are interconnected and organized in different layers.

  2. An individual node in a layer is connected to several other nodes in the previous and the next layer. The inputs from one layer are received and processed to generate the output which is passed to the next layer.

  3. The first layer of this architecture is often named the input layer which accepts the inputs, the last layer is named the output layer which produces the output and every other layer between the input and output layer is named the hidden layer.

Table of Contents

  1. Parts of a neural network
    1. Input neurons
    2. Output neurons
    3. Hidden Layers and Neurons per Hidden Layers
  2. Hyperparameters
    1. Batch Size
    2. Number of epochs
    3. Learning Rate
  3. Activation Function
  4. Loss Function
  5. Regularization
    1. L1 and L2 Regularization
    2. Dropout
    3. Early Stopping
  6. Optimizers
    1. Gradient-Descent
    2. Adam
  7. Neural Networks Cheatsheet

Parts of a neural network

Input neurons

This is the number of features your neural network uses to make its predictions.

Output neurons

This is the number of predictions you want to make. For regression tasks, this can be one value or for multi-variate regression, it is one neuron per predicted value. For binary classification, we use one output neuron per positive class. For multi-class classification, we have one output neuron per class.

Hidden Layers and Neurons per Hidden Layers

The number of hidden layers is highly dependent on the problem and the architecture of your neural network. It is a method of trial and error to achieve the perfect neural network.

Using the same number of neurons for all hidden layers is sufficient most of the time. For some datasets, having a large first layer and following it up with smaller layers will lead to better performance as the first layer can learn a lot of lower-level features that can feed into a few higher-order features in the subsequent layers.

Generally, 1–5 hidden layers and 1–100 neurons will serve well for most problems. You can start with this number and slowly add more layers and neurons until the model starts overfitting.

Choosing a smaller number of layers/neurons might cause the model to not be able to learn the underlying patterns in your data and thus be useless.

Hyperparameters

Batch Size

Batch size refers to the number of training examples utilized in one iteration. So, basically, it defines how many samples of input need to be used for training before updating model parameters.

A training dataset can be divided into one or more batches.

  • Batch Gradient Descent. Batch Size = Size of Training Set
  • Stochastic Gradient Descent. Batch Size = 1
  • Mini-Batch Gradient Descent. 1 < Batch Size < Size of Training Set In the case of mini-batch gradient descent, popular batch sizes include 32, 64, and 128 samples.

Number of epochs

This is a hyperparameter that defines the number of times that the learning algorithm will work through the entire training dataset.
An epoch refers to a single pass through the entire training dataset, during which each sample in the dataset is used to update the internal model parameters.

This process may be divided into one or more batches, with each batch representing a subset of the training dataset.

Typically, the learning algorithm is run for a large number of epochs, often in the hundreds or thousands, in order to minimize the error produced by the model. This process continues until the error has been sufficiently reduced.

Learning Rate

The learning rate is a hyperparameter that determines the size of the adjustments made to the model weights at each iteration, based on the estimated error.

It controls the rate at which the model learns and updates its internal parameters in response to the error. Basically, the amount by which the weights are updated during training is the step size or learning rate.

If the learning rate is set too high, the model may converge too rapidly to a suboptimal solution, while a learning rate that is too low may result in the optimization process becoming stuck. It is important to select an appropriate learning rate in order to balance the speed of convergence and the quality of the model's solutions.

Activation Function

Activation functions are used in neural networks to compute the weighted sum of input and biases, which is used to decide if a neuron can be fired or not. They can be linear or non-linear depending on the function it represents, and are used to control the outputs of neural networks.

In general, the performance from using different activation functions improves in this order (from lowest → highest performing):
logistic → tanh → ReLU → Leaky ReLU → ELU → SELU

Output Layer Activation

  • Regression

    • Regression problems don’t require activation functions for their output neurons because we want the output to take on any value.
    • In cases where we want output values to be bounded into a certain range, we can use tanh for [-1, 1] values and logistic function for [0, 1] values.
    • In cases where we’re only looking for positive output, we can use softplus activation.
  • Classification

    • Use the sigmoid activation function for binary classification to ensure the output is between 0 and 1.
    • Use softmax for multi-class classification to ensure the output probabilities add up to 1.

Detailed overview of Activation Functions

Loss Function

Loss and cost functions are used interchangeably but have one key difference. A loss function is for a single training example/input. A cost function, on the other hand, is the average loss over the entire training dataset.

Loss functions are minimized while training to train the neural network.

Detailed overview of Loss Functions

Regularization

Regularization is a method used to improve the generalization ability of a deep learning model, which is its ability to perform well on new, unseen data. It does this by preventing overfitting.

L1 and L2 Regularization

L1 and L2 regularization are techniques that are used to prevent overfitting in deep learning models by adding a penalty term to the loss function during training.

The strength of the regularization is controlled by a hyperparameter called the regularization rate or lambda.

  • L1 regularization, adds a penalty term to the loss function that is proportional to the sum of the absolute values of the model's weights.
    • This encourages the model to learn weight values that are close to zero, which can also help to prevent overfitting.

$$Cost function=Loss + \frac{\lambda}{2m} * \sum {||w||}$$

  • L2 regularization, also known as weight decay, adds a penalty term to the loss function that is proportional to the sum of the squares of the model's weights.
    • This encourages the model to learn smaller weight values, which can help to prevent the model from overfitting to the training data. $$Cost function=Loss + \frac{\lambda}{2m} * \sum {||w||^2}$$

Dropout

Dropout is a regularization technique that gives you a massive performance boost for how simple the technique actually is.

All dropout does is randomly turn off a percentage of neurons at each layer, at each training step. This makes the network more robust because it can’t rely on any particular set of input neurons for making predictions. The knowledge is distributed amongst the whole network.

  • A good dropout rate is between 0.1 to 0.5; 0.3 for RNNs, and 0.5 for CNNs. Use larger rates for bigger layers.
  • Increasing the dropout rate decreases overfitting, and decreasing the rate is helpful to combat under-fitting.

Early Stopping

Early Stopping works by monitoring the performance of the model on a validation dataset during training and interrupting the training process when the performance on the validation set begins to deteriorate.

By interrupting the training process before the model starts to overfit, early stopping can help to improve the generalization ability of the model and prevent it from performing poorly on new, unseen data.

The specific number of epochs to wait before interrupting the training process is a hyperparameter that must be tuned based on the specific characteristics of the dataset and the model. In Keras, this hyperparameter is called patience.

It is often used in conjunction with other regularization techniques, such as weight decay and dropout, to further improve the generalization performance of the model.

Optimizers

An optimizer is a mathematical algorithm that is used to adjust the internal parameters of a deep learning model in order to minimize the loss function during training.

There are several different types of optimizers that are commonly used in deep learning and choosing the right optimizer for a particular deep learning problem can have a significant impact on the performance of the model.

Gradient Descent

The basic idea behind gradient descent is to iteratively update the model's parameters in the direction that reduces the loss function. This is done by calculating the gradient of the loss function with respect to the model's parameters and taking a step in the opposite direction.

The size of the step is determined by the learning rate, which controls the rate at which the model's parameters are updated.

$$b=a-\gamma \bigtriangledown {f(a)}$$ where,

  • $b$ is the next step
  • $a$ is the current position
  • $\gamma$ is the learning rate
  • $\bigtriangledown f(a)$ is the gradient of the loss function

In order for the gradient descent algorithm to find the local minimum of the loss function, it is important to set the learning rate to a suitable value.

  • If the learning rate is too high, the algorithm may overshoot the local minimum and oscillate around it, preventing it from converging.
  • If the learning rate is too low, the algorithm will converge slowly and may take a long time to reach the local minimum.

There are several variants of gradient descent that are used in practice, including batch gradient descent, mini-batch gradient descent, and stochastic gradient descent (SGD). Each variant of gradient descent has its own trade-offs and may be more or less suitable for a given problem.

Adam

Adam (Adaptive Moment Estimation) is a popular optimizer that combines the ideas of SGD with momentum and adapts the learning rate for each weight based on the historical gradient of that weight. This can help the optimizer to converge more quickly and robustly to a good solution.

Like SGD, Adam updates the model's parameters by calculating the gradient of the loss function with respect to the parameters and taking a step in the opposite direction. However, unlike SGD, Adam uses an exponentially decaying average of the past gradients to scale the learning rate for each parameter. This allows the learning rate to be adapted based on the specific characteristics of the data and the model and can help the optimizer to converge more quickly and robustly to a good solution.

Adam also incorporates the idea of momentum, which is a technique that helps to smooth out the oscillations that can occur when using SGD by adding a fraction of the past update to the current update. This can help the optimizer to escape from local minima and saddle points more easily and can improve the convergence speed of the algorithm.

Neural Networks Cheatsheet

Clone this wiki locally