Skip to content

Recurrent Neural Networks

Ishani Kathuria edited this page Dec 25, 2022 · 3 revisions

Overview

Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. An RNN includes the ability to maintain internal memory with feedback and therefore support temporal behavior.

Why RNN?

  • ANN or CNN get spatial-temporal/static type (only individual input is needed, previous or next inputs are not required) input data.
  • Time series data point requires knowledge of data points around it. For example, videos, speech, etc.
  • Feed-forward networks can't consider previous inputs.

Advantages and Disadvantages

Advantages Disadvantages
Possibility of processing input of any length Computation is slow
Model size doesn't increase with the size of input Difficulty of accessing information from a long time ago
Computation considers historical information Cannot consider any future input for the current state
Weights are shared across time

Table of Contents

  1. Architecture
  2. Types of RNN

Architecture

  • Formula for calculating current state: $h_t=f(h_{t-1}, x_{t})$
    • $h_t$ → current state
    • $h_{t-1}$ → previous state
    • $x_t$ → input state
  • Formula for applying Activation function: $h_t=tanh⁡(W_{hh}.h_{t-1} + W_{xh}.x_{t})$
    • $W_{hh}$ → weight at the recurrent neuron
    • $W_{xh}$ → weight at input neuron
  • Formula for calculating output: $y_t=W_{hy}.h_t$
    • $Y_t$ → output
    • $W_{hy}$ → weight at the output layer

Training through RNN

  1. A single-time step of the input is provided to the network.
  2. Then calculate its current state using a set of current input and the previous state.
  3. The current $h_t$ becomes $h_{t-1}$ for the next time step.
  4. One can go as many time steps as possible according to the problem and join the information from all the previous states.
  5. Once all the time steps are completed the final current state is used to calculate the output.
  6. The output is then compared to the actual output i.e., the target output and the error are generated.
  7. The error is then backpropagated to the network to update the weights and hence the network (RNN) is trained.

Different RNN architectures

Type of RNN Inputs, $T_x$ and Outputs, $T_y$ Illustration Example
One-to-one $T_x=T_y=1$
One-to-many $T_x=1$ & $T_y>1$
Many-to-one $T_x>1$ & $T_y=1$
Many-to-many $T_x=T_y$
Many-to-many $T_x\neq T_y$

Types of RNN

The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long-term dependencies because of multiplicative gradients that can be exponentially decreasing/increasing with respect to the number of layers.

Gradient clipping is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.

Types of gates: To remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted $\Gamma$ and are equal to:

$$\Gamma=\sigma W_x^{}+U_a^{}+b$$

where $W$, $U$, and $b$ are coefficients specific to the gate and $\sigma$ is the sigmoid function.
The main ones are summed up here:

Type of gate Sign Role Used in
Update gate $\Gamma_u$ How much past should matter now? GRU, LSTM
Relevance gate $\Gamma_r$ Drop previous information? GRU, LSTM
Forget gate $\Gamma_f$ Erase a cell or not? LSTM
Output gate $\Gamma_o$ How much to reveal of a cell? LSTM

Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU.

Long Short-Term Memory

  1. Input Gate

    • The addition of useful information to the cell state is done by the input gate.
    • First, the information is regulated with the sigmoid function and filters the value to be remembered, similar to the forget gate using inputs $h_{t-1}$ and $x_t$.
    • Then a vector is created using the $tanh$ function that gives an output from -1 to +1, which contains all the possible values from $h_{t-1}$ and $x_t$.
    • At last, the values of the vector and the regulated values are multiplied to obtain useful information.
  2. Forget Gate

    • Information that is no longer useful in the cell state is removed with the forget gate.
    • Two inputs $x_t$ (input at the particular time) and $h_{t-1}$ (previous cell output) are fed to the gate and multiplied with weight matrices followed by the addition of bias.
    • The resultant is passed through an activation function that gives a binary output.
    • If for a particular cell state the output is 0, the piece of information is forgotten and for output 1, then information is stored for future use.
  3. Output Gate

    • The task of extracting useful information from the current cell state to be presented as output is done by the output gate.
    • First, a vector is generated by applying the $tanh$ function on the cell.
    • Then, the information is regulated using the sigmoid function and filtered by the values to be remembered using inputs $h_{t-1}$ and $x_t$.
    • At last, the values of the vector and the regulated values are multiplied to be sent as an output and input to the next cell.

Gated Recurrent Unit