# Recurrent Neural Networks cheatsheet

*By Afshine Amidi and Shervine Amidi*

## Overview

**Architecture of a traditional RNN** ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:

For each timestep t, the activation a<t> and the output y<t> are expressed as follows:

where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.

The pros and cons of a typical RNN architecture are summed up in the table below:

Advantages |
Drawbacks |

• Possibility of processing input of any length • Model size not increasing with size of input • Computation takes into account historical information • Weights are shared across time |
• Computation being slow • Difficulty of accessing information from a long time ago • Cannot consider any future input for the current state |

**Applications of RNNs** ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:

Type of RNN |
Illustration |
Example |

One-to-one Tx=Ty=1 |
Traditional neural network | |

One-to-many Tx=1,Ty>1 |
Music generation | |

Many-to-one Tx>1,Ty=1 |
Sentiment classification | |

Many-to-many Tx=Ty |
Name entity recognition | |

Many-to-many Tx≠Ty |
Machine translation |

**Loss function** ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:

**Backpropagation through time** ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss Lwith respect to weight matrix W is expressed as follows:

## Handling long term dependencies

**Commonly used activation functions** ― The most common activation functions used in RNN modules are described below:

Sigmoid |
Tanh |
RELU |

g(z)=11+e−z | g(z)=ez−e−zez+e−z | g(z)=max(0,z) |

**Vanishing/exploding gradient** ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.

**Gradient clipping** ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.

**Types of gates** ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to: