Recurrent Neural Networks cheatsheet

By Afshine Amidi and Shervine Amidi

Overview

Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:

For each timestep t, the activation a<t> and the output y<t> are expressed as follows:

a<t>=g1(Waaa<t−1>+Waxx<t>+ba)andy<t>=g2(Wyaa<t>+by)

where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.

The pros and cons of a typical RNN architecture are summed up in the table below:

Advantages Drawbacks
• Possibility of processing input of any length
• Model size not increasing with size of input
• Computation takes into account historical information
• Weights are shared across time
• Computation being slow
• Difficulty of accessing information from a long time ago
• Cannot consider any future input for the current state

Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:

Type of RNN Illustration Example
One-to-one
Tx=Ty=1
Traditional neural network
One-to-many
Tx=1,Ty>1
Music generation
Many-to-one
Tx>1,Ty=1
Sentiment classification
Many-to-many
Tx=Ty
Name entity recognition
Many-to-many
Tx≠Ty
Machine translation

Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:

L(y^,y)=∑t=1TyL(y^<t>,y<t>)

Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss Lwith respect to weight matrix W is expressed as follows:

∂L(T)∂W=∑t=1T∂L(T)∂W|(t)

Handling long term dependencies

Commonly used activation functions ― The most common activation functions used in RNN modules are described below:

Sigmoid Tanh RELU
g(z)=11+e−z g(z)=ez−e−zez+e−z g(z)=max(0,z)
Sigmoid Tanh RELU

Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.

Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.

Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to: