Link Search Menu Expand Document

Introduction to RNNs

In this notebook, you will learn:

  1. What is an RNN?
  2. Architecture of a basic RNN, LSTM and GRU
  3. Back Propagation through time
  4. Implementation of LSTM, RNN, GRU in PyTorch
  5. Bidirectional RNNs

What is an RNN?

Recurrent Neural Networks (RNNs) are a type of artificial neural network designed for processing sequential data. Unlike traditional feedforward neural networks, which have a fixed input and output, RNNs can accept inputs of variable lengths and maintain a kind of memory of what has come before in the sequence. This ability makes RNNs particularly well-suited for time-series tasks such as language modeling, speech recognition, and stock price prediction.

At a high level, an RNN processes a sequence of inputs by iteratively passing the current input and an internal state as inputs to a network of interconnected nodes called ‘memory cells.’ These memory cells use the current input and their internal state to update their output, which is then passed on to the next memory cell in the sequence. This cyclic flow of information allows RNNs to maintain information in their internal state, enabling them to make predictions and generate output based on the context of the previous items of the sequence.

Applications of RNN

RNN have found applications in various fields due to their ability to process sequential data. Here are a few examples of applications of RNN:

  1. Natural Language Processing (NLP): RNNs have been extensively used for various NLP tasks such as language modeling, machine translation, sentiment analysis, and speech recognition. For example, RNNs can be used to model the probability of a sequence of words in a sentence or generate new sentences. Also, we can predict next word of a sentence when typing, with RNNs.

  2. Time Series Analysis: RNNs can be used to model time series data such as stock prices, weather patterns, and energy consumption. They can capture the temporal dependencies in the data and predict the future values based on the previous ones.

  3. Image and Video Recognition: RNNs have also been used for image and video recognition tasks. For example, they can be used to caption images and videos by generating natural language descriptions of them. They can also be used for video classification and action recognition.

  4. Music Generation: RNNs have been used for generating music by training on a large dataset of music compositions. They can generate new music compositions that sound similar to the ones in the dataset.

  5. Reinforcement Learning: RNNs have also been used in reinforcement learning, where the model learns to take actions based on the previous ones and the current state. They have been used to learn to play games such as Atari games and Go.

Architecture of a basic RNN

At its core, an RNN is just a neural network that uses its own output as input for the next step in the sequence. The basic architecture of an RNN looks like this:

image.png

Other types of RNN

LSTM

There are many recurrent blocks such as LSTM. We can say the architecture of a simple LSTM block here:

RNN_auto_x2_colored_toned_light_ai.jpg

At each time step, the LSTM takes an input vector xt and combines it with the hidden state vector ht-1 from the previous time step to produce a new hidden state vector ht. The output ot is then computed from the hidden state vector. The parameters of the LSTM are shared across all time steps, allowing it to handle inputs of varying lengths.

GRU

What is GRU in RNN? The Gated Recurrent Unit (GRU) is a type of Recurrent Neural Network (RNN) that, in certain cases, has advantages over long short term memory (LSTM). GRU uses less memory and is faster than LSTM, however, LSTM is more accurate when using datasets with longer sequences.

The architecture of GRU is shown here:

image.png

Back Propagation through time

Recurrent Neural Networks are those networks that deal with sequential data. They predict outputs using not only the current inputs but also by taking into consideration those that occurred before it. In other words, the current output depends on current output as well as a memory element (which takes into account the past inputs). For training such networks, we use good old backpropagation but with a slight twist. We don’t independently train the system at a specific time “t”. We train it at a specific time “t” as well as all that has happened before time “t” like t-1, t-2, t-3.

Consider the following representation of a RNN:

image.png

S1, S2, S3 are the hidden states or memory units at time t1, t2, t3 respectively, and Ws is the weight matrix associated with it. X1, X2, X3 are the inputs at time t1, t2, t3 respectively, and Wx is the weight matrix associated with it. Y1, Y2, Y3 are the outputs at time t1, t2, t3 respectively, and Wy is the weight matrix associated with it. g1 and g2 are activation functions. For any time, t, we have the following two equations:

image.png

Let us now perform back propagation at time t = 3. Let the error function be:

image.png

Updating Wy

image.png

We backpropagate to update Wy with the following formula:

image.png

Explanation:

E3 is a function of Y3. Hence, we differentiate E3 w.r.t Y3.

Y3 is a function of Wy. Hence, we differentiate Y3 w.r.t Wy.

Updating Ws

image.png

we calculate the gradient of Ws with regard to all of the weights through time until t=3. So:

image.png

image.png

image.png

Explanation:

E3 is a function of Y3. Hence, we differentiate E3 w.r.t Y3.

Y3 is a function of S3. Hence, we differentiate Y3 w.r.t S3.

S3 is a function of Ws. Hence, we differentiate S3 w.r.t Ws.

But we can’t stop with this; we also have to take into consideration, the previous time steps. So, we differentiate (partially) the Error function with respect to memory units S2 as well as S1 taking into consideration the weight matrix Ws. We have to keep in mind that a memory unit, say St is a function of its previous memory unit St-1. Hence, we differentiate S3 with S2 and S2 with S1.

Generally, we can express this formula as:

image.png

Updating Wx

image.png

Formula:

image.png

image.png

image.png

Explanation:

E3 is a function of Y3. Hence, we differentiate E3 w.r.t Y3.

Y3 is a function of S3. Hence, we differentiate Y3 w.r.t S3.

S3 is a function of Wx. Hence, we differentiate S3 w.r.t Wx.

Again we can’t stop with this; we also have to take into consideration, the previous time steps. So, we differentiate (partially) the Error function with respect to memory units S2 as well as S1 taking into consideration the weight matrix Wx.

Generally, we can express this formula as:

image.png

Issues

This method of Back Propagation through time (BPTT) can be used up to a limited number of time steps like 8 or 10. If we back propagate further, the gradient $\delta$ becomes too small. This problem is called the “Vanishing gradient” problem. The problem is that the contribution of information decays geometrically over time. So, if the number of time steps is >10 (Let’s say), that information will effectively be discarded.

One of the famous solutions to this problem is by using what is called Long Short-Term Memory (LSTM) cells instead of the traditional RNN cells. But there might arise yet another problem here, called the exploding gradient problem, where the gradient grows uncontrollably large.

So what can we do about that?

A popular method called gradient clipping can be used where in each time step, we can check if the gradient $\delta$ > threshold. If yes, then normalize it.

Implementing in Pytorch

Now let’s see how we can implement 3 types of RNN in PyTorch. We’ll start by LSTM.

Download Dataset

Firstly, we download a timeseries dataset. Here we use dataset for number of passengers on an airline. Although it is a small and simple dataset, we use it for learning purposes.

#!wget https://raw.githubusercontent.com/jbrownlee/Datasets/master/shampoo.csv
!wget https://raw.githubusercontent.com/jbrownlee/Datasets/master/airline-passengers.csv
--2023-04-18 11:24:07--  https://raw.githubusercontent.com/jbrownlee/Datasets/master/airline-passengers.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2180 (2.1K) [text/plain]
Saving to: ‘airline-passengers.csv’

airline-passengers. 100%[===================>]   2.13K  --.-KB/s    in 0s      

2023-04-18 11:24:07 (36.0 MB/s) - ‘airline-passengers.csv’ saved [2180/2180]

Imports

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import torch
import torch.nn as nn
from torch.autograd import Variable
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt

Data Plot

training_set = pd.read_csv('airline-passengers.csv')
#training_set = pd.read_csv('shampoo.csv')
training_set.shape
(144, 2)
training_set.head()
Month Passengers
0 1949-01 112
1 1949-02 118
2 1949-03 132
3 1949-04 129
4 1949-05 121

We dont want the dates so we remove them.

training_set.iloc[:,1:2]
Passengers
0 112
1 118
2 132
3 129
4 121
... ...
139 606
140 508
141 461
142 390
143 432

144 rows × 1 columns

training_set = training_set.iloc[:,1:2].values

#plt.plot(training_set, label = 'Shampoo Sales Data')
plt.plot(training_set, label = 'Airline Passangers Data')
plt.show()

png

Dataloading

Here, we define a function to split the dataset into time-serie windows of length seq_length. Meaning that the ith element of x is a tensor of length seq_length that shows the target column in the interval [i, i+seq_length-1]. And the label is equal to the next timestep which is seq_length.

def sliding_windows(data, seq_length):
    x = []
    y = []
    n = len(data)

    for i in range(n - seq_length - 1):
        start_x = i
        end_x = i + seq_length

        x_i = data[start_x:end_x]
        y_i = data[end_x]
        
        x.append(x_i)
        y.append(y_i)

    return np.array(x), np.array(y)

Then we nomralize dataset so that it would be in range [0,1].

scaler = MinMaxScaler(feature_range=(0, 1))
training_data = scaler.fit_transform(training_set)
seq_length = 4
x, y = sliding_windows(training_data, seq_length)
x[0]
array([[0.01544402],
       [0.02702703],
       [0.05405405],
       [0.04826255]])

Split test-train set:

train_size = int(len(y) * 0.67)
test_size = len(y) - train_size

dataX = Variable(torch.Tensor(np.array(x)))
dataY = Variable(torch.Tensor(np.array(y)))

trainX = Variable(torch.Tensor(np.array(x[0:train_size])))
trainY = Variable(torch.Tensor(np.array(y[0:train_size])))

testX = Variable(torch.Tensor(np.array(x[train_size:len(x)])))
testY = Variable(torch.Tensor(np.array(y[train_size:len(y)])))

Model

class LSTM(nn.Module):

    def __init__(self, num_classes, input_size, hidden_size, num_layers):
        super(LSTM, self).__init__()
        
        self.num_classes = num_classes
        self.num_layers = num_layers
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.seq_length = seq_length
        
        self.lstm = nn.LSTM(input_size=input_size, hidden_size=hidden_size,
                            num_layers=num_layers, batch_first=True)
        
        self.fc = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        # init hidden state
        h_0 = Variable(torch.zeros(
            self.num_layers, x.size(0), self.hidden_size))
        
        # init cell state
        c_0 = Variable(torch.zeros(
            self.num_layers, x.size(0), self.hidden_size))
        
        # Propagate input through LSTM
        ula, (h_out, _) = self.lstm(x, (h_0, c_0))
        
        h_out = h_out.view(-1, self.hidden_size)
        
        out = self.fc(h_out)
        
        return out

Training

Init the variables needed for training.

num_epochs = 1000
learning_rate = 0.01

input_size = 1
hidden_size = 2
num_layers = 1

num_classes = 1

Init model

model = LSTM(num_classes, input_size, hidden_size, num_layers)

define loss function and optimizer

criterion = torch.nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
#optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

Train the model

train_losses = []
model.train()

for epoch in range(num_epochs):
    outputs = model(trainX)
    optimizer.zero_grad()
    
    # obtain the loss function
    loss = criterion(outputs, trainY)
    train_losses.append(loss.item())
    
    loss.backward()
    optimizer.step()

    if epoch % 100 == 0:
      print(f"Epoch: {epoch}, {loss.item()}")
Epoch: 0, 0.6578255295753479
Epoch: 100, 0.014887369237840176
Epoch: 200, 0.010555111803114414
Epoch: 300, 0.0050033763982355595
Epoch: 400, 0.0030459200497716665
Epoch: 500, 0.0028110509738326073
Epoch: 600, 0.002621398074552417
Epoch: 700, 0.002457499271258712
Epoch: 800, 0.002317240694537759
Epoch: 900, 0.002198742236942053
plt.title("Train Losses")
plt.plot(train_losses)
plt.show()

png

Testing for Airplane Passengers Dataset

model.eval()
train_predict = model(dataX)

data_predict = train_predict.data.numpy()
dataY_plot = dataY.data.numpy()

data_predict = scaler.inverse_transform(data_predict)
dataY_plot = scaler.inverse_transform(dataY_plot)

plt.axvline(x=train_size, c='r', linestyle='--')

plt.plot(dataY_plot, color='b')
plt.plot(data_predict, color='y')
plt.legend(["Test/Train split", "True", "Prediction"], loc ="lower right")

plt.suptitle('Time-Series Prediction')
plt.show()

png

Alternative models: simple RNN

Instead of LSTM in our LSTM class, we can use simple RNN or GRU block:

class RNN(nn.Module):

    def __init__(self, num_classes, input_size, hidden_size, num_layers):
        super(RNN, self).__init__()
        
        self.num_classes = num_classes
        self.num_layers = num_layers
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.seq_length = seq_length
        
        self.rnn = nn.RNN(input_size=input_size, hidden_size=hidden_size,
                            num_layers=num_layers, batch_first=True)
        
        self.fc = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        h_0 = Variable(torch.zeros(
            self.num_layers, x.size(0), self.hidden_size))
        
        # Propagate input through RNN
        ula, h_out = self.rnn(x, h_0)

        h_out = h_out.view(-1, self.hidden_size)
        
        out = self.fc(h_out)
        
        return out
model = RNN(num_classes, input_size, hidden_size, num_layers)
criterion = torch.nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
#optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
train_losses = []
model.train()

for epoch in range(num_epochs):
    outputs = model(trainX)
    optimizer.zero_grad()
    
    # obtain the loss function
    loss = criterion(outputs, trainY)
    train_losses.append(loss.item())
    
    loss.backward()
    optimizer.step()

    if epoch % 100 == 0:
      print(f"Epoch: {epoch}, {loss.item()}")
Epoch: 0, 0.21138572692871094
Epoch: 100, 0.009810179471969604
Epoch: 200, 0.0027966599445790052
Epoch: 300, 0.0023493317421525717
Epoch: 400, 0.0021170424297451973
Epoch: 500, 0.0020042527467012405
Epoch: 600, 0.0019361759768798947
Epoch: 700, 0.0018845399608835578
Epoch: 800, 0.0018474081298336387
Epoch: 900, 0.0018271522130817175
plt.title("RNN Train Losses")
plt.plot(train_losses)
plt.show()

png


Alternative models: GRU

class GRU(nn.Module):

    def __init__(self, num_classes, input_size, hidden_size, num_layers):
        super(GRU, self).__init__()
        
        self.num_classes = num_classes
        self.num_layers = num_layers
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.seq_length = seq_length
        
        self.gru = nn.GRU(input_size=input_size, hidden_size=hidden_size,
                            num_layers=num_layers, batch_first=True)
        
        self.fc = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        h_0 = Variable(torch.zeros(
            self.num_layers, x.size(0), self.hidden_size))
        
        # Propagate input through GRU
        ula, h_out = self.gru(x, h_0)

        h_out = h_out.view(-1, self.hidden_size)
        
        out = self.fc(h_out)
        
        return out
model = GRU(num_classes, input_size, hidden_size, num_layers)
criterion = torch.nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
#optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
train_losses = []
model.train()

for epoch in range(num_epochs):
    outputs = model(trainX)
    optimizer.zero_grad()
    
    # obtain the loss function
    loss = criterion(outputs, trainY)
    train_losses.append(loss.item())
    
    loss.backward()
    optimizer.step()

    if epoch % 100 == 0:
      print(f"Epoch: {epoch}, {loss.item()}")
Epoch: 0, 0.3674144148826599
Epoch: 100, 0.0074386922642588615
Epoch: 200, 0.003244135295972228
Epoch: 300, 0.0029646488837897778
Epoch: 400, 0.002733838278800249
Epoch: 500, 0.002519385190680623
Epoch: 600, 0.0023391328286379576
Epoch: 700, 0.0021935098338872194
Epoch: 800, 0.002077604876831174
Epoch: 900, 0.001986352726817131
plt.title("GRU Train Losses")
plt.plot(train_losses)
plt.show()

png

Bidirectional RNNs

So far, with basic RNNs, we were able to predict the future. Let’s say we have a sentence and we want to predict the next word:

I am ___ .

But what if we want to predict a word in the middle of a sentence? For that we have to take into account the previous and next words. For instance:

I am ___ , I haven’t eaten in hours.

To complete this sentence, we have to process the future(next words), as well as the past(previous words). So if we only process the past, there might be a lot of possibilities and we don’t know which one is correct. However, with combining the future data, we can see that the correct answer is most likely “hungry”.

Hence, to enable straight and reverse traversal of input, Bidirectional RNNs, or BRNNs, are used. A BRNN is a combination of two RNNs - one RNN moves forward, beginning from the start of the data sequence, and the other, moves backward, beginning from the end of the data sequence. Then, we use both RNNs to predict the output.

The network blocks in a BRNN can either be simple RNNs, GRUs, or LSTMs.

image.png

A simple implemetation can be found here:

class BiRNN(nn.Module):
  def __init__(self, input_size, hidden_size):
    super().__init__()
    self.forward_rnn = nn.RNN(input_size=input_size, hidden_size=hidden_size, batch_first=True)
    self.backward_rnn = nn.RNN(input_size=input_size, hidden_size=hidden_size, batch_first=True)
    self.num_hiddens *= 2  # The output dimension will be doubled

  def forward(self, inputs, Hs=None):
    forward_H, backward_H = Hs if Hs is not None else (None, None)

    forward_outputs, forward_H = self.forward_rnn(inputs, forward_H)
    backward_outputs, backward_H = self.backward_rnn(reversed(inputs), backward_H)

    outputs = [torch.cat((forward, backward), -1) for forward, backward in zip(forward_outputs, reversed(backward_outputs))]
    
    return outputs, (forward_H, backward_H)


Neural Network - Computer Science Faculty of Shahid Beheshti University. Winter 2023 - Contact us at abtinmahyar@gmail.com