Recurrent Neural Networks

Many sources of data are sequential in nature. Some examples:

The sequence of words in a document capture narrative, theme and tone: topic classification, sentiment analysis, language translation.
Time series of temperature, rainfall, wind speed, air quality, etc: forecasting weather or climate.
Financial time series, market indices, trading volumes, stock and bond prices, exchange rates.
Recordings of speech, music, other sounds; transcription of text or music, language translation.
Handwriting, such as doctor’s notes, handwritten digits: OCR

The input to a Recurrent Neural Network is a sequence, be it a sequence of measurements, words, notes, sounds, images, or whatever. The architecture of the network will take advantage of the sequential nature of the input data.

The output can be a sequence, a number or a category.

\[ A_{lk} = g \left( w_{k0} + \sum_{j=1}^p w_{kj} X_{lj} + \sum_{s=1}^K u_{ks} A_{l-1,s} \right) \]

where \(g\) is an activation function such as ReLU and

\[ O_l = \beta_0 + \sum_{k=1}^K \beta_k A_{lk} \]

\(W\), \(U\) and \(B\) are the same for each element of the sequence: weight sharing.

For regression problems the loss function is \((Y - O_L)^2\), only referencing the final \(O_L = \beta_0 + \sum_{k=1}^K \beta_k A_{Lk}\) and minimising the sum of squares.

Note that when the desired output \(y\) is a number or category then only \(O_L\) is used; all previous \(O_l\) are discarded, whereas when the desired output is a sequence, all \(\{O_1,\dots, O_L\}\) are used.

Document Classification

Is a movie review positive or negative?

This has to be one of the worst films of the 1990s. When my
friends & I were watching this film (being the target audience it
was aimed at) we just sat & watched the first half an hour with
our jaws touching the floor at how bad it really was. The rest
of the time, everyone else in the theater just started talking to
each other, leaving or generally crying into their popcorn ...

Sentiment analysis makes a judgement about a piece of text whether its sentiment is positive or negative. Some applications:

Social media monitoring
Brand reputation management
Customer feedback analysis
Market research
Political analysis
Healthcare and patient feedback

The simplest approach it to treat the text as a bag of words. This takes no account of the order of the words or the structure of the sentences. We can either take the set of words appearing in a text or a bag of words which also counts the number of occurrences of each word.

import heapq
import re
from collections import Counter
from os.path import exists
from sys import exit
from textwrap import wrap

import nltk
import numpy as np
from requests import get

# Download the text of James Joyce's Ulysses from Project Gutenberg
ulysses = "pg4300.txt"
if exists("pg4300.txt"):
    text = open(ulysses).read()
else:
    response = get("https://gutenberg.org/cache/epub/4300/pg4300.txt")
    if response.status_code == 200:
        text = response.text
    else:
        exit("Unable to get text")


def most_popular_words(text, n=100):
    word2count = Counter()
    for sentence in nltk.sent_tokenize(text):
        word2count.update(nltk.word_tokenize(sentence.lower()))

    return heapq.nlargest(n, word2count, key=word2count.get)


for line in wrap(" ".join(most_popular_words(text))):
    print(line)

. , the of and a to in ’ he his i that s : with it _ ? was on you for
) ( her ! him is all at by said as she from or they bloom me not out
be what up my had there like their mr one have but them an t no so
then stephen if when about are which were o your old who this says
down we man over too now do see after did two would time ... off back
will other into eyes know where more those some could hand

But let's consider the set-of-words approach for movie reviews on the IMDb movie database.

One of the most critically acclaimed films of 2023 is Fallen Leaves. Here is one of its reviews on IMDb. It is 147 words long.

text = """A film brimming with charm, thanks to its human characters, struggling in their
own way to make a living. She's a cashier, moving from job to job. She owns her
own apartment. He's a manual laborer who works in a factory, but drinks. And he
goes from job to job. They cross paths. They're both alone. They're drawn to
each other. Girl meets boy. Boy meets girl. They're both shy. But there will be
grains of sand in the mechanics of their relationship. Aki Kaurismäki doses the
construction of this couple perfectly.

Aki Kaurismäki sprinkles his film with references (Jean-Luc Godard, George A.
Romero, for example). The result is a short film, and all the better for it.
There are no unnecessary sequences here. There's no extra-diegetic music.
Without going too fast, Aki Kaurismäki builds the love story between the
characters. A film to warm the heart."""

word_count = len(text.split())
pop_words = most_popular_words(text, 32)
print(word_count)
for line in wrap(" ".join(pop_words)):
    print(line)

147
a the to film job they in s re there of aki kaurismäki with characters
their own she from he but and both girl meets boy for no brimming
charm thanks its

Kaggle has a dataset of 50k Movie Reviews from IMDb.
To develop a model, we might limit ourselves to the most popular 10k English words.
We could make a dataframe of 50k rows and 10k columns, each entry being a 1 if the word appears in the review or 0 if it does not.
Now we have a dataframe with 500 million entries, the vast majority of which are 0. In fact only about 1.3% of the entries are non-zero. Mathematically, such an array is called a sparse matrix and there are methods for dealing with them in Python for Machine Learning such as in SciPy.

from numpy import array
from scipy.sparse import csr_matrix

A = array([[1, 0, 0, 1, 0, 0], [0, 0, 2, 0, 0, 1], [0, 0, 0, 2, 0, 0]])
S = csr_matrix(A)
B = S.todense()

print(A)
print(S)
print(B)

[[1 0 0 1 0 0]
 [0 0 2 0 0 1]
 [0 0 0 2 0 0]]
<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 5 stored elements and shape (3, 6)>
  Coords	Values
  (0, 0)	1
  (0, 3)	1
  (1, 2)	2
  (1, 5)	1
  (2, 3)	2
[[1 0 0 1 0 0]
 [0 0 2 0 0 1]
 [0 0 0 2 0 0]]

In the Lab you will see an example of taking 25000 reviews and the 10000 most popular words. These are trained on two models.

A lasso logistic regression
A neural network with 2 hidden layers, each with 16 ReLU units.

The outcomes are similar.

Note that the bag-of-words approach takes no account of context. For example, the word "blissfully" in the sentence "The movie was blissfully short" might be considered to have a negative sentiment, while in the sentence "The movie was blissfully long", it might be positive. One way to take context into account is to consider a bag of n-grams

Another approach is to apply dimensionality reduction techniques to reduce the dimension from 10000 in this case to several hundred. This diagram shows 20 words in 16 dimensions.

Here are the words reduced to 5 dimensions.

We call such reductions word embeddings. They have interesting properties.

The-classical-king-woman-man-queen-example-of-neural-word-embeddings-in-2D-It.png

An introductory article

We can even download some embeddings which pre-trained on large corpora of text using PCA.

We use these embeddings to transform our sparse representations in a large number of dimensions into denser representations in a (much) smaller number of dimensions and feed the results into a RNN.

In the Lab we will train a RNN on a word embedding in 32 dimensions of 25000 IMDb reviews using dropout regularization. This gives an accuracy of around 76%. This can be improved by techniques such as Long short-term memory where two (or more) hidden-layer activations are maintained in order to ensure that early signals are not dominated by later ones. This can improve accuracy to 87%.

Time Series Forecasting

Some historical trading statistics for the New York Stock Exchange from 1962 to 1986. There are 6051 triples.

Log trading volume
Dow Jones Index return
Log volatility

Predicting stock prices is notoriously hard. Predicting volume is easier and can be an indicator of price development.

We observe repeating patterns in the series: autocorrelation

We set out to predict volume from past values of volatility, Dow Jones index and volume. Note that whereas we had 2500 distinct series of data for the IMDb case, here we have 6501 overlapping series. If we use a lag of 5, then we have 6046 pairs in our dataset. Fitting such a RNN we get the following results.

Fitting Neural Networks

If our NN has \(K\) layers, weights \(w_k = (w_{k0}, \dots, w_{kp})\) where \(w_{ij}\) is the weight of unit \(j\) in layer \(i\) , and parameters \(\beta = (\beta_0, \dots, \beta_K)\) and we train it on observations \((x_i, y_i), \; i = 1, \dots, n\), then we seek to minimise

\[ \min_{\{w_k\}_{k=1}^K} \frac{1}{2} \sum_{i=1}^n (y_i - f(x_i))^2 \]

where

\[ f(x_i) = \beta_0 \sum_{k=1}^K \beta_k g\left( w_{k0} + \sum_{j=1}^p w_{kj} x_{ij} \right) \]

Since this is non-linear, we have a nonconvex optimization problem in its parameters so there can be multiple solutions. Local techniques are no longer guaranteed to find the optimal one.

Writing all the parameters in one vector \(\theta\), we seek to minimise the objective \[ R(\theta) = \frac{1}{2} \sum_{i=1}^n (y_i - f_\theta(x_i))^2 \]

Now we do gradient descent as follows

Guess a value for \(\theta^0\). Set \(t = 0\)
Now iterate until \(R\) fails to decrease.
1. Find a small vector \(\delta\) such that \(R(\theta^{t+1}) = R(\theta^t + \delta) < R(\theta^t)\)
2. Set \(t \leftarrow t + 1\)

But how do we find \(\delta\)? We calculate the gradient of \(R\) at \(\theta^t\) \[ \nabla R(\theta^t) = \left. \frac {\partial R(\theta)} {\partial \theta} \right|_{\theta = \theta^t} \]

The gradient points in the direction of greatest increase, i.e. up the slope. We want to go down the direction of greatest decrease so we update \[ \theta^{t+1} \leftarrow \theta^t - \rho \nabla R(\theta^t) \] where \(\rho\) is the learning rate.

The calculation of \(\nabla R\) turns out to be more straightforward than it might seem since we can apply the chain rule for differentiation.

To simplify the presentation, in the following we write \(z_{ik}\) for \(w_{k0} + \sum_{j=1}^p w_{kj} x_{ij}\).

In the case of the \(\beta_k\),

\begin{eqnarray*} \frac{\partial R_i(\theta)}{\partial \beta_k} &=& \frac{\partial R_i(\theta)}{\partial f_\theta(x_i)} . \frac{\partial f_\theta(x_i)}{\partial \beta_k} \\ &=& -(y_i - f_\theta(x_i)) . g(z_{ik}) \end{eqnarray*}

And for the \(w_{kj}\),

\begin{eqnarray*} \frac{\partial R_i(\theta)}{\partial w_{kj}} &=& \frac{\partial R_i(\theta)}{\partial f_\theta(x_i)} . \frac{\partial f_\theta(x_i)}{\partial g(z_{ik})} . \frac {\partial g(z_{ik})} {\partial z_{ik}} . \frac {\partial z_{ik}} {\partial w_{kj}}\\ &=& -(y_i - f_\theta(x_i)) \;.\; \beta_k \;.\; g'(z_{ik}) \;.\; x_{ij} \end{eqnarray*}

Note that each of these terms is just the residual \(y_i - f_\theta(x_i)\) scaled by a factor and that these fractions of the residual are propagated back through the network in a process called backpropagation.

Examples of RNNs

Google Translate as an example of seq2seq learning

2013: Atari games on YouTube

Considerations: Neural Networks

Understanding versus prediction
Understandability
Explainability
Occam's Razor

etc

Advent of Code 2025
Course Evaluation