Course Introduction

Today

Getting to know each other
Housekeeping: book, assignments, exams, grading, expectations
What is Machine Learning?
Some history
ML around us today and its effects on our lives

Machine Learning

Machine Learning is the field of study that gives computers the ability to learn without being explicitly programmed.

– Arthur Samuel, 1959

Machine Learning is the science (and art) of programming computers so they can learn from data.

– Aurélien Géron

A computer program is said to learn from experience E with respect to some class of tasks T, and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

– Tom Mitchell

Probabilistic reasoning

We will treat all unknown quantities as random variables, that are endowed with probability distributions which describe a weighted set of possible values the variable may have.

Almost all of machine learning can be viewed in probabilistic terms, making probabilistic thinking fundamental. It is, of course, not the only view. But it is through this view that we can connect what we do in machine learning to every other computational science, whether that be in stochastic optimisation, control theory, operations research, econometrics, information theory, statistical physics or bio-statistics. For this reason alone, mastery of probabilistic thinking is essential.

– Shakir Mohamed, DeepMind

Data Science

Data Science as a career

An example: Spam filtering

spam, ham
training set of training instances
performance measure: ratio of correctly classified emails: accuracy.
How would you write a program to detect spam?
- "Typical" spam
- Patterns
- Test and tune

Figure 1: A traditional approach

Figure 2: A Machine Learning approach

Figure 3: Adapting to change

An Interactive Introduction to Machine Learning

Identifying flowers

import polars as pl

iris_file = "https://ml.auc-computing.nl/data/iris.csv"

df = pl.read_csv(iris_file)
df

shape: (150, 5)

150	4	setosa	versicolor	virginica
f64	f64	f64	f64	i64
5.1	3.5	1.4	0.2	0
4.9	3.0	1.4	0.2	0
4.7	3.2	1.3	0.2	0
4.6	3.1	1.5	0.2	0
5.0	3.6	1.4	0.2	0
…	…	…	…	…
6.7	3.0	5.2	2.3	2
6.3	2.5	5.0	1.9	2
6.5	3.0	5.2	2.0	2
6.2	3.4	5.4	2.3	2
5.9	3.0	5.1	1.8	2

df = pl.read_csv(
    iris_file,
    has_header=True,
    new_columns=[
        "sepal_length",
        "sepal_width",
        "petal_length",
        "petal_width",
        "class_id",
    ],
)

df_head = pl.read_csv(
    iris_file,
    has_header=True,
    n_rows=0
)
species = df_head.columns[2:]
mapping = pl.DataFrame(
    {"class_id": list(range(len(species))), "species": species}
)
df = df.join(mapping, on="class_id")
df

shape: (150, 6)

sepal_length	sepal_width	petal_length	petal_width	class_id	species
f64	f64	f64	f64	i64	str
5.1	3.5	1.4	0.2	0	"setosa"
4.9	3.0	1.4	0.2	0	"setosa"
4.7	3.2	1.3	0.2	0	"setosa"
4.6	3.1	1.5	0.2	0	"setosa"
5.0	3.6	1.4	0.2	0	"setosa"
…	…	…	…	…	…
6.7	3.0	5.2	2.3	2	"virginica"
6.3	2.5	5.0	1.9	2	"virginica"
6.5	3.0	5.2	2.0	2	"virginica"
6.2	3.4	5.4	2.3	2	"virginica"
5.9	3.0	5.1	1.8	2	"virginica"

df.group_by("species").len()

shape: (3, 2)

species	len
str	u32
"setosa"	50
"virginica"	50
"versicolor"	50

import altair as alt

adf = df.drop("class_id")
features = adf.drop('species').columns

alt.Chart(adf).mark_circle().encode(
    alt.X(alt.repeat("column"), type="quantitative"),
    alt.Y(alt.repeat("row"), type="quantitative"),
    color="species:N",
).properties(width=150, height=150).repeat(
    row=features, column=features,
).save('images/iris_altair_pairplot.png')

import seaborn as sns

pdf = df.drop('class_id').to_pandas()

sns.pairplot(pdf, hue="species").savefig("images/iris_pairplot.png")

data frame
design matrix
tabular data
features and target
supervised learning
classification

Misclassification rate
Asymmetric loss matrix

	Setosa	Versicolor	Virginia
Setosa	0	1	1
Versicolor	1	0	1
Virginia	10	10	0

Regression

Mastering Data

Information for humans and information for programs
Dutch National Institute for Public Health and the Environment
- Weekly coronavirus SARS-CoV-2 figures
- RIVM COVID-19 datasets

# Download latest RIVM data on R_0 and plot it.

from requests import get
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import dates as mpl_dates
from datetime import datetime, timedelta

plt.style.use("seaborn-v0_8-whitegrid")
# plt.style.use("seaborn-v0_8-darkgrid")

date_format = mpl_dates.DateFormatter("%d-%b")
plt.gca().xaxis.set_major_formatter(date_format)

w = get("https://data.rivm.nl/covid-19/COVID-19_reproductiegetal.json")
days = w.json()


def plot_key(key):
    key_by_day = [
        float(day[key]) if key in day and day[key] else np.nan for day in days
    ]
    day_count = len(key_by_day)
    start_date = datetime.today() - timedelta(day_count)
    dates = pd.date_range(start_date, periods=day_count)

    plt.plot_date(dates, key_by_day, "-")


day_count = len(days)
start_date = datetime.today() - timedelta(day_count)
all_dates = pd.date_range(start_date, periods=day_count)
plt.plot_date(all_dates, [1.0] * day_count, "r-")

plot_key("Rt_up")
plot_key("Rt_avg")
plot_key("Rt_low")

plt.title("Recent Reproduction number COVID-19 in the Netherlands (RIVM)")
plt.ylabel("$R_0$ (lo, avg, hi)")

plt.savefig("images/r0.png")
# plt.show()

Getting tables from HTML pages. Suppose we are interested in Gini coefficients for countries.

from requests import get
import pandas as pd
from io import StringIO

url = "https://en.wikipedia.org/wiki/List_of_countries_by_income_equality"
header = {
    "User-Agent": (
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
        "(KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36"
    ),
    "X-Requested-With": "XMLHttpRequest",
}

r = get(url, headers=header)

pdf = pd.read_html(StringIO(r.text))[0]
gini_table = pl.from_pandas(pdf)
gini_table

shape: (196, 10)

('Country/Territory', 'Country/Territory')	('UN Region', 'UN Region')	('World Bank Income group (2024)', 'World Bank Income group (2024)')	('Gini coefficient[a]', 'WB[2]')	('Gini coefficient[a]', 'Year')	('Gini coefficient[a]', 'UNU- WIDER[3]')	('Gini coefficient[a]', 'Year.1')	('Gini coefficient[a]', 'OECD[4][5][6]')	('Gini coefficient[a]', 'Year.2')	('Unnamed: 9_level_0', 'Unnamed: 9_level_1')
str	str	str	str	str	str	str	str	str	str
"Afghanistan"	"Southern Asia"	"Low income"	null	null	"31.00"	"2017"	null	null	null
"Angola"	"Middle Africa"	"Lower middle income"	"51.3"	"2018"	"51.27"	"2019"	null	null	null
"Albania"	"Southern Europe"	"Upper middle income"	"29.4"	"2020"	"29.42"	"2020"	null	null	null
"Andorra"	"Southern Europe"	"High income"	null	null	"27.96"	"2016"	null	null	null
"United Arab Emirates"	"Western Asia"	"High income"	"26.4"	"2018"	"25.97"	"2019"	null	null	null
…	…	…	…	…	…	…	…	…	…
"Yemen"	"Western Asia"	"Low income"	"36.7"	"2014"	"36.71"	"2014"	null	null	null
"South Africa"	"Southern Africa"	"Upper middle income"	"63.0"	"2014"	"66.99"	"2017"	null	null	null
"Zambia"	"Eastern Africa"	"Lower middle income"	"51.5"	"2022"	"44.00"	"2022"	null	null	null
"Zimbabwe"	"Eastern Africa"	"Lower middle income"	"50.3"	"2019"	"50.26"	"2019"	null	null	null
"^ The Gini coefficient, or Gin…	"^ The Gini coefficient, or Gin…	"^ The Gini coefficient, or Gin…	"^ The Gini coefficient, or Gin…	"^ The Gini coefficient, or Gin…	"^ The Gini coefficient, or Gin…	"^ The Gini coefficient, or Gin…	"^ The Gini coefficient, or Gin…	"^ The Gini coefficient, or Gin…	"^ The Gini coefficient, or Gin…

Some History

1805

Legendre developed Least Squares Linear Regression to predict the movement of the planets

1936

Linear Discriminant Analysis was developed do predict qualitative results.

1940s

Logistic Regression

1952

Arthur Samuel develops the first computer program to play Checkers using the Minimax algorithm and coins the phrase "Machine Learning"

1957

The Perceptron

1967

The Nearest Neighbor Algorithm

1970s

Generalized Linear Models

1980s

Computing technology had evolved to the point where non-linear methods could be developed for

1990s

Support Vector Machines

1997

Schmidhuber and Hochreiter develop long short-term memory for speech recognition.

2006

The Facial Recognition Grand Challenge shows significant progress

2013

Atari games

Machine Learning today

Self-driving cars
Voice technologies: Siri, Alexa, etc
Large Language Models: ChatGPT, LLaMA and many others
“As soon as it works, no one calls it AI anymore.” – John McCarthy
Google's latest smartphone model
Ethical questions, data privacy and ownership, reward hacking and the alignment problem.

Course overview and housekeeping

Course home page

We do not put the course material on Canvas. Rather, we build a website in the course of the semester. Each class will have its own page which will contain links to all of the materials discussed.

Classes

Monday: Lecture in 2.05, 15:45–17:30
Thursday: Lab in 2.05, 11:00–12:45
During lectures you will be presented with material on various topics in Machine learning. This will be mostly with closed laptops. You may find it useful to have pen and paper to take occasional notes.
During the labs you will be working on material you have learned in the previous lecture. Make sure you have your laptop with the necessary software tools.
Attendance will be taken at the beginning of each class. Please come on time. Late = absent.

Teacher

Breanndán Ó Nualláin <o@uva.nl>

Book

An Introduction to Statistical Learning by James, Witten, Hastie & Tibshirani. The full text is available from that link.

Assignments

There will be one graded assignment. It will be an assignment on Data Ethics which you will carry out in groups of ±4. You will write a report and give a presentation on a topic of your choice. The deadline for submission of the report and presentation is the date of the presentations.
Each Thursday you will be given an assignment consisting of a set of applied tasks. The goal is for you to familiarise yourself with the material covered in the Monday class by putting it into practice and submitting it during the Thursday lab class.
Assignments will not be graded but students will receive feedback on their work.
As each exam will be similar in form to one of these assignments, carrying out the assignments is essential to acquire then necessary knowledge and skills to take the exams.
The first assignments will be in pairs. Later assignments will be individual.

Grading

See the section on Assessment in the Course manual.

Programming environment

In this course we will emphasise good programming practice by using professional software tools and methods. In particular note that the book recommends using Jupyter. However we will not be using Jupyter in this course. For programming you may use Spyder or one of a range of other IDEs.

My own preferred development environment is Emacs, a highly flexible and powerful tool which has an intimidating learning curve. During the course, I will show you how to use Emacs but you are not required to learn it.

Why not Jupyter?

Software tools

Since we will be working with large data sets we will learn to use professional tools for programming and managing the software that we develop. Among these are:

Python
Some Python libraries:
- Numpy adds support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
- Pandas offers data structures and operations for manipulating numerical tables and time series.
- Polars coherent, fast, dataframes (lazy, column-oriented, out-of-memory).
- Scikit-learn features various classification, regression and clustering algorithms as well as tools for building powerful data pipelines.
- Matplotlib is a comprehensive library for creating static, animated, and interactive data visualizations in Python.
- Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
Anaconda is a distribution of the Python programming language for scientific computing, containing the above libraries and more. It aims to simplify package management and deployment. Anaconda has questionable licensing terms that you might not like to agree to.
Spyder is an Interactive Development Environment (IDE) for Python included in Anaconda
Emacs is an extensible, customizable, free/libre text editor. It includes a full development environment for Python and interaction with Git and many other features.
Git Git is a free and open source distributed version control system.
Forgejo is a web-based collaborative software platform for both developing and sharing computer applications. AUC runs its own instance of Forgejo

Assignments

In the labs you will work in pairs for the first few weeks.. Each week the pairs will be assigned randomly.
You will receive an assignment to be worked on in class on Thursdays. Together with your partner, you can also work on the assignment before and after class.
This coming Thursday we will set up our computational environment and software tools. This is very important because you will not be able to program or submit anything until you have them set up correctly.
In the meantime you might like to read the introductory chapter of An Introduction to Statistical Learning

150	4	setosa	versicolor	virginica
f64	f64	f64	f64	i64
5.1	3.5	1.4	0.2	0
4.9	3.0	1.4	0.2	0
4.7	3.2	1.3	0.2	0
4.6	3.1	1.5	0.2	0
5.0	3.6	1.4	0.2	0
…	…	…	…	…
6.7	3.0	5.2	2.3	2
6.3	2.5	5.0	1.9	2
6.5	3.0	5.2	2.0	2
6.2	3.4	5.4	2.3	2
5.9	3.0	5.1	1.8	2

150	4	setosa	versicolor	virginica
f64	f64	f64	f64	i64
5.1	3.5	1.4	0.2	0
4.9	3.0	1.4	0.2	0
4.7	3.2	1.3	0.2	0
4.6	3.1	1.5	0.2	0
5.0	3.6	1.4	0.2	0
…	…	…	…	…
6.7	3.0	5.2	2.3	2
6.3	2.5	5.0	1.9	2
6.5	3.0	5.2	2.0	2
6.2	3.4	5.4	2.3	2
5.9	3.0	5.1	1.8	2

150	4	setosa	versicolor	virginica
f64	f64	f64	f64	i64
5.1	3.5	1.4	0.2	0
4.9	3.0	1.4	0.2	0
4.7	3.2	1.3	0.2	0
4.6	3.1	1.5	0.2	0
5.0	3.6	1.4	0.2	0
…	…	…	…	…
6.7	3.0	5.2	2.3	2
6.3	2.5	5.0	1.9	2
6.5	3.0	5.2	2.0	2
6.2	3.4	5.4	2.3	2
5.9	3.0	5.1	1.8	2