Course Introduction

Today

  • Getting to know each other
  • Housekeeping: book, assignments, exams, grading, expectations
  • What is Machine Learning?
  • Some history
  • ML around us today and its effects on our lives

Machine Learning

Machine Learning is the field of study that gives computers the ability to learn without being explicitly programmed.

– Arthur Samuel, 1959

Machine Learning is the science (and art) of programming computers so they can learn from data.

– Aurélien Géron

A computer program is said to learn from experience E with respect to some class of tasks T, and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

– Tom Mitchell

Probabilistic reasoning

We will treat all unknown quantities as random variables, that are endowed with probability distributions which describe a weighted set of possible values the variable may have.

Almost all of machine learning can be viewed in probabilistic terms, making probabilistic thinking fundamental. It is, of course, not the only view. But it is through this view that we can connect what we do in machine learning to every other computational science, whether that be in stochastic optimisation, control theory, operations research, econometrics, information theory, statistical physics or bio-statistics. For this reason alone, mastery of probabilistic thinking is essential.

– Shakir Mohamed, DeepMind

An example: Spam filtering

  • spam, ham
  • training set of training instances
  • performance measure: ratio of correctly classified emails: accuracy.
  • How would you write a program to detect spam?
    • "Typical" spam
    • Patterns
    • Test and tune

traditional-approach.png

Figure 1: A traditional approach

ml-approach.png

Figure 2: A Machine Learning approach

adaptation.png

Figure 3: Adapting to change

An Interactive Introduction to Machine Learning

Identifying flowers

iris-species.png

import polars as pl

iris_file = "https://ml.auc-computing.nl/data/iris.csv"

df = pl.read_csv(iris_file)
df
shape: (150, 5)
1504setosaversicolorvirginica
f64f64f64f64i64
5.13.51.40.20
4.93.01.40.20
4.73.21.30.20
4.63.11.50.20
5.03.61.40.20
6.73.05.22.32
6.32.55.01.92
6.53.05.22.02
6.23.45.42.32
5.93.05.11.82
df = pl.read_csv(
    iris_file,
    has_header=True,
    new_columns=[
        "sepal_length",
        "sepal_width",
        "petal_length",
        "petal_width",
        "class_id",
    ],
)

df_head = pl.read_csv(
    iris_file,
    has_header=True,
    n_rows=0
)
species = df_head.columns[2:]
mapping = pl.DataFrame(
    {"class_id": list(range(len(species))), "species": species}
)
df = df.join(mapping, on="class_id")
df
shape: (150, 6)
sepal_lengthsepal_widthpetal_lengthpetal_widthclass_idspecies
f64f64f64f64i64str
5.13.51.40.20"setosa"
4.93.01.40.20"setosa"
4.73.21.30.20"setosa"
4.63.11.50.20"setosa"
5.03.61.40.20"setosa"
6.73.05.22.32"virginica"
6.32.55.01.92"virginica"
6.53.05.22.02"virginica"
6.23.45.42.32"virginica"
5.93.05.11.82"virginica"
df.group_by("species").len()
shape: (3, 2)
specieslen
stru32
"setosa"50
"virginica"50
"versicolor"50
import altair as alt

adf = df.drop("class_id")
features = adf.drop('species').columns

alt.Chart(adf).mark_circle().encode(
    alt.X(alt.repeat("column"), type="quantitative"),
    alt.Y(alt.repeat("row"), type="quantitative"),
    color="species:N",
).properties(width=150, height=150).repeat(
    row=features, column=features,
).save('images/iris_altair_pairplot.png')

iris_altair_pairplot.png

import seaborn as sns

pdf = df.drop('class_id').to_pandas()

sns.pairplot(pdf, hue="species").savefig("images/iris_pairplot.png")

iris_pairplot.png

  • data frame
  • design matrix
  • tabular data
  • features and target
  • supervised learning
  • classification

iris_dtree_12_0.png

iris_dtree_10_0.svg

iris_dtree_19_0.svg

iris_dtree_20_0.png

  • Misclassification rate
  • Asymmetric loss matrix
  Setosa Versicolor Virginia
Setosa 0 1 1
Versicolor 1 0 1
Virginia 10 10 0

Regression

linreg_poly_vs_degree_9_2.png

linreg_poly_vs_degree_9_5.png

linreg_poly_vs_degree_9_8.png

linreg_poly_vs_degree_9_11.png

linreg_poly_vs_degree_9_14.png

linreg_poly_vs_degree_7_2.png

Mastering Data

# Download latest RIVM data on R_0 and plot it.

from requests import get
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import dates as mpl_dates
from datetime import datetime, timedelta

plt.style.use("seaborn-v0_8-whitegrid")
# plt.style.use("seaborn-v0_8-darkgrid")

date_format = mpl_dates.DateFormatter("%d-%b")
plt.gca().xaxis.set_major_formatter(date_format)

w = get("https://data.rivm.nl/covid-19/COVID-19_reproductiegetal.json")
days = w.json()


def plot_key(key):
    key_by_day = [
        float(day[key]) if key in day and day[key] else np.nan for day in days
    ]
    day_count = len(key_by_day)
    start_date = datetime.today() - timedelta(day_count)
    dates = pd.date_range(start_date, periods=day_count)

    plt.plot_date(dates, key_by_day, "-")


day_count = len(days)
start_date = datetime.today() - timedelta(day_count)
all_dates = pd.date_range(start_date, periods=day_count)
plt.plot_date(all_dates, [1.0] * day_count, "r-")

plot_key("Rt_up")
plot_key("Rt_avg")
plot_key("Rt_low")

plt.title("Recent Reproduction number COVID-19 in the Netherlands (RIVM)")
plt.ylabel("$R_0$ (lo, avg, hi)")

plt.savefig("images/r0.png")
# plt.show()

r0.png

  • Getting tables from HTML pages. Suppose we are interested in Gini coefficients for countries.
from requests import get
import pandas as pd
from io import StringIO

url = "https://en.wikipedia.org/wiki/List_of_countries_by_income_equality"
header = {
    "User-Agent": (
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
        "(KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36"
    ),
    "X-Requested-With": "XMLHttpRequest",
}

r = get(url, headers=header)

pdf = pd.read_html(StringIO(r.text))[0]
gini_table = pl.from_pandas(pdf)
gini_table
shape: (196, 10)
('Country/Territory', 'Country/Territory')('UN Region', 'UN Region')('World Bank Income group (2024)', 'World Bank Income group (2024)')('Gini coefficient[a]', 'WB[2]')('Gini coefficient[a]', 'Year')('Gini coefficient[a]', 'UNU- WIDER[3]')('Gini coefficient[a]', 'Year.1')('Gini coefficient[a]', 'OECD[4][5][6]')('Gini coefficient[a]', 'Year.2')('Unnamed: 9_level_0', 'Unnamed: 9_level_1')
strstrstrstrstrstrstrstrstrstr
"Afghanistan""Southern Asia""Low income"nullnull"31.00""2017"nullnullnull
"Angola""Middle Africa""Lower middle income""51.3""2018""51.27""2019"nullnullnull
"Albania""Southern Europe""Upper middle income""29.4""2020""29.42""2020"nullnullnull
"Andorra""Southern Europe""High income"nullnull"27.96""2016"nullnullnull
"United Arab Emirates""Western Asia""High income""26.4""2018""25.97""2019"nullnullnull
"Yemen""Western Asia""Low income""36.7""2014""36.71""2014"nullnullnull
"South Africa""Southern Africa""Upper middle income""63.0""2014""66.99""2017"nullnullnull
"Zambia""Eastern Africa""Lower middle income""51.5""2022""44.00""2022"nullnullnull
"Zimbabwe""Eastern Africa""Lower middle income""50.3""2019""50.26""2019"nullnullnull
"^ The Gini coefficient, or Gin…"^ The Gini coefficient, or Gin…"^ The Gini coefficient, or Gin…"^ The Gini coefficient, or Gin…"^ The Gini coefficient, or Gin…"^ The Gini coefficient, or Gin…"^ The Gini coefficient, or Gin…"^ The Gini coefficient, or Gin…"^ The Gini coefficient, or Gin…"^ The Gini coefficient, or Gin…

Some History

1805
Legendre developed Least Squares Linear Regression to predict the movement of the planets
1936
Linear Discriminant Analysis was developed do predict qualitative results.
1940s
Logistic Regression
1952
Arthur Samuel develops the first computer program to play Checkers using the Minimax algorithm and coins the phrase "Machine Learning"
1957
The Perceptron
1967
The Nearest Neighbor Algorithm
1970s
Generalized Linear Models
1980s
Computing technology had evolved to the point where non-linear methods could be developed for
1990s
Support Vector Machines
1997
Schmidhuber and Hochreiter develop long short-term memory for speech recognition.
2006
The Facial Recognition Grand Challenge shows significant progress
2013
Atari games

Machine Learning today

Course overview and housekeeping

Course home page

We do not put the course material on Canvas. Rather, we build a website in the course of the semester. Each class will have its own page which will contain links to all of the materials discussed.

Classes

  • Monday: Lecture in 2.05, 15:45–17:30
  • Thursday: Lab in 2.05, 11:00–12:45
  • During lectures you will be presented with material on various topics in Machine learning. This will be mostly with closed laptops. You may find it useful to have pen and paper to take occasional notes.
  • During the labs you will be working on material you have learned in the previous lecture. Make sure you have your laptop with the necessary software tools.
  • Attendance will be taken at the beginning of each class. Please come on time. Late = absent.

Teacher

  • Breanndán Ó Nualláin <o@uva.nl>

Book

An Introduction to Statistical Learning by James, Witten, Hastie & Tibshirani. The full text is available from that link.

Assignments

  • There will be one graded assignment. It will be an assignment on Data Ethics which you will carry out in groups of ±4. You will write a report and give a presentation on a topic of your choice. The deadline for submission of the report and presentation is the date of the presentations.
  • Each Thursday you will be given an assignment consisting of a set of applied tasks. The goal is for you to familiarise yourself with the material covered in the Monday class by putting it into practice and submitting it during the Thursday lab class.
  • Assignments will not be graded but students will receive feedback on their work.
  • As each exam will be similar in form to one of these assignments, carrying out the assignments is essential to acquire then necessary knowledge and skills to take the exams.
  • The first assignments will be in pairs. Later assignments will be individual.

Grading

See the section on Assessment in the Course manual.

Programming environment

In this course we will emphasise good programming practice by using professional software tools and methods. In particular note that the book recommends using Jupyter. However we will not be using Jupyter in this course. For programming you may use Spyder or one of a range of other IDEs.

My own preferred development environment is Emacs, a highly flexible and powerful tool which has an intimidating learning curve. During the course, I will show you how to use Emacs but you are not required to learn it.

Software tools

Since we will be working with large data sets we will learn to use professional tools for programming and managing the software that we develop. Among these are:

  • Python
  • Some Python libraries:
    • Numpy adds support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
    • Pandas offers data structures and operations for manipulating numerical tables and time series.
    • Polars coherent, fast, dataframes (lazy, column-oriented, out-of-memory).
    • Scikit-learn features various classification, regression and clustering algorithms as well as tools for building powerful data pipelines.
    • Matplotlib is a comprehensive library for creating static, animated, and interactive data visualizations in Python.
    • Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
  • Anaconda is a distribution of the Python programming language for scientific computing, containing the above libraries and more. It aims to simplify package management and deployment. Anaconda has questionable licensing terms that you might not like to agree to.
  • Spyder is an Interactive Development Environment (IDE) for Python included in Anaconda
  • Emacs is an extensible, customizable, free/libre text editor. It includes a full development environment for Python and interaction with Git and many other features.
  • Git Git is a free and open source distributed version control system.
  • Forgejo is a web-based collaborative software platform for both developing and sharing computer applications. AUC runs its own instance of Forgejo

Assignments

  • In the labs you will work in pairs for the first few weeks.. Each week the pairs will be assigned randomly.
  • You will receive an assignment to be worked on in class on Thursdays. Together with your partner, you can also work on the assignment before and after class.
  • This coming Thursday we will set up our computational environment and software tools. This is very important because you will not be able to program or submit anything until you have them set up correctly.
  • In the meantime you might like to read the introductory chapter of An Introduction to Statistical Learning

Author: Breanndán Ó Nualláin <o@uva.nl>

Date: 2025-09-01 Mon 13:35