Classification
The regression problems we studied so far have a response variable that is quantitative, i.e. a number. Now we switch to looking at problems where the response variable is qualitative. We call such variables categorical.
Here are some categorical variables we might like to predict
- eye colour
- from genetic data
- fraudulent credit-card payments
- from transaction history, time and place of payment, payment amount and goods or services bought
- medical conditions
- from symptoms and patient medical history and demographic
- deleterious DNA mutations
- from sequence data and epidemiological data on patients
Whereas predicting quantitative variables is known as regression, predicting quanlitative variables is known as classification. We aim to predict whether a given observation belongs to one of a number of classes or categories.
Doing so might involve predicting the probabilities that the observation belongs to each category and then choosing the category with the highest probability. In this sense, classification reduces to a number of regressions.
Let's look at the Default data set which aims to predict whether a credit
card holder will default on their debt based on their monthly income and
current credit card balance.
In this plot we show those who defaulted as orange pluses while those who did not as blue circles. The default rate is around 3% so the number of observations shown has been undersampled.
It appears that the defaults tend to have higher credit card balances.
These boxplots show income and balance split by default. It appears that
default has a strong relationship with balance but less so with income
Why not Linear Regression?
Consider a case where we would like to predict the medical condition of a
patient in an emergency room. There are three possibilities: stroke, drug
overdose and epileptic seizure? Couldn't we just encode these possibilities
as numbers and do linear regression?
stroke |
1 |
drug overdose |
2 |
epileptic seizure |
3 |
With the known interpretation of these numbers, this suggests that
drug overdoseis in some sense betweenstrokeandepileptic seizure- the "distance" between
strokeanddrug overdoseis the same as that betweendrug overdoseandepileptic seizure
Further if we had encoded them as follows
epileptic seizure |
1 |
stroke |
2 |
drug overdose |
3 |
then we might get a different set of predictions.
We might use linear regression if there were some implied order between the
categories, for example mild, moderate and severe, but still we are
implying that the distance between mild and moderate is the same at that
between moderate and severe.
For binary classification we could use the dummy variable approach to assign the values 0 and 1 to the two classes. For example:
\begin{equation} Y = \begin{cases} 0 & \text{if stroke} \\ 1 & \text{if drug overdose} \end{cases} \end{equation}Now if our linear regression gives \(\hat{Y} > 0.5\) we predict drug overdose or otherwise stroke. We can show that \(\hat{Y}\) is an estimate for \(P(\mathtt{drug overdose} | X)\).
But what if we get values for \(\hat{Y}\) outside the \([0,1]\) interval? We can't interpret that as a probability!
The blue line in the plot on the left is a linear regression for Default
against Balance. Note how it takes negative values and always remains under
\(0.5\), suggesting that nobody will ever default. We now look into a technique
which can transform this line into the blue line in the plot on the right,
giving us interpretable probabilities.
Logistic regression
This is the logistic funciton. It has many applications in biology (especially ecology), biomathematics, chemistry, demography, economics, geoscience, mathematical psychology, probability, sociology, political science, linguistics, statistics, and even neural networks. \[ f(x) = \frac{1}{1 + e^{-x}} \]
import numpy as np import matplotlib # matplotlib.use("Agg") import matplotlib.pyplot as plt def logistic(x): return 1 / (1 + np.exp(-x)) # fig = plt.figure(figsize=(4, 2)) fig = plt.figure() x = np.linspace(-5, 5) plt.plot(x, logistic(x)) plt.savefig("images/python-matplot-fig.png") plt.show()
Since this is a binary classification problem, to model the response directly
we should predict whether Default is Yes or No. But we will take a
different approach. We will model the probability that Default is Yes. We
will write the probability that Default is Yes given a particular balance
as \[ Pr(\mathtt{Default} = \mathtt{Yes} | \mathtt{balance}) \] We will
abbreviate this as \(p(\mathtt{balance})\).
We can choose our threshold. For example we can predict \(\mathtt{Default} = \mathtt{Yes}\) whenever \(p(\mathtt{balance}) > 0.5\) or whenever \(p(\mathtt{balance}) > 0.1\)
Now we can model \(p_l\) as a straightforward linear regression problem: \[ p_l(X) = \beta_0 + \beta_1 X \]
But this could lead to the issue in the plot on the left above where we could get values less than zero or even greater than one.
So, instead of modelling \(p_l(X)\), we will model \(p(x) = f(p_l(X))\) where \(f\) is the logistic function.
\[ p(X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X)}} \]
The result of this is shown in the right plot above.
With a little algebra we can write \[ \frac{p(X)}{1-p(X)} = e^{\beta_0 + \beta_1 X} \] This quantity is called the odds. It can take any value in \([0,\inf)\)
If one in five people default, then the odds will be \[ \frac{1/5}{1 - 1/5} = \frac{1}{4} \]
If nine out of ten people default, then the odds will be \[ \frac{9/10}{1 - 9/10} = 9 \]
If we take logarithms of both sides above we get \[ \log\left(\frac{p(X)}{1-p(X)}\right) = \beta_0 + \beta_1 X \]
This quantity is called the log odds or logit. Since the odds can take any value in \([0,\inf)\), the logit can take any value in \((-\inf, \inf)\).
We see that this logistic regression model has a logit linear in \(X\). Increasing \(X\) by one unit increases the log odds by \(\beta_1\) and thus increases the odds by \(e^\beta_1\).
Previously we used the method of least squares for linear regression. For logistic regression we will use a more general method called maximum likelihood.
The idea is this. We would like to predict a number as close as possible to \(1\) for each credit card holder who defaulted and a number as close as possible to \(0\) for each credit card holder who did not default.
We write \(i:y_i=1\) for the set of indexes of those who defaulted and \(i:y_i=0\) for those that did not. Now we write \(\prod_{i:y_i=1} p(x_i)\) for the product of the probabilities of all those that defaulted and \(\prod_{i:y_i=0} 1 - p(x_i)\) for the product of the probabilities of all those that did not default.
Now we write our likelihood function as a product of these \[ \mathcal{l}(\beta_0, \beta_1) = \prod_{i:y_i=1} p(x_i) \prod_{i:y_i=0} 1 - p(x_i) \]
Now we need to maximize this function to find \(\hat\beta_0\) and \(\hat\beta_1\).
Maximum likelihood estimation is an entire field in its own right. We will not be delving further into the theory in this course but let's look at how we can use it in practice.
Stock market data
Let's look at the Smarket data set which contains daily percentage returns
for the S&P 500 stock index between 2001 and 2005.
First some imports
import statsmodels.api as sm from ISLP import load_data from ISLP.models import ModelSpec as MS
Now load the data and examine it.
Smarket = load_data("Smarket") print(Smarket.shape)
(1250, 9)
print(Smarket.columns)
Index(['Year', 'Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5', 'Volume', 'Today',
'Direction'],
dtype='object')
print(Smarket.describe())
Year Lag1 Lag2 Lag3 Lag4 \
count 1250.000000 1250.000000 1250.000000 1250.000000 1250.000000
mean 2003.016000 0.003834 0.003919 0.001716 0.001636
std 1.409018 1.136299 1.136280 1.138703 1.138774
min 2001.000000 -4.922000 -4.922000 -4.922000 -4.922000
25% 2002.000000 -0.639500 -0.639500 -0.640000 -0.640000
50% 2003.000000 0.039000 0.039000 0.038500 0.038500
75% 2004.000000 0.596750 0.596750 0.596750 0.596750
max 2005.000000 5.733000 5.733000 5.733000 5.733000
Lag5 Volume Today
count 1250.00000 1250.000000 1250.000000
mean 0.00561 1.478305 0.003138
std 1.14755 0.360357 1.136334
min -4.92200 0.356070 -4.922000
25% -0.64000 1.257400 -0.639500
50% 0.03850 1.422950 0.038500
75% 0.59700 1.641675 0.596750
max 5.73300 3.152470 5.733000
Smarket.corr(numeric_only=True)
Let's look at the market Volume by day.
print(Smarket.plot(y="Volume"))
We now fit a logistic regression model to predict the categorical variable
Direction to Volume and the lag columns. We would like to predict whether
the market will go Up or Down based on its behaviour over the previous five
days.
Since logistic regression is a special case of the more general Generalized
Linear Models we use GLM with the Binomial family of distributions.
df_lagvol = Smarket.columns.drop(["Today", "Direction", "Year"]) design = MS(df_lagvol) X = design.fit_transform(Smarket) y = Smarket.Direction == "Up" logreg = sm.GLM(y, X, family=sm.families.Binomial()) results = logreg.fit() print(results.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: Direction No. Observations: 1250
Model: GLM Df Residuals: 1243
Model Family: Binomial Df Model: 6
Link Function: Logit Scale: 1.0000
Method: IRLS Log-Likelihood: -863.79
Date: Mon, 06 Oct 2025 Deviance: 1727.6
Time: 12:56:39 Pearson chi2: 1.25e+03
No. Iterations: 4 Pseudo R-squ. (CS): 0.002868
Covariance Type: nonrobust
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
intercept -0.1260 0.241 -0.523 0.601 -0.598 0.346
Lag1 -0.0731 0.050 -1.457 0.145 -0.171 0.025
Lag2 -0.0423 0.050 -0.845 0.398 -0.140 0.056
Lag3 0.0111 0.050 0.222 0.824 -0.087 0.109
Lag4 0.0094 0.050 0.187 0.851 -0.089 0.107
Lag5 0.0103 0.050 0.208 0.835 -0.087 0.107
Volume 0.1354 0.158 0.855 0.392 -0.175 0.446
==============================================================================
In particular we look at the \(p\)-values
print(results.pvalues)
intercept 0.600700 Lag1 0.145232 Lag2 0.398352 Lag3 0.824334 Lag4 0.851445 Lag5 0.834998 Volume 0.392404 dtype: float64
None of them is particularly small. This is unsurprising; if we could predict market behaviour from previous days' performance we could make a lot of money!
Predicting credit-card defaults
Now let's return to the credit card data set and predict Default from balance.
df_cc = load_data('Default') print(df_cc.shape)
(10000, 4)
print(df_cc.columns)
Index(['default', 'student', 'balance', 'income'], dtype='object')
print(df_cc)
default student balance income
0 No No 729.526495 44361.625074
1 No Yes 817.180407 12106.134700
2 No No 1073.549164 31767.138947
3 No No 529.250605 35704.493935
4 No No 785.655883 38463.495879
... ... ... ... ...
9995 No No 711.555020 52992.378914
9996 No No 757.962918 19660.721768
9997 No No 845.411989 58636.156984
9998 No No 1569.009053 36669.112365
9999 No Yes 200.922183 16862.952321
[10000 rows x 4 columns]
df_balance = df_cc[["balance"]] design = MS(df_balance) X = design.fit_transform(df_cc) y = df_cc.default == "Yes" logreg = sm.GLM(y, X, family=sm.families.Binomial()) results = logreg.fit() print(results.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: default No. Observations: 10000
Model: GLM Df Residuals: 9998
Model Family: Binomial Df Model: 1
Link Function: Logit Scale: 1.0000
Method: IRLS Log-Likelihood: -798.23
Date: Mon, 06 Oct 2025 Deviance: 1596.5
Time: 12:56:39 Pearson chi2: 7.15e+03
No. Iterations: 9 Pseudo R-squ. (CS): 0.1240
Covariance Type: nonrobust
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
intercept -10.6513 0.361 -29.491 0.000 -11.359 -9.943
balance 0.0055 0.000 24.952 0.000 0.005 0.006
==============================================================================
Let's have a look at some of the predictions for the training set.
print(results.predict()[:8])
[0.00130568 0.00211259 0.00859474 0.00043444 0.00177696 0.00370415 0.00221143 0.00201617]
We convert these probabilities into labels Yes or No
labels = np.array(['No '] * df_cc.shape[0]) labels[results.predict() > 0.5] = 'Yes' print(np.unique(labels, return_counts=True))
(array(['No ', 'Yes'], dtype='<U3'), array([9858, 142]))
Now let's make predictions for the probabilities of default for credit-card holders with balances of $1000 and $2000.
df_new = pd.DataFrame({"balance": [1000, 2000]}) X_new = design.transform(df_new) predictions = results.get_prediction(X_new) print(predictions.predicted_mean)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[14], line 1
----> 1 df_new = pd.DataFrame({"balance": [1000, 2000]})
2 X_new = design.transform(df_new)
3 predictions = results.get_prediction(X_new)
NameError: name 'pd' is not defined
We map these probabilities to labels.
label = np.vectorize(lambda x: 'Yes' if x > 0.5 else 'No') print(label(predictions.predicted_mean))
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[15], line 2
1 label = np.vectorize(lambda x: 'Yes' if x > 0.5 else 'No')
----> 2 print(label(predictions.predicted_mean))
NameError: name 'predictions' is not defined
We see that the model predicts that a credit-card holder with a balance of $1000 has less than half a percent probability to default while one with a balance of $2000 has nearly a 60% probability to default.
To use these probabilities for a classification task we could assign a
Default of Yes to any credit-card holder with a greater that 50%
probability and No otherwise.
However note that although we get numbers between 0 and 1, these are not
necessarily "probabilities" in the true sense of the word. As such we can
choose another threshold than \(0.5\), perhaps \(0.6\) if it gives us better
predictions. This may help in a case like this were we have highly imbalanced
data with only about 3% of the data points having a Default of Yes.
etc
- Note that there is an errata page for the book
- Python f-string cheat sheets
- Feedback on the Linear Regression assignment
The week's teams
[('Adrian', 'Emil'), ('Jess', 'Janina'), ('Carlijn', 'Fridolin'), ('Asmahene', 'Jagoda'), ('Efraim', 'Lynn'), ('Romane', 'Natalia'), ('Henrike', 'Giorgia'), ('Matúš', 'Madalina'), ('Yaohong', 'Max'), ('Nora', 'Mariana'), ('Emilio', 'Louanne', 'Roman'), ]