Tree-based Methods

Exam 1 recap

Questions

How was your experience of the exam?
Were you surprised by anything in the exam description?
In what ways did you feel you were well prepared for the exam?
In what ways did you feel you were not well prepared for the exam?
What went well?
What went wrong?
What did you spend more time on than you expected to?
How much of your time did you spend on…
- coding
- debugging
- writing text
- looking up documentation
- searching for answers to questions
- interacting with LLMs
- other things
How did you use LLMs?
Which LLM(s) did you use?
If you could go back in time three weeks, what would you have prepared better or practiced more?
Do you think this is a fair way of examining this course?
polars vs pandas

Feedback

polars vs pandas
It's not helpful to plot median_house_value and longitude on the same scale.

Note

"\\mnt\\c\\Users\\98755\\Desktop\\Machine Learning"

r"\mnt\c\Users\98755\Desktop\Machine Learning"

Also with pathlib, can write with "/"

60 lines of imports, all used? ruff
400 LoC, 4 commits

Is there a better way to get numeric columns?

# Select numeric columns
numeric_cols = [
    col
    for col, dtype in zip(df.columns, df.dtypes)
    if dtype
    in (
        pl.Int8,
        pl.Int16,
        pl.Int32,
        pl.Int64,
        pl.UInt8,
        pl.UInt16,
        pl.UInt32,
        pl.UInt64,
        pl.Float32,
        pl.Float64,
    )
    and col != target
]

Try

df.select(cs.numeric())

LLMs

Inaccessible links!
for (NL) translation
ask in your native language
prep in advance
review your work

priming

remember all of this information in the document! go thorugh it carefully and
remember the code, approaches and logic used. i will give you tasks to
solve/code later but only ever adapt code and use the the approaches that
were aready mentioned here! this is very important. if something is not
dicusses how to solve, then let me know that you don´t know from the document
how to solve it. remember tor prioritise polars over pandas

exam
```
Awesome — here’s an exam-ready starter that sticks exactly to your cheat-sheet + doc
```
- Overview. During the exam you depended a lot on ChatGPT, so much so that it appears
that you lost sight of what it was that you were doing. It would have been better to ask questions step by step. In that way you can keep the overview and help ChatGPT help you to solve your problems.
Similarity mapping to previous exercises
Good use of LLMs. Asked Claude and ChatGPR for feedback on code and report. I very much like this approach (and not only because the last part is doing my work for me!)

Data Ethics Project

Suggested Topics for the Data Ethics Project

Tree-based Methods

We will look at tree-based methods for both classification and regression.
Decision trees
Sometimes called Classification and Regression Trees (CART)
Two general techniques that are especially effective for combining decision trees
- bagging
- boosting

Decision trees for apartments in NYC and SF

A visual introduction to machine learning

Decision trees for regression

We begin by building a decision tree for regression.

We use the Hitters data set of data on baseball players to predict their Salary from the number of Years they have been playing and the number of Hits they made in the previous year. Salaries are expressed as thousands of dollars.

import polars as pl
from os import getenv
from pathlib import Path

import polars as pl
from dotenv import load_dotenv

load_dotenv(".envrc")

ISLP_DATADIR = Path(
    getenv("ISLP_DATADIR")
    or (
        "/home/bon/projects/auc/courses/ml/website/"
        "islp/data/ALL CSV FILES - 2nd Edition"
    )
)

hitters_csv = ISLP_DATADIR / "Hitters.csv"

hitters = pl.read_csv(hitters_csv)
hitters.shape

(322, 20)

Let's set the Polars print settings so that it does not elide the columns.

pl.Config.set_tbl_cols(-1)
pl.Config.set_tbl_width_chars(-1)

Let's have a look at the salaries.

hitters["Salary"]

shape: (322,)
Series: 'Salary' [str]
[
	"NA"
	"475"
	"480"
	"500"
	"91.5"
	…
	"700"
	"875"
	"385"
	"960"
	"1000"
]

It contains some NA values so the whole column is read as a string.

Let's try to read the csv file again, indicating that null values are represented as NA in the file.

hitters = pl.read_csv(hitters_csv, null_values=["NA"])
hitters.dtypes

[Int64, Int64, Int64, Int64, Int64, Int64, Int64, Int64, Int64, Int64, Int64, Int64, Int64, String, String, Int64, Int64, Int64, Float64, String]

Now Salaries is Float64 with some nulls

import matplotlib.pyplot as plt

plt.clf()

bins = 32
plt.hist(hitters["Salary"], bins=bins)
plt.xlabel("salary")
plt.ylabel("frequency")
plt.savefig("images/hitters-salaries.png")

We see they are quite skewed and we suspect that such data are lognormally distributed. If this is so, then the logarithms of the data points will be normally distributed.

plt.clf()
plt.xlabel("log(salary)")
plt.ylabel("frequency")
plt.hist(hitters["Salary"].drop_nulls().log(), bins=bins)
plt.savefig("images/hitters-log-salaries.png")

Now it does not look very normal but it is certainly less skewed so we will build a decision tree to predict LogSalary from Years and Hits. Here is an example of such a tree.

We note the following:

Like many trees in computing, the tree is "upside down". Its root is at the top and its leaves are at the bottom.
At the root is the decision $Years < 4.5$. The training points for which this is true go to the left branch. The remaining points go to the right branch.
On the left is a leaf with the value $5.11$, the mean of the salaries of all of the training points that have arrived there.
On the right is another decision, $Hits < 117.5$. Again the training points for which this is true go to the left branch while the remaining points go to the right branch.
The features Years and Hits are both of type integer. The splits are made half way between two integers.
For a new test data point, we start at the root and follow the decisions left and right until we reach a leaf. Then our prediction for LogSalary is the value at the leaf, e.g. $5.11$ so our prediction for Salary is then $e^{5.11} \times 1000= \$165,670 $

This tree partitions the training data points into three regions.

$R_1: Years < 4.5$
$R_2: Years \geq 4.5 \wedge Hits < 117.5$
$R_3: Years \geq 4.5 \wedge Hits \geq 117.5$

The predictions for each of these regions are:

$R_1: \$165,670$
$R_2: \$403,428$
$R_3: \$845,561$

We might interpret this as saying that the most important factor for predicting Salary is the number of Years played. For players with few Years, their previous year's number of Hits is less relevant but after 4 or 5 years, then it begins to play a role.

This might be an oversimplification but it has the advantage of being explainable.

Stratification of the feature space

Now we take a step back and ask ourselves, of all of the partitions of the feature space, which one gives the best predictions? We can divide the process into two steps.

We start with the $p$ features $X_1, \dots, X_p$. We split them into $J$ non-overlapping regions, $R_1, \dots, R_J$, which together cover the entire feature space. This is called a partition.
For every observation that falls into region $R_j$, we make the same prediction, namely the mean of the response values for the training points in $R_j$.

Step 2 is straightforward but how about Step 1? Mathematically we seek to minimise the $RSS$ given by: \[ RSS = \sum_{j=1}^J \sum_{i \in R_j} (y_i - \hat{y}_{R_j})^2 \]

But how can we choose from among all possible partitions, $R_1, \dots, R_J$? Its The computational complexity of finding the optimal partition is known to be NP-complete. This means that it is computationally intractable so we must use a heuristic approach. Many such approaches have been tried. Here we take the straightforward approach of recursive binary splitting. This is a top-down, greedy approach.

top down: means that we start at the root of our tree, which contains all of our training data, and split by branching.
binary splitting: means that each split divides the data at the tree node into two.
greedy: means that we make the best choice at each step before proceeding down the tree, rather than trying to look ahead and optimize over several steps.

At each step we select one predictor $X_i$ and one value for $X_i$ to split on, which we call the cutpoint $s$, and then split the data points at the node into those for which $X_i < s$, which we place to the left and those for which $X_i \geq s$, which we place to the right. We examine all possible values of $X_i$ and $s$ and select the ones which lead to the greatest reduction in RSS.

We write $\{X \mid X_i < s\}$ for the region in the $p$-dimensional space of predictors where $X_i < s$.

Mathematically, we define the two half-planes

\begin{eqnarray*} R_l(i, s) &= \{X \mid X_i < s\} \\ R_r(i, s) &= \{X \mid X_i \geq s\} \\ \end{eqnarray*}

and write $\hat{y}_{R_l}$ and $\hat{y}_{R_r}$ for the mean response in $R_l(i, s)$ and $R_r(i, s)$ respectively.

Then what we are looking for are the $i$ and $s$ which minimize \[ \sum_{k : \; x_k \in R_l(i, s)} (y_k - \hat{y}_{R_l})^ 2 + \sum_{k : \; x_k \in R_r(i, s)} (y_k - \hat{y}_{R_r})^ 2 \]

When $p$ is not too large, we can find $i$ and $s$ quite quickly.

We iterate this process until we reach a stopping criterion. Examples of stopping criteria are:

The node contains fewer than 8 data points
All of the data points at the node have the same response (more usual for classification than regression). We say that a node is pure if all of its responses are the same. The impurity of a node is the extent to which its responses differ.

We will investigate some more sophisticated stopping criteria later.

Rubik's_cube.svg.png

Number Puzzle Wall Panel 1-15.jpg

Tree pruning

As we saw in the example of apartments in New York and San Francisco, if we continue too far with splitting we can overfit because our model becomes too complex.

We could prevent this by stopping once the reduction in RSS drops below a certain level. This could lead to a smaller tree which might generalise better.

However, since we are adopting a greedy approach, it might be that one step has a small reduction in RSS but the subsequent step has a larger reduction. To find such cases we first grow a large tree and then prune it.

But how should we prune it? Which subtree should we select? We should select the subtree with the lowest RSS. We could estimate this using cross validation. But again, just as with the original optimization problem, this too is intractable.

Rather than doing cross validation on every subtree, let's just "punish" the subtrees that are more complex, i.e. the bigger ones. Rather than looking for $i$ and $s$ which minimize \[ \sum_{k : \; x_k \in R_l(i, s)} (y_k - \hat{y}_{R_l})^ 2 + \sum_{k : \; x_k \in R_r(i, s)} (y_k - \hat{y}_{R_r})^ 2 \] let's add an extra term and minimize \[ \sum_{k : \; x_k \in R_l(i, s)} (y_k - \hat{y}_{R_l})^ 2 + \sum_{k : \; x_k \in R_r(i, s)} (y_k - \hat{y}_{R_r})^ 2 + \alpha \left| T \right| \]

Here $\alpha$ is a number and $\left| T \right|$ is the size of the subtree. In this way we punish the larger subtrees by giving them a higher cost and $\alpha$ determines the degree to which we punish them. For $\alpha=0$ there is no punishment while for a large $\alpha$ larger subtrees are heavily punished.

(The idea of punishing more complex structures in general is known as regularization. We will see examples in other machine learning methods.)

To apply the pruning algorithm, call the quantity to be minimized, $R(T)$ and initialize $\alpha_1 = 0$ and $T_1$ the tree which minimizes $R(T)$.

Now iterate at step $i$, writing $T_t^i$ as the subtree of $T$ at node $t$ at iteration $i$,

Find $t_i \in T^i$ which minimizes $g_1(t) = { R(t) - R(T_t^i) \over \left| f(T_t^i) \right| -1 }$
Set $\alpha_{i+1} = g_i(t_i)$ and $T_{i+1} = T_i - T_{t_i}$

This yields a sequence of subtrees \[ T^1 \supseteq T^2 \supseteq \dots \supseteq \text{root} \] and a sequence of alphas \[ \alpha_1 \leq \alpha_2 \leq \dots \leq \alpha_k \]

We choose the best $\alpha_k$ by cross validation on testing data.

There are some worked examples here. We will do an example with scikit-learn .

Cost Complexity Pruning

This example, from the scikit-learn documentation, applies Cost Complexity Pruning to a data set on breast cancer for classification.

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

We load the data and use cost_complexity_pruning_path to generate the paths, the alphas and the impurities.

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

clf = DecisionTreeClassifier(random_state=0)
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities

The final element is a tree with only a single node so we remove it and then plot how the value of $\alpha$ effects impurity

fig, ax = plt.subplots()
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.savefig("images/impurities.png")

Now we train a decision tree on these alphas

clfs = [
    DecisionTreeClassifier(random_state=0, ccp_alpha=alpha).fit(
        X_train, y_train
    )
    for alpha in ccp_alphas
]
nodes = clfs[-1].tree_.node_count
ccp_alpha = ccp_alphas[-1]
f"Last tree has {nodes} node(s) and ccp_alpha: {ccp_alpha:0.4f}"

Last tree has 1 node(s) and ccp_alpha: 0.3273

Let's remove those last elements and plot the number of nodes and the depth of the trees for the varying alphas.

clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]

node_counts = [clf.tree_.node_count for clf in clfs]
depths = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1)

ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")

ax[1].plot(ccp_alphas, depths, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")

fig.tight_layout()
plt.savefig("images/tree_depths.png")

Finally let us see how the accuracy of our model on the training and test sets varies with $\alpha$

train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]

fig, ax = plt.subplots()

ax.set_xlabel("alpha")
ax.set_ylabel("accuracy")
ax.set_title("Accuracy vs alpha for training and testing sets")

common = {"marker": "o", "drawstyle": "steps-post"}
ax.plot(ccp_alphas, train_scores, label="train", **common)
ax.plot(ccp_alphas, test_scores, label="test", **common)
ax.legend()
plt.savefig("images/tree_accuracies.png")

We see the familiar pattern of a continuing drop in training accuracy but a clear maximum in test accuracy at around $\alpha = 0.015$

Pruning the Hitters dataset

Here is an as yet unpruned tree for the Hitters dataset.

And here is an error analysis on training data, test data and cross-validation. The original dataset was split in two to form the training and test sets and cross-validation was done sixfold.

The pruned tree with three nodes is the one shown at the top of this section.

Decision trees for classification

Using decision trees for classification is very similar to using them for regression. We just need to make several tweaks.

Response

For regression, we used the mean for the response for each node. For classification we will use the most common response at the node.

Split function

classification error rate

For regression we used RSS to make binary splits. For classification we use the classification error rate \[ E = 1 - \max_k (\hat{p}_{mk}) \] where $\hat{p}_{mk}$ is the ratio of training observations in region $m$ that are from class $k$.

It turns out that classification error rate is not sensitive enough for our purposes so two other splitting functions are commonly used.

Gini index

The Gini index, defined by \[ G = \sum_{k=1}^K \hat{p}_{mk} (1 - \hat{p}_{mk}) \] is also commonly used as an indicator of wealth inequality.

If most values of $\hat{p}_{mk}$ are close to 0 or 1, then $G$ is low. The Gini index is often described as a measure of node purity.

Entropy

An Information Theoretic measure closely related to the Gini index is Entropy, defined in our terms as \[ D = - \sum_{k=1}^K \hat{p}_{mk} \log \hat{p}_{mk} \] Similar to the Gini index, entropy takes on values close to zero when most values of $\hat{p}_{mk}$ are close to 0 or 1 and in fact both measures behave similarly.

Of the three measures, experience shows that Gini or entropy tend to be better for splitting, whereas classification error rate works best for pruning.

Here is an example of the Heart dataset (not documented in the ISLP documentation) consisting of 303 patients who presented with chest pain. There are 13 predictors of whether an angiographic test would indicate the presence of heart disease. This is thus a classification task with a Yes/No outcome.

Furthermore this dataset has categorical predictors. This does not present a problem to us since we can split on a predictor having one of the values as against the other possible values.

Something that might appear strange at first glance is that there are nodes which split to two child nodes which have the same predicted value, e.g. the node with RestEGC < 1 has Yes at both children. This is because of node purity. The leaf node on the left has $7/11$ observed data points with Yes, whereas the node on the right has all $9$ as Yes. We can thus be more confident of a Yes for a RestEGC greater than $1$. So this split does nothing to improve the classification error but it does improve our confidence in some cases.

Tree-based models vs Linear models

Tree-based models and Linear models are very different in nature. What are their advantages and disadvantages?

Advantages of tree-based models

They are easily explainable. There has been a great deal of attention in recent years for explainable AI, with ethical questions being asked about black-box systems which can be understood by neither laypersons nor experts.

In the EU, the Right to Explanation is a cornerstone of the GDPR.
They are considered more natural and similar to human reasoning.
They can be displayed graphically which aids visual thinking.
They can easily cater for categorical variables.
They require very little preprocessing.
Feature importance can be easily derived from a decision tree.
They can be made robust against colinearity.

Disadvantages of tree-based models

They do not always reach the levels of predictive accuracy of other modelling approaches.
They are not very robust, with small changes in the input data sometimes giving very different results.
Learning an optimal decision tree is NP-complete so we need to use heuristics.

However, there are techniques which can mitigate these disadvantages to some degree.

Ensemble Methods

An ensemble method is an approach which combines other weaker methods into a stronger machine-learning method. We investigate several approaches to making ensembles of tree-based methods.

Bagging

In an earlier class we looked at The bootstrap as a way of getting a better estimate of a values when the variance is high. Decision trees also suffer from high variance so we can use a kind of bootstrap known as bootstrap aggregation or bagging.

We saw that if we take $n$ samples $Z_1, Z_2, \dots, Z_n$, each with variance $\sigma^2$, then the variance on the mean $\bar{Z}/n$ will be $\sigma^2/n$.

When building decision trees we can also bootstrap $B$ different training sets and average their predictors. We grow the trees deep without pruning. As a result they have high variance but combining hundreds or even thousands of them by bagging can significantly improve accuracy.

For regression we average the predictions. For classification we use majority vote.

An interesting byproduct of the way bagging is done is that about 2/3 of the data points are chosen for each bootstrap sample. The fraction is actually given by \[ 1 - \frac{1}{e} = 0.6321 \] This means that we have about 1/3 of the data points available for validation. These are called the out-of-bag (OOB) data points. So instead of having to make a test set or do cross-validation, we have a read made set of points to test on.

Here is a comparison of several of these approaches for the Heart data set. We see that bagging with a test set improves on a decision tree (the dashed line) and that OOB improves upon that. Random forests make a further improvement.

Random forests

Random forests are decision trees with bagging with a refinement. Instead of considering all $p$ features when splitting at a node, we consider only a sample of $m \approx \sqrt{p}$ of them.

This has the advantage of bringing more diversity to the trees generated. Without sampling, if there are several strong predictors among the features they will almost always get selected for splitting and the resulting trees will be highly correlated. With sampling, the other features are chosen more often and the correlation will be less. It decorrelates the trees.

We can see in the previous figure how this improves the errors.

The hyperparameter $m$ can be adjusted depending on the data set. For a data set with a large number of features, many of which are correlated, it can be useful to take an even smaller $m$.

The NCI 60 dataset contains high-dimensional biological data. The goal was to predict one of 14 cancer types based on gene expression. In the plot we see that 100 or so trees gives good performance and that random forests with $m \approx \sqrt{p}$ improves on plain bagging ($m = p$).

Boosting

Boosting is another way in which we can improve decision trees. It is a general approach that can be applied to other methods of machine learning. While bagging takes a number of independent samples, boosting generates a sequence of trees where each one is generated from the previous one. Instead of immediately fitting a deep tree, we deliberately keep the trees small. We fit a first tree to the outcome $y$ and then fit each subsequent tree to the residuals of the previous tree. In this way we seek to improve our tree in areas where it does not perform well. The idea is to learn slowly. The algorithm for regression is as follows.

Initialize by setting $\hat{f}(x) = 0$ and $r_i = y_i$ for each $i$ in the training set.
For $b \in 1,2,\dots,B$
1. Fit a tree $\hat{f}_b$ with $d$ splits to the training data $(X, r)$.
2. Shrink $\hat{f}$ by $\hat{f}(x) \leftarrow \hat{f}(x) + \lambda \hat{f}_b(x)$
3. Update the residuals $r_i \leftarrow r_i - \lambda \hat{f}_b(x_i)$
The boosted model is \[ \hat{f}(x) \sum_{b=1}^B \lambda \hat{f}_b(x) \]

Note that there are three hyperparameters to tune.

$B$: the number of trees
$\lambda$: the shrinkage parameter, a kind of learning rate, typically $10^{-2}$ or $10^{-3}$
$d$: the number of splits in each tree, which can be as low as 1 for an additive model or, more generally is known as the interaction depth.

Here is boosting applied to the previous cancer data set. We see that it slightly outperforms random forests. As it produces a sum of smaller trees, this can aid explainability.

Bayesian Additive Regression Trees

Bayesian Additive Regression Trees (BART) combine the advantages of bagging and boosting by taking elements from both. As in boosting, a succession of trees is constructed, each using bagging, and each focusing on capturing signal not yet accounted for.

We consider regression.

At each of the $B$ iterations we will have $K$ trees. We write $\hat{f}_k^b(x_i)$ for the prediction at $x_i$ by tree $k$ on iteration $b$. After each iteration, we sum the $k$ trees from that iteration. \[ \hat{f}^b(x) = \sum_{k=1}^K \hat{f}^b_k(x) \]

At each iteration BART randomly perturbs the tree in one of several ways looking for variations that improve the fit. For example the following tree:

With different predictions at some terminal nodes

After pruning

With terminal nodes added

These perturbations server to increase the diversity of the trees.

Another feature of BART is the employment of a burn-in period. The trees generated in early iterations are often of low quality and do not contribute to a good final model so they are not included. Typically several hundred iterations are excluded. Here we see BART on the Heart data settling down after a burn-in period.

Variable importance measures

When we have a single decision tree, it is easy by inspection to see which features are more important than others. In methods such as bagging, where we have many trees, this can be less obvious.

We can however look at how many times nodes were split on each feature and what the total reduction in RSS was. In this way we can get a measure of variable importance.

Decision Trees from Scratch

This material adapted from Machine Learning from Scratch by Marvin Lanhenke

from sklearn import datasets
from sklearn.model_selection import train_test_split
import numpy as np

class Node:
    def __init__(
        self,
        feature=None,
        threshold=None,
        left=None,
        right=None,
        *,
        value=None
    ):
        self.feature = feature
        self.threshold = threshold
        self.left = left
        self.right = right
        self.value = value

    def is_leaf(self):
        return self.value is not None

class DecisionTree:
    def __init__(self, max_depth=100, min_samples_split=2):
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.root = None

    def _is_finished(self, depth):
        return (
            depth >= self.max_depth
            or self.n_class_labels == 1
            or self.n_samples < self.min_samples_split
        )

    def _entropy(self, y):
        proportions = np.bincount(y) / len(y)
        entropy = -np.sum([p * np.log2(p) for p in proportions if p > 0])
        return entropy

    def _create_split(self, X, thresh):
        left_idx = np.argwhere(X <= thresh).flatten()
        right_idx = np.argwhere(X > thresh).flatten()
        return left_idx, right_idx

    def _information_gain(self, X, y, thresh):
        parent_loss = self._entropy(y)
        left_idx, right_idx = self._create_split(X, thresh)
        n, n_left, n_right = len(y), len(left_idx), len(right_idx)

        if n_left == 0 or n_right == 0:
            return 0

        child_loss = (n_left / n) * self._entropy(y[left_idx]) + (
            n_right / n
        ) * self._entropy(y[right_idx])
        return parent_loss - child_loss

    def _best_split(self, X, y, features):
        split = {"score": -1, "feat": None, "thresh": None}

        for feat in features:
            X_feat = X[:, feat]
            thresholds = np.unique(X_feat)
            for thresh in thresholds:
                score = self._information_gain(X_feat, y, thresh)

                if score > split["score"]:
                    split["score"] = score
                    split["feat"] = feat
                    split["thresh"] = thresh

        return split["feat"], split["thresh"]

    def _build_tree(self, X, y, depth=0):
        self.n_samples, self.n_features = X.shape
        self.n_class_labels = len(np.unique(y))

        # stopping criterion
        if self._is_finished(depth):
            most_common_Label = np.argmax(np.bincount(y))
            return Node(value=most_common_Label)

        # get best split
        rnd_feats = np.random.choice(
            self.n_features, self.n_features, replace=False
        )
        best_feat, best_thresh = self._best_split(X, y, rnd_feats)

        # grow children recursively
        left_idx, right_idx = self._create_split(X[:, best_feat], best_thresh)
        left_child = self._build_tree(X[left_idx, :], y[left_idx], depth + 1)
        right_child = self._build_tree(
            X[right_idx, :], y[right_idx], depth + 1
        )
        return Node(best_feat, best_thresh, left_child, right_child)

    def _traverse_tree(self, x, node):
        if node.is_leaf():
            return node.value

        if x[node.feature] <= node.threshold:
            return self._traverse_tree(x, node.left)
        return self._traverse_tree(x, node.right)

    def fit(self, X, y):
        self.root = self._build_tree(X, y)

    def predict(self, X):
        predictions = [self._traverse_tree(x, self.root) for x in X]
        return np.array(predictions)

def accuracy(y_true, y_pred):
    return np.sum(y_true == y_pred) / len(y_true)


data = datasets.load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1
)

clf = DecisionTree(max_depth=10)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
acc = accuracy(y_test, y_pred)

f"Accuracy: {acc:0.4f}"

Accuracy: 0.9561