Chapter 10 of 20

Decision Trees

Understand how decision trees split the feature space with if/else questions — Gini impurity, entropy and information gain, regression trees, controlling overfitting with depth and pruning, feature importance, and a full scikit-learn workflow with plot_tree.

Meritshot16 min read
Machine LearningDecision TreesGiniEntropyInformation GainFeature Importance
All Machine Learning Chapters

What Is a Decision Tree?

A decision tree is a model that makes predictions by asking a sequence of simple yes/no questions about the features, following the answers down a branching path until it reaches a conclusion. Each question splits the data into smaller, more homogeneous groups.

In the previous chapters you saw Logistic Regression draw a single smooth boundary and K-Nearest Neighbors vote among close-by points. A decision tree takes a completely different, refreshingly human approach: it carves the feature space into rectangular regions using one threshold at a time.

Intuitive analogy. Imagine a loan officer named Rahul evaluating applications with a flowchart pinned to his desk. Is monthly income greater than ₹50,000? If no, decline. If yes, is the credit score above 700? If yes, approve; if no, does the applicant already have more than 2 active loans? — and so on. Each question narrows down the group of applicants until Rahul can confidently say "approve" or "decline." A decision tree learns this flowchart automatically from data, choosing which question to ask, in what order, and at what threshold.

Goal: learn a hierarchy of feature-based questions so that the groups at the bottom of the tree are as pure as possible — mostly one class (for classification) or with a tight spread of values (for regression).

Examples:
→ Approve or decline a loan based on income, credit score, and existing debt
→ Predict whether a customer will churn from usage and tenure
→ Estimate the price of a used car from age, mileage, and brand
→ Diagnose a fault from a small set of sensor readings

Anatomy of a Tree: Root, Internal, and Leaf Nodes

Every decision tree has three kinds of nodes. Understanding them makes the rest of the chapter easy.

  • Root node — the very first question, applied to the entire dataset.
  • Internal (decision) nodes — every subsequent question. Each one tests a single feature against a threshold and sends data left or right.
  • Leaf (terminal) nodes — the end of a path. A leaf holds a prediction: a class label (classification) or a number (regression). No more questions are asked here.
                 [ income > 50000 ? ]          <- root node
                    /            \
                  no              yes
                  /                 \
            [ DECLINE ]      [ credit_score > 700 ? ]   <- internal node
              (leaf)             /            \
                                no             yes
                               /                 \
                    [ loans > 2 ? ]           [ APPROVE ]  <- leaf
                       /       \                 (leaf)
                     no         yes
                     /            \
                [ APPROVE ]     [ DECLINE ]
                  (leaf)          (leaf)

The depth of a tree is the length of the longest path from the root to a leaf. A tree of depth 3 asks at most 3 questions before deciding. Depth is one of the most important knobs you will tune, as you will see later.

How a Split Is Chosen

The heart of tree learning is a single question repeated over and over: which feature, at which threshold, best separates the data? The tree tries every candidate split and keeps the one that makes the resulting groups purest. "Purity" needs a formal measure, and there are two classic choices for classification: Gini impurity and entropy.

Gini Impurity

Gini impurity measures the chance that a randomly picked item from a node would be mislabeled if we labeled it randomly according to the class proportions in that node. Lower is better; 0 means perfectly pure (all one class).

Gini impurity of a node:

Gini = 1 - Σ (pᵢ)²      for i = 1 ... number of classes

where pᵢ = fraction of items in the node belonging to class i

Range: 0 (pure) to (1 - 1/k) for k classes.
For 2 classes the maximum is 0.5 (a 50/50 mix).

Worked example for a node with 10 items — 8 "approve" and 2 "decline":

p_approve = 8/10 = 0.8
p_decline = 2/10 = 0.2

Gini = 1 - (0.8² + 0.2²)
     = 1 - (0.64 + 0.04)
     = 1 - 0.68
     = 0.32

A pure node (all 10 "approve") gives Gini = 1 - (1.0² ) = 0. A perfectly mixed node (5/5) gives Gini = 1 - (0.5² + 0.5²) = 0.5.

Entropy and Information Gain

Entropy comes from information theory and measures the disorder or uncertainty in a node. Like Gini, it is 0 for a pure node and maximal for an even mix.

Entropy of a node:

Entropy = - Σ pᵢ × log₂(pᵢ)      for i = 1 ... number of classes

For the same 8-approve / 2-decline node:

Entropy = -(0.8 × log₂ 0.8) - (0.2 × log₂ 0.2)
        = -(0.8 × -0.3219) - (0.2 × -2.3219)
        = 0.2575 + 0.4644
        = 0.7219

A tree that uses entropy chooses the split that maximizes information gain — the reduction in entropy achieved by splitting a parent node into children.

Information Gain = Entropy(parent)
                 - Σ ( (n_child / n_parent) × Entropy(child) )

The children are weighted by how many items fall into each,
so a split that isolates a tiny impure group is not over-rewarded.

Worked information-gain example. A parent with 10 items (6 approve, 4 decline) is split by income > 50000 into a left child of 4 items (1 approve, 3 decline) and a right child of 6 items (5 approve, 1 decline):

Entropy(parent)  = -(0.6·log₂0.6) - (0.4·log₂0.4)  = 0.971
Entropy(left)    = -(0.25·log₂0.25) - (0.75·log₂0.75) = 0.811
Entropy(right)   = -(5/6·log₂ 5/6) - (1/6·log₂ 1/6)   = 0.650

Weighted child entropy = (4/10)(0.811) + (6/10)(0.650)
                       = 0.3244 + 0.3900 = 0.7144

Information Gain = 0.971 - 0.7144 = 0.2566

The tree evaluates the information gain (or Gini reduction) for every possible feature and threshold, then greedily picks the winner. In scikit-learn, criterion="gini" (the default) and criterion="entropy" select the two measures. In practice they usually produce very similar trees — Gini is slightly faster because it avoids the logarithm.

Recursive Partitioning

Once the best split is chosen, the same procedure runs again inside each child, then inside each grandchild, and so on. This top-down, greedy process is called recursive partitioning. It stops when a node is pure, when it becomes too small to split, or when a limit you set (like max_depth) is reached. Each split is chosen to be locally optimal; the tree never looks ahead, which is why trees are fast to train but not guaranteed to be globally optimal.

Regression Trees: Predicting Numbers

Everything above assumed classification. A regression tree predicts a continuous number, so purity is measured differently: instead of Gini or entropy, splits are chosen to minimize the variance (or mean squared error) within each child. The prediction at a leaf is simply the mean of the target values of the training samples that landed there.

For regression, a common split criterion is the reduction in MSE:

MSE(node) = (1/n) × Σ (yᵢ - ȳ)²    where ȳ = mean of y in the node

The best split minimizes the weighted MSE of the two children.
Leaf prediction = mean of y over the training samples in that leaf.

Because a leaf outputs a single mean, a regression tree produces a staircase of flat predictions rather than a smooth line. If you split a used-car dataset by age and mileage, every car falling into the same leaf gets the identical predicted price — the average price of the training cars in that region.

A Full scikit-learn Example

Let's classify loan approvals with DecisionTreeClassifier. The workflow mirrors the Train-Test Split & Cross-Validation chapter: split first, fit on train, evaluate on test.

import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import matplotlib.pyplot as plt

# --- Illustrative loan dataset (numbers are made up for teaching) ---
rng = np.random.default_rng(42)
n = 600
df = pd.DataFrame({
    "income":        rng.integers(15000, 120000, n),   # monthly income (₹)
    "credit_score":  rng.integers(300, 850, n),
    "existing_loans": rng.integers(0, 5, n),
})
# A rule-of-thumb label with some noise, purely for demonstration
approve = ((df.income > 45000) & (df.credit_score > 680) &
           (df.existing_loans <= 2)).astype(int)
df["approved"] = (approve ^ (rng.random(n) < 0.08)).astype(int)  # 8% label noise

X = df[["income", "credit_score", "existing_loans"]]
y = df["approved"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

# --- Train a controlled tree (see 'Overfitting' below for why) ---
clf = DecisionTreeClassifier(
    criterion="gini",
    max_depth=4,
    min_samples_leaf=20,
    random_state=42,
)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
print("Test accuracy:", round(accuracy_score(y_test, y_pred), 3))
print(classification_report(y_test, y_pred))

Expected output (illustrative — your exact numbers will vary):

Test accuracy: 0.9

              precision    recall  f1-score   support
           0       0.91      0.94      0.92        95
           1       0.88      0.83      0.85        55
    accuracy                           0.90       150

Visualizing the tree. One of the biggest selling points of trees is that you can see them. Use plot_tree (or export_text for a plain-text version):

plt.figure(figsize=(16, 8))
plot_tree(
    clf,
    feature_names=X.columns,
    class_names=["decline", "approve"],
    filled=True,     # colour nodes by majority class
    rounded=True,
    fontsize=9,
)
plt.show()

# Text alternative:
from sklearn.tree import export_text
print(export_text(clf, feature_names=list(X.columns)))

Each box plot_tree draws shows the split condition, the Gini value, the sample count, and the class distribution — a fully auditable decision path you can hand to a compliance team.

Overfitting and How to Control It

Left unchecked, a decision tree will keep splitting until every leaf is perfectly pure — often one training sample per leaf. Such a tree memorizes the training data, including its noise, and generalizes terribly. This is the classic overfitting you met conceptually in the workflow chapters, and it is the single biggest weakness of trees.

The fix is to limit the tree's complexity, either while growing it (pre-pruning) or after (post-pruning).

Pre-Pruning Hyperparameters

HyperparameterWhat it controlsEffect of a larger value
max_depthMaximum number of question levelsLarger allows more splits, higher variance
min_samples_splitMin samples a node needs to be eligible to splitLarger forces bigger, more general splits
min_samples_leafMin samples that must remain in each leafLarger smooths predictions, prevents tiny leaves
max_leaf_nodesHard cap on the total number of leavesSmaller yields a simpler tree
max_featuresFeatures considered per splitSmaller adds randomness, reduces overfitting

A good starting point for a first tree: set max_depth to something modest like 3 to 6 and min_samples_leaf to a value such as 1% of your training rows. Then tune with cross-validation.

Post-Pruning with Cost-Complexity (ccp_alpha)

Pruning grows a large tree first, then chops back the branches that add little predictive value. scikit-learn implements cost-complexity pruning, governed by a single parameter ccp_alpha.

Cost-complexity of a subtree T:

R_alpha(T) = R(T) + alpha × |leaves(T)|

R(T)          = total impurity of the leaves (training error)
|leaves(T)|   = number of leaves (a complexity penalty)
alpha         = ccp_alpha, the price paid per leaf

alpha = 0        -> keep the full, unpruned tree
larger alpha     -> heavier penalty -> smaller, simpler tree

The tree keeps a branch only if the impurity it removes is "worth" the alpha cost of its extra leaves. You choose alpha empirically:

# Find candidate alphas from the training data, then cross-validate them
path = clf.cost_complexity_pruning_path(X_train, y_train)
alphas = path.ccp_alphas          # increasing sequence of effective alphas

from sklearn.model_selection import cross_val_score
scores = []
for a in alphas:
    pruned = DecisionTreeClassifier(random_state=42, ccp_alpha=a)
    scores.append(cross_val_score(pruned, X_train, y_train, cv=5).mean())

best_alpha = alphas[int(np.argmax(scores))]
print("Best ccp_alpha:", round(best_alpha, 5))

final_tree = DecisionTreeClassifier(random_state=42, ccp_alpha=best_alpha)
final_tree.fit(X_train, y_train)

The winning alpha is the one that maximizes cross-validated accuracy, not training accuracy — which always favours the biggest tree.

Feature Importance

After training, a tree can tell you which features mattered most. scikit-learn computes importance as the total reduction in impurity (Gini or entropy) contributed by each feature across all the splits that use it, weighted by how many samples pass through those splits. The values are normalized to sum to 1.

importances = pd.Series(clf.feature_importances_, index=X.columns)
print(importances.sort_values(ascending=False))
credit_score     0.58
income           0.34
existing_loans   0.08
dtype: float64

This says credit score drove most of the decisions, income mattered moderately, and the number of existing loans played a minor role. Two cautions: (1) importances are computed on the training data, so a feature that overfits can look important; (2) impurity-based importance is biased toward high-cardinality and continuous features. For a more reliable ranking, prefer permutation importance, which measures how much test-set accuracy drops when a feature's values are shuffled.

Interpretability: The Big Advantage

Decision trees are among the most interpretable models in machine learning. A prediction is just a path of human-readable conditions: income greater than 45,000 and credit score greater than 680 and no more than 2 existing loans, therefore approve. You can:

  • Trace any single decision end-to-end and explain it to a customer or regulator.
  • Visualize the whole model with plot_tree or export_text.
  • Handle mixed feature types (numeric and categorical) with almost no preprocessing — trees need no feature scaling, unlike KNN or logistic regression, because splits are threshold-based.
  • Capture non-linear relationships and interactions naturally, without you specifying them.

This transparency is exactly why banks, insurers, and healthcare providers often prefer trees (and tree ensembles) over black-box models when decisions must be justified.

Pros and Cons

AspectProsCons
InterpretabilityHighly transparent; a path of if/else rules anyone can readVery deep trees become hard to read
PreprocessingNo scaling needed; handles numeric and categorical mixesCategoricals need encoding in scikit-learn
RelationshipsCaptures non-linearities and interactions automaticallySplits are axis-aligned; struggles with diagonal boundaries
SpeedFast to train and predictGreedy search is not globally optimal
StabilitySimple to buildUnstable — a small data change can reshape the whole tree
OverfittingControllable via depth and pruningOverfits badly if left unpruned
ExtrapolationWorks well within the observed rangeCannot extrapolate; regression leaves output a constant beyond the data

The instability and overfitting rows are precisely what the next chapter tackles by averaging many trees.

Common Mistakes

1. Growing an unpruned, deep tree

A tree with no max_depth or min_samples_leaf will reach near-100% training accuracy and then collapse on test data. Always constrain the tree and validate with cross-validation. High training accuracy paired with low test accuracy is the tell-tale sign.

2. Trusting a single tree's stability

Trees are high-variance models: change a few training rows, or a single random seed, and the root split — and therefore the entire structure — can flip. Never read deep significance into "the exact tree" you got. If you need robustness, move to Random Forests or other ensembles.

3. Expecting a tree to extrapolate

A regression tree predicts the mean of a leaf, so for any input beyond the training range it just returns the nearest leaf's constant. Ask it to predict the price of a car far older than any in training and it will confidently repeat a value it has already seen. Trees interpolate; they do not extrapolate.

4. Over-reading impurity-based feature importance

Impurity importance inflates continuous and high-cardinality features and is measured on training data. Do not present it as ground truth. Cross-check with permutation importance on the test set.

5. Scaling features unnecessarily

Because splits compare a feature to a threshold, standardizing or normalizing inputs has no effect on a tree. Spending time on scaling here is wasted effort (though it does no harm) — save it for distance- and gradient-based models.

6. Ignoring class imbalance

On a fraud dataset that is 99% legitimate, an unconstrained tree can score 99% accuracy by rarely predicting fraud. Use class_weight="balanced", resampling, and metrics like precision, recall, and F1 (covered in Model Evaluation Metrics) rather than raw accuracy.

Practice Exercises

  1. A node contains 15 samples: 9 of class A and 6 of class B. Compute its Gini impurity and its entropy. Which value is larger, and why does that make sense?

  2. A parent node (20 samples: 12 positive, 8 negative) is split into a left child (8 samples: 2 positive, 6 negative) and a right child (12 samples: 10 positive, 2 negative). Compute the information gain of this split.

  3. Using the loan example in this chapter, train two DecisionTreeClassifier models: one with max_depth=None and one with max_depth=3. Compare their training and test accuracy. Which one overfits, and how can you tell?

  4. Run cost_complexity_pruning_path on your training data, cross-validate across the returned ccp_alphas, and plot cross-validated accuracy against alpha. Report the alpha you would choose and the number of leaves in the resulting tree.

  5. Build a regression tree on a small dataset of used-car prices (features: age, mileage). Predict the price of a car whose age is larger than any in the training set. Explain the number you get in terms of leaf means and extrapolation.

  6. Compute both feature_importances_ and permutation importance (sklearn.inspection.permutation_importance) for your loan tree. Where do the two rankings disagree, and which would you trust for a stakeholder report?

Summary

In this chapter you learned:

  • A decision tree predicts by asking a sequence of single-feature if/else questions, partitioning the feature space into rectangular regions.
  • Trees have a root node, internal (decision) nodes, and leaf nodes that hold the final prediction.
  • Splits are chosen to maximize purity: Gini impurity (Gini = 1 - Σ pᵢ²) or entropy with information gain (IG = Entropy(parent) - weighted Entropy(children)). Gini and entropy usually give similar trees.
  • The tree is built by greedy recursive partitioning — locally best splits, no lookahead.
  • Regression trees minimize within-node variance/MSE and predict the mean of a leaf, producing a staircase of constants.
  • Unpruned trees overfit; control them with max_depth, min_samples_split, min_samples_leaf, max_leaf_nodes, and post-pruning via ccp_alpha (cost-complexity pruning), tuned with cross-validation.
  • Feature importance ranks features by total impurity reduction; prefer permutation importance for a less biased view.
  • Trees are highly interpretable, need no feature scaling, and handle non-linearities — but they are unstable and cannot extrapolate.
  • Common pitfalls: deep unpruned trees, trusting a single tree's structure, expecting extrapolation, over-reading impurity importance, and ignoring class imbalance.

A single tree is powerful and transparent, but its instability limits its accuracy on its own.

Next up: Random Forests — combine hundreds of decorrelated decision trees by averaging their votes to tame variance and dramatically boost accuracy while keeping much of the tree's flexibility.