What Is a Decision Tree?
A decision tree is a model that makes predictions by asking a sequence of simple yes/no questions about the features, following the answers down a branching path until it reaches a conclusion. Each question splits the data into smaller, more homogeneous groups.
In the previous chapters you saw Logistic Regression draw a single smooth boundary and K-Nearest Neighbors vote among close-by points. A decision tree takes a completely different, refreshingly human approach: it carves the feature space into rectangular regions using one threshold at a time.
Intuitive analogy. Imagine a loan officer named Rahul evaluating applications with a flowchart pinned to his desk. Is monthly income greater than ₹50,000? If no, decline. If yes, is the credit score above 700? If yes, approve; if no, does the applicant already have more than 2 active loans? — and so on. Each question narrows down the group of applicants until Rahul can confidently say "approve" or "decline." A decision tree learns this flowchart automatically from data, choosing which question to ask, in what order, and at what threshold.
Goal: learn a hierarchy of feature-based questions so that the groups at the bottom of the tree are as pure as possible — mostly one class (for classification) or with a tight spread of values (for regression).
Examples:
→ Approve or decline a loan based on income, credit score, and existing debt
→ Predict whether a customer will churn from usage and tenure
→ Estimate the price of a used car from age, mileage, and brand
→ Diagnose a fault from a small set of sensor readings
Anatomy of a Tree: Root, Internal, and Leaf Nodes
Every decision tree has three kinds of nodes. Understanding them makes the rest of the chapter easy.
- Root node — the very first question, applied to the entire dataset.
- Internal (decision) nodes — every subsequent question. Each one tests a single feature against a threshold and sends data left or right.
- Leaf (terminal) nodes — the end of a path. A leaf holds a prediction: a class label (classification) or a number (regression). No more questions are asked here.
[ income > 50000 ? ] <- root node
/ \
no yes
/ \
[ DECLINE ] [ credit_score > 700 ? ] <- internal node
(leaf) / \
no yes
/ \
[ loans > 2 ? ] [ APPROVE ] <- leaf
/ \ (leaf)
no yes
/ \
[ APPROVE ] [ DECLINE ]
(leaf) (leaf)
The depth of a tree is the length of the longest path from the root to a leaf. A tree of depth 3 asks at most 3 questions before deciding. Depth is one of the most important knobs you will tune, as you will see later.
How a Split Is Chosen
The heart of tree learning is a single question repeated over and over: which feature, at which threshold, best separates the data? The tree tries every candidate split and keeps the one that makes the resulting groups purest. "Purity" needs a formal measure, and there are two classic choices for classification: Gini impurity and entropy.
Gini Impurity
Gini impurity measures the chance that a randomly picked item from a node would be mislabeled if we labeled it randomly according to the class proportions in that node. Lower is better; 0 means perfectly pure (all one class).
Gini impurity of a node:
Gini = 1 - Σ (pᵢ)² for i = 1 ... number of classes
where pᵢ = fraction of items in the node belonging to class i
Range: 0 (pure) to (1 - 1/k) for k classes.
For 2 classes the maximum is 0.5 (a 50/50 mix).
Worked example for a node with 10 items — 8 "approve" and 2 "decline":
p_approve = 8/10 = 0.8
p_decline = 2/10 = 0.2
Gini = 1 - (0.8² + 0.2²)
= 1 - (0.64 + 0.04)
= 1 - 0.68
= 0.32
A pure node (all 10 "approve") gives Gini = 1 - (1.0² ) = 0. A perfectly mixed node (5/5) gives Gini = 1 - (0.5² + 0.5²) = 0.5.
Entropy and Information Gain
Entropy comes from information theory and measures the disorder or uncertainty in a node. Like Gini, it is 0 for a pure node and maximal for an even mix.
Entropy of a node:
Entropy = - Σ pᵢ × log₂(pᵢ) for i = 1 ... number of classes
For the same 8-approve / 2-decline node:
Entropy = -(0.8 × log₂ 0.8) - (0.2 × log₂ 0.2)
= -(0.8 × -0.3219) - (0.2 × -2.3219)
= 0.2575 + 0.4644
= 0.7219
A tree that uses entropy chooses the split that maximizes information gain — the reduction in entropy achieved by splitting a parent node into children.
Information Gain = Entropy(parent)
- Σ ( (n_child / n_parent) × Entropy(child) )
The children are weighted by how many items fall into each,
so a split that isolates a tiny impure group is not over-rewarded.
Worked information-gain example. A parent with 10 items (6 approve, 4 decline) is split by income > 50000 into a left child of 4 items (1 approve, 3 decline) and a right child of 6 items (5 approve, 1 decline):
Entropy(parent) = -(0.6·log₂0.6) - (0.4·log₂0.4) = 0.971
Entropy(left) = -(0.25·log₂0.25) - (0.75·log₂0.75) = 0.811
Entropy(right) = -(5/6·log₂ 5/6) - (1/6·log₂ 1/6) = 0.650
Weighted child entropy = (4/10)(0.811) + (6/10)(0.650)
= 0.3244 + 0.3900 = 0.7144
Information Gain = 0.971 - 0.7144 = 0.2566
The tree evaluates the information gain (or Gini reduction) for every possible feature and threshold, then greedily picks the winner. In scikit-learn, criterion="gini" (the default) and criterion="entropy" select the two measures. In practice they usually produce very similar trees — Gini is slightly faster because it avoids the logarithm.
Recursive Partitioning
Once the best split is chosen, the same procedure runs again inside each child, then inside each grandchild, and so on. This top-down, greedy process is called recursive partitioning. It stops when a node is pure, when it becomes too small to split, or when a limit you set (like max_depth) is reached. Each split is chosen to be locally optimal; the tree never looks ahead, which is why trees are fast to train but not guaranteed to be globally optimal.
Regression Trees: Predicting Numbers
Everything above assumed classification. A regression tree predicts a continuous number, so purity is measured differently: instead of Gini or entropy, splits are chosen to minimize the variance (or mean squared error) within each child. The prediction at a leaf is simply the mean of the target values of the training samples that landed there.
For regression, a common split criterion is the reduction in MSE:
MSE(node) = (1/n) × Σ (yᵢ - ȳ)² where ȳ = mean of y in the node
The best split minimizes the weighted MSE of the two children.
Leaf prediction = mean of y over the training samples in that leaf.
Because a leaf outputs a single mean, a regression tree produces a staircase of flat predictions rather than a smooth line. If you split a used-car dataset by age and mileage, every car falling into the same leaf gets the identical predicted price — the average price of the training cars in that region.
A Full scikit-learn Example
Let's classify loan approvals with DecisionTreeClassifier. The workflow mirrors the Train-Test Split & Cross-Validation chapter: split first, fit on train, evaluate on test.
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import matplotlib.pyplot as plt
# --- Illustrative loan dataset (numbers are made up for teaching) ---
rng = np.random.default_rng(42)
n = 600
df = pd.DataFrame({
"income": rng.integers(15000, 120000, n), # monthly income (₹)
"credit_score": rng.integers(300, 850, n),
"existing_loans": rng.integers(0, 5, n),
})
# A rule-of-thumb label with some noise, purely for demonstration
approve = ((df.income > 45000) & (df.credit_score > 680) &
(df.existing_loans <= 2)).astype(int)
df["approved"] = (approve ^ (rng.random(n) < 0.08)).astype(int) # 8% label noise
X = df[["income", "credit_score", "existing_loans"]]
y = df["approved"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42, stratify=y
)
# --- Train a controlled tree (see 'Overfitting' below for why) ---
clf = DecisionTreeClassifier(
criterion="gini",
max_depth=4,
min_samples_leaf=20,
random_state=42,
)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("Test accuracy:", round(accuracy_score(y_test, y_pred), 3))
print(classification_report(y_test, y_pred))
Expected output (illustrative — your exact numbers will vary):
Test accuracy: 0.9
precision recall f1-score support
0 0.91 0.94 0.92 95
1 0.88 0.83 0.85 55
accuracy 0.90 150
Visualizing the tree. One of the biggest selling points of trees is that you can see them. Use plot_tree (or export_text for a plain-text version):
plt.figure(figsize=(16, 8))
plot_tree(
clf,
feature_names=X.columns,
class_names=["decline", "approve"],
filled=True, # colour nodes by majority class
rounded=True,
fontsize=9,
)
plt.show()
# Text alternative:
from sklearn.tree import export_text
print(export_text(clf, feature_names=list(X.columns)))
Each box plot_tree draws shows the split condition, the Gini value, the sample count, and the class distribution — a fully auditable decision path you can hand to a compliance team.
Overfitting and How to Control It
Left unchecked, a decision tree will keep splitting until every leaf is perfectly pure — often one training sample per leaf. Such a tree memorizes the training data, including its noise, and generalizes terribly. This is the classic overfitting you met conceptually in the workflow chapters, and it is the single biggest weakness of trees.
The fix is to limit the tree's complexity, either while growing it (pre-pruning) or after (post-pruning).
Pre-Pruning Hyperparameters
| Hyperparameter | What it controls | Effect of a larger value |
|---|---|---|
max_depth | Maximum number of question levels | Larger allows more splits, higher variance |
min_samples_split | Min samples a node needs to be eligible to split | Larger forces bigger, more general splits |
min_samples_leaf | Min samples that must remain in each leaf | Larger smooths predictions, prevents tiny leaves |
max_leaf_nodes | Hard cap on the total number of leaves | Smaller yields a simpler tree |
max_features | Features considered per split | Smaller adds randomness, reduces overfitting |
A good starting point for a first tree: set max_depth to something modest like 3 to 6 and min_samples_leaf to a value such as 1% of your training rows. Then tune with cross-validation.
Post-Pruning with Cost-Complexity (ccp_alpha)
Pruning grows a large tree first, then chops back the branches that add little predictive value. scikit-learn implements cost-complexity pruning, governed by a single parameter ccp_alpha.
Cost-complexity of a subtree T:
R_alpha(T) = R(T) + alpha × |leaves(T)|
R(T) = total impurity of the leaves (training error)
|leaves(T)| = number of leaves (a complexity penalty)
alpha = ccp_alpha, the price paid per leaf
alpha = 0 -> keep the full, unpruned tree
larger alpha -> heavier penalty -> smaller, simpler tree
The tree keeps a branch only if the impurity it removes is "worth" the alpha cost of its extra leaves. You choose alpha empirically:
# Find candidate alphas from the training data, then cross-validate them
path = clf.cost_complexity_pruning_path(X_train, y_train)
alphas = path.ccp_alphas # increasing sequence of effective alphas
from sklearn.model_selection import cross_val_score
scores = []
for a in alphas:
pruned = DecisionTreeClassifier(random_state=42, ccp_alpha=a)
scores.append(cross_val_score(pruned, X_train, y_train, cv=5).mean())
best_alpha = alphas[int(np.argmax(scores))]
print("Best ccp_alpha:", round(best_alpha, 5))
final_tree = DecisionTreeClassifier(random_state=42, ccp_alpha=best_alpha)
final_tree.fit(X_train, y_train)
The winning alpha is the one that maximizes cross-validated accuracy, not training accuracy — which always favours the biggest tree.
Feature Importance
After training, a tree can tell you which features mattered most. scikit-learn computes importance as the total reduction in impurity (Gini or entropy) contributed by each feature across all the splits that use it, weighted by how many samples pass through those splits. The values are normalized to sum to 1.
importances = pd.Series(clf.feature_importances_, index=X.columns)
print(importances.sort_values(ascending=False))
credit_score 0.58
income 0.34
existing_loans 0.08
dtype: float64
This says credit score drove most of the decisions, income mattered moderately, and the number of existing loans played a minor role. Two cautions: (1) importances are computed on the training data, so a feature that overfits can look important; (2) impurity-based importance is biased toward high-cardinality and continuous features. For a more reliable ranking, prefer permutation importance, which measures how much test-set accuracy drops when a feature's values are shuffled.
Interpretability: The Big Advantage
Decision trees are among the most interpretable models in machine learning. A prediction is just a path of human-readable conditions: income greater than 45,000 and credit score greater than 680 and no more than 2 existing loans, therefore approve. You can:
- Trace any single decision end-to-end and explain it to a customer or regulator.
- Visualize the whole model with
plot_treeorexport_text. - Handle mixed feature types (numeric and categorical) with almost no preprocessing — trees need no feature scaling, unlike KNN or logistic regression, because splits are threshold-based.
- Capture non-linear relationships and interactions naturally, without you specifying them.
This transparency is exactly why banks, insurers, and healthcare providers often prefer trees (and tree ensembles) over black-box models when decisions must be justified.
Pros and Cons
| Aspect | Pros | Cons |
|---|---|---|
| Interpretability | Highly transparent; a path of if/else rules anyone can read | Very deep trees become hard to read |
| Preprocessing | No scaling needed; handles numeric and categorical mixes | Categoricals need encoding in scikit-learn |
| Relationships | Captures non-linearities and interactions automatically | Splits are axis-aligned; struggles with diagonal boundaries |
| Speed | Fast to train and predict | Greedy search is not globally optimal |
| Stability | Simple to build | Unstable — a small data change can reshape the whole tree |
| Overfitting | Controllable via depth and pruning | Overfits badly if left unpruned |
| Extrapolation | Works well within the observed range | Cannot extrapolate; regression leaves output a constant beyond the data |
The instability and overfitting rows are precisely what the next chapter tackles by averaging many trees.
Common Mistakes
1. Growing an unpruned, deep tree
A tree with no max_depth or min_samples_leaf will reach near-100% training accuracy and then collapse on test data. Always constrain the tree and validate with cross-validation. High training accuracy paired with low test accuracy is the tell-tale sign.
2. Trusting a single tree's stability
Trees are high-variance models: change a few training rows, or a single random seed, and the root split — and therefore the entire structure — can flip. Never read deep significance into "the exact tree" you got. If you need robustness, move to Random Forests or other ensembles.
3. Expecting a tree to extrapolate
A regression tree predicts the mean of a leaf, so for any input beyond the training range it just returns the nearest leaf's constant. Ask it to predict the price of a car far older than any in training and it will confidently repeat a value it has already seen. Trees interpolate; they do not extrapolate.
4. Over-reading impurity-based feature importance
Impurity importance inflates continuous and high-cardinality features and is measured on training data. Do not present it as ground truth. Cross-check with permutation importance on the test set.
5. Scaling features unnecessarily
Because splits compare a feature to a threshold, standardizing or normalizing inputs has no effect on a tree. Spending time on scaling here is wasted effort (though it does no harm) — save it for distance- and gradient-based models.
6. Ignoring class imbalance
On a fraud dataset that is 99% legitimate, an unconstrained tree can score 99% accuracy by rarely predicting fraud. Use class_weight="balanced", resampling, and metrics like precision, recall, and F1 (covered in Model Evaluation Metrics) rather than raw accuracy.
Practice Exercises
-
A node contains 15 samples: 9 of class A and 6 of class B. Compute its Gini impurity and its entropy. Which value is larger, and why does that make sense?
-
A parent node (20 samples: 12 positive, 8 negative) is split into a left child (8 samples: 2 positive, 6 negative) and a right child (12 samples: 10 positive, 2 negative). Compute the information gain of this split.
-
Using the loan example in this chapter, train two
DecisionTreeClassifiermodels: one withmax_depth=Noneand one withmax_depth=3. Compare their training and test accuracy. Which one overfits, and how can you tell? -
Run
cost_complexity_pruning_pathon your training data, cross-validate across the returnedccp_alphas, and plot cross-validated accuracy againstalpha. Report thealphayou would choose and the number of leaves in the resulting tree. -
Build a regression tree on a small dataset of used-car prices (features: age, mileage). Predict the price of a car whose age is larger than any in the training set. Explain the number you get in terms of leaf means and extrapolation.
-
Compute both
feature_importances_and permutation importance (sklearn.inspection.permutation_importance) for your loan tree. Where do the two rankings disagree, and which would you trust for a stakeholder report?
Summary
In this chapter you learned:
- A decision tree predicts by asking a sequence of single-feature if/else questions, partitioning the feature space into rectangular regions.
- Trees have a root node, internal (decision) nodes, and leaf nodes that hold the final prediction.
- Splits are chosen to maximize purity: Gini impurity (
Gini = 1 - Σ pᵢ²) or entropy with information gain (IG = Entropy(parent) - weighted Entropy(children)). Gini and entropy usually give similar trees. - The tree is built by greedy recursive partitioning — locally best splits, no lookahead.
- Regression trees minimize within-node variance/MSE and predict the mean of a leaf, producing a staircase of constants.
- Unpruned trees overfit; control them with
max_depth,min_samples_split,min_samples_leaf,max_leaf_nodes, and post-pruning viaccp_alpha(cost-complexity pruning), tuned with cross-validation. - Feature importance ranks features by total impurity reduction; prefer permutation importance for a less biased view.
- Trees are highly interpretable, need no feature scaling, and handle non-linearities — but they are unstable and cannot extrapolate.
- Common pitfalls: deep unpruned trees, trusting a single tree's structure, expecting extrapolation, over-reading impurity importance, and ignoring class imbalance.
A single tree is powerful and transparent, but its instability limits its accuracy on its own.
Next up: Random Forests — combine hundreds of decorrelated decision trees by averaging their votes to tame variance and dramatically boost accuracy while keeping much of the tree's flexibility.