Chapter 19 of 20

Ensemble Methods — Bagging, Boosting & XGBoost

Learn how bagging, boosting, and stacking combine many weak learners into one strong model — AdaBoost and gradient boosting explained, XGBoost, LightGBM and CatBoost, voting classifiers, key hyperparameters, and how to avoid overfitting with early stopping.

Meritshot17 min read
Machine LearningEnsembleBaggingBoostingXGBoostGradient BoostingStacking
All Machine Learning Chapters

What Are Ensemble Methods?

An ensemble method combines the predictions of many individual models — called base learners or weak learners — into a single, stronger model. The core idea is deceptively simple: a crowd of mediocre models, pooled the right way, routinely beats the single best model you can build on its own.

The reason is statistical. Each individual model makes errors. If those errors are at least partly independent of one another, then combining many models causes the errors to partly cancel out while the correct signal reinforces. You keep the collective wisdom and throw away much of the individual noise. This is why ensembles like Random Forests and XGBoost have dominated tabular-data competitions on Kaggle and platforms like Analytics Vidhya for years — for structured business data (churn, credit risk, demand forecasting), a well-tuned gradient-boosted ensemble is very often the model to beat.

Intuitive analogy. Imagine a bank in Mumbai deciding whether to approve a loan for a customer named Priya. One loan officer, deciding alone, might be swayed by a bad morning or a single red flag. But a committee — a credit analyst, a risk officer, and a branch manager, each weighing different evidence — reaches a decision that is far more reliable than any one of them alone. An ensemble is that committee. The interesting part is how the committee is assembled, and that gives us three distinct families.

The three families of ensembles:
→ Bagging   — train many models in PARALLEL on random data samples, then average/vote
→ Boosting  — train many models SEQUENTIALLY, each fixing the previous ones' mistakes
→ Stacking  — train a META-MODEL that learns how to best combine several base models

You have already met one ensemble in depth: the Random Forest (see the Random Forests chapter) is the flagship bagging method. This chapter zooms out to the whole landscape and then dives deep into boosting and the gradient-boosting libraries — XGBoost, LightGBM, and CatBoost — that win competitions.

Why Ensembles Win: The Bias–Variance View

To understand why the three families work differently, recall the bias–variance decomposition from the Bias-Variance, Overfitting & Regularization chapter. A model's expected error breaks into three parts:

Expected error  =  Bias²  +  Variance  +  Irreducible noise

Bias      = error from wrong/too-simple assumptions (underfitting)
Variance  = error from sensitivity to the exact training sample (overfitting)
Noise     = irreducible randomness in the data itself

Different ensembles attack different terms:

  • Bagging reduces variance. A single deep decision tree has low bias but high variance — it fits the training data well but changes wildly if you perturb the data. Averaging many such trees, each trained on a different bootstrap sample, cancels the variance while leaving the (already low) bias roughly unchanged.
  • Boosting reduces bias. Boosting starts with weak, high-bias learners (typically shallow trees) and adds them one at a time, each new learner correcting the residual errors of the ensemble so far. The committee grows progressively more accurate, driving bias down.
  • Stacking reduces both by letting a meta-model learn where each base model is trustworthy.
Rule of thumb:
If your base model overfits (low bias, high variance)  → use BAGGING (average it down)
If your base model underfits (high bias, low variance) → use BOOSTING (add learners)

Bagging: Parallel and Variance-Reducing

Bagging stands for bootstrap aggregating. It trains each base model on a bootstrap sample — rows drawn from the training set with replacement — and then aggregates: a majority vote for classification, or the average for regression. Because the trees are independent, you can train them fully in parallel.

The variance reduction is real and quantifiable. If you average B models each with variance σ², and their pairwise correlation is ρ, the variance of the average is:

Var(average)  =  ρ·σ²  +  (1 − ρ)/B · σ²

As B → ∞, the second term vanishes, leaving ρ·σ².
Lesson: to keep reducing variance, you must keep correlation ρ LOW.

That last line is exactly why the Random Forest adds a second trick on top of plain bagging — at each split it considers only a random subset of features — to force the trees to be decorrelated (small ρ). See the Random Forests chapter for the full treatment, including out-of-bag error and feature importances.

You can bag any base estimator, not just trees, using scikit-learn's BaggingClassifier:

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score

X, y = make_classification(n_samples=2000, n_features=20, random_state=42)

# Bag 200 unpruned trees, each on a bootstrap sample of the rows
bag = BaggingClassifier(
    estimator=DecisionTreeClassifier(),   # any estimator works
    n_estimators=200,
    max_samples=1.0,      # bootstrap sample size = full dataset
    bootstrap=True,       # sample WITH replacement
    n_jobs=-1,            # trees train in parallel
    random_state=42,
)
scores = cross_val_score(bag, X, y, cv=5, scoring="accuracy")
print(f"Bagging CV accuracy: {scores.mean():.3f} +/- {scores.std():.3f}")
Bagging CV accuracy: 0.918 +/- 0.011   (illustrative — a single tree would be lower and more variable)

Boosting: Sequential and Bias-Reducing

Boosting flips bagging on its head. Instead of many strong, independent models trained in parallel, boosting trains many weak learners sequentially, where each new learner focuses on the examples the current ensemble gets wrong. The final prediction is a weighted sum of all the learners.

The typical weak learner is a decision stump or a shallow tree (max_depth of 1 to 6). On its own each one is barely better than a coin flip, but stacked in sequence they compound into a highly accurate model. There are two classic recipes for "focus on the mistakes."

AdaBoost: Reweighting the Hard Examples

AdaBoost (Adaptive Boosting) works by re-weighting the training rows. Every row starts with equal weight. After each weak learner is trained, AdaBoost increases the weight of the rows it misclassified and decreases the weight of the ones it got right, so the next learner is pushed to concentrate on the hard cases.

AdaBoost (binary, labels in {-1, +1}) — the loop:

1. Initialise sample weights:  wᵢ = 1/n  for all i
2. For m = 1 ... M:
   a. Fit weak learner hₘ using the current weights wᵢ
   b. Weighted error:   errₘ = Σ wᵢ·[hₘ(xᵢ) ≠ yᵢ] / Σ wᵢ
   c. Learner weight:    αₘ = 0.5 · ln((1 − errₘ) / errₘ)
   d. Update weights:    wᵢ ← wᵢ · exp(−αₘ · yᵢ · hₘ(xᵢ))   then renormalise
3. Final model:  H(x) = sign( Σ αₘ · hₘ(x) )

Notice that a more accurate learner (small errₘ) earns a larger αₘ — a bigger vote in the final sum. AdaBoost minimises an exponential loss and is beautifully simple, but it is sensitive to noisy labels and outliers, because those rows get their weights blown up round after round.

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

ada = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1),  # decision stumps
    n_estimators=300,
    learning_rate=0.5,
    random_state=42,
)
ada.fit(X, y)

Gradient Boosting: Fitting the Residuals

Gradient Boosting generalises the idea. Rather than reweighting rows, each new tree is trained to predict the residuals — the errors — of the ensemble built so far. Formally, each tree fits the negative gradient of the loss function, which is why it works for any differentiable loss (squared error for regression, log-loss for classification, and more).

Gradient Boosting (regression with squared error) — the intuition:

1. Start with a constant prediction:  F₀(x) = mean(y)
2. For m = 1 ... M:
   a. Compute residuals (pseudo-gradients):  rᵢ = yᵢ − Fₘ₋₁(xᵢ)
   b. Fit a small tree hₘ to predict those residuals rᵢ
   c. Update:  Fₘ(x) = Fₘ₋₁(x) + η · hₘ(x)
3. Final model:  F(x) = F₀(x) + η · Σ hₘ(x)

η (eta) = the LEARNING RATE (also called shrinkage), typically 0.01 to 0.3

The learning rate η is the single most important lever. It scales down each tree's contribution so the ensemble takes small, careful steps toward the answer. A small η needs more trees but generalises better; a large η learns fast but overfits. The safe recipe is a small learning rate plus many trees plus early stopping (more on that below).

Here is GradientBoostingClassifier from scikit-learn on a realistic churn-style problem:

import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score

# Imagine features: tenure, monthly_charges, num_support_calls, etc.
X, y = make_classification(
    n_samples=5000, n_features=25, n_informative=10,
    weights=[0.75, 0.25], random_state=42,   # 25% churners
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

gbc = GradientBoostingClassifier(
    n_estimators=300,
    learning_rate=0.05,
    max_depth=3,          # shallow trees = weak learners
    subsample=0.8,        # stochastic gradient boosting: 80% of rows per tree
    random_state=42,
)
gbc.fit(X_train, y_train)

proba = gbc.predict_proba(X_test)[:, 1]
print(f"Test ROC-AUC: {roc_auc_score(y_test, proba):.3f}")
print(classification_report(y_test, gbc.predict(X_test)))
Test ROC-AUC: 0.94   (illustrative)
              precision    recall  f1-score   support
           0       0.92      0.95      0.93       750
           1       0.83      0.75      0.79       250
    accuracy                           0.90      1000

Note the use of predict_proba and ROC-AUC rather than raw accuracy — with a 75/25 class imbalance, accuracy alone would be misleading (see the Model Evaluation Metrics chapter).

XGBoost, LightGBM & CatBoost: Modern Gradient Boosting

Plain GradientBoostingClassifier is correct but slow and lightly regularized. The libraries that actually win competitions are optimized, regularized re-implementations of the same gradient-boosting idea.

  • XGBoost (eXtreme Gradient Boosting) adds an explicit regularization term on tree complexity to the loss, uses second-order (Newton) gradient information, handles missing values natively, and supports built-in early stopping. It is the long-standing default for tabular problems.
  • LightGBM (from Microsoft) grows trees leaf-wise (splitting the leaf with the largest loss reduction) instead of level-wise, and buckets features into histograms. It is dramatically faster on large datasets and lower on memory — ideal when you have millions of rows.
  • CatBoost (from Yandex) handles categorical features natively with a clever ordered target-encoding scheme, so you often skip one-hot encoding entirely. It has strong out-of-the-box defaults and is excellent for datasets full of categorical columns (city, product category, plan type).

All three share the gradient-boosting core and expose nearly the same key hyperparameters. Here is XGBoost with early stopping, the single most important practice for boosting:

from xgboost import XGBClassifier

# Split off a validation set to watch performance and stop at the right time
X_tr, X_val, y_tr, y_val = train_test_split(
    X_train, y_train, test_size=0.2, stratify=y_train, random_state=42
)

xgb = XGBClassifier(
    n_estimators=2000,          # an UPPER BOUND — early stopping picks the real number
    learning_rate=0.03,
    max_depth=4,
    subsample=0.8,              # row sampling (stochastic boosting)
    colsample_bytree=0.8,       # feature sampling per tree
    reg_lambda=1.0,             # L2 regularization on leaf weights
    reg_alpha=0.0,              # L1 regularization
    eval_metric="auc",
    early_stopping_rounds=50,   # stop if val AUC does not improve for 50 rounds
    n_jobs=-1,
    random_state=42,
)
xgb.fit(X_tr, y_tr, eval_set=[(X_val, y_val)], verbose=False)

print(f"Best iteration (trees actually used): {xgb.best_iteration}")
print(f"Test ROC-AUC: {roc_auc_score(y_test, xgb.predict_proba(X_test)[:, 1]):.3f}")
Best iteration (trees actually used): 412   (illustrative — far fewer than the 2000 cap)
Test ROC-AUC: 0.95

The pattern is: set n_estimators deliberately high, use a small learning_rate, and let early_stopping_rounds decide when to quit by watching a held-out validation set. This gives you the accuracy of many trees without the overfitting of forcing all of them.

Stacking: A Meta-Model Over Base Models

Stacking (stacked generalization) is the third family. Instead of averaging or sequencing, it trains a meta-model (also called a blender) whose inputs are the predictions of several diverse base models. The base learners might be a Random Forest, an SVM, a logistic regression, and an XGBoost model; the meta-model — often a simple logistic regression — learns how much to trust each one and in which situations.

The critical detail is avoiding leakage: the meta-model must be trained on out-of-fold predictions (predictions made by base models on data they did not train on), which scikit-learn's StackingClassifier handles for you via internal cross-validation.

from sklearn.ensemble import StackingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

base_learners = [
    ("rf",  RandomForestClassifier(n_estimators=300, n_jobs=-1, random_state=42)),
    ("svm", SVC(probability=True, random_state=42)),
    ("gb",  GradientBoostingClassifier(random_state=42)),
]

stack = StackingClassifier(
    estimators=base_learners,
    final_estimator=LogisticRegression(),  # the meta-model / blender
    cv=5,                                   # out-of-fold predictions prevent leakage
    n_jobs=-1,
)
stack.fit(X_train, y_train)
print(f"Stacking test AUC: {roc_auc_score(y_test, stack.predict_proba(X_test)[:, 1]):.3f}")

Stacking often squeezes out the last fraction of a percent of accuracy, which is why it appears on winning Kaggle leaderboards — but it is heavier to train and serve, so weigh the complexity against the gain.

Voting Classifiers: The Simplest Ensemble

Before reaching for stacking, try a voting classifier — the lightest way to combine several fitted models. There are two flavours.

  • Hard voting: each model casts one vote for a class label; the majority label wins.
  • Soft voting: each model outputs class probabilities; you average the probabilities and pick the class with the highest average. Soft voting usually beats hard voting because it uses the models' confidence, not just their final label — but it requires that every base model can produce calibrated predict_proba outputs.
from sklearn.ensemble import VotingClassifier

voting = VotingClassifier(
    estimators=base_learners,
    voting="soft",              # average predicted probabilities
    weights=[2, 1, 2],          # optionally trust some models more
    n_jobs=-1,
)
voting.fit(X_train, y_train)
print(f"Soft-voting test AUC: {roc_auc_score(y_test, voting.predict_proba(X_test)[:, 1]):.3f}")
Hard voting: majority label wins       — good when models only give labels
Soft voting: average probabilities     — usually better, needs predict_proba

Key Hyperparameters for Boosting

The three or four hyperparameters below account for most of the tuning effort. Understanding how they interact is what separates a strong result from an overfit one.

HyperparameterWhat it controlsTypical rangeEffect if too highEffect if too low
n_estimatorsNumber of boosting rounds (trees)100 to 2000+Overfits (memorises training set)Underfits (stops too early)
learning_rateShrinkage per tree; 0 < lr < 10.01 to 0.3Overfits, unstableVery slow; needs many trees
max_depthDepth of each weak tree3 to 8Overfits; trees too complexUnderfits; can't capture interactions
subsampleFraction of rows per tree0.6 to 1.0Less regularizationTrees see too little data
colsample_bytreeFraction of features per tree0.6 to 1.0Less decorrelationTrees starved of signal
reg_lambda / reg_alphaL2 / L1 penalty on leaf weights0 to 10Underfits (over-penalised)Little regularization

The most important relationship: learning_rate and n_estimators trade off. Halve the learning rate and you roughly need to double the number of trees. The standard workflow is to fix a small learning rate, set n_estimators high, and let early stopping find the right count on a validation set.

Bagging vs Boosting vs Stacking

AspectBaggingBoostingStacking
Training orderParallel (independent)Sequential (dependent)Base parallel, meta after
Primarily reducesVarianceBiasBoth
Base learnersStrong (deep trees)Weak (shallow trees)Diverse, mixed types
How combinedVote / averageWeighted sumLearned meta-model
Sensitivity to noiseRobustSensitive (can overfit)Depends on base models
Flagship exampleRandom ForestXGBoost, LightGBM, AdaBoostKaggle blends
Overfitting riskLowHigher (needs early stopping)Moderate
Speed to trainFast (parallel)Slower (sequential)Slowest

Common Mistakes

1. Too many boosting rounds without early stopping

Setting n_estimators = 5000 with no validation set and no early stopping is
the classic boosting failure. Boosting keeps reducing TRAINING error round
after round; past a point it is only memorising noise. Always pass an
eval_set and early_stopping_rounds, or tune n_estimators with cross-validation.

2. Learning rate too high

A large learning_rate (say 0.5) makes each tree take a big, greedy step.
The ensemble converges fast on the training data but generalises poorly.
Prefer a small learning_rate (0.01 to 0.1) with more trees + early stopping.

3. Bagging a low-variance model (or boosting a high-variance one)

Bagging shines on HIGH-variance base learners (deep trees). Bagging a
logistic regression barely helps — there's little variance to average away.
Likewise, boosting DEEP trees defeats the purpose: boosting wants WEAK,
high-bias learners (shallow trees) to correct sequentially.

4. Leaking data into a stacking meta-model

If the meta-model trains on base-model predictions made on the SAME rows the
base models were trained on, it sees inflated, over-optimistic inputs and
overfits. Always use out-of-fold predictions (StackingClassifier cv=... does
this automatically). Never blend on in-sample predictions.

5. Reading feature importances as causation

XGBoost/RF feature importances tell you what the model USED to split, not
what CAUSES the outcome. Correlated features can steal each other's
importance. Use them for rough intuition; for reliable attributions prefer
permutation importance or SHAP values.

6. Using accuracy on imbalanced data

On a 95/5 fraud dataset, a model that predicts "not fraud" always scores 95%
accuracy while catching zero fraud. Judge boosted ensembles with ROC-AUC,
PR-AUC, precision/recall, or F1 — see the Model Evaluation Metrics chapter.

Practice Exercises

  1. Bagging vs single tree. Train a single unpruned DecisionTreeClassifier and a BaggingClassifier of 200 such trees on the same dataset using 5-fold cross-validation. Report the mean and standard deviation of accuracy for each. Which term of the bias–variance decomposition did bagging reduce, and how can you tell?

  2. Learning-rate / n_estimators trade-off. Train a GradientBoostingClassifier with learning_rate = 0.3 and n_estimators = 100, then again with learning_rate = 0.03 and n_estimators = 1000. Compare test ROC-AUC. Which generalises better, and why?

  3. Early stopping. Fit an XGBClassifier with n_estimators = 3000, a small learning rate, an eval_set, and early_stopping_rounds = 50. Print best_iteration. How many trees did it actually keep versus the cap you set?

  4. Soft vs hard voting. Build a VotingClassifier over a Random Forest, an SVM (probability=True), and a logistic regression, once with voting="hard" and once with voting="soft". Which performs better on your validation set, and what does that tell you about the value of probability estimates?

  5. Stacking. Construct a StackingClassifier with those same three base learners and a LogisticRegression meta-model. Does it beat the best single base learner? By how much, and is the extra complexity worth it for a production system?

  6. Categorical data. Take a dataset with several categorical columns (for example: city, plan type, payment method). Compare CatBoost (categorical features passed directly) against XGBoost with one-hot encoding. Which is easier to set up and which scores higher?

Summary

In this chapter you learned:

  • Ensembles combine many weak learners into one strong model; their partly-independent errors cancel, keeping the signal.
  • Bagging trains models in parallel on bootstrap samples and averages/votes — it reduces variance; the Random Forest is its flagship.
  • Boosting trains weak learners sequentially, each correcting the previous ones' errors — it reduces bias.
  • AdaBoost re-weights the misclassified rows; Gradient Boosting fits each new tree to the residuals (negative gradient) of the ensemble so far.
  • XGBoost, LightGBM, and CatBoost are fast, regularized gradient-boosting libraries that dominate tabular competitions; XGBoost is the default, LightGBM is fastest on huge data, CatBoost handles categoricals natively.
  • Stacking trains a meta-model on the out-of-fold predictions of diverse base models; voting (hard vs soft) is the lightweight alternative.
  • Key knobs: n_estimators, learning_rate, and max_depth; a small learning rate + many trees + early stopping is the safe recipe.
  • Common pitfalls: boosting too many rounds without early stopping, learning rate too high, bagging low-variance models, leaking data into a stack, and judging imbalanced problems by accuracy.

Ensembles of decision trees are the workhorses of tabular machine learning — for structured business data they are very often the strongest model you can train. But some problems (images, audio, text, sequences) have structure that trees cannot exploit, and there a different architecture takes over.

Next up: Introduction to Neural Networks & Deep Learning — how layers of artificial neurons learn hierarchical representations directly from raw data, powering the modern era of computer vision and language models.