Chapter 18 of 20

Bias-Variance, Overfitting & Regularization

Understand the bias-variance trade-off, diagnose underfitting and overfitting with learning curves, and fight variance using more data, simpler models, cross-validation, early stopping, and L1/L2/ElasticNet regularization.

Meritshot17 min read
Machine LearningOverfittingUnderfittingBias-VarianceRegularizationRidgeLasso
All Machine Learning Chapters

The Central Tension in Machine Learning

Every model you build sits somewhere on a spectrum between two failure modes:

  • Underfitting: the model is too simple to capture the real pattern. It does badly on training data and on new data.
  • Overfitting: the model is too complex and memorises the training data — including its noise. It does great on training data but poorly on new data.

The whole art of modelling is landing in the sweet spot between them, where the model is complex enough to learn the signal but not so complex that it learns the noise.

Analogy — the exam student. Imagine Priya preparing for an exam.

Underfitting student:
  Reads only the chapter titles. Understands nothing deeply.
  → Scores badly on practice tests AND the real exam.

Overfitting student:
  Memorises last year's exact question paper word-for-word.
  → Scores 100% on last year's paper, but the real exam has
    different questions → scores badly.

Good student:
  Understands the underlying concepts.
  → Scores well on practice tests AND on the real, unseen exam.

Generalization — performing well on data the model has never seen — is the goal. Everything in this chapter is about measuring and improving generalization.

The Bias-Variance Trade-off

The generalization error of a model can be broken down into three parts. This decomposition explains why underfitting and overfitting happen and why fixing one often worsens the other.

Expected prediction error = Bias² + Variance + Irreducible Error

Bias      = error from wrong assumptions (model too simple).
            A high-bias model consistently misses the true pattern
            in the same direction, no matter which training set it sees.

Variance  = error from sensitivity to the training data (model too complex).
            A high-variance model changes drastically when you swap
            in a slightly different training sample.

Irreducible error = noise in the data itself (measurement error,
            randomness). No model can remove this.

Think of it with a dartboard.

              Low Variance          High Variance
            (tight grouping)      (scattered darts)

Low Bias    Bullseye cluster       Spread around bullseye
(on target)   ← IDEAL               ← overfitting risk

High Bias   Tight cluster,         Scattered AND
(off        but off-centre         off-centre
 target)     ← underfitting         ← worst case

Why It's a Trade-off

As you increase model complexity (deeper trees, higher-degree polynomials, more features), bias falls but variance rises. As you decrease complexity, bias rises but variance falls. You cannot minimise both by tuning complexity alone — you have to find the balance point where their sum is smallest.

Error
  |\                              /
  | \                          /  ← Variance (rises with complexity)
  |  \                      /
  |   \  Total error     /
  |    \    ___       /
  |     \_/    \_ _ /   ← minimum total error = SWEET SPOT
  |     /          \
  |   /             \___
  | /  Bias (falls with complexity)  \___
  |/________________________________________
      Low          Model Complexity      High

The good news: regularization, more data, and ensembles can bend this curve — letting you reduce variance without adding much bias. That is the real payoff of the techniques below.

Symptoms: Reading the Train vs Validation Gap

You diagnose bias and variance by comparing performance on the training set (data the model learned from) against a held-out validation set (data it never saw). See the Train-Test Split & Cross-Validation chapter for how to build these splits properly.

SymptomTraining scoreValidation scoreDiagnosisWhat it means
Both poorLowLowUnderfitting (high bias)Model too simple; add complexity/features
Big gapHighLowOverfitting (high variance)Model memorised noise; simplify/regularize
Both good, closeHighHighGood fitShip it (and keep monitoring)
Both great, tiny gapVery highVery highSuspiciously goodCheck for data leakage

A concrete illustrative example on a customer-churn dataset:

Model                     Train Accuracy   Validation Accuracy   Gap
----------------------------------------------------------------------
Logistic (few features)      0.71               0.70            0.01   ← underfit
Decision tree (max_depth=3)  0.79               0.77            0.02   ← good
Decision tree (unlimited)    1.00               0.72            0.28   ← overfit!

The unlimited tree scores a perfect 1.00 on training — a red flag. That gap of 0.28 is the fingerprint of overfitting.

Learning Curves

A learning curve plots training and validation scores as you increase the amount of training data. Its shape tells you which problem you have.

HIGH BIAS (underfitting):        HIGH VARIANCE (overfitting):

score                            score
  | train ___________            | train ________________
  |      /                       |
  |     /  val ______            |          val ____
  |    / __/                     |      ___/
  |   //                         |   __/
  |__/___________ data           |__/______________ data
                                    
  Both curves plateau LOW        Wide, persistent GAP between
  and CLOSE together.            train (high) and val (low).
  → More data won't help.        → More data likely WILL help.
  → Add complexity/features.     → The curves converge as data grows.
  • Curves converge at a low score → high bias. Adding more data is a waste; you need a more expressive model or better features.
  • A large gap that shrinks slowly as data grows → high variance. More data helps; so does regularization.

How to Fight Overfitting

Once you have diagnosed high variance, you have a toolkit. Roughly in order of "reach for this first":

  1. Get more (and more varied) training data. The single most reliable cure for variance. More examples make it harder for the model to memorise noise.
  2. Use a simpler model. Fewer parameters, a shallower tree (max_depth), a lower-degree polynomial. Less capacity to memorise.
  3. Feature selection / reduce dimensionality. Drop irrelevant features; they are pure opportunity for the model to fit noise. See Dimensionality Reduction & PCA and Feature Engineering & Scaling.
  4. Cross-validation. Don't cure overfitting directly, but gives an honest estimate of generalization so you tune toward the sweet spot instead of the training score.
  5. Early stopping. For iterative learners (gradient descent, boosting, neural nets), stop training when validation error starts rising — before the model begins memorising.
  6. Regularization. Add a penalty that discourages large/complex parameters. The most surgical tool, and the focus of the rest of this chapter.
  7. Ensembles (bagging, boosting) — covered in the next chapter — average many models to cut variance.
  8. Dropout / data augmentation for neural networks (brief note near the end).

Early Stopping in One Picture

Validation
error
  |\
  | \
  |  \                    ← model still learning signal
  |   \____
  |        \___
  |            \___  ← MINIMUM: stop here (best generalization)
  |                \___
  |                    \____
  |                          \___  ← beyond this, it memorises noise:
  |                              \      validation error RISES again
  |______________________________________ training iterations
                     ^
              early stopping point

Regularization: The Core Idea

Regularization adds a penalty to the model's loss function that grows with the size or number of the coefficients. The model now has to trade off fitting the data against keeping its coefficients small. Small coefficients mean a smoother, simpler function that is less able to chase noise.

For linear models, ordinary least squares minimises just the fit error:

OLS loss (no regularization):
  Loss = Σ (yᵢ − ŷᵢ)²          where ŷᵢ = b₀ + b₁x₁ + ... + bₚxₚ

Regularization tacks on a penalty term controlled by a strength α (also written λ, "lambda"):

Regularized loss = (fit error)  +  α × (penalty on coefficients)

  Larger α  → stronger penalty → smaller coefficients → simpler model
                                → higher bias, lower variance
  α = 0     → no penalty → back to plain OLS
  α → ∞     → all coefficients forced toward 0 → underfitting

α is a hyperparameter you tune with cross-validation. The two classic penalties are L2 (Ridge) and L1 (Lasso).

L2 Regularization — Ridge Regression

Ridge adds the sum of squared coefficients as the penalty. It shrinks all coefficients toward zero smoothly, but rarely makes any of them exactly zero.

Ridge loss = Σ (yᵢ − ŷᵢ)²  +  α × Σ bⱼ²
                                    j=1..p   (note: intercept b₀ is NOT penalised)

Effect: coefficients shrink proportionally.
        Correlated features share the weight rather than one dominating.
        Good default when you believe MOST features are useful.

L1 Regularization — Lasso Regression

Lasso ("Least Absolute Shrinkage and Selection Operator") adds the sum of absolute coefficient values. Its geometry drives some coefficients exactly to zero — so Lasso performs automatic feature selection.

Lasso loss = Σ (yᵢ − ŷᵢ)²  +  α × Σ |bⱼ|
                                    j=1..p

Effect: some coefficients become EXACTLY 0 → those features are dropped.
        Produces a sparse, interpretable model.
        Good when you suspect only a FEW features truly matter.

The intuition for why L1 zeros things out: the absolute-value penalty has sharp corners on the coordinate axes, and the optimum tends to land exactly on a corner (where a coefficient is 0). The squared penalty of Ridge is smooth and round, so it slides coefficients close to — but not onto — zero.

ElasticNet — The Best of Both

ElasticNet blends L1 and L2. A mixing ratio l1_ratio decides how much of each. It gives you Lasso's feature selection while keeping Ridge's stability when features are correlated (Lasso alone tends to arbitrarily pick one of several correlated features and drop the rest).

ElasticNet loss = Σ (yᵢ − ŷᵢ)²
                  + α × [ l1_ratio × Σ|bⱼ|  +  (1 − l1_ratio) × Σ bⱼ² ]

  l1_ratio = 1  → pure Lasso
  l1_ratio = 0  → pure Ridge
  0 < l1_ratio < 1 → a blend (e.g. 0.5 is a common starting point)
AspectRidge (L2)Lasso (L1)ElasticNet
PenaltySum of squares Σ bⱼ²Sum of absolutes `Σbⱼ
CoefficientsShrinks all toward 0Drives some to exactly 0Some to 0, rest shrunk
Feature selectionNoYes (built-in)Yes
Correlated featuresShares weight across themPicks one, drops othersHandles groups well
Best whenMost features usefulFew features truly matterMany features, some correlated
sklearn classRidgeLassoElasticNet

Ridge and Lasso in scikit-learn

Let's watch regularization shrink coefficients in practice. We predict house prices from several features, some of which are noisy or redundant.

Always scale features before regularizing. The penalty depends on coefficient magnitude, so features on larger scales (e.g. area in sq ft vs number of bedrooms) would be penalised unfairly. Use a Pipeline with StandardScaler so scaling is learned on training data only and applied consistently.

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split

# Synthetic housing data: 6 features, but only 3 truly drive price
rng = np.random.RandomState(42)
n = 500
area      = rng.normal(1000, 300, n)      # sq ft         (matters)
bedrooms  = rng.normal(3, 1, n)           # count         (matters)
age       = rng.normal(15, 8, n)          # years         (matters)
noise1    = rng.normal(0, 1, n)           # irrelevant
noise2    = rng.normal(0, 1, n)           # irrelevant
noise3    = area + rng.normal(0, 5, n)    # redundant (correlated with area)

# True price (in ₹ lakhs) depends only on area, bedrooms, age
price = 0.05*area + 8*bedrooms - 0.9*age + rng.normal(0, 5, n)

X = pd.DataFrame({"area": area, "bedrooms": bedrooms, "age": age,
                  "noise1": noise1, "noise2": noise2, "redundant": noise3})
y = price

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

# Fit plain OLS, Ridge, and Lasso — all inside a scaling pipeline
models = {
    "OLS":   make_pipeline(StandardScaler(), LinearRegression()),
    "Ridge": make_pipeline(StandardScaler(), Ridge(alpha=10.0)),
    "Lasso": make_pipeline(StandardScaler(), Lasso(alpha=1.0)),
}

coefs = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    coefs[name] = model[-1].coef_          # coefficients of the final estimator
    print(f"{name:6s}  R² (test) = {model.score(X_test, y_test):.3f}")

coef_table = pd.DataFrame(coefs, index=X.columns).round(2)
print("\nCoefficients (on standardised features):")
print(coef_table)

Expected output (illustrative — your exact numbers will vary slightly):

OLS     R² (test) = 0.912
Ridge   R² (test) = 0.914
Lasso   R² (test) = 0.913

Coefficients (on standardised features):
             OLS   Ridge   Lasso
area       10.42    9.10    9.88
bedrooms    8.05    7.71    7.42
age        -7.18   -6.95   -6.81
noise1      0.31    0.22    0.00   ← Lasso zeroed it out
noise2     -0.44   -0.29   -0.00   ← Lasso zeroed it out
redundant   4.91    2.40    0.00   ← Lasso dropped the redundant feature

Read the coefficients carefully — this is the whole lesson in one table:

  • OLS happily assigns non-zero weights to the two pure-noise features and splits weight between area and its redundant copy. It has no reason not to.
  • Ridge shrinks every coefficient toward zero, taming the noise and the redundant feature, but never quite eliminates them.
  • Lasso drives noise1, noise2, and redundant to exactly zero — automatically recovering the three real drivers of price. That is feature selection for free.

Tuning alpha with Cross-Validation

Don't guess alpha. scikit-learn ships RidgeCV and LassoCV that search a grid of alpha values with built-in cross-validation.

from sklearn.linear_model import LassoCV

lasso_cv = make_pipeline(
    StandardScaler(),
    LassoCV(alphas=np.logspace(-3, 1, 50), cv=5, random_state=42)
)
lasso_cv.fit(X_train, y_train)

best_alpha = lasso_cv[-1].alpha_
print(f"Best alpha chosen by 5-fold CV: {best_alpha:.4f}")
print(f"Test R²: {lasso_cv.score(X_test, y_test):.3f}")
Best alpha chosen by 5-fold CV: 0.0546
Test R²: 0.915

A regularization path plot (coefficient value vs alpha) is a great diagnostic: as alpha grows, watch coefficients shrink and, for Lasso, drop to zero one by one.

Coefficient
 value  | area \_______
        | bed  \______
        | age  ________/¯¯¯  (magnitudes shrink as alpha ↑)
        | redundant \__0
        | noise1 __0
        | noise2 _0
        +--------------------------- increasing alpha →
          weak penalty        strong penalty

A Note on Regularization Elsewhere

Regularization is not just a linear-model idea — the same principle appears across the toolkit:

  • Logistic Regression / SVM: scikit-learn parametrises strength as C = 1/α. So a small C means strong regularization (the opposite direction from alpha). See the Logistic Regression and Support Vector Machines chapters.
  • Decision Trees: limiting max_depth, min_samples_leaf, or applying cost-complexity pruning (ccp_alpha) is regularization for trees.
  • Neural networks — dropout. During training, dropout randomly "switches off" a fraction of neurons on each forward pass (e.g. dropout=0.3 drops 30%). The network can't rely on any single neuron, so it learns redundant, robust representations — a powerful anti-overfitting technique. Neural nets also use L2 (called weight decay) and early stopping. The Introduction to Neural Networks & Deep Learning chapter revisits this.
  • Data augmentation (images, text) synthetically expands the training set and acts as regularization by exposing the model to more variation.

Underfit vs Good Fit vs Overfit at a Glance

PropertyUnderfit (high bias)Good fitOverfit (high variance)
Model complexityToo lowBalancedToo high
Training errorHighLowVery low (near zero)
Validation errorHighLowHigh
Train-vs-val gapSmallSmallLarge
Learning curvesConverge lowConverge highWide persistent gap
Typical causeToo few features; over-simple modelRight balanceToo many features/params; too little data
FixAdd features/complexity; reduce alphaKeep itMore data; simpler model; increase alpha; regularize

Common Mistakes

1. Judging the model on training accuracy

"My model is 99% accurate!" — measured on the training set.
This tells you almost nothing about generalization. A model can memorise
its way to 100% training accuracy. ALWAYS report validation/test scores.

2. Forgetting to scale before regularizing

Ridge/Lasso penalise coefficient magnitude. If 'area' is in the thousands
and 'bedrooms' is 1–5, the penalty crushes 'bedrooms' unfairly.
Fix: put StandardScaler inside a Pipeline so every feature is comparable.

3. Tuning alpha on the test set

Choosing alpha by checking which value scores best on the TEST set leaks
the test set into training — your reported score becomes optimistic.
Fix: tune with cross-validation (RidgeCV/LassoCV/GridSearchCV) on the
training data, and touch the test set only ONCE at the very end.

4. Confusing sklearn's C with alpha

For Ridge/Lasso/ElasticNet: LARGER alpha = STRONGER regularization.
For LogisticRegression/SVM:  SMALLER C   = STRONGER regularization (C = 1/α).
Mixing these up sends you tuning in the wrong direction.

5. Assuming more data always fixes the problem

More data cures HIGH VARIANCE (overfitting). It does NOT cure HIGH BIAS.
If your learning curves have already converged at a low score, collecting
more rows is wasted effort — you need a better model or better features.

6. Treating Lasso's dropped features as "proven useless"

Lasso zeros coefficients partly based on chance and correlations.
Among several correlated features it may keep one and drop the rest
almost arbitrarily. Don't read a zeroed coefficient as "this feature has
no real-world effect." Use ElasticNet when features are correlated.

Practice Exercises

  1. A model scores 0.98 accuracy on training data and 0.61 on validation data. Name the problem, identify whether it is high bias or high variance, and list three concrete fixes.

  2. Write the loss functions for Ridge, Lasso, and ElasticNet from memory (use a fenced code block). In one sentence each, state what each penalty does to the coefficients.

  3. Using the housing code above, loop alpha over [0.01, 0.1, 1, 10, 100] for Ridge, and print how many coefficients drop below 0.5 in magnitude at each value. Describe the trend you see.

  4. You plot learning curves and both the training and validation scores plateau together at about 0.68. Is collecting more data likely to help? What should you try instead, and why?

  5. Explain in your own words why L1 (Lasso) produces exactly-zero coefficients while L2 (Ridge) only shrinks them. A rough geometric argument is fine.

  6. A colleague tuned alpha by picking the value that maximised the test-set R². Explain what is wrong with this and describe the correct procedure using cross-validation.

Summary

In this chapter you learned:

  • Underfitting (high bias) = too simple; poor on both train and validation. Overfitting (high variance) = too complex; great on train, poor on validation. The goal is generalization.
  • Bias-variance decomposition: Expected error = Bias² + Variance + Irreducible error. Increasing complexity lowers bias but raises variance; the sweet spot minimises their sum.
  • Diagnose with the train-vs-validation gap and with learning curves: converge-low means high bias; wide persistent gap means high variance.
  • Fight overfitting with more data, simpler models, feature selection, cross-validation, early stopping, ensembles, and — most surgically — regularization.
  • Ridge (L2) penalises Σ bⱼ² and shrinks all coefficients smoothly. Lasso (L1) penalises Σ |bⱼ| and drives some coefficients to exactly zero → built-in feature selection. ElasticNet blends both via l1_ratio.
  • The strength alpha (λ) is a hyperparameter: larger means stronger penalty, more bias, less variance. Tune it with RidgeCV / LassoCV, never on the test set.
  • Always scale features (via a Pipeline with StandardScaler) before regularizing. Remember C = 1/α for LogisticRegression/SVM.
  • Dropout, weight decay, and data augmentation bring the same anti-overfitting principle to neural networks.

Getting the bias-variance balance right is what separates a model that impresses on a slide from one that actually holds up in production.

Next up: Ensemble Methods — Bagging, Boosting & XGBoost — learn how combining many models (via bagging to cut variance and boosting to cut bias) produces some of the most accurate, competition-winning models in machine learning.