The Central Tension in Machine Learning
Every model you build sits somewhere on a spectrum between two failure modes:
- Underfitting: the model is too simple to capture the real pattern. It does badly on training data and on new data.
- Overfitting: the model is too complex and memorises the training data — including its noise. It does great on training data but poorly on new data.
The whole art of modelling is landing in the sweet spot between them, where the model is complex enough to learn the signal but not so complex that it learns the noise.
Analogy — the exam student. Imagine Priya preparing for an exam.
Underfitting student:
Reads only the chapter titles. Understands nothing deeply.
→ Scores badly on practice tests AND the real exam.
Overfitting student:
Memorises last year's exact question paper word-for-word.
→ Scores 100% on last year's paper, but the real exam has
different questions → scores badly.
Good student:
Understands the underlying concepts.
→ Scores well on practice tests AND on the real, unseen exam.
Generalization — performing well on data the model has never seen — is the goal. Everything in this chapter is about measuring and improving generalization.
The Bias-Variance Trade-off
The generalization error of a model can be broken down into three parts. This decomposition explains why underfitting and overfitting happen and why fixing one often worsens the other.
Expected prediction error = Bias² + Variance + Irreducible Error
Bias = error from wrong assumptions (model too simple).
A high-bias model consistently misses the true pattern
in the same direction, no matter which training set it sees.
Variance = error from sensitivity to the training data (model too complex).
A high-variance model changes drastically when you swap
in a slightly different training sample.
Irreducible error = noise in the data itself (measurement error,
randomness). No model can remove this.
Think of it with a dartboard.
Low Variance High Variance
(tight grouping) (scattered darts)
Low Bias Bullseye cluster Spread around bullseye
(on target) ← IDEAL ← overfitting risk
High Bias Tight cluster, Scattered AND
(off but off-centre off-centre
target) ← underfitting ← worst case
Why It's a Trade-off
As you increase model complexity (deeper trees, higher-degree polynomials, more features), bias falls but variance rises. As you decrease complexity, bias rises but variance falls. You cannot minimise both by tuning complexity alone — you have to find the balance point where their sum is smallest.
Error
|\ /
| \ / ← Variance (rises with complexity)
| \ /
| \ Total error /
| \ ___ /
| \_/ \_ _ / ← minimum total error = SWEET SPOT
| / \
| / \___
| / Bias (falls with complexity) \___
|/________________________________________
Low Model Complexity High
The good news: regularization, more data, and ensembles can bend this curve — letting you reduce variance without adding much bias. That is the real payoff of the techniques below.
Symptoms: Reading the Train vs Validation Gap
You diagnose bias and variance by comparing performance on the training set (data the model learned from) against a held-out validation set (data it never saw). See the Train-Test Split & Cross-Validation chapter for how to build these splits properly.
| Symptom | Training score | Validation score | Diagnosis | What it means |
|---|---|---|---|---|
| Both poor | Low | Low | Underfitting (high bias) | Model too simple; add complexity/features |
| Big gap | High | Low | Overfitting (high variance) | Model memorised noise; simplify/regularize |
| Both good, close | High | High | Good fit | Ship it (and keep monitoring) |
| Both great, tiny gap | Very high | Very high | Suspiciously good | Check for data leakage |
A concrete illustrative example on a customer-churn dataset:
Model Train Accuracy Validation Accuracy Gap
----------------------------------------------------------------------
Logistic (few features) 0.71 0.70 0.01 ← underfit
Decision tree (max_depth=3) 0.79 0.77 0.02 ← good
Decision tree (unlimited) 1.00 0.72 0.28 ← overfit!
The unlimited tree scores a perfect 1.00 on training — a red flag. That gap of 0.28 is the fingerprint of overfitting.
Learning Curves
A learning curve plots training and validation scores as you increase the amount of training data. Its shape tells you which problem you have.
HIGH BIAS (underfitting): HIGH VARIANCE (overfitting):
score score
| train ___________ | train ________________
| / |
| / val ______ | val ____
| / __/ | ___/
| // | __/
|__/___________ data |__/______________ data
Both curves plateau LOW Wide, persistent GAP between
and CLOSE together. train (high) and val (low).
→ More data won't help. → More data likely WILL help.
→ Add complexity/features. → The curves converge as data grows.
- Curves converge at a low score → high bias. Adding more data is a waste; you need a more expressive model or better features.
- A large gap that shrinks slowly as data grows → high variance. More data helps; so does regularization.
How to Fight Overfitting
Once you have diagnosed high variance, you have a toolkit. Roughly in order of "reach for this first":
- Get more (and more varied) training data. The single most reliable cure for variance. More examples make it harder for the model to memorise noise.
- Use a simpler model. Fewer parameters, a shallower tree (
max_depth), a lower-degree polynomial. Less capacity to memorise. - Feature selection / reduce dimensionality. Drop irrelevant features; they are pure opportunity for the model to fit noise. See Dimensionality Reduction & PCA and Feature Engineering & Scaling.
- Cross-validation. Don't cure overfitting directly, but gives an honest estimate of generalization so you tune toward the sweet spot instead of the training score.
- Early stopping. For iterative learners (gradient descent, boosting, neural nets), stop training when validation error starts rising — before the model begins memorising.
- Regularization. Add a penalty that discourages large/complex parameters. The most surgical tool, and the focus of the rest of this chapter.
- Ensembles (bagging, boosting) — covered in the next chapter — average many models to cut variance.
- Dropout / data augmentation for neural networks (brief note near the end).
Early Stopping in One Picture
Validation
error
|\
| \
| \ ← model still learning signal
| \____
| \___
| \___ ← MINIMUM: stop here (best generalization)
| \___
| \____
| \___ ← beyond this, it memorises noise:
| \ validation error RISES again
|______________________________________ training iterations
^
early stopping point
Regularization: The Core Idea
Regularization adds a penalty to the model's loss function that grows with the size or number of the coefficients. The model now has to trade off fitting the data against keeping its coefficients small. Small coefficients mean a smoother, simpler function that is less able to chase noise.
For linear models, ordinary least squares minimises just the fit error:
OLS loss (no regularization):
Loss = Σ (yᵢ − ŷᵢ)² where ŷᵢ = b₀ + b₁x₁ + ... + bₚxₚ
Regularization tacks on a penalty term controlled by a strength α (also written λ, "lambda"):
Regularized loss = (fit error) + α × (penalty on coefficients)
Larger α → stronger penalty → smaller coefficients → simpler model
→ higher bias, lower variance
α = 0 → no penalty → back to plain OLS
α → ∞ → all coefficients forced toward 0 → underfitting
α is a hyperparameter you tune with cross-validation. The two classic penalties are L2 (Ridge) and L1 (Lasso).
L2 Regularization — Ridge Regression
Ridge adds the sum of squared coefficients as the penalty. It shrinks all coefficients toward zero smoothly, but rarely makes any of them exactly zero.
Ridge loss = Σ (yᵢ − ŷᵢ)² + α × Σ bⱼ²
j=1..p (note: intercept b₀ is NOT penalised)
Effect: coefficients shrink proportionally.
Correlated features share the weight rather than one dominating.
Good default when you believe MOST features are useful.
L1 Regularization — Lasso Regression
Lasso ("Least Absolute Shrinkage and Selection Operator") adds the sum of absolute coefficient values. Its geometry drives some coefficients exactly to zero — so Lasso performs automatic feature selection.
Lasso loss = Σ (yᵢ − ŷᵢ)² + α × Σ |bⱼ|
j=1..p
Effect: some coefficients become EXACTLY 0 → those features are dropped.
Produces a sparse, interpretable model.
Good when you suspect only a FEW features truly matter.
The intuition for why L1 zeros things out: the absolute-value penalty has sharp corners on the coordinate axes, and the optimum tends to land exactly on a corner (where a coefficient is 0). The squared penalty of Ridge is smooth and round, so it slides coefficients close to — but not onto — zero.
ElasticNet — The Best of Both
ElasticNet blends L1 and L2. A mixing ratio l1_ratio decides how much of each. It gives you Lasso's feature selection while keeping Ridge's stability when features are correlated (Lasso alone tends to arbitrarily pick one of several correlated features and drop the rest).
ElasticNet loss = Σ (yᵢ − ŷᵢ)²
+ α × [ l1_ratio × Σ|bⱼ| + (1 − l1_ratio) × Σ bⱼ² ]
l1_ratio = 1 → pure Lasso
l1_ratio = 0 → pure Ridge
0 < l1_ratio < 1 → a blend (e.g. 0.5 is a common starting point)
| Aspect | Ridge (L2) | Lasso (L1) | ElasticNet |
|---|---|---|---|
| Penalty | Sum of squares Σ bⱼ² | Sum of absolutes `Σ | bⱼ |
| Coefficients | Shrinks all toward 0 | Drives some to exactly 0 | Some to 0, rest shrunk |
| Feature selection | No | Yes (built-in) | Yes |
| Correlated features | Shares weight across them | Picks one, drops others | Handles groups well |
| Best when | Most features useful | Few features truly matter | Many features, some correlated |
| sklearn class | Ridge | Lasso | ElasticNet |
Ridge and Lasso in scikit-learn
Let's watch regularization shrink coefficients in practice. We predict house prices from several features, some of which are noisy or redundant.
Always scale features before regularizing. The penalty depends on coefficient magnitude, so features on larger scales (e.g. area in sq ft vs number of bedrooms) would be penalised unfairly. Use a Pipeline with StandardScaler so scaling is learned on training data only and applied consistently.
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
# Synthetic housing data: 6 features, but only 3 truly drive price
rng = np.random.RandomState(42)
n = 500
area = rng.normal(1000, 300, n) # sq ft (matters)
bedrooms = rng.normal(3, 1, n) # count (matters)
age = rng.normal(15, 8, n) # years (matters)
noise1 = rng.normal(0, 1, n) # irrelevant
noise2 = rng.normal(0, 1, n) # irrelevant
noise3 = area + rng.normal(0, 5, n) # redundant (correlated with area)
# True price (in ₹ lakhs) depends only on area, bedrooms, age
price = 0.05*area + 8*bedrooms - 0.9*age + rng.normal(0, 5, n)
X = pd.DataFrame({"area": area, "bedrooms": bedrooms, "age": age,
"noise1": noise1, "noise2": noise2, "redundant": noise3})
y = price
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
# Fit plain OLS, Ridge, and Lasso — all inside a scaling pipeline
models = {
"OLS": make_pipeline(StandardScaler(), LinearRegression()),
"Ridge": make_pipeline(StandardScaler(), Ridge(alpha=10.0)),
"Lasso": make_pipeline(StandardScaler(), Lasso(alpha=1.0)),
}
coefs = {}
for name, model in models.items():
model.fit(X_train, y_train)
coefs[name] = model[-1].coef_ # coefficients of the final estimator
print(f"{name:6s} R² (test) = {model.score(X_test, y_test):.3f}")
coef_table = pd.DataFrame(coefs, index=X.columns).round(2)
print("\nCoefficients (on standardised features):")
print(coef_table)
Expected output (illustrative — your exact numbers will vary slightly):
OLS R² (test) = 0.912
Ridge R² (test) = 0.914
Lasso R² (test) = 0.913
Coefficients (on standardised features):
OLS Ridge Lasso
area 10.42 9.10 9.88
bedrooms 8.05 7.71 7.42
age -7.18 -6.95 -6.81
noise1 0.31 0.22 0.00 ← Lasso zeroed it out
noise2 -0.44 -0.29 -0.00 ← Lasso zeroed it out
redundant 4.91 2.40 0.00 ← Lasso dropped the redundant feature
Read the coefficients carefully — this is the whole lesson in one table:
- OLS happily assigns non-zero weights to the two pure-noise features and splits weight between
areaand itsredundantcopy. It has no reason not to. - Ridge shrinks every coefficient toward zero, taming the noise and the redundant feature, but never quite eliminates them.
- Lasso drives
noise1,noise2, andredundantto exactly zero — automatically recovering the three real drivers of price. That is feature selection for free.
Tuning alpha with Cross-Validation
Don't guess alpha. scikit-learn ships RidgeCV and LassoCV that search a grid of alpha values with built-in cross-validation.
from sklearn.linear_model import LassoCV
lasso_cv = make_pipeline(
StandardScaler(),
LassoCV(alphas=np.logspace(-3, 1, 50), cv=5, random_state=42)
)
lasso_cv.fit(X_train, y_train)
best_alpha = lasso_cv[-1].alpha_
print(f"Best alpha chosen by 5-fold CV: {best_alpha:.4f}")
print(f"Test R²: {lasso_cv.score(X_test, y_test):.3f}")
Best alpha chosen by 5-fold CV: 0.0546
Test R²: 0.915
A regularization path plot (coefficient value vs alpha) is a great diagnostic: as alpha grows, watch coefficients shrink and, for Lasso, drop to zero one by one.
Coefficient
value | area \_______
| bed \______
| age ________/¯¯¯ (magnitudes shrink as alpha ↑)
| redundant \__0
| noise1 __0
| noise2 _0
+--------------------------- increasing alpha →
weak penalty strong penalty
A Note on Regularization Elsewhere
Regularization is not just a linear-model idea — the same principle appears across the toolkit:
- Logistic Regression / SVM: scikit-learn parametrises strength as
C = 1/α. So a smallCmeans strong regularization (the opposite direction fromalpha). See the Logistic Regression and Support Vector Machines chapters. - Decision Trees: limiting
max_depth,min_samples_leaf, or applying cost-complexity pruning (ccp_alpha) is regularization for trees. - Neural networks — dropout. During training, dropout randomly "switches off" a fraction of neurons on each forward pass (e.g.
dropout=0.3drops 30%). The network can't rely on any single neuron, so it learns redundant, robust representations — a powerful anti-overfitting technique. Neural nets also use L2 (called weight decay) and early stopping. The Introduction to Neural Networks & Deep Learning chapter revisits this. - Data augmentation (images, text) synthetically expands the training set and acts as regularization by exposing the model to more variation.
Underfit vs Good Fit vs Overfit at a Glance
| Property | Underfit (high bias) | Good fit | Overfit (high variance) |
|---|---|---|---|
| Model complexity | Too low | Balanced | Too high |
| Training error | High | Low | Very low (near zero) |
| Validation error | High | Low | High |
| Train-vs-val gap | Small | Small | Large |
| Learning curves | Converge low | Converge high | Wide persistent gap |
| Typical cause | Too few features; over-simple model | Right balance | Too many features/params; too little data |
| Fix | Add features/complexity; reduce alpha | Keep it | More data; simpler model; increase alpha; regularize |
Common Mistakes
1. Judging the model on training accuracy
"My model is 99% accurate!" — measured on the training set.
This tells you almost nothing about generalization. A model can memorise
its way to 100% training accuracy. ALWAYS report validation/test scores.
2. Forgetting to scale before regularizing
Ridge/Lasso penalise coefficient magnitude. If 'area' is in the thousands
and 'bedrooms' is 1–5, the penalty crushes 'bedrooms' unfairly.
Fix: put StandardScaler inside a Pipeline so every feature is comparable.
3. Tuning alpha on the test set
Choosing alpha by checking which value scores best on the TEST set leaks
the test set into training — your reported score becomes optimistic.
Fix: tune with cross-validation (RidgeCV/LassoCV/GridSearchCV) on the
training data, and touch the test set only ONCE at the very end.
4. Confusing sklearn's C with alpha
For Ridge/Lasso/ElasticNet: LARGER alpha = STRONGER regularization.
For LogisticRegression/SVM: SMALLER C = STRONGER regularization (C = 1/α).
Mixing these up sends you tuning in the wrong direction.
5. Assuming more data always fixes the problem
More data cures HIGH VARIANCE (overfitting). It does NOT cure HIGH BIAS.
If your learning curves have already converged at a low score, collecting
more rows is wasted effort — you need a better model or better features.
6. Treating Lasso's dropped features as "proven useless"
Lasso zeros coefficients partly based on chance and correlations.
Among several correlated features it may keep one and drop the rest
almost arbitrarily. Don't read a zeroed coefficient as "this feature has
no real-world effect." Use ElasticNet when features are correlated.
Practice Exercises
-
A model scores 0.98 accuracy on training data and 0.61 on validation data. Name the problem, identify whether it is high bias or high variance, and list three concrete fixes.
-
Write the loss functions for Ridge, Lasso, and ElasticNet from memory (use a fenced code block). In one sentence each, state what each penalty does to the coefficients.
-
Using the housing code above, loop
alphaover[0.01, 0.1, 1, 10, 100]forRidge, and print how many coefficients drop below0.5in magnitude at each value. Describe the trend you see. -
You plot learning curves and both the training and validation scores plateau together at about 0.68. Is collecting more data likely to help? What should you try instead, and why?
-
Explain in your own words why L1 (Lasso) produces exactly-zero coefficients while L2 (Ridge) only shrinks them. A rough geometric argument is fine.
-
A colleague tuned
alphaby picking the value that maximised the test-set R². Explain what is wrong with this and describe the correct procedure using cross-validation.
Summary
In this chapter you learned:
- Underfitting (high bias) = too simple; poor on both train and validation. Overfitting (high variance) = too complex; great on train, poor on validation. The goal is generalization.
- Bias-variance decomposition:
Expected error = Bias² + Variance + Irreducible error. Increasing complexity lowers bias but raises variance; the sweet spot minimises their sum. - Diagnose with the train-vs-validation gap and with learning curves: converge-low means high bias; wide persistent gap means high variance.
- Fight overfitting with more data, simpler models, feature selection, cross-validation, early stopping, ensembles, and — most surgically — regularization.
- Ridge (L2) penalises
Σ bⱼ²and shrinks all coefficients smoothly. Lasso (L1) penalisesΣ |bⱼ|and drives some coefficients to exactly zero → built-in feature selection. ElasticNet blends both vial1_ratio. - The strength
alpha(λ) is a hyperparameter: larger means stronger penalty, more bias, less variance. Tune it withRidgeCV/LassoCV, never on the test set. - Always scale features (via a
PipelinewithStandardScaler) before regularizing. RememberC = 1/αforLogisticRegression/SVM. - Dropout, weight decay, and data augmentation bring the same anti-overfitting principle to neural networks.
Getting the bias-variance balance right is what separates a model that impresses on a slide from one that actually holds up in production.
Next up: Ensemble Methods — Bagging, Boosting & XGBoost — learn how combining many models (via bagging to cut variance and boosting to cut bias) produces some of the most accurate, competition-winning models in machine learning.