Model Evaluation Metrics

Why Metrics Matter: You Cannot Improve What You Measure Wrong

You have trained a model. It "works." But is it any good? That question has no single answer — it depends entirely on how you measure and what you care about. A metric is the lens through which you judge a model, and the wrong lens can make a useless model look brilliant.

Intuitive analogy. Imagine a hospital in Pune screening patients for a rare disease that affects 1 in 100 people. A lazy model that says "healthy" to everyone is correct 99% of the time. On a report, 99% accuracy looks like a triumph. In reality it catches zero sick patients — it is worse than useless, it is dangerous. Accuracy, the most intuitive metric, has quietly lied to you. Choosing the right metric is the difference between a model that helps and one that harms.

This chapter gives you a complete toolkit. For classification you will learn the confusion matrix, precision, recall, F1, ROC-AUC, and how to average across many classes. For regression you will learn MAE, MSE, RMSE, R-squared, and MAPE. Most importantly, you will learn to pick the metric that matches the business cost of being wrong.

Everything here assumes you evaluate on a held-out test set (or via cross-validation), never on the training data — a discipline you set up in the Train-Test Split & Cross-Validation chapter.

The one rule that underlies this entire chapter:
→ A metric encodes what you value. Optimise the metric that
  reflects the real-world cost of your model's mistakes.

The Confusion Matrix: The Foundation of Classification Metrics

Almost every classification metric is built from four counts. For a binary problem — a positive class (the thing you are trying to detect, e.g. fraud, disease, churn) and a negative class — every prediction falls into one of four boxes.

                       PREDICTED
                  Positive     Negative
             +-----------------------------+
ACTUAL  Pos  |    TP      |      FN         |
             |-----------------------------|
        Neg  |    FP      |      TN         |
             +-----------------------------+

TP (True Positive)  = predicted positive, actually positive  (correct hit)
TN (True Negative)  = predicted negative, actually negative  (correct rejection)
FP (False Positive) = predicted positive, actually negative  (false alarm, "Type I error")
FN (False Negative) = predicted negative, actually positive  (miss, "Type II error")

A worked reading of the confusion matrix from the previous chapter's churn example:

[[512  38]      Row 0 = actual "stay":   TN=512, FP=38
 [ 71 129]]     Row 1 = actual "churn":  FN=71,  TP=129

So the model:
→ correctly kept 512 stayers (TN)
→ falsely flagged 38 stayers as churners (FP)
→ missed 71 real churners (FN)
→ correctly caught 129 churners (TP)

The whole game is that FP and FN are different kinds of mistakes with different costs. A false positive on a spam filter deletes a real email. A false negative on a cancer screen sends a sick patient home. Metrics differ precisely in how they weigh these two errors.

Accuracy — and Why It Lies Under Class Imbalance

Accuracy is the fraction of all predictions that were correct. It is the metric everyone reaches for first, and the one that fails most often.

Accuracy = (TP + TN) / (TP + TN + FP + FN)
         = (correct predictions) / (total predictions)

Accuracy is perfectly fine when your classes are roughly balanced and the two error types cost about the same. It becomes actively misleading under class imbalance.

Rare-disease example (1 positive in 100):
Dataset of 1000 patients: 990 healthy, 10 sick.

"Predict healthy for everyone" model:
TP = 0, TN = 990, FP = 0, FN = 10
Accuracy = (0 + 990) / 1000 = 0.99   ← looks excellent!
Recall on the sick = 0 / 10 = 0.0     ← catches NOBODY.

The 99% accuracy is the base rate of the majority class.
The model has learned nothing useful.

Whenever one class dominates — fraud, disease, churn, defaults, defect detection — accuracy is a trap. You need metrics that focus on the rare, important class. That is what precision and recall do.

Precision, Recall, and F1

These three are the core of imbalanced classification. Each answers a different, precise question.

Precision — "When I say positive, how often am I right?"

Precision = TP / (TP + FP)

Of all the items the model FLAGGED as positive, what fraction truly were?
High precision = few false alarms.

Recall (Sensitivity, True Positive Rate) — "Of all the real positives, how many did I catch?"

Recall = TP / (TP + FN)

Of all the ACTUAL positives, what fraction did the model find?
High recall = few misses.

F1 Score — the harmonic mean of the two

F1 = 2 · (Precision · Recall) / (Precision + Recall)

The harmonic mean punishes imbalance: F1 is high only when BOTH
precision and recall are high. A model with precision 1.0 and
recall 0.01 has F1 ≈ 0.02, not 0.5.

A worked example on the churn matrix TP=129, FP=38, FN=71:

Precision = 129 / (129 + 38) = 129 / 167 = 0.772
Recall    = 129 / (129 + 71) = 129 / 200 = 0.645
F1        = 2 · (0.772 · 0.645) / (0.772 + 0.645) = 0.703

Read as: when the model predicts "churn" it is right 77% of the time,
but it only catches 65% of the customers who actually churn.

The Precision-Recall Trade-off

Precision and recall pull against each other. Push the model to catch more positives (raise recall) and it inevitably flags more borderline cases as positive, so more of them are wrong (precision drops) — and vice versa. You control this balance with the decision threshold.

Threshold on predict_proba (probability of the positive class):

Lower threshold (e.g. 0.3):
→ predict "positive" more readily
→ catch more real positives  → RECALL up
→ but more false alarms       → PRECISION down

Higher threshold (e.g. 0.7):
→ predict "positive" only when very confident
→ fewer false alarms          → PRECISION up
→ but miss more real positives → RECALL down

The default 0.5 is just a starting point, rarely the best choice.

Threshold tuning means choosing the threshold that best matches your business costs, not accepting 0.5 by default. You inspect a range of thresholds and pick the operating point where the precision-recall balance is right for your problem.

import numpy as np
from sklearn.metrics import precision_recall_curve

# y_test = true labels, y_proba = model.predict_proba(X_test)[:, 1]
precisions, recalls, thresholds = precision_recall_curve(y_test, y_proba)

# Inspect a few operating points
for t in [0.3, 0.5, 0.7]:
    y_pred_t = (y_proba >= t).astype(int)
    tp = np.sum((y_pred_t == 1) & (y_test == 1))
    fp = np.sum((y_pred_t == 1) & (y_test == 0))
    fn = np.sum((y_pred_t == 1)[y_test == 1] == False)  # missed positives
    prec = tp / (tp + fp) if (tp + fp) else 0
    rec = tp / (tp + fn) if (tp + fn) else 0
    print(f"threshold={t}: precision={prec:.2f}, recall={rec:.2f}")

Illustrative output (shape, not a real benchmark):
threshold=0.3: precision=0.61, recall=0.82
threshold=0.5: precision=0.77, recall=0.65
threshold=0.7: precision=0.88, recall=0.44

As the threshold rises, precision climbs and recall falls — the trade-off.

The ROC Curve and AUC

A single confusion matrix reflects one threshold. But a good model should rank positives above negatives at every threshold. The ROC curve (Receiver Operating Characteristic) visualises performance across all thresholds at once.

ROC curve plots, as the threshold sweeps from 1 down to 0:

  True Positive Rate (Recall)  =  TP / (TP + FN)      on the y-axis
  False Positive Rate          =  FP / (FP + TN)      on the x-axis

Each threshold gives one point. Connecting them traces the curve.

→ A perfect model hugs the top-left corner (TPR=1, FPR=0).
→ A random-guess model lies on the diagonal from (0,0) to (1,1).
→ A curve BELOW the diagonal is worse than random (invert it!).

AUC (Area Under the ROC Curve) collapses that curve into a single number between 0 and 1.

AUC interpretation:
AUC = 1.0   → perfect ranking
AUC = 0.5   → no better than random guessing
AUC < 0.5   → worse than random (predictions are inverted)

Probabilistic meaning:
AUC = P(a random positive is ranked above a random negative)

AUC's great virtues: it is threshold-independent (it judges the ranking, not one operating point) and it is far more robust to class imbalance than accuracy. Its main limitation: on severely imbalanced data, ROC-AUC can look optimistic because the huge negative class dominates the FPR calculation. In that case the precision-recall AUC (average precision) is more honest, because it focuses on the positive class.

Quick guidance:
→ Balanced or moderate imbalance  → ROC-AUC is a great overall summary.
→ Severe imbalance, positives rare → prefer PR-AUC / average precision.

Multiclass Averaging: Macro, Micro, and Weighted

With more than two classes (e.g. classify a ticket as Billing, Technical, or Sales) precision and recall are computed per class, then averaged into a single number. How you average matters enormously.

Averaging	How it combines classes	Effect on class imbalance	Use when
`macro`	Simple unweighted mean of the per-class scores	Treats every class equally, so rare classes count as much as common ones	You care about performance on rare classes as much as common ones
`weighted`	Mean weighted by each class's number of true instances (`support`)	Dominated by common classes	You want a single number that reflects the overall population
`micro`	Pool all TP, FP, FN globally, then compute the metric once	Dominated by common classes; equals accuracy for single-label problems	You care about total correct decisions across all predictions

Illustrative 3-class recall:
  Billing  : recall 0.90 (support 900)
  Technical: recall 0.60 (support  80)
  Sales    : recall 0.40 (support  20)

macro recall    = (0.90 + 0.60 + 0.40) / 3            = 0.633
weighted recall = (0.90·900 + 0.60·80 + 0.40·20)/1000 = 0.874

Macro exposes the weak minority classes; weighted hides them behind Billing.

The lesson: report macro when small classes matter (they usually do), and never let a high weighted score lull you into ignoring a class the model is failing.

Regression Metrics: MAE, MSE, RMSE, R-squared, MAPE

For regression — predicting a continuous number like a house price or delivery time — there is no confusion matrix. Instead we measure how far predictions ŷ land from the truth y.

Notation: yᵢ = true value, ŷᵢ = predicted value, n = number of samples,
          ȳ = mean of the true values.

Mean Absolute Error (MAE)

MAE = (1/n) · Σ |yᵢ - ŷᵢ|

Average size of the error, in the ORIGINAL units.
Treats all errors linearly — a ₹10,000 miss counts exactly twice
a ₹5,000 miss. Robust to outliers. Easy to explain to stakeholders.

Mean Squared Error (MSE)

MSE = (1/n) · Σ (yᵢ - ŷᵢ)²

Average of the SQUARED errors. Squaring punishes large errors much
more than small ones — one big miss dominates. Units are squared
(e.g. rupees², which is hard to interpret), so it is mostly used
internally as a training loss.

Root Mean Squared Error (RMSE)

RMSE = √MSE = √( (1/n) · Σ (yᵢ - ŷᵢ)² )

The square root of MSE, back in the ORIGINAL units. Keeps MSE's heavy
penalty on large errors but is interpretable. RMSE is always >= MAE;
the gap between them signals the presence of large outlier errors.

R-squared (Coefficient of Determination)

R² = 1 - (SS_res / SS_tot)
   = 1 - Σ(yᵢ - ŷᵢ)² / Σ(yᵢ - ȳ)²

Fraction of the variance in y explained by the model.
R² = 1.0  → perfect predictions
R² = 0.0  → no better than always predicting the mean ȳ
R² < 0    → WORSE than predicting the mean (yes, it can go negative)

Mean Absolute Percentage Error (MAPE)

MAPE = (100/n) · Σ | (yᵢ - ŷᵢ) / yᵢ |     (expressed as a percentage)

Error as a percentage of the true value — scale-free and intuitive
("we are off by 8% on average"). Caution: it explodes when yᵢ is near
zero, and it penalises over-prediction and under-prediction unevenly.

Which Regression Metric When?

Metric	Units	Outlier sensitivity	Best for
MAE	Original units	Low (robust)	You want a plain, robust "average error"; outliers are noise you don't want to over-weight
MSE	Squared units	High	A training loss; when large errors are disproportionately bad
RMSE	Original units	High	Reporting error size while still penalising big misses; the default report metric for many teams
R-squared	Unitless ratio	Moderate	Comparing models on the same target; explaining "variance captured"
MAPE	Percentage	Depends on scale	Communicating relative error to business; comparing across targets of different scale (avoid when values near zero)

Rule of thumb:
→ Care about typical error, distrust outliers → MAE.
→ Big mistakes are especially costly            → RMSE / MSE.
→ Need a business-friendly percentage           → MAPE (if no near-zero targets).
→ Comparing models / "how much did we explain"  → R-squared.

Putting It Together in scikit-learn

sklearn.metrics gives you everything above with one-line calls. Here is a compact classification evaluation.

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
                             f1_score, confusion_matrix, classification_report,
                             roc_auc_score)

# Imbalanced binary data (10% positive) so accuracy will be misleading
X, y = make_classification(n_samples=2000, weights=[0.9, 0.1],
                           random_state=42)
X_tr, X_te, y_tr, y_te = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=42)

model = Pipeline([("scaler", StandardScaler()),
                  ("clf", LogisticRegression(max_iter=1000))])
model.fit(X_tr, y_tr)

y_pred  = model.predict(X_te)
y_proba = model.predict_proba(X_te)[:, 1]   # P(positive) for ROC-AUC

print("Accuracy :", round(accuracy_score(y_te, y_pred), 3))
print("Precision:", round(precision_score(y_te, y_pred), 3))
print("Recall   :", round(recall_score(y_te, y_pred), 3))
print("F1       :", round(f1_score(y_te, y_pred), 3))
print("ROC-AUC  :", round(roc_auc_score(y_te, y_proba), 3))
print(confusion_matrix(y_te, y_pred))
print(classification_report(y_te, y_pred))

Illustrative output (shape, not a real benchmark):
Accuracy : 0.918
Precision: 0.74
Recall   : 0.42
F1       : 0.54
ROC-AUC  : 0.90
[[441   9]
 [ 32  18]]
              precision    recall  f1-score   support
           0       0.93      0.98      0.95       450
           1       0.67      0.36      0.47        50
    accuracy                           0.92       500
   macro avg       0.80      0.67      0.71       500
 weighted avg       0.90      0.92      0.90       500

Notice: 92% accuracy looks great, but recall on class 1 is only 0.36 —
the model misses most of the minority class. Accuracy hid the problem.

Reading classification_report: each row is a class; the macro avg row is the unweighted mean across classes (exposes weak minorities); weighted avg weights by support. Always scan the per-class rows, not just the averages.

For multiclass and regression, the same module has you covered.

# --- Multiclass averaging ---
from sklearn.metrics import f1_score
f1_macro    = f1_score(y_true, y_pred, average="macro")
f1_weighted = f1_score(y_true, y_pred, average="weighted")
f1_micro    = f1_score(y_true, y_pred, average="micro")

# --- Regression ---
from sklearn.metrics import (mean_absolute_error, mean_squared_error,
                             r2_score, mean_absolute_percentage_error)

mae  = mean_absolute_error(y_true, y_pred)
mse  = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mean_squared_error(y_true, y_pred))   # RMSE = sqrt(MSE)
r2   = r2_score(y_true, y_pred)
mape = mean_absolute_percentage_error(y_true, y_pred)  # returns a fraction
print(f"MAE={mae:.1f}  RMSE={rmse:.1f}  R2={r2:.3f}  MAPE={mape*100:.1f}%")

Illustrative regression output:
MAE=142000.0  RMSE=205000.0  R2=0.871  MAPE=8.4%

RMSE (₹2.05L) is noticeably larger than MAE (₹1.42L) → a few large
errors are inflating RMSE. If those big misses matter, act on them.

Choosing the Right Metric for the Business Goal

This is the most important section of the chapter. The metric is not a technicality — it encodes what your organisation values. Ask: what does a mistake cost?

Scenario	Costly error	Optimise for	Why
Cancer / disease screening	Missing a sick patient (FN)	Recall	A missed case can be fatal; a false alarm just triggers a follow-up test
Spam / promotional filter	Deleting a real email (FP)	Precision	A lost genuine email is far worse than one spam message slipping through
Fraud detection (rare fraud)	Both, but misses are expensive	Recall + PR-AUC	Catch fraud, then tune threshold against the false-alarm review cost
Loan default (imbalanced)	Approving a defaulter (FN on default)	Recall / ROC-AUC	A default costs far more than declining a good applicant
Balanced product rating class	Either error similar	Accuracy / F1	Classes balanced and costs symmetric, so accuracy is meaningful
Ranking leads for a sales team	Ordering quality, not a hard cutoff	ROC-AUC / PR-AUC	You act on the ranked list, not a single threshold
House price prediction	Large price misses	RMSE / MAE	RMSE if big misses hurt more; MAE for a robust typical error
Demand forecasting for stakeholders	Percentage-off matters	MAPE	Business speaks in "% off"; comparable across products

The decision procedure:
1. Is it classification or regression?
2. Are the classes imbalanced? (If yes, drop plain accuracy.)
3. Which error is more expensive — FP or FN?
     FN worse → optimise RECALL.
     FP worse → optimise PRECISION.
     Both matter → F1.
     Judging the ranking, not one threshold → ROC-AUC / PR-AUC.
4. For regression: outliers noise → MAE; outliers dangerous → RMSE;
   need a % → MAPE; explaining variance → R-squared.

Recall the disease example from the top: there, recall is king — missing a patient is unacceptable, so we tolerate false alarms. For a spam filter, precision wins — we would rather let a little spam through than lose a real email. Same math, opposite metric, because the costs are opposite.

Common Mistakes

1. Reporting accuracy on imbalanced data

On a 99:1 dataset, 99% accuracy can mean the model never predicts the
rare class. Always check the per-class recall and precision, and prefer
F1, ROC-AUC, or PR-AUC when classes are imbalanced.

2. Evaluating on the training set

Scoring on data the model already saw gives inflated, meaningless numbers
(the model can memorise). ALWAYS evaluate on a held-out test set or via
cross-validation. A near-perfect training score with a poor test score is
the classic overfitting signature (see the next chapter).

3. Optimising the wrong metric

Tuning for accuracy when the business cares about catching fraud gets you
a model that is "accurate" and useless. Decide the metric FROM the business
cost BEFORE training, then optimise that metric (e.g. scoring="recall"
in GridSearchCV), not whatever is convenient.

4. Accepting the default 0.5 threshold blindly

predict() uses a 0.5 cutoff. For imbalanced or cost-asymmetric problems
that is rarely optimal. Use predict_proba and pick the threshold from the
precision-recall curve according to your FP-vs-FN costs.

5. Using ROC-AUC on severely imbalanced data without thinking

With very rare positives, ROC-AUC can look great because the enormous
negative class swamps the false-positive rate. Prefer PR-AUC (average
precision), which focuses on the positive class you actually care about.

6. Comparing R-squared or MAPE across different datasets

R² depends on the variance of the target, and MAPE explodes near zero
targets — so a "higher R²" on a different dataset does NOT mean a better
model. Compare metrics only on the SAME target and test set.

Practice Exercises

A fraud model produces this confusion matrix (row = actual, [[TN, FP], [FN, TP]]): [[9500, 100], [200, 200]]. Compute accuracy, precision, recall, and F1. Is accuracy a good summary here? Explain in one sentence.
For a disease screening tool and a promotional-email filter, state which of precision or recall you would prioritise for each, and justify each choice in terms of the cost of a false negative versus a false positive.
Using scikit-learn's load_breast_cancer, build a StandardScaler + LogisticRegression pipeline, then print the classification_report and roc_auc_score. Which class has lower recall, and what does that mean?
Take the model from Exercise 3, extract predict_proba, and compute precision and recall at thresholds 0.3, 0.5, and 0.7. Describe how each metric moves and explain why.
For true values y = [100, 200, 300, 400] and predictions ŷ = [110, 190, 350, 380], compute MAE, MSE, RMSE, and MAPE by hand. Which prediction contributes most to RMSE, and why?
You have a 3-class classifier where one class has only 30 of 1000 samples. Explain when you would report macro F1 versus weighted F1, and what each would hide or reveal.

Summary

In this chapter you learned:

The confusion matrix (TP, FP, TN, FN) is the foundation of classification metrics; the key tension is that FP and FN are different mistakes with different costs.
Accuracy = (TP + TN) / total is intuitive but lies under class imbalance — a lazy majority-class model can score 99% and catch nothing.
Precision = TP / (TP + FP) ("when I say positive, am I right?") and Recall = TP / (TP + FN) ("did I catch the positives?") trade off against each other; F1 is their harmonic mean.
The decision threshold controls the precision-recall balance — lower it for more recall, raise it for more precision; 0.5 is rarely optimal.
ROC-AUC summarises ranking quality across all thresholds and is imbalance-robust; prefer PR-AUC when positives are very rare.
Multiclass metrics are averaged as macro (every class equal, exposes rare classes), weighted (by support), or micro (global pooling).
Regression metrics: MAE (robust, original units), MSE/RMSE (punish big errors), R-squared (variance explained), MAPE (percentage error) — choose by how much outliers should count.
Choose the metric from the business cost: recall for disease detection, precision for spam, ROC-AUC/PR-AUC for ranking, RMSE vs MAE by outlier sensitivity.
Avoid the classic traps: accuracy on imbalanced data, evaluating on the training set, optimising the wrong metric, and the blind 0.5 threshold.

Metrics are the language in which you argue that a model is good — pick the words that actually mean what your business needs.

Next up: Bias-Variance, Overfitting & Regularization — why a model that memorises the training set fails on new data, how to diagnose the bias-variance trade-off, and how regularization keeps models honest.