Why Metrics Matter: You Cannot Improve What You Measure Wrong
You have trained a model. It "works." But is it any good? That question has no single answer — it depends entirely on how you measure and what you care about. A metric is the lens through which you judge a model, and the wrong lens can make a useless model look brilliant.
Intuitive analogy. Imagine a hospital in Pune screening patients for a rare disease that affects 1 in 100 people. A lazy model that says "healthy" to everyone is correct 99% of the time. On a report, 99% accuracy looks like a triumph. In reality it catches zero sick patients — it is worse than useless, it is dangerous. Accuracy, the most intuitive metric, has quietly lied to you. Choosing the right metric is the difference between a model that helps and one that harms.
This chapter gives you a complete toolkit. For classification you will learn the confusion matrix, precision, recall, F1, ROC-AUC, and how to average across many classes. For regression you will learn MAE, MSE, RMSE, R-squared, and MAPE. Most importantly, you will learn to pick the metric that matches the business cost of being wrong.
Everything here assumes you evaluate on a held-out test set (or via cross-validation), never on the training data — a discipline you set up in the Train-Test Split & Cross-Validation chapter.
The one rule that underlies this entire chapter:
→ A metric encodes what you value. Optimise the metric that
reflects the real-world cost of your model's mistakes.
The Confusion Matrix: The Foundation of Classification Metrics
Almost every classification metric is built from four counts. For a binary problem — a positive class (the thing you are trying to detect, e.g. fraud, disease, churn) and a negative class — every prediction falls into one of four boxes.
PREDICTED
Positive Negative
+-----------------------------+
ACTUAL Pos | TP | FN |
|-----------------------------|
Neg | FP | TN |
+-----------------------------+
TP (True Positive) = predicted positive, actually positive (correct hit)
TN (True Negative) = predicted negative, actually negative (correct rejection)
FP (False Positive) = predicted positive, actually negative (false alarm, "Type I error")
FN (False Negative) = predicted negative, actually positive (miss, "Type II error")
A worked reading of the confusion matrix from the previous chapter's churn example:
[[512 38] Row 0 = actual "stay": TN=512, FP=38
[ 71 129]] Row 1 = actual "churn": FN=71, TP=129
So the model:
→ correctly kept 512 stayers (TN)
→ falsely flagged 38 stayers as churners (FP)
→ missed 71 real churners (FN)
→ correctly caught 129 churners (TP)
The whole game is that FP and FN are different kinds of mistakes with different costs. A false positive on a spam filter deletes a real email. A false negative on a cancer screen sends a sick patient home. Metrics differ precisely in how they weigh these two errors.
Accuracy — and Why It Lies Under Class Imbalance
Accuracy is the fraction of all predictions that were correct. It is the metric everyone reaches for first, and the one that fails most often.
Accuracy = (TP + TN) / (TP + TN + FP + FN)
= (correct predictions) / (total predictions)
Accuracy is perfectly fine when your classes are roughly balanced and the two error types cost about the same. It becomes actively misleading under class imbalance.
Rare-disease example (1 positive in 100):
Dataset of 1000 patients: 990 healthy, 10 sick.
"Predict healthy for everyone" model:
TP = 0, TN = 990, FP = 0, FN = 10
Accuracy = (0 + 990) / 1000 = 0.99 ← looks excellent!
Recall on the sick = 0 / 10 = 0.0 ← catches NOBODY.
The 99% accuracy is the base rate of the majority class.
The model has learned nothing useful.
Whenever one class dominates — fraud, disease, churn, defaults, defect detection — accuracy is a trap. You need metrics that focus on the rare, important class. That is what precision and recall do.
Precision, Recall, and F1
These three are the core of imbalanced classification. Each answers a different, precise question.
Precision — "When I say positive, how often am I right?"
Precision = TP / (TP + FP)
Of all the items the model FLAGGED as positive, what fraction truly were?
High precision = few false alarms.
Recall (Sensitivity, True Positive Rate) — "Of all the real positives, how many did I catch?"
Recall = TP / (TP + FN)
Of all the ACTUAL positives, what fraction did the model find?
High recall = few misses.
F1 Score — the harmonic mean of the two
F1 = 2 · (Precision · Recall) / (Precision + Recall)
The harmonic mean punishes imbalance: F1 is high only when BOTH
precision and recall are high. A model with precision 1.0 and
recall 0.01 has F1 ≈ 0.02, not 0.5.
A worked example on the churn matrix TP=129, FP=38, FN=71:
Precision = 129 / (129 + 38) = 129 / 167 = 0.772
Recall = 129 / (129 + 71) = 129 / 200 = 0.645
F1 = 2 · (0.772 · 0.645) / (0.772 + 0.645) = 0.703
Read as: when the model predicts "churn" it is right 77% of the time,
but it only catches 65% of the customers who actually churn.
The Precision-Recall Trade-off
Precision and recall pull against each other. Push the model to catch more positives (raise recall) and it inevitably flags more borderline cases as positive, so more of them are wrong (precision drops) — and vice versa. You control this balance with the decision threshold.
Threshold on predict_proba (probability of the positive class):
Lower threshold (e.g. 0.3):
→ predict "positive" more readily
→ catch more real positives → RECALL up
→ but more false alarms → PRECISION down
Higher threshold (e.g. 0.7):
→ predict "positive" only when very confident
→ fewer false alarms → PRECISION up
→ but miss more real positives → RECALL down
The default 0.5 is just a starting point, rarely the best choice.
Threshold tuning means choosing the threshold that best matches your business costs, not accepting 0.5 by default. You inspect a range of thresholds and pick the operating point where the precision-recall balance is right for your problem.
import numpy as np
from sklearn.metrics import precision_recall_curve
# y_test = true labels, y_proba = model.predict_proba(X_test)[:, 1]
precisions, recalls, thresholds = precision_recall_curve(y_test, y_proba)
# Inspect a few operating points
for t in [0.3, 0.5, 0.7]:
y_pred_t = (y_proba >= t).astype(int)
tp = np.sum((y_pred_t == 1) & (y_test == 1))
fp = np.sum((y_pred_t == 1) & (y_test == 0))
fn = np.sum((y_pred_t == 1)[y_test == 1] == False) # missed positives
prec = tp / (tp + fp) if (tp + fp) else 0
rec = tp / (tp + fn) if (tp + fn) else 0
print(f"threshold={t}: precision={prec:.2f}, recall={rec:.2f}")
Illustrative output (shape, not a real benchmark):
threshold=0.3: precision=0.61, recall=0.82
threshold=0.5: precision=0.77, recall=0.65
threshold=0.7: precision=0.88, recall=0.44
As the threshold rises, precision climbs and recall falls — the trade-off.
The ROC Curve and AUC
A single confusion matrix reflects one threshold. But a good model should rank positives above negatives at every threshold. The ROC curve (Receiver Operating Characteristic) visualises performance across all thresholds at once.
ROC curve plots, as the threshold sweeps from 1 down to 0:
True Positive Rate (Recall) = TP / (TP + FN) on the y-axis
False Positive Rate = FP / (FP + TN) on the x-axis
Each threshold gives one point. Connecting them traces the curve.
→ A perfect model hugs the top-left corner (TPR=1, FPR=0).
→ A random-guess model lies on the diagonal from (0,0) to (1,1).
→ A curve BELOW the diagonal is worse than random (invert it!).
AUC (Area Under the ROC Curve) collapses that curve into a single number between 0 and 1.
AUC interpretation:
AUC = 1.0 → perfect ranking
AUC = 0.5 → no better than random guessing
AUC < 0.5 → worse than random (predictions are inverted)
Probabilistic meaning:
AUC = P(a random positive is ranked above a random negative)
AUC's great virtues: it is threshold-independent (it judges the ranking, not one operating point) and it is far more robust to class imbalance than accuracy. Its main limitation: on severely imbalanced data, ROC-AUC can look optimistic because the huge negative class dominates the FPR calculation. In that case the precision-recall AUC (average precision) is more honest, because it focuses on the positive class.
Quick guidance:
→ Balanced or moderate imbalance → ROC-AUC is a great overall summary.
→ Severe imbalance, positives rare → prefer PR-AUC / average precision.
Multiclass Averaging: Macro, Micro, and Weighted
With more than two classes (e.g. classify a ticket as Billing, Technical, or Sales) precision and recall are computed per class, then averaged into a single number. How you average matters enormously.
| Averaging | How it combines classes | Effect on class imbalance | Use when |
|---|---|---|---|
macro | Simple unweighted mean of the per-class scores | Treats every class equally, so rare classes count as much as common ones | You care about performance on rare classes as much as common ones |
weighted | Mean weighted by each class's number of true instances (support) | Dominated by common classes | You want a single number that reflects the overall population |
micro | Pool all TP, FP, FN globally, then compute the metric once | Dominated by common classes; equals accuracy for single-label problems | You care about total correct decisions across all predictions |
Illustrative 3-class recall:
Billing : recall 0.90 (support 900)
Technical: recall 0.60 (support 80)
Sales : recall 0.40 (support 20)
macro recall = (0.90 + 0.60 + 0.40) / 3 = 0.633
weighted recall = (0.90·900 + 0.60·80 + 0.40·20)/1000 = 0.874
Macro exposes the weak minority classes; weighted hides them behind Billing.
The lesson: report macro when small classes matter (they usually do), and never let a high weighted score lull you into ignoring a class the model is failing.
Regression Metrics: MAE, MSE, RMSE, R-squared, MAPE
For regression — predicting a continuous number like a house price or delivery time — there is no confusion matrix. Instead we measure how far predictions ŷ land from the truth y.
Notation: yᵢ = true value, ŷᵢ = predicted value, n = number of samples,
ȳ = mean of the true values.
Mean Absolute Error (MAE)
MAE = (1/n) · Σ |yᵢ - ŷᵢ|
Average size of the error, in the ORIGINAL units.
Treats all errors linearly — a ₹10,000 miss counts exactly twice
a ₹5,000 miss. Robust to outliers. Easy to explain to stakeholders.
Mean Squared Error (MSE)
MSE = (1/n) · Σ (yᵢ - ŷᵢ)²
Average of the SQUARED errors. Squaring punishes large errors much
more than small ones — one big miss dominates. Units are squared
(e.g. rupees², which is hard to interpret), so it is mostly used
internally as a training loss.
Root Mean Squared Error (RMSE)
RMSE = √MSE = √( (1/n) · Σ (yᵢ - ŷᵢ)² )
The square root of MSE, back in the ORIGINAL units. Keeps MSE's heavy
penalty on large errors but is interpretable. RMSE is always >= MAE;
the gap between them signals the presence of large outlier errors.
R-squared (Coefficient of Determination)
R² = 1 - (SS_res / SS_tot)
= 1 - Σ(yᵢ - ŷᵢ)² / Σ(yᵢ - ȳ)²
Fraction of the variance in y explained by the model.
R² = 1.0 → perfect predictions
R² = 0.0 → no better than always predicting the mean ȳ
R² < 0 → WORSE than predicting the mean (yes, it can go negative)
Mean Absolute Percentage Error (MAPE)
MAPE = (100/n) · Σ | (yᵢ - ŷᵢ) / yᵢ | (expressed as a percentage)
Error as a percentage of the true value — scale-free and intuitive
("we are off by 8% on average"). Caution: it explodes when yᵢ is near
zero, and it penalises over-prediction and under-prediction unevenly.
Which Regression Metric When?
| Metric | Units | Outlier sensitivity | Best for |
|---|---|---|---|
| MAE | Original units | Low (robust) | You want a plain, robust "average error"; outliers are noise you don't want to over-weight |
| MSE | Squared units | High | A training loss; when large errors are disproportionately bad |
| RMSE | Original units | High | Reporting error size while still penalising big misses; the default report metric for many teams |
| R-squared | Unitless ratio | Moderate | Comparing models on the same target; explaining "variance captured" |
| MAPE | Percentage | Depends on scale | Communicating relative error to business; comparing across targets of different scale (avoid when values near zero) |
Rule of thumb:
→ Care about typical error, distrust outliers → MAE.
→ Big mistakes are especially costly → RMSE / MSE.
→ Need a business-friendly percentage → MAPE (if no near-zero targets).
→ Comparing models / "how much did we explain" → R-squared.
Putting It Together in scikit-learn
sklearn.metrics gives you everything above with one-line calls. Here is a compact classification evaluation.
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
f1_score, confusion_matrix, classification_report,
roc_auc_score)
# Imbalanced binary data (10% positive) so accuracy will be misleading
X, y = make_classification(n_samples=2000, weights=[0.9, 0.1],
random_state=42)
X_tr, X_te, y_tr, y_te = train_test_split(
X, y, test_size=0.25, stratify=y, random_state=42)
model = Pipeline([("scaler", StandardScaler()),
("clf", LogisticRegression(max_iter=1000))])
model.fit(X_tr, y_tr)
y_pred = model.predict(X_te)
y_proba = model.predict_proba(X_te)[:, 1] # P(positive) for ROC-AUC
print("Accuracy :", round(accuracy_score(y_te, y_pred), 3))
print("Precision:", round(precision_score(y_te, y_pred), 3))
print("Recall :", round(recall_score(y_te, y_pred), 3))
print("F1 :", round(f1_score(y_te, y_pred), 3))
print("ROC-AUC :", round(roc_auc_score(y_te, y_proba), 3))
print(confusion_matrix(y_te, y_pred))
print(classification_report(y_te, y_pred))
Illustrative output (shape, not a real benchmark):
Accuracy : 0.918
Precision: 0.74
Recall : 0.42
F1 : 0.54
ROC-AUC : 0.90
[[441 9]
[ 32 18]]
precision recall f1-score support
0 0.93 0.98 0.95 450
1 0.67 0.36 0.47 50
accuracy 0.92 500
macro avg 0.80 0.67 0.71 500
weighted avg 0.90 0.92 0.90 500
Notice: 92% accuracy looks great, but recall on class 1 is only 0.36 —
the model misses most of the minority class. Accuracy hid the problem.
Reading classification_report: each row is a class; the macro avg row is the unweighted mean across classes (exposes weak minorities); weighted avg weights by support. Always scan the per-class rows, not just the averages.
For multiclass and regression, the same module has you covered.
# --- Multiclass averaging ---
from sklearn.metrics import f1_score
f1_macro = f1_score(y_true, y_pred, average="macro")
f1_weighted = f1_score(y_true, y_pred, average="weighted")
f1_micro = f1_score(y_true, y_pred, average="micro")
# --- Regression ---
from sklearn.metrics import (mean_absolute_error, mean_squared_error,
r2_score, mean_absolute_percentage_error)
mae = mean_absolute_error(y_true, y_pred)
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mean_squared_error(y_true, y_pred)) # RMSE = sqrt(MSE)
r2 = r2_score(y_true, y_pred)
mape = mean_absolute_percentage_error(y_true, y_pred) # returns a fraction
print(f"MAE={mae:.1f} RMSE={rmse:.1f} R2={r2:.3f} MAPE={mape*100:.1f}%")
Illustrative regression output:
MAE=142000.0 RMSE=205000.0 R2=0.871 MAPE=8.4%
RMSE (₹2.05L) is noticeably larger than MAE (₹1.42L) → a few large
errors are inflating RMSE. If those big misses matter, act on them.
Choosing the Right Metric for the Business Goal
This is the most important section of the chapter. The metric is not a technicality — it encodes what your organisation values. Ask: what does a mistake cost?
| Scenario | Costly error | Optimise for | Why |
|---|---|---|---|
| Cancer / disease screening | Missing a sick patient (FN) | Recall | A missed case can be fatal; a false alarm just triggers a follow-up test |
| Spam / promotional filter | Deleting a real email (FP) | Precision | A lost genuine email is far worse than one spam message slipping through |
| Fraud detection (rare fraud) | Both, but misses are expensive | Recall + PR-AUC | Catch fraud, then tune threshold against the false-alarm review cost |
| Loan default (imbalanced) | Approving a defaulter (FN on default) | Recall / ROC-AUC | A default costs far more than declining a good applicant |
| Balanced product rating class | Either error similar | Accuracy / F1 | Classes balanced and costs symmetric, so accuracy is meaningful |
| Ranking leads for a sales team | Ordering quality, not a hard cutoff | ROC-AUC / PR-AUC | You act on the ranked list, not a single threshold |
| House price prediction | Large price misses | RMSE / MAE | RMSE if big misses hurt more; MAE for a robust typical error |
| Demand forecasting for stakeholders | Percentage-off matters | MAPE | Business speaks in "% off"; comparable across products |
The decision procedure:
1. Is it classification or regression?
2. Are the classes imbalanced? (If yes, drop plain accuracy.)
3. Which error is more expensive — FP or FN?
FN worse → optimise RECALL.
FP worse → optimise PRECISION.
Both matter → F1.
Judging the ranking, not one threshold → ROC-AUC / PR-AUC.
4. For regression: outliers noise → MAE; outliers dangerous → RMSE;
need a % → MAPE; explaining variance → R-squared.
Recall the disease example from the top: there, recall is king — missing a patient is unacceptable, so we tolerate false alarms. For a spam filter, precision wins — we would rather let a little spam through than lose a real email. Same math, opposite metric, because the costs are opposite.
Common Mistakes
1. Reporting accuracy on imbalanced data
On a 99:1 dataset, 99% accuracy can mean the model never predicts the
rare class. Always check the per-class recall and precision, and prefer
F1, ROC-AUC, or PR-AUC when classes are imbalanced.
2. Evaluating on the training set
Scoring on data the model already saw gives inflated, meaningless numbers
(the model can memorise). ALWAYS evaluate on a held-out test set or via
cross-validation. A near-perfect training score with a poor test score is
the classic overfitting signature (see the next chapter).
3. Optimising the wrong metric
Tuning for accuracy when the business cares about catching fraud gets you
a model that is "accurate" and useless. Decide the metric FROM the business
cost BEFORE training, then optimise that metric (e.g. scoring="recall"
in GridSearchCV), not whatever is convenient.
4. Accepting the default 0.5 threshold blindly
predict() uses a 0.5 cutoff. For imbalanced or cost-asymmetric problems
that is rarely optimal. Use predict_proba and pick the threshold from the
precision-recall curve according to your FP-vs-FN costs.
5. Using ROC-AUC on severely imbalanced data without thinking
With very rare positives, ROC-AUC can look great because the enormous
negative class swamps the false-positive rate. Prefer PR-AUC (average
precision), which focuses on the positive class you actually care about.
6. Comparing R-squared or MAPE across different datasets
R² depends on the variance of the target, and MAPE explodes near zero
targets — so a "higher R²" on a different dataset does NOT mean a better
model. Compare metrics only on the SAME target and test set.
Practice Exercises
-
A fraud model produces this confusion matrix (row = actual,
[[TN, FP], [FN, TP]]):[[9500, 100], [200, 200]]. Compute accuracy, precision, recall, and F1. Is accuracy a good summary here? Explain in one sentence. -
For a disease screening tool and a promotional-email filter, state which of precision or recall you would prioritise for each, and justify each choice in terms of the cost of a false negative versus a false positive.
-
Using scikit-learn's
load_breast_cancer, build aStandardScaler+LogisticRegressionpipeline, then print theclassification_reportandroc_auc_score. Which class has lower recall, and what does that mean? -
Take the model from Exercise 3, extract
predict_proba, and compute precision and recall at thresholds0.3,0.5, and0.7. Describe how each metric moves and explain why. -
For true values
y = [100, 200, 300, 400]and predictionsŷ = [110, 190, 350, 380], compute MAE, MSE, RMSE, and MAPE by hand. Which prediction contributes most to RMSE, and why? -
You have a 3-class classifier where one class has only 30 of 1000 samples. Explain when you would report
macroF1 versusweightedF1, and what each would hide or reveal.
Summary
In this chapter you learned:
- The confusion matrix (
TP,FP,TN,FN) is the foundation of classification metrics; the key tension is that FP and FN are different mistakes with different costs. - Accuracy
= (TP + TN) / totalis intuitive but lies under class imbalance — a lazy majority-class model can score 99% and catch nothing. - Precision
= TP / (TP + FP)("when I say positive, am I right?") and Recall= TP / (TP + FN)("did I catch the positives?") trade off against each other; F1 is their harmonic mean. - The decision threshold controls the precision-recall balance — lower it for more recall, raise it for more precision;
0.5is rarely optimal. - ROC-AUC summarises ranking quality across all thresholds and is imbalance-robust; prefer PR-AUC when positives are very rare.
- Multiclass metrics are averaged as
macro(every class equal, exposes rare classes),weighted(by support), ormicro(global pooling). - Regression metrics:
MAE(robust, original units),MSE/RMSE(punish big errors),R-squared(variance explained),MAPE(percentage error) — choose by how much outliers should count. - Choose the metric from the business cost: recall for disease detection, precision for spam, ROC-AUC/PR-AUC for ranking, RMSE vs MAE by outlier sensitivity.
- Avoid the classic traps: accuracy on imbalanced data, evaluating on the training set, optimising the wrong metric, and the blind
0.5threshold.
Metrics are the language in which you argue that a model is good — pick the words that actually mean what your business needs.
Next up: Bias-Variance, Overfitting & Regularization — why a model that memorises the training set fails on new data, how to diagnose the bias-variance trade-off, and how regularization keeps models honest.