Logistic Regression

What Is Logistic Regression?

Despite the word "regression" in its name, logistic regression is a classification algorithm. It answers yes/no questions: Will this customer churn? Is this transaction fraud? Will this loan applicant default?

In the previous chapter you saw how Linear Regression predicts a continuous number like a salary or a house price. But if you tried to predict a label (say, churn = 1 vs stay = 0) with a straight line, the model would happily output values like -0.4 or 1.7 — nonsense as probabilities. Logistic regression fixes this by wrapping the linear model in a squashing function that keeps the output between 0 and 1.

Intuitive analogy. Think of a bank loan officer named Priya. She looks at an applicant's income, credit score, and existing loans, and mentally adds up "evidence for approval." A little evidence nudges her confidence up; a lot of evidence pushes it near certainty; strong negative evidence pushes it near zero. She never says "I am 170% confident" — her confidence saturates. Logistic regression models exactly this: it computes a weighted score from the features, then converts that score into a calibrated-looking probability that flattens out near 0 and near 1.

Goal: learn feature weights so that the predicted probability of the positive class is high for positive examples and low for negative ones — then apply a threshold to turn probability into a decision.

Examples:
→ Predict whether a customer will churn (churn = 1, stay = 0)
→ Predict whether a UPI transaction is fraudulent (fraud = 1, legit = 0)
→ Predict whether an email is spam (spam = 1, ham = 0)
→ Predict whether a patient has a disease (positive = 1, negative = 0)

From a Line to a Probability: The Sigmoid Function

Logistic regression starts with the same linear combination you know from linear regression. This raw score is called the logit or log-odds, and we usually denote it z.

Linear score (logit):
z = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ

z can be any real number from -∞ to +∞.

The problem: z is unbounded, but a probability must live in the range 0 <= p <= 1. We pass z through the sigmoid (also called the logistic) function, which squashes any real number into that range.

Sigmoid / logistic function:
σ(z) = 1 / (1 + e^(-z))

Key behaviour:
z → -∞   ⇒  σ(z) → 0
z = 0    ⇒  σ(z) = 0.5
z → +∞   ⇒  σ(z) → 1

The curve is S-shaped: steep near z = 0, flat in the tails.

Putting the two pieces together gives the full model. The predicted probability that the label is 1 (given features x) is:

p = P(y = 1 | x) = σ(z) = 1 / (1 + e^(-(β₀ + β₁x₁ + ... + βₙxₙ)))

The sigmoid is what makes logistic regression a linear classifier: the boundary between classes is linear in the features, even though the mapping from score to probability is a smooth curve.

A Quick Numerical Feel

Suppose a fitted model gives z = β₀ + β₁·(credit_score):
z = -4 + 0.01 × credit_score

For credit_score = 300:  z = -4 + 3   = -1.0  ⇒ σ(-1.0) = 0.27
For credit_score = 400:  z = -4 + 4   =  0.0  ⇒ σ(0.0)  = 0.50
For credit_score = 500:  z = -4 + 5   =  1.0  ⇒ σ(1.0)  = 0.73

Notice: equal 100-point jumps in the feature do NOT cause equal
jumps in probability — the sigmoid flattens near the extremes.

The Decision Boundary and Threshold

The sigmoid gives a probability. To turn it into a hard class label, we apply a threshold — by default 0.5.

Decision rule (default threshold t = 0.5):
if p >= 0.5  ⇒  predict class 1
if p <  0.5  ⇒  predict class 0

Because σ(z) = 0.5 exactly when z = 0, the default threshold is
equivalent to:
predict class 1 when z = β₀ + β₁x₁ + ... + βₙxₙ >= 0

The set of points where z = 0 is the decision boundary. Because z is linear in the features, this boundary is a straight line (in 2D), a plane (in 3D), or a hyperplane (in higher dimensions).

The threshold is a business decision, not a mathematical constant. For fraud detection you might lower the threshold to 0.2 so you catch more fraud (higher recall) at the cost of more false alarms. For a spam filter you might raise it to 0.8 so you rarely misclassify a real email. You will study how to pick thresholds using the ROC curve and precision-recall trade-offs in the Model Evaluation Metrics chapter.

Cost Function: Log Loss (Cross-Entropy)

Linear regression minimises squared error. Logistic regression does not — squared error on top of the sigmoid produces a non-convex, bumpy surface with many local minima that gradient descent gets stuck in. Instead we use log loss, also called binary cross-entropy, which is convex for logistic regression and rewards confident, correct probabilities.

Cost for a single example (yᵢ is the true label, pᵢ is the predicted probability):

cost = -[ yᵢ · log(pᵢ) + (1 - yᵢ) · log(1 - pᵢ) ]

Intuition:
→ If yᵢ = 1, cost = -log(pᵢ)      → 0 when pᵢ → 1, → ∞ when pᵢ → 0
→ If yᵢ = 0, cost = -log(1 - pᵢ)  → 0 when pᵢ → 0, → ∞ when pᵢ → 1

Total cost over m examples (this is what training minimises):

J(β) = -(1/m) · Σᵢ [ yᵢ · log(pᵢ) + (1 - yᵢ) · log(1 - pᵢ) ]

The key insight: log loss punishes confident wrong answers brutally. Predicting p = 0.99 for something that is actually class 0 incurs a huge penalty, while predicting p = 0.51 for the same mistake is only mildly penalised. This is why the model learns not just to be right, but to be appropriately uncertain. There is no neat closed-form solution like OLS, so z is fit iteratively using optimisation algorithms (gradient descent, lbfgs, newton-cg, etc.).

Interpreting Coefficients: Log-Odds and Odds Ratios

This is logistic regression's superpower over black-box models: the coefficients are interpretable. The trick is to think in terms of odds rather than probability.

Odds = p / (1 - p)          (e.g. p = 0.75  ⇒  odds = 3, i.e. "3 to 1")

The model is linear in the LOG of the odds:
log( p / (1 - p) ) = z = β₀ + β₁x₁ + ... + βₙxₙ

So each coefficient βⱼ is the change in log-odds for a one-unit
increase in xⱼ (holding other features constant).

Log-odds are hard to explain to a stakeholder, so we exponentiate the coefficient to get an odds ratio:

Odds ratio for feature xⱼ:  OR = e^(βⱼ)

Interpretation of a one-unit increase in xⱼ (others held constant):
OR > 1  ⇒  odds of class 1 MULTIPLY by OR   (feature increases likelihood)
OR = 1  ⇒  no effect
OR < 1  ⇒  odds of class 1 shrink            (feature decreases likelihood)

Worked Interpretation

Churn model coefficient: β_tenure = -0.12  (tenure in months)
Odds ratio = e^(-0.12) = 0.887

Interpretation:
"Each additional month a customer stays multiplies their odds of
churning by 0.887 — roughly an 11% reduction in churn odds per month,
holding all other features constant."

Another: β_monthly_charge = 0.03 (charge in ₹100 units)
Odds ratio = e^(0.03) = 1.030
"Each extra ₹100 of monthly charge multiplies churn odds by 1.03
(a 3% increase in churn odds)."

Note the difference from linear regression: the coefficient acts on the odds multiplicatively, not on the probability additively. A 0.03 coefficient does not mean "3% more probability."

Binary vs Multiclass Classification

Plain logistic regression handles two classes. Real problems often have more (e.g. classify a support ticket as Billing, Technical, or Sales). There are two standard extensions.

One-vs-Rest (OvR)

Train one binary classifier per class: "this class vs everything else." For K classes you fit K models, get K probabilities, and predict the class whose model gives the highest score.

3 classes {A, B, C} ⇒ train 3 binary models:
  Model_A: A  vs (B or C)
  Model_B: B  vs (A or C)
  Model_C: C  vs (A or B)
Predict = class with the highest probability.

Softmax / Multinomial Logistic Regression

A single model that outputs a full probability distribution over all K classes at once, using the softmax function (a generalisation of the sigmoid). The probabilities across all classes sum to 1.

Softmax for class k out of K classes:
P(y = k | x) = e^(zₖ) / Σⱼ e^(zⱼ)     (sum over j = 1..K)

The K probabilities are guaranteed to sum to 1.0.

In scikit-learn, modern versions choose multinomial automatically for multiclass problems with solvers that support it, or you can force behaviour. Multinomial is usually better calibrated when classes are genuinely mutually exclusive; OvR is a robust, simple fallback.

Logistic Regression in scikit-learn

Here is a realistic end-to-end binary classification workflow. We predict customer churn from a small feature set. Note the pipeline with scaling — logistic regression is sensitive to feature scale when regularised (more on that below).

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (accuracy_score, confusion_matrix,
                             classification_report, roc_auc_score)

# --- Illustrative telecom churn data (features already engineered) ---
# Columns: tenure_months, monthly_charge, num_support_calls
df = pd.read_csv("telecom_churn.csv")

X = df[["tenure_months", "monthly_charge", "num_support_calls"]]
y = df["churned"]          # 1 = churned, 0 = stayed

# Stratify to preserve the churn ratio in both splits
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

# Scale + model in one pipeline so scaling is fit ONLY on training data
model = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(C=1.0, max_iter=1000, random_state=42))
])

model.fit(X_train, y_train)

# Hard predictions (uses the default 0.5 threshold internally)
y_pred = model.predict(X_test)

# Probabilities — column 1 is P(class = 1)
y_proba = model.predict_proba(X_test)[:, 1]

print("Accuracy :", round(accuracy_score(y_test, y_pred), 3))
print("ROC-AUC  :", round(roc_auc_score(y_test, y_proba), 3))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Illustrative output (numbers shown for shape, not a real benchmark):

Accuracy : 0.842
ROC-AUC  : 0.889
[[512  38]
 [ 71 129]]
              precision    recall  f1-score   support
           0       0.88      0.93      0.90       550
           1       0.77      0.65      0.70       200
    accuracy                           0.84       750
   macro avg       0.82      0.79      0.80       750
 weighted avg       0.85      0.84      0.84       750

Using predict_proba and a Custom Threshold

predict_proba is what makes logistic regression valuable — you get a ranked score, not just a label. To catch more churners (higher recall) you can lower the decision threshold yourself.

# Default predict() uses 0.5. Here we use a lower, business-chosen threshold.
threshold = 0.35
y_pred_custom = (y_proba >= threshold).astype(int)

# Lowering the threshold typically increases recall (catch more class 1)
# at the cost of precision (more false positives).

Reading Coefficients as Odds Ratios

# Pull the fitted coefficients out of the pipeline's classifier step.
clf = model.named_steps["clf"]
feature_names = X.columns

coefs = pd.DataFrame({
    "feature": feature_names,
    "coefficient": clf.coef_[0],
    "odds_ratio": np.exp(clf.coef_[0])
}).sort_values("odds_ratio", ascending=False)

print(coefs)

Illustrative output (coefficients are on SCALED features):
             feature  coefficient  odds_ratio
2  num_support_calls        0.640       1.897
1     monthly_charge        0.310       1.363
0      tenure_months       -0.880       0.415

Read as: after scaling, more support calls sharply raise churn odds,
while longer tenure strongly lowers them.

Because the pipeline scaled the features, these odds ratios describe the effect of a one-standard-deviation change, not a one-raw-unit change. If you need raw-unit odds ratios for stakeholders, fit the model on unscaled features (accepting the caveats in Common Mistakes).

The Regularization Parameter C

Logistic regression in scikit-learn is regularized by default (L2). Regularization discourages large coefficients, which prevents overfitting and stabilises the model when features are correlated. The strength is controlled by C.

C is the INVERSE of regularization strength:
Small C  ⇒  strong regularization  ⇒  smaller coefficients, simpler model (may underfit)
Large C  ⇒  weak regularization    ⇒  larger coefficients, more flexible (may overfit)

C = 1.0 is the scikit-learn default.

You can also switch the penalty type (penalty="l1" for sparse models that zero out useless features, penalty="l2" for the default, or elasticnet). Tune C with cross-validation.

from sklearn.model_selection import GridSearchCV

param_grid = {"clf__C": [0.01, 0.1, 1, 10, 100]}
grid = GridSearchCV(model, param_grid, cv=5, scoring="roc_auc")
grid.fit(X_train, y_train)

print("Best C     :", grid.best_params_)
print("Best ROC-AUC:", round(grid.best_score_, 3))

The full treatment of the bias-variance trade-off, why regularization works, and L1 vs L2 lives in the Bias-Variance, Overfitting & Regularization chapter. For now, remember: C is a knob you tune, and it interacts with feature scaling — which is why the pipeline above scales first.

When to Use Logistic Regression: Pros and Cons

Aspect	Details
Problem type	Binary or multiclass classification; probabilistic ranking
Output	Calibrated-ish probability via `predict_proba`, plus a hard label
Interpretability	High — coefficients map directly to odds ratios
Training speed	Very fast; scales to large datasets
Decision boundary	Linear (a hyperplane) in the feature space
Assumptions	Roughly linear relationship between features and log-odds
Handles non-linearity	Only if you add polynomial or interaction features manually
Feature scaling	Recommended (required for fair regularization and faster convergence)

Pros

Fast, simple, and a strong baseline — always try it before reaching for complex models.
Coefficients are interpretable as odds ratios, which regulators and business stakeholders love.
Outputs probabilities, not just labels, so you can rank and set custom thresholds.
Rarely overfits when regularized; works well even with many features.

Cons

The decision boundary is linear; it cannot capture complex non-linear patterns without feature engineering.
Sensitive to outliers and to strongly correlated features (multicollinearity destabilises coefficients).
Assumes a roughly linear relationship between features and the log-odds.
Probabilities are not automatically well-calibrated, especially with imbalanced classes or heavy regularization.

If the boundary is clearly non-linear, algorithms like K-Nearest Neighbors, Decision Trees, or Support Vector Machines (all covered later in this series) may fit better.

Common Mistakes

1. Not scaling features before regularized logistic regression

Because L2 regularization penalises the SIZE of coefficients, features
on large scales (e.g. income in ₹ vs age in years) get unfairly shrunk.
Always scale (StandardScaler) inside a Pipeline so the penalty is fair
and the solver converges quickly. Symptom: a "ConvergenceWarning".

2. Ignoring class imbalance

If only 2% of transactions are fraud, a model predicting "never fraud"
scores 98% accuracy — and is useless. Fixes:
→ use class_weight="balanced" in LogisticRegression
→ resample (SMOTE / undersampling)
→ judge with ROC-AUC, precision, recall — NOT plain accuracy.

3. Treating predict_proba output as perfectly calibrated

A predicted 0.90 does NOT guarantee "90 out of 100 such cases are positive."
Regularization and imbalance distort calibration. If you need trustworthy
probabilities (e.g. for pricing or risk), calibrate with
CalibratedClassifierCV and check a calibration (reliability) curve.

4. Interpreting coefficients on scaled features as raw-unit effects

After StandardScaler, a coefficient describes a ONE-STANDARD-DEVIATION
change in the feature, not a one-rupee or one-year change. Don't tell a
stakeholder "each extra ₹1 does X" when the model saw scaled inputs.

5. Using the default 0.5 threshold blindly

0.5 is rarely optimal for imbalanced or cost-asymmetric problems.
Choose the threshold from business costs (cost of a false negative vs
false positive) using the ROC / precision-recall curve — not by default.

6. Confusing coefficient magnitude with importance across unscaled features

On unscaled data, a small coefficient on a large-range feature can matter
more than a big coefficient on a tiny-range feature. Compare importance
only after scaling, or use standardized coefficients.

Practice Exercises

A fitted model has z = -3 + 0.5·x. Compute the predicted probability p = σ(z) for x = 4, x = 6, and x = 8. At what value of x is the decision boundary (where p = 0.5)?
A logistic regression coefficient for a "has_dependents" (0/1) feature is β = -0.7. Compute the odds ratio and explain in plain English what it means for churn.
Load any binary dataset (e.g. scikit-learn's load_breast_cancer). Build a Pipeline with StandardScaler and LogisticRegression, fit it, and report accuracy and ROC-AUC on a stratified test set.
Using the model from Exercise 3, extract predict_proba, then compute predictions at thresholds 0.3, 0.5, and 0.7. Describe how precision and recall shift as the threshold changes.
Explain in one paragraph why logistic regression uses log loss instead of mean squared error as its cost function.
Take an imbalanced dataset. Train two models — one with default settings and one with class_weight="balanced" — and compare their recall on the minority class. What changed and why?

Summary

In this chapter you learned:

Logistic regression is classification, not regression — it predicts the probability of a class, then applies a threshold.
The sigmoid σ(z) = 1 / (1 + e^(-z)) squashes the linear logit z into the range 0 to 1.
The decision boundary is where z = 0 (probability 0.5); the threshold is a tunable business choice, not a fixed constant.
Training minimises log loss (cross-entropy), which is convex and heavily punishes confident wrong answers.
Coefficients are interpretable: e^(βⱼ) is the odds ratio — how a one-unit feature change multiplies the odds of the positive class.
Multiclass is handled by one-vs-rest (K binary models) or softmax / multinomial (one model, probabilities sum to 1).
In scikit-learn, use a Pipeline with StandardScaler, call fit/predict, and use predict_proba for probabilities and custom thresholds.
C is the inverse regularization strength — small C regularizes more; tune it with cross-validation (see the regularization chapter).
Watch for the classic pitfalls: unscaled features, imbalanced classes, uncalibrated probabilities, and blind use of the 0.5 threshold.

Logistic regression is the workhorse baseline of classification — fast, interpretable, and probabilistic — and a model you will reach for constantly in real data science work.

Next up: K-Nearest Neighbors (KNN) — a delightfully simple, non-parametric classifier that makes predictions by looking at the closest examples in the training data, with no explicit training step at all.